Box of Tricks is the final product of the SPECO Project (1999-2002) which was funded by the EU through the INCO-COPERNICUS program (Contract no. 977126). The project’s head was Klara Vicsi (Technical University of Budapest, Hungary) who developed the Hungarian version. Box of Tricks was also developed in three other languages: English by Peter Roach & Anna Sfakianaki (University of Reading, United Kingdom), Swedish by Anne-Marie Oster (Kungl. Tekniska Hogskolan, Sweden) and Slovenian by Zdravko Kacic (University of Maribor, Slovenia). There was also a commercial partner, Peter Barczikay (Robot Control Software, Hungary) who was involved in the programming and is still involved in the marketing and sales of Box of Tricks.
Introduction
Box of Tricks is developing a workstation that provides real-time visual display of acoustic information for children in need of assistance with various aspects of speech production. During the process of learning speech, children with normal hearing follow a product-oriented approach. They discover how to control their speech organs through reference to acoustic speech signals. In this way they develop the ability to generate all the acoustic effects occurring in speech. Naturally, this process is problematic for speech impaired people. In traditional speech therapy a process-oriented approach is generally used; the speech therapist gives instructions on how to use the speech organs while forming sounds. Nevertheless, during normal speech development, children never receive instructions on how to move or where to place their speech articulators.
Instead of
the process-oriented
approach, or to supplement it, Box of Tricks hopes to offer a
product-oriented
one. In speech communication it is not the process of the articulation
that is
important, but the quality of the produced sound by which the
information is
transmitted to the other person. In Box of Tricks -developed for
hearing-impaired children, the produced sound is measured and
visualised. The
user discovers how to control his or her speech organs by comparing the
visual
patterns (speech pictures) of the normal acoustic speech signal with
the
defective one. Additionally, the acoustic pre-processing of the system
uses a
special filter bank imitating the filtering characteristics of the
inner ear.
So the speech picture should be much more similar to the perceived one
than a
simple bank of traditional filters, or FFT spectra.
The components of the system
The system
consists of two
basic parts. The first part consists of a language-independent editor
and
measuring system which is used to construct the modules for all SPECO
languages. This language-independent editor can be adapted to any
European
language. The second part consists of language-dependent speech
databases. The
participating languages are English, Hungarian, Slovenian and Swedish,
thus
there are four reference speech databases, which the system uses in
order to
make a decision about the microphone input.
The Child Speech Database
Each language has two databases: the reference-speaker database and the multi-speaker database. The four language versions are divided into two packages: the fricative and affricate support and the vowel support. Regarding the English version, the fricative and affricate support includes the sibilants s, z, S, Z and the affricates tS, and dZ. The vowel support includes the five long vowels i:, 3:, A:, O: and u:, and the six short vowels I, e, Q, {, V and U (symbols are in SAMPA).
The fricative and affricate support was recorded with our reference speaker, Charlie, when he was eight years old, and the vowel support was recorded about a year later. All recordings were carried out in the sound-deadened recording room in the speech lab at the university of Reading, using the special editor incorporated in the SPECO system. Each utterance was recorded three times and the best one was saved and chosen to appear as the reference example in the exercises. The reference database was segmented using a special application within the SPECO editor. The reference examples were segmented so as to feed the system with information about the normal range of each phoneme and to demonstrate the arbitrary limits of each phoneme in the exercise window to assist the speech therapist and the client in training.
The multi-speaker database contains a portion of the reference material. 36 children aged between 7 and 11 were recorded. Each recording session took approximately 8-12 minutes, depending mostly on how fast the child could read the utterances from the cards. The speakers were selected from three different schools. Two of these schools are situated in or near Reading and the third one is in a suburb of London. It may be worth noting that some children had problems articulating certain fricative and affricate sounds, most commonly [Z] and [dZ], especially in isolation. There were also some articulation problems concerning the sounds [r] and [T]. The multi-speaker database was also segmented but this time using software (WASP) not incorporated in the editor itself.
Both
databases have been used
to establish norms which guide the teaching or remediation process. The
segmented material was used in the construction of fricative and vowel
spectra
-“spreadlines”, as we call them, and determined the allowed spectral
deviation.
The spreadlines are constructed for each language separately and
constitute the
actual background of the exercise.
Types of display
The concept
of the SPECO system
is to visualise speech at a low level of speech processing and to let
clients
use their high level information processing ability to work on this.
Teaching
children how to obtain information from speech pictures is more
preferable to
giving articulation instructions. A detailed examination has been
prepared to
decide what scale of loudness, pitch contour, spectral distribution,
etc.,
gives the most informative visual presentation (speech pictures) of
these
parameters. How can we draw children’s attention to the areas of
maximum energy
in the spectrogram? How can we encourage them to use correct loudness
and
intonation levels? How can children recognise if their rhythm is
appropriate
etc.? Generally we use different amusing background drawings to help
children
find the important parts of the speech pictures. First of all, each
phoneme is
assigned its own symbolic picture so that the child very quickly find
out which
are the significant parts of the screen (Figure 1).
Figure 1 The top picture shows typical cochleagrams of the English fricatives and affricates trained by the system. Each sound corresponds to a particular drawing (bottom picture) so that the client can make the necessary association when looking at the speech picture. For example, the correct production of an [s] (which is symbolised with a snake) would cover the most part of the eggs with dots.
Some examples of types of speech pictures are the following: energy changing with time (Figure 2); pitch; voiced - unvoiced detection; intonation; spectrum; spectrogram (cochleagram); spectrogram differences.
Figure2 Bysaying pi pi pi, the child must make the yellow ball jump over the heads of the worms with the appropriate rhythm.
The system is based on up-to-date technology, but we follow the steps of traditional speech therapy in both modules. These are sound preparation, sound development, followed by training in words and automation (meaning the achievement of a reliable production not requiring further instruction).
At the
stage sound preparation children are
trained to pay the necessary attention to the screen. They start to
familiarise
themselves with
the way
curves form on the screen according to sound energy and the position of
the
speech organs. There is the possibility to train the adjustment of
different
speech parameters: loudness, rhythm, spectrum, pitch, voicing,
intonation.
In sound
development we start with the
forming of individual phonemes. This stage includes
working with
articulation pictures,
isolated pronunciation practice and syllable training. The articulation
pictures (Figure 3) show the child which is the right position of all
the
organs (mouth, tongue, teeth etc.) that play a role in sound
production. After
teaching the correct articulation, children attempt to produce sustained
sounds.
Figure 3 Articulation picture for fricative [z]; the little bell ringing indicates that there must be voicing when producing this sound.
Figure
4 The spectrum
of the English fricative [S] presented
as a speech picture by the program.
The objective is to produce and sustain a line within the limits of the
“green
field”.
For
syllable training, the vocabulary contains sound sequences constructed
so that
the phonemes being practised occur in different positions and contexts.
These
syllables appear on the screen in the form of cochleagrams. For the
English
fricative and affricate support, fricatives and affricates are
presented in CV,
VCV, VC and VC-VC-VC position and connected with the five long vowels.
Whereas
the English vowel support contains all vowels in syllables along with
front
stops, like [p, t and b]. The order of presentation of sound sequences
could be
important, so we grade those from the easier pronunciations to the more
difficult ones. In this exercise, the reference syllable is
demonstrated on the
upper half of the screen, while the syllable the client produces
appears in the
bottom half of the screen (Figure 5). The client attempts to match his
picture
with the reference one as closely as possible.
Figure 5 The reference syllable [3:s] appears on the top half of the screen and the client’s production below. The blue dots correspond to the vowel and the red dots to the fricative. The aim is to cover as much of the eggs as possible with red dots and leave the snake uncovered.
In the training
in words the grouping of words
is different in fricative support and vowel support. In fricative
support all
phonemes are presented in initial, medial and final position in words.
In vowel
support all phonemes occur in one-syllable words and in words of two or
more
syllables. Again the upper half of the screen shows the cochleagram of
the
reference word and the client has to produce the same word so as to
fill in the
right parts of the bottom half of the screen (Figure 6).
Figure 6 The reference speech picture (cochleagram of the English word ‘kitchen’) above and the client’s production below. The phoneme trained here is [tS] word-medially and its symbolic picture is a station (for the closure) a train (for the lower part of the cochleagram) and its smoke (for the actual release). The objective is to cover most of the smoke (release of [tS) with red dots and leave the station and the train clear.
The automation (or “continuity” for the English version of the system) consists of two parts: contrast pairs and phrases. These exercises work on the basis of cochleagrams as well. The contrast pairs are presented to the child to show the differences between the speech pictures of two phonemes in similar words. For example one of the word pairs chosen to train the phoneme /z/ word-initially is “zip-dip”. The phrases contain the trained phoneme at least once and they are specially designed and graded from simple and short to complex and longer ones.
Gradually
the
clients learn how
to interpret their spreadlines or the dots in the cochleagrams and can
easily
compare their productions with the model one. But until that happens or
for
very young children who cannot readily compare the two screens, Box of
Tricks
gives another type of feedback which can be easily understood. This
automatic
feedback is placed under the cochleagrams and can take different forms.
Every
form has five stages in order to demonstrate any subtle improvement or
deterioration of each production. So the feedback can take the form of
a duck
which moves to the right and lifts its head in joy when the production
is
correct (Figure 6), a number which changes from 1 to 5 depending on the
production (Figure 7), a flower which comes gradually out of its pot as
the
production improves and a colour which changes from red to green when
the
production is correct.
Figure 7 The automatic feedback can change through ‘Settings’ and take the form of a duck, a number, a flower or a colour.
An additional feature of Box of Tricks which could be very helpful for the speech therapist is the ‘User Management’ tool. The therapist can have a login name and password and create files for all his or her clients. These files can be created during therapy by saving the client’s productions. Thus a database is created which contains the date of the recording, the type of exercise and the exact utterance which can be also played back (button ‘Say’ in Figure 8), the mark and the comments of the therapist at that time.
Figure 8 This shows the selected files of a certain client. These files can be edited and can be confidential if the therapist chooses so. Thus a database is created and the therapist can easily keep track of the client’s progress.
For more information about the project SPECO and the product Box of Tricks you can visit the official website: