Wednesday, August 15, 2018

Automatic Phonological Transcription of Speech Corpora


Abstract and Keywords

This chapter provides an overview of the state of the art in automatic phonological transcription. It discusses the most relevant methodological issues with regard to automatic transcription, and presents in detail the various (semi-)automatic procedures that are currently in use to obtain, evaluate, and optimize automatic transcriptions.

5.1 Introduction

Within the framework of Corpus Phonology, spoken language corpora are used for conducting research on speakers’ and listeners’ knowledge of the sound system of their native languages, and on the laws underlying such sound systems as well as their role in first and second language acquisition. Many of these studies require a phonological annotation of the speech data contained in the corpus. The present chapter discusses how such phonological annotations can be obtained (semi-)automatically.
An important distinction that should be drawn is whether or not the audio signals come with a verbatim transcription (orthographic transcription) of the spoken utterances. If an orthographic annotation is available, the (semi-)automatic phonological annotation could in theory be derived directly from the verbatim transcription through a simple conversion procedure without an automatic analysis of the original speech signals. In this case, the strings of symbols representing the graphemes in words are replaced by corresponding strings of symbols representing phonemes. This can be achieved by resorting to a grapheme–phoneme conversion algorithm, through a lexicon look-up procedure in which each individual word is replaced by its phonological representation as found in a pronunciation dictionary or by a combination of the two (Binnenpoorte 2006). The term ‘phonological representation’ is often used for this kind of annotation, as suggested by Oostdijk and Boves (2008: 650). It is important to realize that such a phonological representation does not provide information on the speech sounds that were actually realized, but relies on what is already known about the possible (p. 90) ways in which words can be pronounced (pronunciation variants). Such pronunciation variants may also refer to sandhi phenomena and can be used to model processes such as cross-word assimilation or phoneme intrusions, but the choice of the variants will be based on phonological knowledge and not on an analysis of the speech signals.
Alternatively, a phonological annotation of the words in a spoken language corpus can be obtained automatically through an analysis of the original speech signals by means of automatic speech recognition algorithms. In this case, previously trained acoustic models of the phonemes to be identified are used together with the speech signal and the corresponding orthographic transcription, if available, to provide the most likely string of phonemes that reflects the speech sounds that were actually realized. In this chapter we will reserve the term ‘automatic phonological transcription’ for this latter type of analysis, as suggested by Oostdijk and Boves (2008: 650). It is also this form of phonological annotation, and the various (semi-)automatic procedures to obtain, evaluate, and optimize them, that will be the focus of the present chapter.
Phonological transcriptions have long been used in linguistic research, for both explorative and hypothesis testing purposes. More recently, phonological transcriptions have proven to be very useful for speech technology too, for example, for automatic speech recognition and for speech synthesis. In addition, the development of multi-purpose speech corpora that we have witnessed in the last decades—e.g. TIMIT (Zue et al. 1990), Switchboard (Godfrey et al. 1992), Verbmobil (Hess et al. 1995), the Spoken Dutch Corpus (Oostdijk 2002), the Corpus of Spontaneous Japanese (Maekawa 2003), Buckeye (Pitt et al. 2005), and ‘German Today’ (Brinckmann et al. 2008)—has underlined the importance of phonological transcriptions of speech data, because these considerably increase the value of such corpora for scientific research and application development.
Both orthographic transcriptions and phonological transcriptions are known to be time-consuming and costly. In general, the more detailed the transcription, the higher the cost. Orthographic transcriptions appear to be produced at speeds varying from three to five times real-time, depending on the quality of the recording, the speech style, the quality of the transcription desired, and the skill of the transcriber (Hazen 2006). Highly accurate transcriptions that account for all speech events (filled pauses, partial words, etc.) as well as other meta-data (speaker identities and changes, non-speech artefacts and noises, etc.) can take up to 50 times real-time depending on the nature of the data and the level of detail of the meta-data (Barras et al. 2001Strassel and Glennky 2004).
Making phonological transcriptions from scratch can take up to 50–60 times real-time, because in this case transcribers have to compose the transcription and choose a symbol for every single speech sound. An alternative, less time-consuming procedure consists in having transcribers correct an example transcription, i.e. a transcription of the word in question taken from a lexicon or a dictionary, which transcribers can edit and improve after having listened to the corresponding utterance. Both for Switchboard and the Spoken Dutch Corpus, transcription costs were restricted by presenting trained students with an example transcription. The students were asked to (p. 91)verify this transcription rather than transcribing from scratch (Greenberg et al. 1996Goddijn and Binnenpoorte 2003). Although such a check-and-correct procedure is very attractive in terms of cost reduction, it has been suggested that it may bias the resulting transcriptions towards the example transcription (Binnenpoorte 2006). In addition, the costs involved in such a procedure are still quite substantial. Demuynck et al. (2002) reported that the manual verification process took 15 minutes for one minute of speech recorded in formal lectures and 40 minutes for one minute of spontaneous speech.
Because of the problems involved in obtaining phonological transcriptions—the time required, the high costs incurred, the often limited accuracy obtained, and the need to transcribe large amounts of data—researchers have been looking for ways of automating this process, for example by employing speech recognition algorithms. The advantages of automatic phonological transcriptions become really evident when it comes to exploring large speech databases. First, because automatic phonological transcriptions make it possible to achieve uniformity in transcription. With a manual transcription this aim would be utopian: large amounts of speech data cannot possibly be transcribed by one person, and the more transcribers are involved, the less uniform the transcriptions are going to be. Eliminating part of this subjectivity in transcriptions can be very advantageous, especially when analysing large amounts of data. Second, because with automatic methods it is possible to generate phonological transcriptions of huge amounts of data that would otherwise remain unexplored. The fact that large amounts of material can be analysed in a relatively short time, and with relatively low costs, makes automatic phonological transcription even more interesting. The importance of this aspect for the generalizability of the results cannot be overestimated. And although the automatic procedures used to generate automatic phonological transcriptions are not infallible, the advantages of a very large dataset might very well outweigh the errors introduced by the mistakes the automatic procedures make.
In this chapter we provide an overview of the state of the art in automatic phonological transcription, paying special attention to the most relevant methodological issues and the ways they have been approached.

5.1.1 Types of Phonological Transcription

Before we discuss the various ways in which phonological transcriptions can be obtained (semi-)automatically, it is worth clarifying a number of terms that will be used in the remainder of this chapter. First of all, a distinction should be drawn between segmental and suprasegmental phonological transcriptions. In this chapter, focus will be on the former, and in particular on (semi-)automatic ways of obtaining segmental annotations of speech; but this is not to say that there has been no research on (semi-)automatic transcription of suprasegmental processes. For instance, important work in this direction was carried out in the 1990’s within the framework of the Multext project (Gibbon and Llisterri 1994Llisterri 1996), and more recently by Mertens (2004b)Tamburini and (p. 92) Caini (2005), Obin et al. (2009)Avanzi, Lacheret et al. (2010), and Lacheret et al. (2010). A discussion of these approaches is, however, beyond the scope of this chapter.
Even within the category of segmental phonological transcriptions, different types can be distinguished depending on the symbols used and the degree of detail recorded in the transcriptions. In the literature, various classifications have been provided on the basis of these parameters (for a brief overview, see Cucchiarini 1993). With respect to the notation symbols, in this chapter we will be concerned only with alphabetic notations, in particular with computer-readable notations, as this is a prerequisite for automatic transcription.
Over the years different computer-readable notation systems have been developed. A widely used one in Europe is *SAMPA[http://www.phon.ucl.ac.uk/home/sampa/]* (Wells 1997http://www.phon.ucl.ac.uk/home/sampa/), a mapping between symbols of the International Phonetic Alphabet and ASCII codes, which was established through a consultation process among international speech researchers. X-SAMPA is an extended version of SAMPA intended to cover every symbol of the IPA Chart so as to make it possible to provide a machine-readable phonetic transcription for every known human language.
However, many other systems have been introduced. Arpabet was developed by the Advanced Research Projects Agency (ARPA) and consists of a mapping between the phonemes of General American English and ASCII characters. Worldbet (Hieronymus 1994) is a more extended mapping between ASCII codes and IPA symbols intended to cover a wider set of the world languages. Besides the computer phonetic alphabets mentioned here, many others do exist (see e.g. Hieronymus 1994Wells 1997EAGLES 1996Draxler 2005).
With regard to the degree of detail to be recorded in transcription, it appears that in general two types of transcription are distinguished: broad phonetic, or phonemic, and narrow phonetic, or allophonic. A broad phonetic transcription indicates only the distinctive units of an utterance, thus presupposing knowledge of the precise realization of the sounds transcribed. A narrow phonetic transcription attempts to provide such details. For transcriptions made by human transcribers, it holds that the more detailed the transcription, the more time-consuming and costly it will be. In addition, more detailed transcriptions are likely to be less consistent. Also, for automatically generated transcriptions it holds that recording more details requires more effort, albeit not in the same way and to the same extent as for manually generated transcriptions.
The type of phonological transcription that is generally contained in spoken language corpora is broad phonetic, although narrow transcriptions are sometimes also provided, for example, in the Switchboard corpus (Greenberg et al. 1996). In general, the degree of detail required of a transcription will essentially depend on the aim for which the transcription is made. In the case of multi-purpose corpora of spoken language, it is therefore often decided to adopt a broad phonetic transcription, and to make it possible for individual users to add the details that are relevant for their own research at a later stage.
An important question that is partly related to the degree of detail recorded in phonological transcription concerns the accuracy of transcriptions. As a matter of fact, there (p. 93) is a trade-off relation between degree of detail and accuracy (Gut and Bayerl 2004) defined in terms of reliability and validity (see section 1.3: the more details the transcription contains the less likely it is to be accurate. This may imply that, although a high level of detail may be considered essential for a specific linguistic study, for instance on degree of devoicing or aspiration, it may nevertheless turn out to be extremely difficult or impossible to obtain transcriptions of that specific phenomenon that are sufficiently accurate. This brings us to another important aspect of phonological transcription in general and of automatic phonological transcription in particular: that of its accuracy. This will be discussed below.

5.1.2 Evaluation of Phonological Transcriptions

Before phonological transcriptions can be used for research or applications, it is important to know how accurate they are. The problem of transcription quality assessment is not new, since for manual phonological transcriptions it is also important to know how accurate they are before using them for research or applications (Shriberg and Lof 1991Cucchiarini 1993Cucchiarini 1996Wesenick and Kipp 1996). Phonological transcriptions, whether they are obtained automatically or produced by human transcribers, are generally used as a basis for further processing (research, ASR (automatic speech recognition) training, etc.). They can be viewed as representations or measurements of the speech signal, and it is therefore legitimate to ask to what extent they achieve the standards of reliability and validity that are required of any form of measurement. With respect to automatic transcriptions, the problem of quality assessment is complex because comparison with human performance, which is customary in many fields, is not straightforward, owing to the subjectivity of human transcriptions and to a series of methodologically complex issues that will be explained below.

5.1.3 Reliability and Validity of Phonological Transcriptions

In general terms, the reliability of a measuring instrument represents the degree of consistency observed between repeated measurements of the same object made with that instrument. It is an indication of the degree of accuracy of a measuring device. Validity, on the other hand, is concerned with whether the instrument measures what it purports to measure. In fact, the definitions of reliability and validity used in test theory are much more complex and will not be treated in this chapter. The description provided above indicates an important difference between the reliability of human-made as opposed to automatic transcriptions, and is related to the fact that human transcriptions suffer from intra-subject and inter-subject variation, and repeated measurements of the same object will differ from each other. With automatic transcriptions this can be prevented, because a machine can be programmed in such a way that repeated measurements of the same object always give the same (p. 94) result, thus yielding a reliability coefficient of 1, the highest possible. It follows that with respect to the quality of automatic transcription, only one (albeit not trivial) question needs to be answered, viz. that concerning validity.

5.1.4 Defining a Reference Phonological Transcription

The description of validity given above suggests that any validation activity implies the existence of a correct representation of what is to be measured, a so-called benchmark or ‘true’ criterion score (as in test theory), a gold standard. The difficulties in obtaining such a benchmark transcription are well known, and it is generally acknowledged that there is no absolute truth of the matter as to what phones a speaker produced in an utterance (Cucchiarini 19931996Wester et al. 2001). For instance, in an experiment we asked nine experienced listeners to judge whether a phone was present or not for 467 cases (Wester et al. 2001). The results showed that all nine listeners agreed in only 246 of the 467 cases, which is less than 53 per cent (see section 2.2.2.2). Furthermore, a substantial amount of variation was observed between the nine listeners. The values of Cohen’s kappa varied from 0.49 to 0.73 for the various listener pairs. It follows that one cannot establish the validity of an automatic transcription simply by comparing it with an arbitrarily chosen human transcription, because the latter would inevitably contain errors. Unfortunately, this seems to be the practice in many studies on automatic transcription. To try as much as possible to circumvent the problems due to the lack of a reference point, different procedures have been devised to obtain reference transcriptions. One possibility consists of using a consensus transcription, which is a transcription made by at least two experienced phoneticians after having reached a consensus on each symbol contained in the transcript (Shriberg et al. 1984). The fact that different transcribers are involved and that they have to reach a consensus before writing down the symbols can be seen as an attempt to minimize errors of measurement, thus approaching ‘true’ criterion scores. Another option is to have more than one transcriber transcribe the material and to use only that part of the material for which all transcribers agree, or at least the majority of them (Kuipers and van Donselaar 1997Kessens et al. 1998).

5.1.5 Comparing Phonological Transcriptions

Another issue that has to be defined in automatic phonological transcription is how to determine whether the quality of a given transcription is satisfactory. Once a reference phonological transcription has been defined, the obvious choice would be to carry out some sort of alignment between the reference phonological transcription and the automatic phonological transcription, with a view to determining a distance measure which will also provide a measure of transcription quality.
(p. 95) For this purpose, dynamic programming (DP) algorithms with different weightings have been used by various authors (Wagner and Fischer 1974Hanna et al. 1999Picone et al. 1986). Several of these DP algorithms are compared in Kondrak and Sherif (2006).
In the current chapter we will refer to dynamic programming, agreement scores, error rates, and related issues. Some explanation of these issues is provided here.
A standard (simple) DP algorithm is one in which the penalty for an insertion, deletion, or substitution is 1. However, when using such DP algorithms we often obtained suboptimal alignments, like the following example:
Tref = /A m st@R d A m /
Tbu = /A m s#@t a: n#/   (# = insertion)
For this reason, we decided to make use of a more sophisticated Dynamic Programming DP alignment procedure. In this second DP algorithm, the distance between two phones is not just 0 (when they are identical) or 1 (when they are not identical), but more gradual. The distance between two phones is calculated on the basis of the articulatory features defining the speech sounds the symbols stand for (Cucchiarini 1993: 96; Elffers et al. 2005). More details about this DP algorithm can be found in (Cucchiarini 1993: 96; Elffers et al. 2005). Using this second DP algorithm the following alignment was found for the example mentioned above:
Tref = /A m st@R d A m /
Tbu = /A m s#@#t a: n/
It is obvious that the second alignment is better than the first. Since, in general, the alignments obtained with DP algorithm 2 were more plausible than those obtained with DP algorithm 1, DP algorithm 2 was used to determine the alignments. Similar algorithms have been proposed, and are used by others.
These dynamic programming algorithms can be used to align not only automatic phonological transcriptions and reference phonological transcriptions, but also other phonological transcriptions. Besides being used to assess the quality of phonological transcriptions, they can be also used to study in what respects the compared phonological transcriptions deviate from each other, and to obtain information on pronunciation variation. For instance, our DP algorithms compare the two transcriptions and return various data such as an overall distance measure, the number of insertions, deletions, and substitutions of phonemes, and more detailed data indicating to which features substitutions are related. This kind of information can be extremely valuable if one is interested to know how the automatic phonological transcription differs from the reference phonological transcription, and how the (p. 96)automatic phonological transcription could be improved (see e.g. Binnenpoorte and Cucchiarini 2003Cucchiarini and Binnenpoorte 2002Cucchiarini et al. 2001).

5.1.6 Determining when an Automatic Phonological Transcription is of Satisfactory Quality

After having established how much an automatic phonological transcription differs from a reference phonological transcription, one would probably need some reference data to determine whether the degree of distance observed is acceptable or not. In other words, how can we determine whether the quality of a given automatic phonological transcription is satisfactory? Again, human transcriptions could be used as a point of reference. For instance, one could compare the degree of agreement observed between the automatic phonological transcription and the reference phonological transcription with the degree of agreement observed between human transcriptions of the same utterances that are of the same level of detail and that are made under similar conditions, because this agreement level constitutes the upper bound, as in the study reported in Wesenick and Kipp (1996). If the degree of agreement between the automatic phonological transcription and the reference phonological transcription is comparable to what is usually observed between human transcriptions, one could accept the automatic phonological transcription as is; alternatively, if the degree of agreement between the automatic phonological transcription and the reference phonological transcription is lower than what is usually observed between human transcriptions, the automatic phonological transcription should first be improved. However, the problem with this approach is that it is difficult to find data on human transcriptions to be used as reference (for more information on this point, see Cucchiarini and Binnenpoorte 2002). Whether a transcription is of satisfactory quality will also depend on the purpose of the transcription. Some differences in transcriptions can be important for one application, but less important for another. Therefore, application, goal, and context should be taken into account for meaningful transcription evaluation (van Bael et al. 2003Van Bael et al. 2007).

5.2 Obtaining Phonological Transcriptions

In the current section we look at (semi-)automatic methods for obtaining phonological transcriptions. We start with completely automatic methods, distinguishing between cases in which orthographic transcriptions are not available and cases in which they are. We then discuss comparing (combinations of) methods, including methods in which a (small) part of the material is manually transcribed and subsequently used to improve automatic phonological transcriptions for a larger amount of speech. Finally, we look (p. 97) at automatic phonological transcription optimization. However, before discussing how automatic phonological transcriptions can be obtained, we provide a brief explanation of how automatic speech recognition works.

5.2.1 Automatic Speech Recognition (ASR)

Standard ASR systems are generally employed to recognize words. The ASR system consists of a decoder (the search algorithm) and three ‘knowledge sources’: the language model (LM), the lexicon, and the acoustic models. The LM contains probabilities of words and sequences of words. Acoustic models are models of how the sounds of a language are pronounced; in most cases so-called hidden Markov models (HMMs) are used, but it is also possible to use artificial neural networks (ANNs). The lexicon is the connection between the language model and the acoustic models. It contains information on how the words are pronounced, in terms of sequences of sounds. The lexicon therefore contains two representations for every entry: an orthographic and a phonological transcription. Most lexicons contain words for which more than one entry is present in the lexicon, i.e. the pronunciation variants.
ASR is a probabilistic procedure. In a nutshell, ASR (with HMMs) works as follows. The LM defines which sequences of words are possible, for each word the possible variants and their transcriptions are retrieved from the lexicon, and for each sound in these transcriptions the appropriate HMM is retrieved. Everything is represented by means of a huge probabilistic network: an LM is a network of words, each word is a network of pronunciation variants and their transcriptions, and for each of the sounds in these transcriptions the corresponding HMM is a network of its own. In this huge and complex network, paths have probabilities attached to them. For a given (incoming, unknown) speech signal, the task of the decoder is to find the optimal global path in this network, using all the probabilistic information. In standard word recognition the output then consists of the labels of the words on this optimal path: the recognized words. However, the optimal path can contain more information than just that concerning the word labels, e.g. information on pronunciation variants, the phone symbols in these pronunciation variants, and even the segmentation at phone level.
The description provided above is a short description of a standard ASR, i.e. one that is used for word recognition. However, it is also possible to use ASR systems in other ways. For instance, it is possible to perform phone recognition by only using the acoustic models, i.e. without the top-down constraints of language model and lexicon. Alternatively, the lexicon may contain phones, instead of words. If there are no further restrictions (in the LM). we are dealing with so-called free or unrestricted phone recognition, whereas if the LM contains a model with probabilities of phone sequence (a kind of phonotactic constraints), then we have restricted phone recognition. These phone LM models are generally trained using canonical phonological transcriptions. For instance, in some experiments on automatic phonological transcription that we carried out (see section 2.3) it turned out that 4-gram models outperformed 2-gram, 3-gram, (p. 98) 5-gram, and 6-gram models (van Bael et al. 20062007). Such a 4-gram model contains the probabilities of sequences of 4 phones.

5.2.2 Automatic Methods


5.2.2.1 Automatic Methods: No Orthography


In general, if orthographic transcriptions are present, it is better to derive the phonological transcriptions not only from the audio signals, but from a combination of audio signals and orthographic transcriptions. But what to do if no orthographic transcriptions are present?

5.2.2.1.1 No Orthography, ASR

If no orthographic annotation is present, an obvious solution would be to use ASR to obtain it. However, since ASR is not flawless, this orthographic transcription is likely to contain ASR errors. These errors in the orthographic transcription would counterbalance the positive effects of using the orthographic representation for obtaining automatic phonological transcriptions. Whether the net effect is positive depends on the task. For some tasks that are relatively easy for ASR, such as isolated digit recognition, the net effect may even be positive, but for most tasks this will probably not be the case.

5.2.2.1.2 No Orthography, Phone Recognition

Another option, if no orthographic representation is available, is to use phone recognition (see section 2.1 on ASR). For this purpose, completely unrestricted phone recognition can be used, but usually some (phonotactic) constraints are employed in the form of a phone language model. Phone accuracy can be highly variable, roughly between 30 and 70 per cent, depending on factors such as speech style and quality of the speech (amount of background noise) (see e.g. Chang 2002). For instance, for one of our ASR systems we measured a phone accuracy level of 63 per cent for extemporaneous speech (Wester et al. 1998). In general, high accuracy values can be obtained for relatively easy tasks (e.g. carefully read speech), and by carefully tuning the ASR system for specific tasks (i.e. speech style, dialect or accent, gender, or speaker).
In general, such levels of phone accuracy are too low, and thus the resulting automatic phonological transcriptions cannot be used directly for most applications. Still, phone recognition can be useful. For our ASR system with a phone accuracy of 63 per cent we examined the resulting phone strings by comparing them to canonical transcriptions (Wester et al. 1998). Canonical transcriptions can be obtained by means of lexicons, grapheme-to-phoneme conversion tools (for an overview, see Bisani and Ney 2008), or a combination of the two. Since the quality of the phonological transcriptions in lexicons is usually better than that of grapheme-to-phoneme conversion tools, in many applications one first looks up the phonological transcriptions of words in lexicons, and if these are (p. 99) not found in existing lexicons the grapheme-to-phoneme conversion is applied. In Wester et al. (1998) it was found that the number of insertions (4 per cent) was much smaller than the number of deletions (17 per cent) and substitutions (15 per cent). Furthermore, the vowels remain identical more often than the consonants, mainly because in comparison to the consonants they are deleted less often. Finally, we studied the most frequently observed processes, which were all deletions. It turned out that these frequent processes are plausible connected speech processes (see Wester et al. 1998), some of which are related to Dutch phonological processes that have been described in the literature (e.g. /n/-deletion, /t/-deletion, and /@/-deletion are described in Booij 1995), but also some others that could not be found in the literature.
Phone recognition can thus be used for hypothesis generation (Wester et al. 1998). However, owing to the considerable number of inaccuracies in unsupervised phone recognition, it is often necessary to check or filter the output of phone recognition. The latter can be done by applying decision trees (Fosler-Lussier 1999van Bael et al. 20062007) or forced recognition (Kessens and Strik 2001, 2004). The results of phone recognition can be described in terms of context-dependent rewrite rules. Various criteria can be employed for filtering these rewrite rules—for example, straightforward criteria are the absolute frequency with which changes (insertions, deletions, or substitutions) occur, or the relative frequency, i.e. the absolute frequency divided by the number of times the conditions of the rule are met (see e.g. Kessens et al. 2002). Of course, combinations of criteria are also possible.

5.2.2.2 Automatic Methods: With Orthography


Above different methods were described to derive phonological transcriptions when no orthographic transcriptions are available. Such methods are often called bottom-up or data-driven methods. In the current subsection we will describe methods to obtain phonological transcriptions when orthographic transcriptions are present. In the latter case, top-down information can also be applied.

5.2.2.2.1 With Orthography, Canonical Transcriptions

Probably the simplest way to derive phonological transcriptions in this case is by using canonical transcriptions (see above). Once a phonological (canonical) transcription is obtained for every word, the orthographic representations are simply replaced by the corresponding phonological representations (Binnenpoorte and Cucchiarini 2003van Bael et al. 20062007).

5.2.2.2.2 With Orthography, Forced Recognition

Words are not always pronounced in the same way, and thus representing all occurrences of a word with the same phonological transcription will result in phonological transcriptions containing errors. The quality of the phonological transcriptions could be improved by modelling pronunciation variation. One way to do this is to use so-called forced recognition. In forced recognition the goal is not to recognize the string (p. 100) of words that was spoken, as in standard ASR. On the contrary, in forced recognition this string of words (the orthographic transcription) has to be known. Given the orthographic transcription, and multiple pronunciations of some words, forced recognition will determine automatically which of the pronunciation variants of a word best matches the audio signal corresponding to that word. The words are thus fixed, and for every word the recognizer is forced to choose one of the pronunciation variants of that word—hence the term ‘forced recognition’. The search space can also be represented as a network or lattice with pronunciation variants. The goal then is to find the optimal path in that network, the optimal alignment, by means of the Viterbi algorithm; this is why this procedure is also referred to as ‘Viterbi’ or ‘forced’ alignment. In any case, if hypotheses (pronunciation variants) are present, forced recognition can be used for hypothesis verification (i.e. to find the hypothesis, the variant that most closely matches the audio signal). It is important to note that through the use of pronunciation variants it is also possible to model sandhi phenomena and cross-word phenomena such as assimilation or intrusion.
We evaluated how well forced recognition performs by comparing its performance to that of human annotators (Wester et al. 2001). Nine experts, who often carried out phonetic transcriptions for their own research, carried out exactly the same task as the computer program: they had to indicate which pronunciation variants of a word best matched the audio signal. The variants of 467 cases were generated by means of 5 frequent phonological rules: /n/-, /r/-, /t/-, /@/-deletion, and /@/-insertion (Booij 1995). In all 467 cases the machine and the human transcribers thus had to determine whether a phone was present or not. The results of these experiments were evaluated in different ways; some of these results are presented here.
Table 5.1 shows how often N of the 9 transcribers agree. For 5 out of 9 this is 100 per cent, obviously, but for larger N values this percentage drops rapidly, and all 9 experts agree only in 53 per cent of the cases. Note that these results concern decisions on whether a phone was present or not, i.e. insertions and deletions, and not substitutions, where a phone could be substituted by many other phones and thus the number of possibilities is much larger. Determining whether a phone is present or not can also be very difficult both for humans and for machines, because very often we are dealing with gradual processes in which phones are neither completely present or absent, and even if a phone is (almost) not present some traces can remain in the context. Furthermore, human listeners could be biased by their knowledge of the language.
Table 5.1 Forced Recognition: majority vote results, i.e. the number of times that N of the 9 transcribers agree
5 out of 9
467 (100%)
6 out of 9
435 (93%)
7 out of 9
385 (82%)
8 out of 9
335 (72%)
9 out of 9
246 (53%)
On Automatic Phonological Transcription of Speech CorporaClick to view larger
Figure 5.1 Percentage agreement of the reference transcription compared to the transcriptions made by humans and machine.
We then compared the transcriptions made by humans and machines to the same reference transcription, the majority vote with different degrees of strictness (N out of 9, N = 5–9) mentioned above. The results are shown in Figure 5.1.
For the 246 cases in which all transcribers agree, the percentage agreement between listeners and reference transcription obviously is 100 per cent. If N decreases, the percentage agreement with the reference transcription decreases, both for the judgements of the listeners and for the forced recognition program. Note that the behaviour is similar: the average percentage agreement of the listeners almost runs parallel to the agreement of the ASR system.
(p. 101) We also carried out pairwise comparisons between transcriptions. We obtained inter-listener percentage agreement scores (for Dutch spontaneous speech) in the range of 75–87 per cent (with an average of 82 per cent) (Wester et al. 2001). Similar results were obtained for German spontaneous speech: 79–83 per cent (Kipp et al. 1997Schiel et al. 1998), and for American English (Switchboard): 72–80 per cent (Greenberg 1999). The ASR-listener pairwise comparisons yielded slightly lower percentage agreement scores: 76–80 per cent (with an average of 78 per cent) for Dutch (Wester et al. 2001), and 72–80 per cent for German (Kipp et al 1997Schiel et al. 1998).
(p. 102) Forced recognition appears to perform well: the results are comparable to those of human transcribers, and the percentage agreement scores for the ASR are only slightly lower than those between human annotators. However, note that these results were obtained by comparing the judgements by humans and machine to the judgements by human annotators. If we had based the reference transcription(s) on a combination of the judgements by listeners and by ASR systems, the differences would have been (much) smaller. In any case, forced recognition seems to be a useful technique for hypothesis verification, i.e. for obtaining information regarding the phonological transcription.
Forced recognition, for hypothesis verification, can thus be used in combination with other methods that generate hypotheses. Examples of the latter are phone recognition (see section 2.2.1.2). and rule-based methods (e.g. by using phonological rules, such as the five rules mentioned above). A method to obtain information regarding reduction processes is to simply generate (many) variants by making (many) phones optional, and to use forced recognition to select variants. In Kessens et al. (2000) we showed that there is a large overlap between the results of the latter method and those obtained with phone recognition in combination with forced recognition. Both methods are useful for automatically detecting connected speech processes, and it turned out that only about half of these connected speech processes had already been described in the literature at that moment (Kessens et al. 2000).

5.2.3 Comparing (Combinations of) Methods

In the previous sections we have presented several methods for obtaining phonological transcriptions. The question that arises is which of these methods performs best. In addition, many of these methods can be combined. In the research by van Bael et al. (20062007) several (combinations of) methods were implemented, tested, and compared. Ten automatic procedures were used to generate broad phonetic transcriptions of well-prepared speech (read-aloud texts) and spontaneous speech (telephone dialogues) from the Spoken Dutch Corpus (see Table 5.2). The resulting transcriptions were compared to manually verified phonetic transcriptions from the same corpus. These ten methods are briefly described here (for more information, see van Bael et al. 20062007). In methods 3–10, multiple pronunciation lexicons were used, and the best variant was chosen by means of forced recognition (in methods 1 and 2 this was not the case).
Table 5.2 Accuracy of the ten transcription methods for read speech and telephone dialogues: percentage of Substitutions (Subs), Deletions (Dels), Insertions (Ins), and percentage disagreement (%dis, the summation of Subs, Dels, and Ins)
Comparison with RT
Read speech
Telephone dialogues
Subs
Dels
Ins
%dis
Subs
Dels
Ins
%dis
CAN-PT
6.3
1.2
2.6
10.1
9.1
1.1
8.1
18.3
DD-PT
16.1
7.4
3.6
27.0
26.0
18.0
3.8
47.8
KB-PT
6.3
3.1
1.5
10.9
9.0
2.5
5.8
17.3
CAN/DD-PT
13.1
2.0
4.8
19.9
21.5
6.2
7.1
34.7
KB/ DD-PT
12.8
3.1
3.6
19.5
20.5
7.8
5.4
33.7
[CAN-PT]d
4.8
1.6
1.7
8.1
7.1
3.3
4.2
14.6
[DD-PT]d
15.7
7.4
3.5
26.7
26.0
18.6
3.8
48.3
[KB-PT]d
5.0
3.2
1.2
9.4
7.1
3.5
4.2
14.8
[CAN/DD-PT]d
12.0
2.3
4.3
18.5
20.1
7.2
5.5
32.8
[KB/ DD-PT]d
11.6
3.1
3.1
17.8
19.3
9.4
4.5
33.1

5.2.3.1 Canonical Transcription: CAN-PT

The canonical transcriptions (CAN-PTs) were generated through a lexicon look-up procedure. Crossword assimilation and degemination were not modelled. Canonical transcriptions are easy to obtain, since many corpora feature an orthographic transcription and a canonical lexicon of the words in the corpus.

(p. 103) 5.2.3.2 Data-Driven Transcription: DD-PT

The data-driven transcriptions (DD-PTs) were derived from the audio signals through constrained phone recognition: an ASR system segmented and labelled the speech signal using as a language model a 4-gram phonotactic model trained with the reference transcriptions of the development data in order to approximate human transcription behaviour. Transcription experiments with the data in the development set indicated that for both speech styles 4-gram models outperformed 2-gram, 3-gram, 5-gram, and 6-gram models.

5.2.3.3 Knowledge-Based Transcription: KB-PT

We generated so-called knowledge-based transcriptions (KB-PTs) in three steps.
  1. a. First, a list of 20 prominent phonological processes was compiled from the linguistic literature on the phonology of Dutch (Booij 1995). These processes were implemented as context-dependent rewrite rules modelling both within-word and cross-word contexts in which phones from a CAN-PT can be deleted, inserted or substituted with another phone.
  2. b. In the second step, the phonological rewrite rules were ordered and used to generate optional pronunciation variants from the CAN-PTs of the speech chunks. The rules applied to the chunks rather than to the words in isolation to account for cross-word phenomena.
  3. (p. 104) c. In the third step of the procedure, chunk-level pronunciation variants were listed. The optimal knowledge-based transcription (KB-PT) was identified through forced recognition.
Methods 4 and 5 are combinations of data-driven transcription (DD-PT) with canonical transcription (CAN-PT) and knowledge-based transcription (KB-PT):

5.2.3.4 Combined CAN-DD Transcription: CAN/DD-PT


5.2.3.5 Combined KB-DD Transcription: KB/DD-PT

For each of these two methods, the variants generated by the two procedures were combined, and the optimal variant was chosen by means of forced recognition.
Methods 1–5 are completely automatic methods—no manual phonological transcriptions are used. However, manual phonological transcriptions may be already available, at least for a (small) subset of the corpus. The question then is whether these manual phonological transcriptions could be used to improve the quality of the automatic phonological transcriptions obtained for the rest of the corpus. A possible way to do this is to align automatic and manual phonological transcriptions for the subset of the corpus, and use these alignments to train decision trees. Roughly speaking, these decision trees learn the (systematic) differences between manual phonological transcriptions and automatic phonological transcriptions. If the same decision trees are then used to transform the automatic phonological transcriptions of the rest of the corpus, these transformed automatic phonological transcriptions might be closer to the reference transcriptions. We applied these decision trees in each of the five methods described above, thus obtaining five new transcriptions, i.e. methods 6–10. For each of these methods, these decision trees and the automatic phonological transcriptions were used to generate new variants. The optimal variants were selected by means of forced recognition.
To summarize, the ten methods are:
  1. 1. Canonical transcription: CAN-PT
  2. 2. Data-driven transcription: DD-PT
  3. 3. Knowledge-based transcription: KB-PT
  4. 4. Combined CAN-DD transcription: CAN/DD-PT
  5. 5. Combined KB-DD transcription: KB/DD-PT
6–10 = 1–5 with decision trees
The results are presented in Table 5.2. It can be observed that applying the decision trees improves the results. Therefore, if manual phonological transcriptions are available for part of the corpus, they can be used to improve the automatic phonological transcriptions for the rest of the corpus. And if no manual phonological transcriptions are available, one could consider obtaining such transcriptions for (only a small) part of the corpus.
The best results are obtained for method 6: [CAN-PT]d, a canonical transcription that, through the use of a small sample of manual transcriptions and decision trees, (p. 105) was modelled towards the target transcription. This method does not require the use of an ASR system, only canonical transcriptions obtained by means of a lexicon look-up, some manual phonological transcriptions, and decision trees trained on these manual transcriptions. For these (best) transcriptions, the number and the nature of the remaining disagreements with the reference transcriptions are similar to inter-labeller disagreement values reported in the literature.
Some examples, including those for the best method (i.e. 6: [CAN-PT]d). are provided in Table 5.3. It can be observed that the decision trees have ‘learned’ some patterns that lead to improvements: the deletion of the schwa (@) of /@t/, the deletion of the ‘l’ in /Als/, and the devoicing of ‘v’ (see the last word). In order to determine what the errors in the transcriptions are, they are aligned with the reference. The number of errors for [CAN-PT]d (i.e. 1 Del and 2 Subs) is much smaller than for CAN-PT (3 Dels and 3 Subs).
Table 5.3 Examples of utterances with different phonological transcriptions (in SAMPA). From top to bottom: orthography, CAN-PT (method 1), [CAN-PT]d (method 6), and the manually verified phonetic transcription from the Spoken Dutch Corpus that is used as reference
  • Orthog. maar het is niet handig als je nou…verbinding
  • Reference mar t Is nid hAnd@x A S@ nA+…f@-bIndIN
  • CAN-PT mar @t Is nit hAnd@x Als j@ nA+…v@rbIndIN
  • D S DD S S (3 Dels and 3 Subs)
  • [CAN-PT]d mar t Is nit hAnd@x As j@ nA+…f@-bIndIN
  • S D S - (1 Del and 2 Subs)

5.2.4 Optimizing Automatic Phonological Transcriptions

Deriving automatic phonological transcriptions, according to the methods described above, is usually done by using ASR systems. Since standard ASR systems are primarily intended for recognizing words, for automatic phonological transcription it is necessary to apply the ASR systems in nonstandard, modified ways (as was described above, for various methods). For many decades efforts in ASR research were directed to reducing the word error rate (WER). a measure of the accuracy of ASR systems in recognizing words. If an ASR system is used for deriving automatic phonological transcriptions, one generally takes an ASR system for which the WER is low. However, it is questionable whether the ASR system with the lowest WER is also the best choice for obtaining automatic phonological transcriptions. Given that automatic phonological transcriptions are increasingly used, it is remarkable that relatively little research has been conducted on optimizing automatic phonological transcriptions and on optimizing ASR systems for this purpose. In one of our studies we investigated the effect of changing the properties of the ASR system on the quality of the resulting transcriptions and the WER (Kessens and Strik 2001).
(p. 106) As a criterion we used the percentage agreement between the automatic phonological transcriptions and reference phonological transcriptions. The study concerned 1,237 instances of the five Dutch phonological rules mentioned above (see section 2.2.2.2): the 467 cases mentioned in section 2.2.2.2, in which the reference phonological transcription was obtained by means of a majority vote procedure, and an extra 770 cases, in which the reference phonological transcription was a consensus transcription. By means of a DP alignment of automatic phonological transcriptions with reference phonological transcriptions, we obtained agreement scores which are expressed in either %agreement or Kappa. A higher %agreement or Kappa indicates better transcription quality.
We showed that the relation between WERs and transcription quality is not straightforward (Kessens and Strik 2001, 2004). For instance, using context-dependent HMMs usually leads to lower WERs, but not always to higher-quality transcriptions. In other words, lower WERs do not always guarantee better transcriptions. Therefore, in order to increase the quality of automatic phonological transcriptions, one should not simply take the ASR system with the lowest WER. Instead, specific ASR systems have to be optimized for this task (i.e. to generate optimal automatic phonological transcriptions). Our research made clear that by combining the right properties of an ASR, the resulting automatic phonological transcriptions can be improved. In Kessens and Strik (2001) this was achieved by training the HMMs on read speech (instead of spontaneous speech), by shortening the topology of the HMMs, and by means of pronunciation variation modelling.
Related to the issue above, i.e. which ASR system to use to obtain automatic phonological transcriptions with high transcription quality, is the issue of which phonological transcriptions to use to obtain an ASR system with low WER. The question is whether higher-quality transcriptions—e.g. manual phonological transcriptions—always yield ASR systems with lower WERs. We used different phonological transcriptions for training ASR systems, measured transcription quality by comparing these transcriptions to a reference phonological transcription, and also measured the WERs of the resulting ASR systems (Van Bael et al. 2006, 2007). The phonological transcriptions we used were: a manual phonological transcription, a canonical transcription (APT1), and an improved APT2 obtained by modelling pronunciation variation. In this case too, no straightforward relation was observed between transcription quality and WER; for example, manual phonological transcriptions do not always yield ASR systems with lower WERs.
The overall conclusion of these experiments is therefore that, since ASR systems with lower WERs do not always yield better phonological transcriptions, and better phonological transcriptions do not always yield lower WERs, if ASR systems are to be used to obtain automatic phonological transcriptions, they should be optimized for this specific task.

5.3 Concluding Remarks

In the previous sections we have discussed the possibilities and advantages offered by automatic methods for phonological annotation. Ceteris paribus, the quality of the transcriptions is likely to be higher for careful (e.g. read) than for sloppy (p. 107) (e.g. spontaneous) speech, and also higher for high-quality audio signals than for lower-quality ones (more noise, distortions, lower sampling frequency, etc.). If there is no orthographic transcription, it will be difficult to automatically obtain phonological transcriptions of high quality, since the output of ASR and phone recognition generally contain a substantial number of errors. If there are orthographic transcriptions, a good possibility might be to use method 6 of section 2.3: obtain some manual transcriptions, use them to train decision trees, and apply these decision trees to the canonical transcriptions. Another good option is to use method 3 of section 2.3: use ‘knowledge’ (e.g. a list of pronunciation variants, or rules for creating them) to generate variants, and apply ‘forced recognition’ to select the variants that best match the audio signal.
In practice, automatic phonological transcriptions can be used in all research situations in which phonological transcriptions have to be made by one person. Given that an ASR does not suffer from tiredness and loss of concentration, it could assist the transcriber who is likely to make mistakes owing to concentration loss. By comparing his/her own transcriptions with those produced by the ASR a transcriber could spot possible errors that are due to absent-mindedness.
Furthermore, this kind of comparison could be useful for other reasons. For instance, a transcriber may be biased by his/her own hypotheses and expectations, with obvious consequences for the transcriptions, while the biases in automatic phonological transcription can be controlled. Checking the automatic transcriptions may help discover possible biases in the listener’s data. In addition, APT can be employed in those situations in which more than one transcriber is involved, in order to solve possible doubts about what was actually realized. It should be noted that using automatic phonological transcription will be less expensive than having an extra transcriber carry out the same task.
Automatic phonological transcription could also play a useful role within the framework of agile corpus creation as proposed by (Voormann and Gut 2008; see also chapter on corpus design in this volume). Agile corpus creation advocates the adoption of a query-driven approach that envisages small, rapid iterations of the various cycles in corpus creation (querying, annotation schema development, corpus annotation, and corpus analysis) to enhance the quality of corpora. In this approach, automatic phonetic transcription can be employed in a step-by-step bootstrap procedure as proposed by Binnenpoorte (2006). so that improved automatic phonological transcriptions are obtained after each step.
Finally, we would like to reiterate the clear advantage of using automatic phonological transcription when it comes to transcribing large amounts of speech data that otherwise would probably remain unexplored.

(p. 108) Appendix Phonetic transcription Tools

Below a list of some (pointers to) phonetic transcription tools is provided. Since much more is available for English than for other languages, we first list the tools for English, and then the tools for other languages.

English

Other Languages

  • http://mickey.ifp.illinois.edu/speechWiki/index.php?title=Phonetic_Transcription_ Tool&oldid=3011
    • This is a tool that maps strings of letters (words) to their phonetic transcriptions via a Hidden Markov Model. It can also give phonetic transcriptions for partial words or words not in a dictionary. If a transcription dictionary is provided, the tool can align letters with (p. 109) their corresponding phones. It has been trained on American English pronunciations, but models for other languages can also be created.
  • http://tom.brondsted.dk/text2phoneme/
    • Tom Brøndsted: Phonemic transcription. An automated phonetic/phonemic transcriber supporting English, German, and Danish. Outputs transcriptions in the International Phonetic Alphabet IPA or the SAMPA alphabet designed for speech recognition technology.
  • http://ilk.uvt.nl/g2p-www-demo.html, last accessed date: 26/06/2011
    • The TreeTalk demo converts Dutch or English words to their phonetic transcription in the SAMPA (Dutch) or DISC (English) phonetic alphabet, and also generates speech audio.
  • http://hstrik.ruhosting.nl/tqe/
    • Automatic Transcription Quality Evaluation (TQE) tool. Input is a corpus with audio files and phone transcriptions (PTs). Audio and PTs are aligned, phone boundaries are derived, and for each segment-phone combination it is determined how well they fit together, i.e. for each phone a TQE measure (a confidence measure) is determined, e.g. ranging from 0–100 per cent, indicating how good the fit is, what the quality of the phone transcription is.
  • http://www.fon.hum.uva.nl/praat/
    • Praat: doing phonetics by computer.
  • http://latlcui.unige.ch/phonetique/
    • EasyAlign: a friendly automatic phonetic alignment tool under Praat.
  • http://korpling.german.hu-berlin.de/~amir/phon.php
    • Automatic Phonetic Transcription and Syllable Analysis for German and Polish.
  • http://www.webalice.it/sandro.carnevali2011/indice.htm
    • Program for IPA phonetic transcription of Italian, Japanese and English.
  • http://www.ipanow.com/
    • PhoneticSoft automatically transcribes Latin, Italian, German and French texts into IPA symbols.
  • http://billposer.org/Software/earm2ipa.html
    • This program translates Armenian in UTF-8 Unicode to the International Phonetic Alphabet, assuming that the dialect represented is Eastern Armenian.