Kernerman Dictionary News • Number 15 • July 2007

Phonetics of EFL Dictionary Definitions

Włodzimierz Sobkowiak



Poznań: Wydawnictwo Poznańskie. 2006

249 pp.

ISBN 83-7177-450-8


The rationale of Phonetics of EFL Dictionary Definitions is to provide lexicographers with phonetically-based insights into their choice of words in dictionary definitions so that these definitions can be more easily understood by second language learners.


The book concentrates on a rather neglected area of lexicography, namely, the application of phonetic principles to dictionary writing. Why is this type of research relatively rare? The answer is found when one considers the widespread knowledge needed in such disparate and, to many, largely inaccessible areas in order to tackle this issue. The areas of expertise include: phonetics, computational linguistics, statistics, corpus linguistics, contrastive phonology, natural language processing (NLP), etc. It is unusual to find one person who can enter such a large arena of events, and be capable of handling such immense diversity. Sobkowiak is an exception. His knowledge of all of these spheres is impressive, and his ability to integrate these outlying strands into one woven piece of lexicographic cloth is indeed admirable. In fact, looking at Sobkowiak’s work over the last years indicates a nearly one-man crusade for the inclusion of phonetic analyses in lexicographic research (Sobkowiak 2002, 2003, 2004).


The same reasons that make this work truly notable, namely its breadth and attention to detailed analyses, also provide the major obstacles to its wider acceptance and fuller understanding. To comprehend the book, you must be familiar with concepts as diverse as statistical frequency analyses and N-grams from the field of corpus linguistics to more esoteric and specialized notions from the field of phonetics. Concepts such as phonological interference, sandhi phenomena, as well as the various phonetic terms that are used in the book (e.g. devoicing, syllabic sonorants, palatalization, overnasalization, just to name a few) may be rather obscure to someone from a purely lexicographic background. Meanwhile, the detailed tables and charts sometimes bog down the reader with so many intricacies that one is often trying to look for the forest while navigating the many trees.


One suggestion that could be helpful is to provide the reader with a glossary of terms. This is particularly important for the many abbreviations that are used throughout. It would also be helpful for pinpointing small editing problems, such as the use of the abbreviation POS (‘parts of speech’) in Table 7, [p.31] while using COS (presumably ‘categories of speech’) in Table 10, [p.44] for discussing the same concept. A glossary would also help unite the disparate areas needed for understanding this material.


If one is inclined to think that this is an area that has been somewhat left untreated simply because of its relative unimportance, then a re-examination is clearly in order. Historical lack of attention to the most basic element of reading – sound to meaning correspondence – is an oversight in current dictionary design that should not be taken lightly. After all, if you cannot read or understand a definition, then why have the definition in the first place?


From a pedagogical standpoint, what can be learned from definitions? As Sobkowiak notes, incidental learning of vocabulary is well-known, but attention paid to learning from dictionary definitions is a rather neglected area of vocabulary-acquisition research: “During definition reading and processing by learners, incidental learning can occur, just like in any other reading activity…however [], I could find no research devoted to definition reading itself.” [p.78]. If definitions can be improved so that sub-vocal reading is made easier (presumably leading to greater understanding of the definition), this would clearly be an improvement in dictionary development.


The book is a collection of several large-scale studies, compacted into one overall treatise. It provides a multitude of in-depth research programs that include:


  1. An analysis of grapho-phonemic problems and inter-lingual phonological interference patterns encountered by Polish speakers learning English. 
  2. The development of a scale of the “Phonetic Difficulty Index” (PDI) – a coded metric of how difficult would an English word be for native Polish speakers to pronounce, based on the above analyses, and its application by algorithmic assignment to each entry of a reference wordlist database (a machine-readable version of the OALD wordlist).
  3. Detailed general language and phonetic modeling, including an impressive array of statistical analyses, to act as baselines for comparing to dictionary-specific content.
  4. Detailed empirical investigations of the PDI metric, used for measuring the inherent phonologically-related difficulty of the following dictionary content:
    1. the defining vocabularies (DV) of four leading EFL dictionaries (LDOCE, OALD, CALD, and MEDAL)
    2. the definitions of the MEDAL dictionary
    3. 100-word samples of definitions from five EFL dictionaries (LDOCE, OALD, CALD, MEDAL, and COBUILD).


The basic findings indicate that these major dictionaries do not differ significantly from one another in terms of the PDI’s of their defining vocabularies and definitions. Thus, no dictionary is ‘phonetically harder’ than any other. The question is, however, if some improvements could be made to make the dictionaries ‘phonetically easier’, and on what basis?


The comprehensive statistical analysis of MEDAL shows some differences in comparison to a reference lexicon. Some could be explained by the choice of DV or the usage of particular definition-specific words that boost the incidence of hard-to-pronounce phonemes. Sobkowiak points out that dictionary writers and editors could judiciously choose DV items or particular words in the definitions, keeping in mind the PDI metric. For example, the word ‘whether’, with the medial /th/ sound that is hard for Polish speakers to pronounce, could be replaced by the easier-to-pronounce ‘if’, while providing similar functionality in the definition (e.g. in the definition of screen: “to decide whether someone is suitable” vs. “to decide if someone is suitable”, p. 90). Sobkowiak analyses dictionary microstructure and provides other such phonolapsological-based suggestions for making dictionary definitions easier for Polish learners.


Another major finding is that the PDI metric, being word-based, does not capture across-word phenomena that are evident when words occur in various contexts. Having been involved in the application of phonetics to Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) for several decades, I can attest to the fact that this is not a trivial issue. Capturing contextually-variable coarticulation and vowel-reduction effects is a major obstacle in creating accurate acoustic models for speech recognition engines. Adequate across-word modeling, including intonation and other suprasegmental factors, is at the basis of providing natural sounding synthetic speech in TTS.


In the ELT sphere, it is apparent that ‘a word spoken in isolation’ is only the beginning of pronunciation learning: the real test is if the word can be intelligibly spoken by the non-native in varying contexts, with proper stress, using varying intonation and dynamic syntactic patterning. As a former teacher-trainer involved in the technological hunt for the ultimate ‘teacher-free’ automatic program for teaching pronunciation to foreign learners, I can testify that this test is highly complex and reflects the intricate phonetic inter-dependencies that occur in the production of variable speech. The current international craze for accent reduction programs and the high attention paid to across-word contextual phenomena indicate that aiding such pronunciation problems addresses a real need; the success rate of such programs show that even human teachers (not only automatic-based instruction programs) find these difficult to adequately teach.


In this regard, Sobkowiak must be commended for his academic honesty in outlining such problems with the proposed PDI and the influence this may have had on the outcome of some of the results. However, nobody has yet produced a perfect metric the first time around, and this is where subsequent studies have their work cut out. It must also be noted that there are many possible applications for such a metric if it could be perfected, notably in the field of linguistic resource development for speech applications involving foreign accents, currently a pressing problem for ASR. Procedures for collecting databases that are relevant for speech recognition simply do not take into account difficulty of pronunciation. ASR databases typically record hundreds, if not thousands, of speakers using prompt sheets that include linguistically designed material that covers phoneme variability related to contextual factors (e.g. the phonetically balanced sentences in the TIMIT database, Fisher et. al, 1986). Such collections of foreign speakers of English are difficult to create, since non-native speakers find it hard to read aloud the required material that must be recorded to create phoneme models (e.g. the ‘Orientel’ collection for several types of Arabic-accented English or French speech; Zitouni et. al 2002, Siemund et. al 2002).


Looking into foreseeable dictionary development, one can surmise that in the not-too-distant future it may be possible to have ‘read-aloud’ programs packaged into the dictionary itself, using real speech recordings or natural sounding TTS, for aiding the second-language learner to read dictionary definitions. Until such time, however, users must still sub-vocalize, read, and understand these definitions. This research indicates that some type of ‘phonetic control’ can be accomplished to make the task easier, without impacting on other important lexicographic needs.


What can now be studied is the actual degradation of vocabulary learning that presumably would take place if very difficult phonetic material (based on the PDI metric) is used in the dictionary. Subsequent studies could model vocabulary learning difficulties based on PDI challenges, both within dictionary definitions and elsewhere, and, of course, for speakers of other languages learning English. It is hoped that in the future more attention will be paid to researching dictionary usage and effectiveness of definitions in terms of phonetic factors.



Fisher et al. (1986). W. Fisher, G. Doddington, K. Goudie-Marshall, The DARPA Speech Recognition Research Database: Specifications and Status (TIMIT database), in Proceedings of DARPA Workshop on Speech Recognition. 93-99.

Siemund et al. (2002). R. Siemund. D. Iskra, H. van den Heuvel, O. Gedge, S. Shammass, Multilingual Access to Interactive Communication Services for the Mediterranean and the Middle East, in Specification of Validation Criteria, Deliverable D6.2.

Sobkowiak (2002). W. Sobkowiak, Phonetic keywords in learner’s dictionaries, in U. Heid et al. (eds.), Euralex 2000 Proceedings. Stuttgart: IMS, 1.237-246.

Sobkowiak (2003). W. Sobkowiak, Pronunciation in Macmillian English Dictionary for Advanced Learners on CD-Rom, International Journal of Lexicography, 16.4.423-441.

Sobkowiak (2004). W. Sobkowiak, Phonetic keywords in EFL dictionaries revisited: MED, in M.C. Campoy and P. Safont (eds.), Computer-Mediated Lexicography in the Foreign Language Learning Context, Colleccio Estudis Filologics, 18. Castello: Universitat Jaume I, 123-32.

Zitouni et al. (2002). I. Zitouni, J. Olive, D. Iskra, C. Choukri, O. Emam, O. Gedge, E. Maragoudakis, H. Tropft, A. Moreno, A. N. Rodriguez, B. Houft, R. Siemund, Orientel: Speech-Based Interactive Communication Applications for the Mediterranean and Middle East, ICSLP-2002, 325-328.



Shaunie Shammass

K Dictionaries