Kernerman Dictionary News • Number 10 • July 2002


The Benefits of a Product-Independent Lexical Database with Formal Word Features

Janneke Froon and Franciska de Jong

Dictionaries can be used as a basis for lexicon development for NLP applications. However, it often takes a lot of pre-processing before they are usable. In the last 5 years a product-independent database of formal word features has been developed on the basis of the Van Dale dictionaries for Dutch. The database has proven to be useful in various NLP applications. This paper describes the history, some advantages and the constraints in the development of this database.

1. Introduction
Using traditional dictionaries as a starting point to construct lexicons for NLP applications is obvious. Dictionaries can be deployed in end-user applications such as spelling-correction tools and development tools, for instance phonological lexicons for automatic speech recognition.

Several attempts to apply machine-readable dictionaries have been reported in the literature, for instance Boguraev and Briscoe (1989), Binot and Jensen (1993), Braden-Harder (1993) and Wilks et al (1996). The focus of this work is mostly on the application of semantic information in the dictionaries. Semantic information is only one type of possible information in dictionaries. They can also be used to derive formal word features like hyphenation, pronunciation, word structure and inflection.

This paper describes the development of a lexical database on the basis of the Van Dale dictionaries for Dutch containing formal word features. Reusability of the data has been a major goal while developing this database. Reusability has always been an important concept in the development of lexical databases (cf. Calzolari 1990). For the Van Dale publishing house this concept is important since the information in the database is meant to be used as a source for various dictionaries and for the development of other applications such as text-to-speech systems.

Rather than adapting the dictionary resources to particular applications, a resource is created from which the content needed to realize new products and applications can be extracted. This paper describes the history of the database, the advantages that the database proved to have, and the limitations of the development.

2. History
Van Dale Lexicografie is a major lexicographic publisher in the Netherlands. The Van Dale dictionaries are commonly considered to be the most authoritative dictionaries in the Dutch-speaking community. The development of a database using the dictionary files of Van Dale has been a gradual process. In this section, the background of this process is sketched.

2.1 Twenty years ago
Until the early 1980s, paper dictionaries published by Van Dale each had their own author. The authors were responsible for the contents of the dictionary. The role of the editors was to check the data for textual correctness and ensure that the books got printed and sold. Since each author was responsible only for one dictionary, the contents of the dictionaries were not related, causing unintended differences. At best, the same implicit lexical standards were adhered to by all authors.

2.2 Fifteen years ago
The situation changed during the mid-80s. A new series of bilingual dictionaries from Dutch to three languages (English, French and German) was derived from the same dictionary file of common Dutch words. The potential advantages were immediately recognized by the NLP-community and attempts were made to use this high quality source of data for the development of NLP-systems, e.g. in the machine translation project Rosetta (Rosetta 1994).

However, the in-principle ideal situation didn't last long. The different dictionaries got different authors who were again responsible for their own dictionary only. Although editors checked the material that the authors produced, the dictionaries started to drift apart and therefore many defects were reintroduced, such as inconsistencies of products and the fact that products were not taking advantage of each other's corrections. As a result of the independent editing of the dictionaries the same work was sometimes done more than once and some work could never be done, because it was too expensive for a single dictionary and it was impossible to combine the various efforts.

2.3 Seven years ago
In 1995 the Dutch government changed the rules of spelling. For Van Dale this implied a spelling adaptation of Dutch words in about 50 books, containing 6 million words. This had to be done in a very short period, since users would have stopped buying if new editions were expected soon. Besides the time pressure, the job to be done caused a problem too. The editors of dictionaries may be experts in the lexicographic area, but not in the new spelling regulation that had to be adhered to and internalized.

These new circumstances forced the publisher to adopt a new working method: creating a product-independent spelling database. For this aim, it was still necessary to look at the spelling of about half a million different words. However, at least, this had to be done only once, and not over and over again for all books.

The use of such a product-independent database of spellings proved to have many advantages and was soon followed by other product-independent databases, such as one with hyphenation information and another with pronunciation information. These databases were integrated into a single database with formal word features.

The database had to overcome problems like those described in Quazza and Van den Heuvel (2000), as phonemic information in dictionaries has a limited usability because it is available only for exceptional words and base words, not for all related words.

2.4 Current applications
The Van Dale database with formal word features has been used in various applications. The Dutch text-to-speech system Fluency[
1] uses the phonemic transcriptions in the database. The speech synthesis of Fluency has been used in several of Van Dale's electronic dictionaries and in the Fluency e-mail-reader, a tool which automatically announces and reads aloud e-mail messages[2] . Furthermore, the database is the basis of the Van Dale Spellingcorrector (VDS 2000), a spell-checker for Dutch. These two applications are examples of end-user applications.

Another type of use that illustrates the importance of product-independent databases for the NLP research community, is the application in development projects. For instance, the Druid project[
3] and the ECHO project[4] use the pronunciation information of the database to build an acoustic model for a system for Dutch speech recognition. This speech recognition module is meant to play a role in the development of technology for spoken document retrieval, particularly in video retrieval (see Ordelman et al 1999).

A similar product-independent approach has been used in the development of the VLIS database[
5] from an earlier database which contains semantic word features. This semantic database has been used in the Dutch version of EuroWordNet (see Vossen et al 1999), and is now available under license for commercial use. In cross-language retrieval tools the semantic database has proven to be valuable, in particular in the development of the disambiguation method applied in the Twenty-One search engine, which has been evaluated at several TREC-conferences[6] (cf. Hiemstra and Kraaij 1999, Hiemstra and De Jong 1999).

3. Advantages
The database with formal word features proved to have many advantages, three of which are illustrated here. Firstly, the consistency of products is easily attainable. Secondly, the information in the database is richly encoded. Finally, the information is flexible. As the illustration below will underline, these aspects are beneficial for the production of book dictionaries, for the development of NLP-products targeting the end-user market, and for the level of support for NLP research teams.

3.1 Consistency of products
For Van Dale it is important that the products are consistent. As explained above, Van Dale is an authority in the field of lexicographic information. Therefore, its credibility and authority can be damaged if different products manifest different information. If, for instance, the 'Groot Woordenboek der Nederlandse Taal' (large dictionary of the Dutch language, Geerts and Den Boon, 1999) contradicts the dictionary 'Hedendaags Nederlands' (contemporary Dutch, Van Sterkenburg, 1996) in the pronunciation of a word, the user of these dictionaries cannot rely on the information anymore.

Word information like hyphenation and pronunciation is sometimes difficult to describe and causes differences. Even spelling, which seems to be strictly regulated and therefore unambiguous, has many uncertainties. For instance, the use of hyphens in words like on-line-verbinding (on line connection) is not unambiguously prescribed.

The consistency of dictionaries is guaranteed if the information is drawn from the same source every time a new dictionary file is assembled. The dictionary doesn't contain the information itself, but only a dynamic link to the information that is in the central database. When a new edition is prepared, information from the central database is imported into the dictionary. The imported information cannot be edited in the product file itself. If changes are needed, for instance because errors are found, they have to be stored in the central database. All dictionaries will profit from the corrections.

Not only do book dictionaries thereby become consistent, but so do all applications derived from the database. New insights are shared in every product. The overall quality of the products can reach a higher level.

3.2 Richness of data
The second advantage is the richness of the database. A product-independent database will tend to represent data on a more abstract level than when the data are assembled for a special product, thereby resulting in a richer resource. The most important reason is that while working on the database, it is often not clear at first which information will be needed in which product. It is not desirable to leave information out just because it is not needed at the moment when the database is constructed.

The next two examples illustrate the benefits of rich codes in phonemic representations and in hyphenation marks.

The first example is the representation of underlying phonemes while representing pronunciation. In the pronunciation of bezettoon (busy signal) only one t is heard. The t of bezet (busy) disappears because of degemination with the t in toon (signal). If the representation is needed for the phonemic transcription in a dictionary, one t will do. If the representation is used to synthesize the pronunciation, a single t will sound unnatural, and the presence of a second t has to be indicated. A code indicating such a special t can cause the dictionary generator to delete it, while causing the speech synthesis tool to pronounce the t's in a special way. The same code can be used in different products.

A second example originates from the hyphenation of words. An investigation of hyphenation for Dutch showed that it was better to indicate syllable boundaries instead of hyphenation, although syllable boundaries coincide often - but not always - with hyphenation positions. The reason is that there is a Dutch hyphenation rule that prohibits hyphenating on a position that would cause a syllable of one letter to be separate from the rest of the word on a new line. So radi-o (id.) and a-demen (breathe) are not allowed. The rule also applies when a single-letter-syllable is separated from the rest of a compounding part or derivational part. Therefore radi-otoestel (radio+toestel: radio set) and bea-demen (be+ademen: insufflate) are not allowed. Lexicographers who enter the hyphenation marks serve two aims: firstly, they have to indicate the syllable boundaries; secondly, they have to check whether a single-letter-syllable will be created. In the encoding, this distinction has to be kept. For ademen this causes the encoding a:de-men, where a colon indicates a syllable boundary that doesn't coincide with a hyphenation position.

In conclusion, every stage in the production of the word information should be represented. If every step of the thinking process is explicitly encoded, it is possible to correct the result without having to recall what was going on. Besides, a rich representation has an advantage in itself. The maintainability of the database greatly improves if rich representations are being exploited. Using rich codes, it is possible to infer which processes are responsible for the formation of, for instance, pronunciation or hyphenation. By checking the soundness of these processes, the quality of the data can be improved.

3.3 Flexibility of data
The third advantage of a database is the flexibility of the data. When needed for new products, the information in the databases is readily available, and because of the richness of the data there will be no obstacles in adapting it to a new product. Therefore the database may aspire to do things with the available data that are otherwise unattainable.

Gibbon (2000) points out that phonemic transcripts from machine-readable dictionaries require "extensive pre-processing" before they can be used in system lexicons. However, in the Van Dale database the phonemic information is simply there, readily applicable in a variety of products.

An example is the use of phonemic information in the Van Dale Spellingcorrector, a spell-checker for Dutch which benefits from this information in two different ways. The first is the use of phonemic information in the assembling of a list with predicted errors in the spell-checker. This list is used to detect quickly and properly correct the spelling errors in the list. A lot of predictions about errors can be made on the basis of problematic spelling patterns. For instance the c in Dutch is often confused with k, resulting in well-known errors like kontakt for contact and aktie for actie. A large group of errors is caused by writers staying too close to modern pronunciation, disregarding historical aspects of the spelling of certain words. For instance the b in ambtenaar is often incorrectly omitted, because it cannot be heard. Another example is the word quitte that has a spelling which is very different from its pronunciation /kit/, resulting in the erroneous spelling error kiet. These spelling errors can be predicted if phonemic representations are used. A whole class of plausible errors can be incorporated in the spell-checker, that are beyond reach if phonological information isn't available.

The second way pronunciation information is used in the Van Dale Spellingcorrector is in finding homophones. Homophones are words that are pronounced similarly, but have different spellings. Examples in Dutch are biljart (game of billiards) and biljard (number, thousand billion), boxer (type of dog) and bokser (someone who boxes), and in English discrete (separate) and discreet (tactful). These words cause problems, because writers tend to mix them up, writing for instance biljart when the number is meant. A spell-checker can be improved if attention is paid to the difficulties with homophones, by using the pronunciation information in the databases.

Due to, among others, the use of pronunciation information in various ways, the Van Dale Spellingcorrector can compete with spell-checkers for Dutch that are provided with word processors. Without the information in the database, pronunciation information would have been out of reach because of the high costs. The database provides an affordable opportunity to incorporate into a spell-check this valuable information source.

4. Constraints
Although the development of a multi-purpose database has many advantages, it has a price in both time and money. If information has only one purpose and can be used in one product only, the cost effectiveness is not optimal. However, every time the information is reused, the return on investment potentially increases and for some applications the use of a product-independent database may be the only source of data that is affordable.

For a company, making profit is crucial, and it is tempting to choose making money in the short term. The development of a multi-purpose database undoubtedly has advantages, but especially in the long run. There is thus always the risk that developments are stopped for economic reasons, just before the end-goal is reached, because the remaining work isn't profitable enough. Collaboration with non-profit institutions, such as NLP-research groups with research capacity and/or knowledge, can then be an incentive for sustained resource development.

5. Conclusion
Building a multi-purpose database for formal word features in Dutch, or any other language, is a difficult and expensive task. However, the more the information is used in applications, the cheaper the information gets. The product-independence of the database pays itself back in the long run.

The advantages of such a product-independent database are indisputable. The information in the dictionaries is easily available, more consistent and rich, which benefits any application using the database. However, endurance is demanded of the developing companies to make these advantages commercially viable.

5 Vlis is the Van Dale lexicographic information system and semantic network, in existence since 1992
6 The Twenty-One search engine is distributed in the Netherlands by Irion Technologies

Binot J. and K. Jensen.
1993. 'A Semantic Expert Using an Online Standard Dictionary.' In Natural Language Processing: The PLNLP Approach, K. Jensen et al. (ed.). Dordrecht: Kluwer Academic Publishers, pp. 135-149.
Boguraev B. and E. Briscoe (ed.). 1989. Computational Lexicography for Natural Language Processing. Harlow: Longman Group UK.
Braden-Harder L. 1993. 'Sense Disambiguation Using Online Dictionaries.' In Natural Language Processing: The PLNLP Approach, K. Jensen et al. (ed.). Dordrecht: Kluwer Academic Publishers, pp. 247-263.
Calzolari N. 1990. 'Lexical Databases and Ttextual Corpora: Perspectives of Integration for a Lexical Knowledge-Base.' In Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon, U. Zernik (ed.). Hillsdale, NJ: Laurence Erlbaum, pp. 191-208.
Geerts G. and Den Boon C. 1999. Van Dale Groot Woordenboek der Nederlandse Taal. Utrecht: Van Dale Lexicografie.
Gibbon D. 2000. 'Computational Lexicography.' In Lexicon Development for Speech and Language Processing, F. Van Eynde and D. Gibbon (ed.). Dordrecht: Kluwer Academic Publishers, pp. 1-42.
Hiemstra D. and De Jong F. 1999. 'Disambiguation Strategies for Cross-language Information Retrieval.' In Proceedings of the third European Conference on Research and Advanced Technology for Digital Libraries: ECDL'99. Heidelberg: Springer-Verlag, pp. 274-293.
Hiemstra D. and W. Kraaij. 1999. 'Twenty-One at TREC-7: Ad-hoc and Cross-Language Track.' In Proceedings of the seventh Text Retrieval Conference TREC-7. NIST Special Publication 500-242, pp. 227-238.
Ordelman R., A. Van Hessen and D. Van Leeuwen. 1999. 'Dealing with Phrase Level Coarticulation (PLC) in Speech Recognition: A First Approach.' In Proceedings of the ESCA ETRW Workshop on Accessing Information in Spoken Audio. Cambridge: Cambridge University Press, pp. 64-68.
Quazza S. and H. Van den Heuvel. 2000. 'The Use of Lexica in Text-to-Speech Systems.' In Lexicon Development for Speech and Language Processing, F. Van Eynde and D. Gibbon (ed.). Dordrecht: Kluwer Academic Publishers, pp 207-233.
Rosetta M.T. 1994. Compositional Translation. Dordrecht: Kluwer Academic Publishers.
Van Sterkenburg P. 1996. Van Dale Groot Woordenboek van Hedendaags Nederlands. Utrecht: Van Dale Lexicografie.
VDS 2000. Van Dale Spellingcorrector voor MS-Word. Utrecht: Van Dale Lexicografie.
Vossen P., L. Bloksma and P. Boersma. 1999. The Dutch Wordnet. Amsterdam: The University of Amsterdam.
Wilks Y., B. Slator and L. Guthrie. 1996. Electric Words - Dictionaries, Computers, and Meanings. Cambridge, MA: MIT Press.

About the authors
Janneke Froon is a language technology coordinator for Van Dale Data. In 1997 she graduated as a computational linguist at the University of Utrecht, and has since been working for Van Dale using language technology to enhance lexicographic information for dictionaries and language products. She is preparing her PhD thesis, researching the improvement of spell-checkers using lexicographic information, and is working on the development of a large lexicographical database that integrates formal and semantic features.

Franciska M.G. de Jong teaches language technology at the Computer Science Department of the University of Twente, Enschede, and works for TNO-TPD in Delft as a consultant in the area of multimedia technology. Her background is in theoretical and computational linguistics, and she worked as an assistant researcher at the Faculty of Arts of the University of Utrecht (1980-1985) and as a senior researcher at Philips Research on the Rosetta machine translation project (1985-1992). She is frequently involved in international program committees, expert groups and review panels, and has initiated a number of EU projects. Professor de Jong is currently coordinating several projects aimed at multimedia indexing and retrieval, and chairs the Advisory Board of Van Dale Lexicografie.

About van Dale Data
For over 100 years Van Dale Lexicography has been recognized as the foremost dictionary publishing source in the Netherlands. Since 1989 it has been publishing electronic dictionary applications. Van Dale Data BV has been an independent enterprise of van Dale Lexicografie BV since 1999, focusing on the management and commercial operation of linguistic databases and their applications within language and speech technology. Van Dale is part of the Veen Bosch en Keuning publisihing group.

Van Dale Lexicographical Information System (VLIS)
o semantic hierarchical network
o multilanguage information
o phraseology, idioms
o classification
o word attributes
o 170,000 Dutch word definitions, 1,070,000 translations
o 145,000 semantic relationships
o 225,000 examples, 525,000 translations
o 250 different thematic labels
o lexicographical products for different media
o multilingual dictionaries
o multiple text retrieval and analysis techniques
o automatic classification and summarizing of texts
o development of databases
o building of indexing tools

Word attributes database and language technology
o spelling and hyphenation
o expansions
o word class
o frequency
o context relationships
o pronunciation
o transcriptions
o morphology
o 250,000 Dutch keywords
o 1,250,000 expansions
o checks using language rules
o relationships between words
o inheritance of attributes
o editorial expertise
o parameterization
o corpus
o frequency
o reverse engineering
o hyphenation, expansions
o electronic dictionaries
o language tools
o speech applications
o rhyme engine
o games

Speech technology
o pronunciation indication for all words
o lexicon of more than 180,000 word forms
o rules for unknown words
o rules for interpreting numbers, punctuation, etc.
o prosody generation
o rules for length of sounds in context
o rules for sentence melody
Diphone synthesis
o diphone
o diphone database
o MBROLA synthesizer
o talking dictionaries
o tools for handicapped persons
o games
o fluency e-mail reader
o telephony and Internet

K Dictionaries Ltd
10 Nahum Street, Tel Aviv 63503 Israel
tel: 972-3-5468102 • fax: 972-3-5468103