Kernerman Dictionary News • Number 12 • July 2004
Rav-Milim – a Modern Dictionary for an Ancient but Thriving
Yaacov Choueka was born in Cairo in 1936, immigrated to Israel in 1957, and got a PhD in Mathematics from the Hebrew University in Jerusalem in 1971. His computational work on the Hebrew language has pioneered innovations on a global scale, including the development of a computerized lemmatizer and morphological analyzer (1964), leading the Responsa corpus project for a full-text system for Rabbinical literature (1974–86), and laying the infrastructure for and directing the Rav-Milim dictionary project and the Nakdan computerized program for automatic vocalization of texts (1989–97). He was twice recipient of the annual award of the Israel Association for Information Technology, and received the Prime Minister Prize for computing in 1997. Since 1974, Professor Choueka has been teaching Computer Sciences at Bar Ilan University, where he heads the Institute for Information Retrieval and Computational Linguistics.
1. Hebrew is probably one of the oldest languages in current usage, and its dictionary-making history goes back more than a thousand years. It is not our intention here to trace the whole history of Hebrew dictionaries or lexicographic compendia, but rather – as a contrastive background to this brief presentation of the Rav-Milim dictionary of Modern Hebrew (hereafter MH) – to mention the modern ones, i.e. those in vogue in the twentieth century before the Rav-Milim publication in 1997.
By universal opinion, the scene for the Hebrew dictionaries in the twentieth century was dominated by three major and influential works. First and foremost is the Eliezer Ben-Yehuda 16-volume dictionary, a monumental work of erudition and scholarship by “the reviver of the Hebrew language”, the publication of which started in 1908 but ended only in 1959. This is an OED-type of historical dictionary, whose glossary included not only (though mostly) Biblical and Rabbinical terms but also whatever MH ones were available then, especially those coined by Ben-Yehuda himself. The Gur dictionary (1934-36) was really the first general dictionary of MH, quite popular in the late thirties and forties. Finally, the enormously popular Even-Shoshan one – first published in 1947-52, then reprinted countless times in various formats and numbers of volumes (with only one major revision in 1970) – completely dominated the scene in Israel for almost 50 years, being present, virtually, in most local households.
To complete this picture, one should mention three other dictionaries whose impact was rather negligible: Cnaani’s 18-volume dictionary (1960-82), Alcalay (1969-71) and Medan (1954).
All in all, then, just six dictionaries in a whole century, one of them – updated and revised in only minor ways – exclusively dominating the scene for most of the second half of that century, and none of them with any computerized components. Thus, during a period when not only the State of Israel and the Hebrew language were undergoing extraordinary dynamic cycles of changes and expansion, but the whole world was – and still is – exploring new frontiers (and devising new terms and semantic fields to describe them) in technology and science, and in intellectual and social life, dictionary making in MH was in practice frozen for some fifty years.
This was the state of affairs in late 1992, when it was decided to compile and publish – both in print and in electronic form – a new and up-to-date illustrated dictionary of MH, Rav-Milim [Master-Words], with a shorter companion – richly annotated and copiously illustrated in color, specially adapted to young children and teenagers in elementary and secondary schools – Junior Rav-Milim. Here we restrict ourselves to the description of the unabridged Rav-Milim printed version, its underlying philosophy and some of its salient features.
2. Although not a purely corpus-based dictionary, the Rav-Milim design was deeply influenced by computerized methodologies and techniques of natural language processing developed since the mid-1980s, not only in its production and in its extensive cross-checking algorithms, but also in its very structure and editing method. Indeed, since the late eighties, computers have altered the way we view dictionaries, their functionality, their aims, and the degree of thoroughness, coverage, accuracy, precision and methodical writing we have come to expect from them. These influences were masterly described by Krishnamurthy in a previous issue of this newsletter (2002). True enough, the Krishnamurthy paper was about EFL dictionaries; taking into account, however, that for a great majority of the population of Israel, immigrants from all over the world, Hebrew is indeed, to a certain extent at least, a “foreign language”, such insights are highly relevant to a general dictionary of MH as well.
From its very inception, it was decided that Rav-Milim (RM) will be developed along completely different – in fact, radically different – lines than previously published dictionaries of Hebrew (PPDH in short), constituting an “anti-thesis” – so to speak – to them on almost each and every methodological issue of dictionary designing and editing. It differs from PPDH in the list of entries, in the entry’s structure, in the entry’s “explanation”, in the detailed and fine analysis of the various meanings of the entry and their order, in the usage examples, in the usage directives, in the registers’ annotations, in the “etymological” notes, and in the thorough and detailed processing of collocations and when, where and how to include them. At the risk of being somewhat simplistic, we can state schematically that RM is intended to be synchronic and not diachronic, descriptive and not normative, explanatory and not definitional, contemporary and not archival, illustrative and not quotation-minded. Furthermore, a maximum of uniformity and consistency in the dictionary compilation was assured (and continuously checked by the computer) by having all editorial questions discussed, decided and recorded formally by the editorial committee, which counted among its members five prominent professors of Hebrew.
In the following we shall briefly present the main features of RM, most of which were “firsts” in Hebrew lexicography, and some of which have since been adapted in several Hebrew dictionaries that were published after it.
3. The written form of Hebrew – as that of other Semitic languages – is an essentially unvocalized one, vocalization being marked by diacritical points that may appear below, above, or inside the word’s letters. Such a vocalization is however rarely used in everyday writing, except for Biblical texts, poetry or (more recently) for children’s books. To alleviate some of the annoying ambiguity that would thereby result in many different “readings” of a given word, it has been customary to add in appropriate positions of the word some mater lectionis: Vav for the vowels O and U, Yod for E and I, and Aleph for A, thus producing the so-called plene script. Still, most PPDH were edited in the formally vocalized grammatical script, and the entries were also given – and therefore sorted – in this form. We thought that such a vocalized script would seem totally out of context to any reader who never encounters such texts elsewhere, not to mention its childish (on one hand) and somewhat paternalistic (on the other hand) projection. RM is therefore edited in the plene spelling, and the headwords are given in that script, since this is exactly how a user will usually see it in a publication and look for it in the dictionary. Following the plene headword, its grammatical vocalized form is given, so as to assist the user in pronouncing it correctly and recognizing its pattern. Additionally, a pointer is given from that form, in its alphabetical position, to the plene one, just in case the user encounters that form or is extrapolating from the given plene one and looking for it in RM. Incidentally, the number of spelling variants in Hebrew is rather large, also because of different ways of transcribing loan words from many languages over the ages, whether from Aramaic in ancient times or mainly from English most recently; since having pointers from these variants in the main dictionary page would have hopelessly encumbered it (indeed many pages in PPHD consist mostly of such pointers!), all pointers pertinent to a given page were collected and printed in a separate section at the bottom of that page.
The list of entries in RM is distinguished both by what it contains and by what it omits. Besides listing virtually every (Hebrew) Biblical word and most terms from early Rabbinical sources (except Hapax Legomena, whose meaning is not well understood and is inferred only from the context), the list contains every word in current usage, from all registers – from the highest literary ones to the most colloquial and vulgar ones. The only criterion for inclusion was whether such an utterance can be read or heard somewhere; if so, then we must help the user understand it, by including it and its meanings in the dictionary (this was indeed the first time ever that such terms were included in a general dictionary). On the other hand, the word’s register is always clearly marked; from the highly literary (to warn the reader against using such a word in – say – asking directions) to the colloquial, vulgar or obscene, as well as corrupt form of, etc.
We included as entries also utterances that are not, linguistically, “words” of the language, but are used in certain ways specific to Hebrew, such as tsvits tsvits for denoting a bird song, miaou for a cat call, koukourikou for the rooster call, sha for requesting silence, etc.
Special consideration had to be given to the inclusion of “encyclopedic” terms and knowledge, and of terms from various scientific and technological domains. A dictionary is neither an encyclopedia nor a complete guide to the fauna and flora of the world or even of a certain region of it. As a rule-of-the-thumb, any term that may potentially occur in a general publication was included, and any term that occurs only in the relevant professional publications was excluded.
For various types of “non-linguistic” terms, the decision on whether to include them as entries in RM was made by the editorial committee, and rigorously implemented.
Following are some examples of such decisions:
· No proper name of anyone (living or dead) is to be included; literary or mythological figures are mentioned to the extent that they are used metaphorically (Samson, Venus, Casanova) or in collocations (Richter’s scale, Columbus egg).
· Country names are included, along with the language(s), capital and up to three cities, and two denominations of currency – the minimal one and the main one (cent and dollar, penny and pound).
· Places in Israel are included if they have more than 5000 inhabitants as per the last Israel census.
· No specific “creations” (books, theater, arts, etc) are included, with the exception of the 24 books of the Bible and the canonical early Rabbinical sources.
· All elements of the cyclical table are included, with a uniformly designed explanation.
On the other hand, we omitted from RM thousands of obsolete entries that appeared in PPHD: words coined from the late nineteenth century and loan words from other languages that were almost never used, even words officially coined by the Academy of the Hebrew Language that did not enjoy wide acceptance, etc. Our policy was that not every word used once or twice by a writer, as great as he or she may be, should be automatically recorded in the dictionary. Delicate editorial considerations sometimes have to be applied in such cases.
Another issue that well illustrates the spirit of RM is the following. Because of the peculiar history of the Hebrew language, many words have persisted and are in current usage in certain conjugated or derived forms, while the original variant is – and was – never in use (hav [give], only in the imperative; be'etyo [because of him/it/that], only with the preposition and the pronoun). PPHD used, in such cases, to “extrapolate” and invent the presumed original form and list it as a dictionary entry. We refrained from inventing words, and such terms were given as entries “as are”, which is anyway the form in which the user will encounter such words and look for them in the dictionary.
4. “The principal reason for the existence of a general monolingual dictionary is its definitions. All the art and all the scholarship and all the scientific methods that the editors can command are required to study meanings and write definitions” (Gove, 1961).
Contrary to Gove’s wise dictum, one cannot but notice that in most PPHD this aspect of dictionary compilation has been quite neglected, usually with the justification of offering one or more synonyms of the entry. In RM, however, we fully endorsed this statement, with all its consequences and ramifications, except, maybe, for replacing “definitions” by “explanations”, since our aim was not to give an Aristotelian definition of an entry, but to explain it completely and precisely. According to the RM concept, the ultimate test of a good explanation is whether a user who has never encountered the word before can now understand it as fully and precisely as possible. On the one hand, we painstakingly analyzed and checked every word in the explanation to assure its appropriateness and pertinent coverage. On the other hand, we aimed at detailing explicitly all the nuances and shades of the basic meaning of the entry, as manifested in the different contexts in which it actually occurs. Indeed, as stated by Firth (1957), “you recognize a word by the company it keeps”.
One example should suffice to clarify this approach. The adjective ham [hot, warm] is defined in Even-Shoshan only as “having a more or less high temperature”. In RM, this entry details some 11 different meanings or usages in various contexts (that may well translate into different words in other languages), which an innocent reader would not be able to guess on her/his own. Thus, besides the basic meaning as in “hot soup” (vs. “cold soup”), we have “hot news” (but not “cold news”), “hot temper”, “warm heart” (the former with a negative connotation, the later with a positive one), “warm voice” (specific voice texture), “warm clothes” (the clothes themselves are not warm, they warm the body), “he is hot” (which doesn’t mean he has “a more or less high temperature”, he is not sick, he just feels hot and would like to open the window), etc. Even the “Hot! Hot!” call in the hide-and-seek children’s game deserves and gets its own numbered meaning. Indeed, the fine analysis of the extremely rich spectrum of the nuances of almost every word, according to the contexts in which it appears, is one of the greatest achievements and benefits of the application of computers to the processing of large corpora, and the lexicographer’s efforts for collecting, classifying, sorting and adequately explaining these nuances is probably the most exciting and satisfying part of the dictionary making process.
When the meanings of an entry have changed throughout its history, they were always ordered, traditionally, chronologically. In RM, which has always had the user in mind, meanings are ordered by decreasing frequency; the most frequent sense given first, and adequate period labels attached when necessary.
Finally, an explanation is almost always followed in RM by one or more example of usage, which only rarely are quotations from canonical writing. In nearly all cases, examples were carefully crafted to add interesting and useful details to the explanation.
5. One of the impacts of large corpora processing on linguistic studies in general, and on dictionary making in particular, since the mid-eighties, has been the recognition of the critical importance of collocations in defining the language elements and structure.
If this is true for European languages, how much more so for Hebrew! Indeed, with the world dynamically revolving around us, the Hebrew language has constantly had to acquire and absorb numerous new words from the various domains of modern life activities. Although some new terms are adapted as loan-words “as is” and easily become part of current Hebrew, in many cases, however, Hebrew – being a Semitic language with a structure of 3- (or 4-) letter roots and derivation patterns – is quite resistant to such assimilation. A common productive solution is to have a two- (or three-) word Hebrew sequence to represent a new concept. A large number of single-word nouns in English, for example, such as school, hospital, lawyer, accountant, are represented in Hebrew by a two-word sequence.
In spite of that, the treatment of collocations in PPHD has been rather poor, to say the least. Very few collocations found their way into these dictionaries; phrasal collocations, idioms and even proverbs (!) were all mixed up; no clear guidelines were respected in terms of where and how to have the collocation’s main entry (in fact, in an extreme example, a 4-word collocation could appear in 4 different entries with 4 different explanations!), or in terms of how to deal with and uniformly represent the “empty places” in some of these collocations, etc.
Having researched the problem of collocations already in the eighties (see 1983, 1988), I was strongly biased in favor of a comprehensive, systematic, rigorous and consistent treatment of the collocational part of RM. A small sample of the new features introduced in this endeavor now follows:
· To the question of when does a sequence of two or more words deserve its own entry in the dictionary as a collocation, a common answer is: when the meaning of the sequence is not the total sum of its components’ meanings, and cannot be guessed from it. This is indeed an important criterion, but it is far from being unique. We delineated 12 different criteria that can justify such an inclusion, and every potential collocation was tested accordingly.
· Almost 10,000 new collocations were added in RM that never appeared before in PPHD. This is an extremely high figure when taking into account that the total number of (single-word) entries in PPHD is of the order of 35,000 entries only.
· Proverbs (e.g. ‘not all that glitters is gold’) were completely banned from the dictionary; phrasal collocations and idioms were sharply separated.
· Strict rules were set up and followed on where to introduce the main entry of a collocation and its explanation. The explanation appears, of course, only once, but pointers to that occurrence are given from every word of the collocation.
· Collocations were tagged by part-of-speech tags: nominal, verbal, adjectival, adverbial, etc. When necessary, morphological variants were added.
· Possible additions, omissions, replacements, etc, in the collocation text were marked clearly, in a uniform way.
To sum up: RM was a bold step taken to bring modern methodologies, trends and techniques to Hebrew dictionary making, applying overwhelmingly a computerized approach to its compilation and checking procedures. We believe that it has thus set a new standard of precision, coverage, methodology and systematization that will be hard to ignore.
Alcalay, Reuven. 1969-1971. Milon 'Ivri Shalem (The Complete Hebrew Dictionary). 3 vol. Ramat-Gan: Massada.
Ben-Yehuda, Eliezer. 1908-1959. Milon ha-Lashon ha-'Ivrit ha-Yeshana ve-ha-Hadasha (A Complete Dictionary of Ancient and Modern Hebrew). 16 vol.
Cnaani, Yaacov. 1960-1982. Otsar ha-Lashon ha-'Ivrit (Treasure of the Hebrew Language). 18 vol. Ramat-Gan: Massada.
Even-Shoshan, Avraham. 1947-1952. Ha-Milon he-Hadash (The New Dictionary). 5 vol; 1966-1970, 7 vol; 1997, 5 vol. Jerusalem: Qiryat Sefer.
Grazovsky (Gur), Yehuda. 1934-1936. Milon ha-Safa ha-'Ivrit (Dictionary of the Hebrew Language). 3 vol; 1947, 1 vol. Tel Aviv: Dvir.
Medan, Meir. 1954. Me-'Aleph 'ad Tav – Milon 'Ivri Shimushi (From A to Z – a Practical Hebrew Dictionary). Jerusalem: Achiasaf.
Choueka, Y., S.T. Klein and E. Neuwitz. 1983. ‘Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus’. ALLC Journal, 4.34-38.
Choueka, Y. 1988. ‘Looking for needles in a haystack or: locating interesting expressions in large textual databases’. Proceedings of the RIAO Conference (Cambridge, MA). 609-623.
Firth, J.R. 1957. ‘A Synopsis of Linguistic Theory, 1930-1955’. Studies in Linguistic Analysis, 1-32. Oxford: Blackwell.
Gove, P.B. 1961. ‘Linguistic Advances and Lexicography’. Word Study, October 1961, 3-8.
Krishnamurthy, R. 2002. ‘The Corpus Revolution in EFL Dictionaries.’ Kernerman Dictionary News, 10.23-27 (http://kdictionaries.com/newsletter/kdn10-9.html).
K Dictionaries Ltd