Kernerman Dictionary News • Number 13 • June 2005
|
A Large-Scale Lexical Database of Danish for Language
Technology Applications and Other Purposes
|
|
Lexical
Category |
No. of Lemmas |
Morphology
only |
Morphology & Syntax |
Morphology |
|
Noun |
64735 |
47% |
||
|
Adjective |
9773 |
32% |
||
|
Verb |
5775 |
2% |
||
|
Adverb |
771 |
81% |
||
|
Interjection |
158 |
100% |
||
|
Preposition |
80 |
100% |
||
|
Conjunction |
60 |
100% |
||
|
Pronoun |
44 |
100% |
||
|
Misc. |
128 |
100% |
||
|
Total |
81524 |
|
Table 1: The composition of
the entire STO vocabulary
The large number of
lemmas with only morphological information is especially useful in
applications such as shallow parsers, taggers, spell checkers, etc.
The words for the
syntactic encoding were selected on a frequency basis; all verbs are provided
with syntax, whereas only nouns and adjectives above a certain frequency
threshold are provided with syntactic information.
Tables 2 and 3 specify the
vocabulary from the selection point of view (as originating from general
language and domain language texts).
Lexical
Category |
Number of
Lemmas |
|
Noun |
52840 |
|
Adjective |
8568 |
|
Verb |
5410 |
|
Adverb |
771 |
|
Interjection |
158 |
|
Preposition |
80 |
|
Conjunction |
60 |
|
Pronoun |
44 |
|
Misc. |
128 |
|
Total |
68059 |
Table 2: General language vocabulary (all closed word
classes belong to this category)
|
Domain |
Nouns |
Verbs |
Adjectives |
Total of
Domain |
|
IT |
1730 |
160 |
115 |
2005 |
|
Environment |
1770 |
50 |
300 |
2120 |
|
Commerce |
1800 |
60 |
160 |
2020 |
|
Administration |
2430 |
25 |
220 |
2675 |
|
Health |
2285 |
40 |
250 |
2575 |
|
Finance |
1880 |
30 |
160 |
2070 |
|
Total |
11895 |
365 |
1205 |
13465 |
Table 3: Domain language
vocabularies in the STO database with part of speech distribution
Lexicon
model
The establishment of the descriptive model and the
linguistic specifications for STO greatly benefits from the experience
acquired at CST within the framework of the multilingual (LE2-4017) PAROLE
project (1996-98) of the European Commission. The PAROLE lexicons were built
around a generic model, an instantiation of the EAGLES recommendations in an
enriched GENELEX model (see side notes). Thus, the Danish STO lexicon is well
integrated in the multilingual infrastructure of European computational
language resources, which ensures its compatibility with other resources
developed for Human Language Technology (HLT).
The STO lexicon is corpus-based as regards both the
selection and the description of lemmas. The linguistic descriptions are
based on corpus analysis, and all lemma types are treated in a uniform way.
The linguistic information content of the STO lexicon is
organized according to the traditional practice in computational linguistics
of division into three independent descriptive layers, i.e. the
morphological, syntactic and semantic layer. Each descriptive layer is made
up of a comprehensive system of the characteristic linguistic properties. The
linguistic description of a lemma is structured in different sets of
information, the so-called units; each unit represents a particular
morphological, syntactic or semantic behaviour of the lemma (the so-called
units) at the layer concerned. From the computational point of view, a unit
is a structured object containing a feature-based description expressed in
attribute/value pairs.
l
Morphology: lexical category,
inflectional patterns, spelling variants, agreement features, compounding
properties, etc.
l
Syntax: syntactic patterns
comprising subcategorisation frame (categorical and functional valence),
diathesis and alternation phenomena, reflexivity of verbs, etc.
l
Semantics: the information is
provided at three specificity levels. Level 1 contains domain reference only
(all entries). Level 2 comprises domain information, ontological type,
argument structure and selectional restrictions (about 2,000 entries). Level
3 is identical with the SIMPLE semantics. Information types of level 3 are
the ontological type, semantic relation, argument structure, selectional
restrictions, qualia structure, event structure, domain information, etc
(about 7,000 entries). The subdivision of the semantic information into three
levels is introduced for practical reasons. Levels 1 and 2 are proper subsets
of level 3, representing a relatively lean semantics.
In a collaborative
lexicon project like STO, it is a key issue to ensure the inter-coder
consistency in order to achieve homogeneity of the linguistic content. To
this end, the lexicographers were guided by detailed encoding guidelines and
worked with encoding tools supporting consistency checks. The successive
stages of the work were organized in three steps: the lemmas were encoded by
one lexicographer/team, then checked/revised by
another, finally all data were validated at CST before uploading it to the
STO database. Also external users’ reported experience and relevant comments
were taken into consideration during the process.
STO is currently the largest and most comprehensive
computational lexicon for the Danish language, and the demand for this
resource is growing. The material is already being used in a number of
projects and applications, for a variety of purposes. According to users’
specifications, data subsets were extracted from the lexicon. These were
adapted to various format requirements and the linguistic content was
exploited for both particular research and development purposes. This way, both
the linguistic content and the formal properties of the lexicon were judged
from the user’s points of view. The examples below illustrate some typical
uses of the STO-data.
In research:
l
evaluation of search engine
behaviour in a multilingual environment
l
computational analysis and
processing of complex sentence structures from the point of view of potential
reading speed
l
conversion of verb entries into
the lexicon format of the Danish Dependency Treebank
l
testing of a computational
grammar for Danish
l
using the qualia structure
information to calculate semantic relations in compounds
In practical
applications:
l
Machine Translation (MT) for a
specific domain
l
lemmatizer for Danish
l
information retrieval system
prototype
l
preparatory work with the aim of
exploitation of verb descriptions in a construction dictionary for humans
l
ongoing development for speech
technology applications, extension with pronunciation of all word forms
Perspectives for
further applications
Reports on successful
experimental applications and positive responses from users provide a
promising basis for marketing the STO resource both for the research
community and for commercial NLP/HLT tool developers.
Currently, only few
industrial products are developed for Danish at all, partly due to the
bottleneck problem of lacking a lexical resource. Because of its
comprehensive and detailed content, STO can keep up with very different
demands and be exploited as a lexicon component in both monolingual tools
(parsers, taggers, authoring tools, browsers, spelling/grammar checkers) and
in multilingual applications (MT systems, search engines, etc) as well as for
HLT tasks such as developing computer-aided language learning tools for
Danish as a second language, authoring tools, etc.
In addition to
various NLP applications, STO offers a valuable resource to linguistic
researchers, teachers and learners of the Danish language. To facilitate
access, there is a web interface that enables various searches and corpus
investigations [2]. However, the database contains more linguistic
information than shown on the screen for human users.
Search options
l
Word Search displays all the
inflected forms and syntactic constructions of the lemma.
l
Compound Search displays all the
compounds containing the search lemma as one of its elements.
l
Corpus Search establishes links
from each result of a Word Search to direct searches in corpora (corpus
instances are displayed in KWIC format).
l
Parameterized Search uses a
combination of the lexical category and the value(s) of all its selected
prevalent properties (up to 30 lemmas meeting the combination of search
parameters are displayed).
Additional facilities
The web interface
provides links to other on-line Danish language resources, such as electronic
dictionaries for human use (Retskrivningsordbogen, the Official
Spelling Dictionary and NetOrdbogen, the Internet Dictionary) and
corpora (Korpus2000 and Berlingske Tidende, a newspaper corpus).
In addition, Danish websites can be searched through a link to Google. These
facilities allow make it possible to make direct searches in a user-friendly
way, e.g. to compare the STO data with information in the electronic
dictionaries and supplement the STO data by corpus evidence. The URL is: http://cst.dk/sto/webinterface/index.html
(presently in Danish only).
The lexicon material is now available for both commercial use and research
purposes. Starting this summer, the licensing and distribution will be
handled by the Evaluations & Language
resources Distribution Agency (ELDA).
Standard packages available:
l
81,000
entries with morphological description only, provided with full documentation
of the morphological layer;
l
81,000
entries with morphological description whereof 45,000 entries are provided
with syntactic information including full documentation of both layers.
Standard formats of delivery:
l
Morphology
in comma-separated plain text files
l
Syntax
in XML format
l
ORACLE dump database files
(8.I on request)
User defined data package types and delivery
formats can be produced on demand at CST.
Center for Sprogteknologi
The CST (Centre for Language
Technology) is a research institute at the
http://cst.dk/uk
![]()
K Dictionaries Ltd
tel: 972-3-5468102 • fax: 972-3-5468103
kd@kdictionaries.com