C O N C E D E

Public progress report: deliverable 0.2.1



Introduction

Language Engineering needs lexicons. For most languages, lexical resources are not currently available in any usable form. The CONCEDE project is developing lexical databases, based on the information in published dictionaries, for six CEE languages: Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene. The CONCEDE consortium comprises partners for each of the languages, from the associated countries, and XRCE (Grenoble) and ITRI, University of Brighton (co-ordinators). The basic design for the project is that each CEE partner works on its own dictionary, with Western partners developing the framework, advising and co-ordinating.

The format for the lexical databases will be TEI-conformant SGML. The TEI proposals for dictionaries are designed to support accurate description of existing dictionaries, rather than the production of new resources in an optimal form for language engineering use. It is a secondary goal of the project to develop a highly constrained variant of the TEI dictionaries DTD which is suitable for language engineering use.

First year goals

The goals for the first year of the project were:

Change of partners

At an early stage, University of Aix expressed a wish to withdraw from the project. Following some negotiations, this was agreed, and Xerox Research Center Europe (based in Grenoble, France) agreed to join the project. This involved a certain amount of adjustment to the work program.

The dictionaries

Clearly, there was much about the project's output that would be determined by the nature of the input dictionaries. Each CEE partner investigated the dictionaries available for them to use, bearing in mind issues such as the kind and size of dictionary it was; the availability of an electronic version; copyright; the co-operation of authors and/or publishers, etc. For each language a dictionary was chosen, as described in brief below and in detail in deliverable 1.1.

For the Bulgarian, Hungarian and Estonian partners, there is overlap between the CONCEDE team and the dictionary producers. In Estonia, the dictionary is a very large one, with eighteen volumes out of an envisaged 24 produced to date. The lexicographers are at the Institute of the Estonian language, who are sub-contractors in CONCEDE. The CONCEDE partner thus has full access to the database representing the dictionary-in-progress.

For Hungarian, both the CONCEDE team and the dictionary team for the medium-sized Hungarian Explanatory Hand Dictionary are at the Hungarian Academy of Sciences, and the original plan was that the CONCEDE work would be integral to the production of a new edition of the dictionary.

The Romanian and Czech partners are working from medium-sized school or college dictionaries. For both of these, as for Hungarian, the CONCEDE partners fully intend to convert the whole dictionary for language engineering use. (The undertaking within CONCEDE is to produce a 3,000 entry database for all languages except Slovene, where the database size is 500 entries.) For Czech, relations with the copyright-holders are good, but for Romanian, problems over copyright were encountered, with the outcome that the dictionary was re-keyed for CONCEDE.

All of these dictionaries were monolingual. The Slovene dictionary is by contrast an English-Slovene bilingual dictionary, currently in preparation. The method for the production of the dictionary was that the Slovenian company, DZS, licensed the "English framework" from Oxford University Press. The "English framework" is a suitable starting point for the English-source-language part of any "English-to-X" bilingual dictionary, giving all the specifications of English words and phrases, their grammar and their meanings, which require translation into the target language. The framework used for the English-Slovene dictionary was also the one used for the extensively-researched Oxford-Hachette English-French dictionary.

Headword selection

CONCEDE undertook to produce 500-headword lexical databases in the first phase of lexicography, and a further 2500 (for all languages but Slovene) in the main phase. A procedure was needed for selecting the words. Note that English, rather than Slovene, is the salient language here for the Slovene partner, as the headwords are English words.

A strategy based on finding translation-equivalents across the six languages was considered, and rejected. Identifying translation equivalents across languages raises many theoretical and practical difficulties, and equivalences were not essential to the project's work. However, it was desirable to have comparable coverage from one database to another, and for the headwords to be selected in a principled way.

A strategy was developed that used the "Orwell" corpus, which had been developed in EU project MULTEXT-EAST. MULTEXT-EAST (which had most of the same partners as CONCEDE) had identified a text of George Orwell's novel, 1984, in each of the six CEE languages (and English) and each of these texts had been produced as a well-structured, lemmatised, TEI-compliant corpus. (The project had also produced model "Corpus Encoding standards", playing a similar role in relation to corpus encoding that CONCEDE aims to play in relation to dictionary encoding.) From Orwell, a lemmatised frequency list could be produced for each language. The simple strategy would then have been to take, first, the 500, and later, the 3,000 highest-frequency words from this list as the CONCEDE headword list. However high-frequency words are atypical of the lexicon as a whole, as they are overwhelmingly closed-class grammatical words, whereas the bulk of the lexicon comprises open-class content words, and even where open class words are of very high frequency, they tend to be highly polysemous and to occurring in many fixed phrases. The simple strategy would have made the first phase of the work unlike subsequent phases (and disproportionately difficult).

A strategy was developed in which the sampling was by word class, with the proportion of the sample allocated to each word class determined by modified frequency (which discounted very-high-frequency items). For each word class, words were automatically classified as high, medium or low frequency (according to their count in Orwell) and samples were taken from each word class and each frequency band. The procedure is fully specified in deliverable 1.3.

Thus there was an automatic procedure which was, with minor modifications, applied to each language.

Up-translation

With the dictionaries and headword lists identified, the work at the heart of the project could proceed: up-translation to a ll-structured database. This was to be phased, with the first phase of 500 entries providing input to the definition of the target formalism.

For each language, there was a journey to be made, from input formalism to well-structured SGML. In the first phase, the work concentrated on identifying and unpicking the problems associated with the input formalism. These were different for each language, both at the macro and at the micro level.

At the macro level, the input formalism for the Estonian dictionary was a database; for the Romanian it was text keyboarded in Word to look exactly like the printed book. The Hungarian dictionary was in principle available either in SGML or in TEX format, with utilities to transfer between the two; in practice, the dictionary was in progress, and there were logistical difficulties associated with continuing to work on the latest version. The Czech dictionary was available in a proprietary PC format and the English-Slovene, in SGML.

The differences at the micro level are less easily described but more significant. For example, the simple question, "what is an entry?", has quite different answers in different dictionaries, with varying policies on homographs and morphology: some English dictionaries give two entries for bank, with the verb distinguished from the noun. Others make distinct entries for river banks and money banks. Most will give a further entry for banking (the profession or activity) and sometimes for bank account, bank balance, blood bank etc.

Whereas different dictionaries select different structural and typographic devices for representing word-class homographs, semantic homographs, separately lexicalised inflectional variants (eg banking), compounds, phrases etc., a standardised lexical database formalism should provide a unique way of representing each.

Not only do the devices for representing the information-types vary between dictionaries: the taxonomy of information-types also varies, in ways which may be specific to a particular dictionary, but which may relate to intrinsic features of a particular language or language family, or to the linguistic or lexicographic tradition of a particular country. The process of up-translation has been one of identifying the structures, presentational devices and taxonomies of linguistic phenomena implicit in the dictionary in order that the taxonomy may be mapped onto a standard one (which will be augmented with insights from, eg, new language families, as appropriate) and material may be presented in a standard way.

Details of the process for each language are described in deliverable 2.2.

The CONCEDE workplan lists an "encoding specifications" work package and a "lexicography" work package (where "lexicography" is of the computational variety and is largely a matter of up-translation). The "encoding" work is then divided between "generic" and "language-specific". While it might appear that these are concerned with quite different products -- "encoding" produces a formalism, "lexicography", a lexicon -- in practice, the up-translation and the language-specific encoding were inseparable aspects of the same process. The process of up-translation is just the process of migration from a less useful to a more useful encoding, and proceeds step-by-step. The consortium agreed that the target formalism would be the same for all languages, so all language-specific encodings are provisional: with more work, the language-specific elements will be replaced by generic ones.

Towards a CONCEDE DTD

In parallel with the up-translation (and informed by its progress) a formal grammar for CONCEDE and other lexical databases has been prepared, in the form of an SGML DTD (hereafter 'the CONCEDE DTD'). This has taken forward some of the ideas discussed in the TEI Dictionaries Working Group, but which were not implemented in TEI guidelines owing to the demand for those guidelines to be highly permissive.

The CONCEDE DTD, in its current form, is 'small and clean'; it aims to be a language-neutral, dictionary-neutral framework for presenting lexical information which has not been compromised in its generality by the characteristics of any of the CONCEDE dictionaries. This means that it may not prove possible for all, or any, of the CONCEDE dictionaries to be up-translated into it, given the resource limitations of the project: the journey from input dictionary to this DTD may simply be too long to be completed before the project ends. In that case, the output formalism of the dictionaries will not be the CONCEDE DTD, but one which has additional elements for information-types which have not yet been resolved into the information-types permitted by the CONCEDE DTD.

The CONCEDE DTD makes a clean distinction between structural elements and content elements. There are just three structural elements and each has a well-defined inheritance semantics. They are called entry, struc and alt. entry corresponds to the basic concept of a dictionary entry; struc to all other hierarchical structures in an entry (most notably sense and subsense divisions), and alt to alternations, or variants, as when colo(u)r has two alternative spellings associated with all of its meanings, however many there are. There are no other means for building entries.

Content elements take text as their content models (with minor exceptions), and the semantics for each of them is as described in the TEI guidelines. CONCEDE partners identified the list of content elements that was needed by working through, as a group, a list of all the elements which had been used in the first phase of lexicography; for each element, we determined whether it was the appropriate tag for the information-type it was being used for, and if so, agreed that all partners would use it for that purpose.

The session exposed some discrepancies in partners' choices of elements, though these were the exception rather than the rule. No partners had used more than a small subset of the elements that TEI made available, and, by and large, the same elements had been used in the same way across the project, which leaves us optimistic that the set of content elements is sufficient for a language-engineering oriented dictionary encoding in general.

The CONCEDE DTD and associated documentation is available as deliverable 2.1.

Looking ahead

The April, 1999 project meeting in Budapest looked ahead to the validation of the Phase 1 lexicons. The strategy will include looking closely at ten entries for each language, studying both the dictionary as it appeared on the printed page and the encoding in the phase 1 lexicon, with each partner determining (so far as their knowledge of the language permits) whether they concur with the encoding choices that have been made. We shall also explore up-translating the encoded entries into the CONCEDE DTD.

Phase 2 lexicography will start imminently. The original workplan envisaged that this would be doing similar things to the next 2,500 words, as had been done to the first 500 words in phase 1. In fact the work does not follow that pattern: for each language, suites of programs have been developed in the course of phase 1 lexicography, and these programs can be run over the next 2,500 entries (or indeed the whole dictionary). The phase 2 work will involve substantial checking for the 2,500 phase 2 entries, but will also leave space for developing the up-translation programs so that they take the dictionary further on the journey towards a standardised lexical database.

Additional Papers, and Dissemination

Four CONCEDE papers were presented at the 1999 COMPLEX conference in Pecs, Hungary (references below). CONCEDE work on up-translation and the CONCEDE DTD were described in the tutorial Lexicography for Computationalists presented at both ACL 99 (Maryland) and European ACL 99 (Bergen).

SGML/XML and DATR

As it stands, the CONCEDE DTD only permits the description of simple inheritance relations. Many linguistic phenomena, notably morphology, can only be well described in a more sophisticated inheritance formalism. One such formalism is DATR (Evans and Gazdar, 1997). Evans has described a new syntax for DATR which would permit the integration of sophisticated DATR inheritance specifications for morphology within XML lexicons. A paper on this work has been presented by Evans at ITRI and at the CONCEDE meeting in Budapest, and by Kilgarriff at an XML workshop at Edinburgh University.

Paper References

All in the volume: Papers Page maintained by Adam Kilgarriff
Last updated 9 Sept 1999