The format for the lexical databases will be TEI-conformant SGML. The TEI proposals for dictionaries are designed to support accurate description of existing dictionaries, rather than the production of new resources in an optimal form for language engineering use. It is a secondary goal of the project to develop a highly constrained variant of the TEI dictionaries DTD which is suitable for language engineering use.
For the Bulgarian, Hungarian and Estonian partners, there is overlap between the CONCEDE team and the dictionary producers. In Estonia, the dictionary is a very large one, with eighteen volumes out of an envisaged 24 produced to date. The lexicographers are at the Institute of the Estonian language, who are sub-contractors in CONCEDE. The CONCEDE partner thus has full access to the database representing the dictionary-in-progress.
For Hungarian, both the CONCEDE team and the dictionary team for the medium-sized Hungarian Explanatory Hand Dictionary are at the Hungarian Academy of Sciences, and the original plan was that the CONCEDE work would be integral to the production of a new edition of the dictionary.
The Romanian and Czech partners are working from medium-sized school or college dictionaries. For both of these, as for Hungarian, the CONCEDE partners fully intend to convert the whole dictionary for language engineering use. (The undertaking within CONCEDE is to produce a 3,000 entry database for all languages except Slovene, where the database size is 500 entries.) For Czech, relations with the copyright-holders are good, but for Romanian, problems over copyright were encountered, with the outcome that the dictionary was re-keyed for CONCEDE.
All of these dictionaries were monolingual. The Slovene dictionary is by contrast an English-Slovene bilingual dictionary, currently in preparation. The method for the production of the dictionary was that the Slovenian company, DZS, licensed the "English framework" from Oxford University Press. The "English framework" is a suitable starting point for the English-source-language part of any "English-to-X" bilingual dictionary, giving all the specifications of English words and phrases, their grammar and their meanings, which require translation into the target language. The framework used for the English-Slovene dictionary was also the one used for the extensively-researched Oxford-Hachette English-French dictionary.
A strategy based on finding translation-equivalents across the six languages was considered, and rejected. Identifying translation equivalents across languages raises many theoretical and practical difficulties, and equivalences were not essential to the project's work. However, it was desirable to have comparable coverage from one database to another, and for the headwords to be selected in a principled way.
A strategy was developed that used the "Orwell" corpus, which had been developed in EU project MULTEXT-EAST. MULTEXT-EAST (which had most of the same partners as CONCEDE) had identified a text of George Orwell's novel, 1984, in each of the six CEE languages (and English) and each of these texts had been produced as a well-structured, lemmatised, TEI-compliant corpus. (The project had also produced model "Corpus Encoding standards", playing a similar role in relation to corpus encoding that CONCEDE aims to play in relation to dictionary encoding.) From Orwell, a lemmatised frequency list could be produced for each language. The simple strategy would then have been to take, first, the 500, and later, the 3,000 highest-frequency words from this list as the CONCEDE headword list. However high-frequency words are atypical of the lexicon as a whole, as they are overwhelmingly closed-class grammatical words, whereas the bulk of the lexicon comprises open-class content words, and even where open class words are of very high frequency, they tend to be highly polysemous and to occurring in many fixed phrases. The simple strategy would have made the first phase of the work unlike subsequent phases (and disproportionately difficult).
A strategy was developed in which the sampling was by word class, with the proportion of the sample allocated to each word class determined by modified frequency (which discounted very-high-frequency items). For each word class, words were automatically classified as high, medium or low frequency (according to their count in Orwell) and samples were taken from each word class and each frequency band. The procedure is fully specified in deliverable 1.3.
Thus there was an automatic procedure which was, with minor modifications, applied to each language.
For each language, there was a journey to be made, from input formalism to well-structured SGML. In the first phase, the work concentrated on identifying and unpicking the problems associated with the input formalism. These were different for each language, both at the macro and at the micro level.
At the macro level, the input formalism for the Estonian dictionary was a database; for the Romanian it was text keyboarded in Word to look exactly like the printed book. The Hungarian dictionary was in principle available either in SGML or in TEX format, with utilities to transfer between the two; in practice, the dictionary was in progress, and there were logistical difficulties associated with continuing to work on the latest version. The Czech dictionary was available in a proprietary PC format and the English-Slovene, in SGML.
The differences at the micro level are less easily described but more significant. For example, the simple question, "what is an entry?", has quite different answers in different dictionaries, with varying policies on homographs and morphology: some English dictionaries give two entries for bank, with the verb distinguished from the noun. Others make distinct entries for river banks and money banks. Most will give a further entry for banking (the profession or activity) and sometimes for bank account, bank balance, blood bank etc.
Whereas different dictionaries select different structural and typographic devices for representing word-class homographs, semantic homographs, separately lexicalised inflectional variants (eg banking), compounds, phrases etc., a standardised lexical database formalism should provide a unique way of representing each.
Not only do the devices for representing the information-types vary between dictionaries: the taxonomy of information-types also varies, in ways which may be specific to a particular dictionary, but which may relate to intrinsic features of a particular language or language family, or to the linguistic or lexicographic tradition of a particular country. The process of up-translation has been one of identifying the structures, presentational devices and taxonomies of linguistic phenomena implicit in the dictionary in order that the taxonomy may be mapped onto a standard one (which will be augmented with insights from, eg, new language families, as appropriate) and material may be presented in a standard way.
Details of the process for each language are described in deliverable 2.2.
The CONCEDE workplan lists an "encoding specifications" work package and a "lexicography" work package (where "lexicography" is of the computational variety and is largely a matter of up-translation). The "encoding" work is then divided between "generic" and "language-specific". While it might appear that these are concerned with quite different products -- "encoding" produces a formalism, "lexicography", a lexicon -- in practice, the up-translation and the language-specific encoding were inseparable aspects of the same process. The process of up-translation is just the process of migration from a less useful to a more useful encoding, and proceeds step-by-step. The consortium agreed that the target formalism would be the same for all languages, so all language-specific encodings are provisional: with more work, the language-specific elements will be replaced by generic ones.
The CONCEDE DTD, in its current form, is 'small and clean'; it aims to be a language-neutral, dictionary-neutral framework for presenting lexical information which has not been compromised in its generality by the characteristics of any of the CONCEDE dictionaries. This means that it may not prove possible for all, or any, of the CONCEDE dictionaries to be up-translated into it, given the resource limitations of the project: the journey from input dictionary to this DTD may simply be too long to be completed before the project ends. In that case, the output formalism of the dictionaries will not be the CONCEDE DTD, but one which has additional elements for information-types which have not yet been resolved into the information-types permitted by the CONCEDE DTD.
The CONCEDE DTD makes a clean distinction between structural elements and content elements. There are just three structural elements and each has a well-defined inheritance semantics. They are called entry, struc and alt. entry corresponds to the basic concept of a dictionary entry; struc to all other hierarchical structures in an entry (most notably sense and subsense divisions), and alt to alternations, or variants, as when colo(u)r has two alternative spellings associated with all of its meanings, however many there are. There are no other means for building entries.
Content elements take text as their content models (with minor exceptions), and the semantics for each of them is as described in the TEI guidelines. CONCEDE partners identified the list of content elements that was needed by working through, as a group, a list of all the elements which had been used in the first phase of lexicography; for each element, we determined whether it was the appropriate tag for the information-type it was being used for, and if so, agreed that all partners would use it for that purpose.
The session exposed some discrepancies in partners' choices of elements, though these were the exception rather than the rule. No partners had used more than a small subset of the elements that TEI made available, and, by and large, the same elements had been used in the same way across the project, which leaves us optimistic that the set of content elements is sufficient for a language-engineering oriented dictionary encoding in general.
The CONCEDE DTD and associated documentation is available as deliverable 2.1.
Phase 2 lexicography will start imminently. The original workplan envisaged that this would be doing similar things to the next 2,500 words, as had been done to the first 500 words in phase 1. In fact the work does not follow that pattern: for each language, suites of programs have been developed in the course of phase 1 lexicography, and these programs can be run over the next 2,500 entries (or indeed the whole dictionary). The phase 2 work will involve substantial checking for the 2,500 phase 2 entries, but will also leave space for developing the up-translation programs so that they take the dictionary further on the journey towards a standardised lexical database.