Estonian Dialect Corpus

The territory where Estonian is spoken is quite small but there are large differences between traditional dialects. Researches of Estonian dialects have classified at least eight main dialects and over hundred sub-dialects (parish dialects). Until now, there are only few comparative studies on Estonian dialect phonology and grammar because of the lack of united data sources for such kind of analysis. The Corpus of Estonian Dialects is meant to simplify every kind of research on Estonian dialects.

Corpus of Estonian Dialects is a joint project of the University of Tartu and the Institute of Estonian Language, started in 1998. The corpus includes the best part of dialect data sources of the Institute of Estonian Language and of Tartu University. We have started with the oldest recordings from each dialect.

Aim of the corpus

The main aim of the corpus is to facilitate researchers to find authentic data from all Estonian dialects, gathered and handled using same principles. The corpus enables above all to study the phonological and grammatical structure of Estonian dialects comparatively. All the data of the corpus is available in digital mood.

Description of the corpus

The dialect corpus consists of:

1)      Dialect recordings. The corpus is based on dialect recordings which have mainly been made in the 1960s and 1970s. The first recordings are even earlier – they date from 1938. The recordings are traditional dialect recordings where the interview is conducted at the home of the informant.

2)      Phonetically transcribed texts. The traditional Finno-Ugric phonetic transcription is used. The texts are available as Word and pdf files (by the 1st of May 2011, there are about 1,284,000 text words in the corpus).

3)      Dialect texts in simplified transcription. All of the phonetically transcribed texts have been transported one-to-one into the simplified transcription (.txt), which enables the use of these texts with every program and to conduct primary analyses.

4)      Morphologically tagged texts which have been read into a MySQL database. All the word classes and morphological forms are tagged;

5)      Database containing information about informants and recordings;

6)      Syntactically parsed texts (about 40000 text words).

In the corpus, every phonetically transcribed text is accompanied by a recording, a file in simplified transcription and a description; more than half of the texts are also accompanied by a morphologically tagged file.

Also some data from other Finnic languages which are spoken around Estonia have been added. The aim is to incorporate at least Votic, Ingrian and Livonian data to the corpus.

Morphological database

Texts in the simplified transcription are morphologically tagged and read into the database. The texts were tagged with the help of the program Mark. The tagged texts are in XML format. Texts that have been already tagged have been read into a MySQL database which can be used via the Internet:

For every word the following fields have been tagged:

SNE: the original form of the token as  it occurs  in the text, e.g. t's'ibõrdõl'l'i ‘fidget’ (past sg 3), `vaesõq ‘poor’ (pl nominative).

MSN: the keyword (lemma) in the literary language form, e.g. tsiberdelema, vaene.

TAH: meaning if it differs from the literary language.

FRA: phrase, tagged for phrasal units, e.g. phrasal verbs (e.g. ära ostma ‘buy’; ära ‘away’ is here the perfective marker).

SLK: word class. Words have been divided into 26 word classes according to their morphological inflections, syntactic characteristics and semantics. This classification is based on the system of word classes presented in Estonian academic grammar (EKG I: 14–41); however, we distinguish more subclasses. For more details on the issue of word classes in the dialect corpus see Lindström et al. (2006).

MRF: morphological information. Morphological information has been added to inflected words (nouns, verbs, pronouns, adjectives, etc.).

KHK: the parish where the word comes from. Abbreviations in capital letters have been used (VAS=Vastseliina etc., see, for example, In addition to the data on Estonian dialects, Votic texts have also been tagged to a certain extent (the corresponding parish abbreviations are IVA=Eastern-Votic and LVA=Western-Votic).

Using the corpus

To use the Corpus of Estonian Dialects, please write to the corpus administrator Liina Lindström: liina.lindstrom [ät]

or via ordinary mail:

Liina Lindström
Department of Estonian and Finno-Ugric Linguistics
University of Tartu
Ülikooli 18
Tartu 50090

Last edited: 2014-07-10 17:40:58