Wszystkie pola: Dimitrova, E - Katalog OPAC zbiorów

Skocz do pozycji: 1.

Tytuł:: The Bulgarian National Corpus : Theory and Practice in Corpus Design
Autorzy:: Koeva, S.
Stoyanova, I.
Leseva, S.
Dimitrova, T.
Dekova, R.
Tarpomanova, E.
Powiązania:: https://bibliotekanauki.pl/articles/103907.pdf
Data publikacji:: 2012
Wydawca:: Polska Akademia Nauk. Instytut Podstaw Informatyki PAN
Tematy:: corpus design
Bulgarian National Corpus
computational linguistics
Opis:: The paper discusses several key concepts related to the development of corpora and reconsiders them in light of recent developments in NLP. On the basis of an overview of present-day corpora, we conclude that the dominant practices of corpus design do not utilise the technologies adequately and, as a result, fail to meet the demands of corpus linguistics, computational lexicology and computational linguistics alike. We proceed to lay out a data-driven approach to corpus design, which integrates the best practices of traditional corpus linguistics with the potential of the latest technologies allowing fast collection, automatic metadata description and annotation of large amounts of data. Thus, the gist of the approach we propose is that corpus design should be centred on amassing large amounts of mono- and multilingual texts and on providing them with a detailed metadata description and high-quality multi-level annotation. We go on to illustrate this concept with a description of the compilation, structuring, documentation, and annotation of the Bulgarian National Corpus (BulNC). At present it consists of a Bulgarian part of 979.6 million words, constituting the corpus kernel, and 33 Bulgarian-X language corpora, totalling 972.3 million words, 1.95 billion words (1.95×10⁹) altogether. The BulNC is supplied with a comprehensive metadata description, which allows us to organise the texts according to different principles. The Bulgarian part of the BulNC is automatically processed (tokenised and sentence split) and annotated at several levels: morphosyntactic tagging, lemmatisation, word-sense annotation, annotation of noun phrases and named entities. Some levels of annotation are also applied to the Bulgarian-English paralel corpus with the prospect of expanding multilingual annotation both in terms of linguistic levels and the number of languages for which it is available. We conclude with a brief evaluation of the quality of the corpus and an outline of its applications in NLP and linguistic research.
Źródło:: Journal of Language Modelling; 2012, 0, 1; 65-110
2299-856X
2299-8470
Pojawia się w:: Journal of Language Modelling
Dostawca treści:: Biblioteka Nauki

Artykuł

Zmień widok

na półce

Informacja

Wyszukujesz frazę "Dimitrova, E" wg kryterium: Wszystkie pola

Źródło danych

Dostawca treści

Kolekcja

Rok wydania

Wydawca

Temat

Autor

Typ dokumentu

Język