Automated parsing of interlinear glossed text from page images of grammatical 
descriptions

Round, Erich; Macklin-Cordes, Jayden L.; Ellison, T. Mark; Beniamine, Sacha

doi:10.5281/zenodo.3550760

Datensatz

DATENSATZ AKTIONENEXPORT

Zur Ablage hinzufügen

Lokale TagsFreigabegeschichteDetailsÜbersicht

Freigegeben

Konferenzbeitrag

Automated parsing of interlinear glossed text from page images of grammatical descriptions

MPG-Autoren

/persons/resource/persons247722

Round, Erich
Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Max Planck Society;

/persons/resource/persons247724

Beniamine, Sacha
Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Max Planck Society;

Externe Ressourcen

Es sind keine externen Ressourcen hinterlegt

Volltexte (beschränkter Zugriff)

Für Ihren IP-Bereich sind aktuell keine Volltexte freigegeben.

Volltexte (frei zugänglich)

shh2601.pdf
(Verlagsversion), 1004KB

Ergänzendes Material (frei zugänglich)

Es sind keine frei zugänglichen Ergänzenden Materialien verfügbar

Zitation

Round, E., Macklin-Cordes, J. L., Ellison, T. M., & Beniamine, S. (2020). Automated parsing of interlinear glossed text from page images of grammatical descriptions. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, et al. (Eds.), Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) (pp. 2878-2883). Paris: European Language Resources Association (ELRA). doi:10.5281/zenodo.3550760.

Zitierlink: https://hdl.handle.net/21.11116/0000-0006-6E7F-2

Zusammenfassung

Linguists seek insight from all human languages, however accessing information from most of the full store of extant global linguistic descriptions is not easy. One of the most common kinds of information that linguists have documented is vernacular sentences, as recorded in descriptive grammars. Typically these sentences are formatted as interlinear glossed text (IGT). Most descriptive grammars, however, exist only as hardcopy or scanned pdf documents. Consequently, parsing IGTs in scanned grammars is a priority, in order to significantly increase the volume of documented linguistic information that is readily accessible. Here we demonstrate fundamental viability for a technology that can assist in making a large number of linguistic data sources machine readable: the automated identification and parsing of interlinear glossed text from scanned page images. For example, we attain high median precision and recall (>0.95) in the identification of example sentences in IGT format. Our results will be of interest to those who are keen to see more of the existing documentation of human language, especially for less-resourced and endangered languages, become more readily accessible.