Automated parsing of interlinear glossed text from page images of grammatical 
descriptions

Round, Erich; Macklin-Cordes, Jayden L.; Ellison, T. Mark; Beniamine, Sacha

doi:10.5281/zenodo.3550760

Local TagsRelease HistoryDetailsSummary

Automated parsing of interlinear glossed text from page images of grammatical descriptions

Round, E., Macklin-Cordes, J. L., Ellison, T. M., & Beniamine, S. (2020). Automated parsing of interlinear glossed text from page images of grammatical descriptions. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, et al. (Eds.), Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) (pp. 2878-2883). Paris: European Language Resources Association (ELRA). doi:10.5281/zenodo.3550760.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/21.11116/0000-0006-6E7F-2 Version Permalink: https://hdl.handle.net/21.11116/0000-0006-6E80-E

Genre: Conference Paper

Files

show Files

hide Files

:

shh2601.pdf (Publisher version), 1004KB

View Save

File Permalink:
https://hdl.handle.net/21.11116/0000-0006-6E81-D

Name:
shh2601.pdf

Description:
OA

OA-Status:

Visibility:
Public

MIME-Type / Checksum:
application/pdf / [MD5]

Technical Metadata:

View

Copyright Date:
-

Copyright Info:
-

License:
http://creativecommons.org/licenses/by/4.0/

Locators

show

Creators

show

hide

Creators:
Round, Erich¹, Author
Macklin-Cordes, Jayden L., Author
Ellison, T. Mark, Author
Beniamine, Sacha¹, Author

Affiliations:
1Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Max Planck Society, ou_2074311

Content

show

hide

Free keywords: Information Extraction, Information Retrieval”, Less-Resourced/Endangered Languages, Morphology, Typological Databases

Abstract: Linguists seek insight from all human languages, however accessing information from most of the full store of extant global linguistic descriptions is not easy. One of the most common kinds of information that linguists have documented is vernacular sentences, as recorded in descriptive grammars. Typically these sentences are formatted as interlinear glossed text (IGT). Most descriptive grammars, however, exist only as hardcopy or scanned pdf documents. Consequently, parsing IGTs in scanned grammars is a priority, in order to significantly increase the volume of documented linguistic information that is readily accessible. Here we demonstrate fundamental viability for a technology that can assist in making a large number of linguistic data sources machine readable: the automated identification and parsing of interlinear glossed text from scanned page images. For example, we attain high median precision and recall (>0.95) in the identification of example sentences in IGT format. Our results will be of interest to those who are keen to see more of the existing documentation of human language, especially for less-resourced and endangered languages, become more readily accessible.

Details

show

hide

Language(s): eng - English

Dates: Published Online: 2020-05-19Date issued: 2020

Publication Status: Issued

Pages: 6

Publishing info: -

Table of Contents: -

Rev. Type: -

Identifiers: DOI: 10.5281/zenodo.3550760
Other: shh2601

Degree: -

Event

show

hide

Title: 12th Conference on Language Resources and Evaluation [postponed due to Corona]

Place of Event: Marseille

Start-/End Date: 2020-05-11 - 2020-05-16

Legal Case

show

Project information

show

Source 1

show

hide

Title: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)

Source Genre: Proceedings

Creator(s):
Calzolari, Nicoletta, Editor
Béchet, Frédéric, Editor
Blache, Philippe, Editor
Choukri, Khalid, Editor
Cieri, Christopher, Editor
Declerck, Thierry, Editor
Goggi, Sara, Editor
Ishara, Hitoshi, Editor
Maegaard, Bente, Editor
Mariani, Hélène Mazo, Editor
Moreno, Asuncion, Editor
Odijk, Jan, Editor
Piperidis, Stelios, Editor

Affiliations:
-

Publ. Info: Paris : European Language Resources Association (ELRA)

Pages: 7251 Volume / Issue: - Sequence Number: - Start / End Page: 2878 - 2883 Identifier: ISBN: 979-10-95546-34-4