Using lexical language models to detect borrowings in monolingual wordlists

Miller, John E.; Tresoldi, Tiago; Zariquiey, Roberto; Castañón, César A. Beltrán; Morozova, Natalia; List, Johann-Mattis

doi:10.17613/m051-e049

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Journal Article

Using lexical language models to detect borrowings in monolingual wordlists

MPS-Authors

/persons/resource/persons220957

Tresoldi, Tiago
CALC, Max Planck Institute for the Science of Human History, Max Planck Society;
Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Max Planck Society;

/persons/resource/persons240937

Morozova, Natalia
Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Max Planck Society;

/persons/resource/persons201886

List, Johann-Mattis
CALC, Max Planck Institute for the Science of Human History, Max Planck Society;
Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Max Planck Society;

External Resource

S1 Table
(Supplementary material)

S2 Table
(Supplementary material)

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

shh2701.pdf
(Publisher version), 3MB

shh2701pre.pdf
(Preprint), 3MB

Supplementary Material (public)

There is no public supplementary material available

Citation

Miller, J. E., Tresoldi, T., Zariquiey, R., Castañón, C. A. B., Morozova, N., & List, J.-M. (2020). Using lexical language models to detect borrowings in monolingual wordlists. PLoS One, 15(12): e0242709. doi:10.17613/m051-e049.

Cite as: https://hdl.handle.net/21.11116/0000-0007-214E-D

Abstract

Native speakers are often assumed to be efficient in identifying whether a word in their language has been borrowed, even when they do not have direct knowledge of the donor language from which it was taken. To detect borrowings, speakers make use of various strategies, often in combination, relying on clues such as semantics of the words in question, phonology and phonotactics. Computationally, phonology and phonotactics can be modeled with support of Markov n-gram models or -- as a more recent technique -- recurrent neural network models. Based on a substantially revised dataset in which lexical borrowings have been thoroughly annotated for 41 different languages of a large typological diversity, we use these models to conduct a series of experiments to investigate their performance in borrowing detection using only information from monolingual wordlists. Their performance is in many cases unsatisfying, but becomes more promising for strata where there is a significant ratio of borrowings and when most borrowings originate from a dominant donor language. The recurrent neural network performs marginally better overall in both realistic studies and artificial experiments, and holds out the most promise for continued improvement and innovation in lexical borrowing detection. Phonology and phonotactics, as operationalized in our lexical language models, are only a part of the multiple clues speakers use to detect borrowings. While improving our current methods will result in better borrowing detection, what is needed are more integrated approaches that also take into account multilingual and cross-linguistic information for a proper automated borrowing detection.