Using lexical language models to detect borrowings in monolingual wordlists

Miller, John E.; Tresoldi, Tiago; Zariquiey, Roberto; Castañón, César A. Beltrán; Morozova, Natalia; List, Johann-Mattis

doi:10.17613/m051-e049

Local TagsRelease HistoryDetailsSummary

Using lexical language models to detect borrowings in monolingual wordlists

Miller, J. E., Tresoldi, T., Zariquiey, R., Castañón, C. A. B., Morozova, N., & List, J.-M. (2020). Using lexical language models to detect borrowings in monolingual wordlists. PLoS One, 15(12): e0242709. doi:10.17613/m051-e049.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/21.11116/0000-0007-214E-D Version Permalink: https://hdl.handle.net/21.11116/0000-0007-A598-3

Genre: Journal Article

Files

show Files

hide Files

:

shh2701.pdf (Publisher version), 3MB

View Save

File Permalink:
https://hdl.handle.net/21.11116/0000-0007-A599-2

Name:
shh2701.pdf

Description:
OA

OA-Status:

Visibility:
Public

MIME-Type / Checksum:
application/pdf / [MD5]

Technical Metadata:

View

Copyright Date:
-

Copyright Info:
-

License:
http://creativecommons.org/licenses/by/4.0/

:

shh2701pre.pdf (Preprint), 3MB

View Save

File Permalink:
https://hdl.handle.net/21.11116/0000-0007-A59A-1

Name:
shh2701pre.pdf

Description:
OA (HC)

OA-Status:

Visibility:
Public

MIME-Type / Checksum:
application/pdf / [MD5]

Technical Metadata:

View

Copyright Date:
-

Copyright Info:
-

License:
http://creativecommons.org/licenses/by/4.0/

Locators

show

hide

Locator:
S1 Table (Supplementary material) Open Access status unknown

Description:
Detection results by language for seeded borrowings. - (last seen Jan. 2021)

OA-Status:

Locator:
S2 Table (Supplementary material) Open Access status unknown

Description:
Ten-fold cross validation of detection results by language for WOLD wordlists. - (last seen Jan. 2021)

OA-Status:

Creators

show

hide

Creators:
Miller, John E., Author
Tresoldi, Tiago^{1, 2}, Author
Zariquiey, Roberto, Author
Castañón, César A. Beltrán, Author
Morozova, Natalia², Author
List, Johann-Mattis^{1, 2}, Author

Affiliations:
1CALC, Max Planck Institute for the Science of Human History, Max Planck Society, ou_2385703
2Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Max Planck Society, ou_2074311

Content

show

hide

Free keywords: Language, Markov models, Semantics, Neural networks, Phonology, Memory recall, Evolutionary linguistics, Recurrent neural networks

Abstract: Native speakers are often assumed to be efficient in identifying whether a word in their language has been borrowed, even when they do not have direct knowledge of the donor language from which it was taken. To detect borrowings, speakers make use of various strategies, often in combination, relying on clues such as semantics of the words in question, phonology and phonotactics. Computationally, phonology and phonotactics can be modeled with support of Markov n-gram models or -- as a more recent technique -- recurrent neural network models. Based on a substantially revised dataset in which lexical borrowings have been thoroughly annotated for 41 different languages of a large typological diversity, we use these models to conduct a series of experiments to investigate their performance in borrowing detection using only information from monolingual wordlists. Their performance is in many cases unsatisfying, but becomes more promising for strata where there is a significant ratio of borrowings and when most borrowings originate from a dominant donor language. The recurrent neural network performs marginally better overall in both realistic studies and artificial experiments, and holds out the most promise for continued improvement and innovation in lexical borrowing detection. Phonology and phonotactics, as operationalized in our lexical language models, are only a part of the multiple clues speakers use to detect borrowings. While improving our current methods will result in better borrowing detection, what is needed are more integrated approaches that also take into account multilingual and cross-linguistic information for a proper automated borrowing detection.

Details

show

hide

Language(s): eng - English

Dates: Published Online: 2020-12-09

Publication Status: Published online

Pages: 23

Publishing info: -

Table of Contents: Introduction
- Problem and motivation
- State of the art

Materials and methods
- Materials
- Lexical language models
-- Bag of sounds
-- Markov Model
-- Recurrent neutral network
- Decision preocedures
- Assessing detection performance
- Implementation
- Experiments and studies
- Detection of artificially seeded borrowings
- Borrowing detection on real language data
- Factors that influence borrowing detection
- Detecting borrowings from a single donor language
- Comparing entropy distributions

Discussion
- Artificially seeded borrowings
- Borrowing detection on real language data
- Factors influencing borrowing detection
- Detecting borrowings from a single donor language
- Comparing entropy distributions

Conclusion

Rev. Type: Peer

Identifiers: DOI: 10.17613/m051-e049
Other: shh2701

Degree: -

Event

show

Legal Case

show

Project information

show hide

Project name : CALC

Grant ID : 715618

Funding program : Horizon 2020 (H2020)

Funding organization : European Commission (EC)

Source 1

show

hide

Title: PLoS One

Source Genre: Journal

Creator(s):

Affiliations:

Publ. Info: San Francisco, CA : Public Library of Science

Pages: - Volume / Issue: 15 (12) Sequence Number: e0242709 Start / End Page: - Identifier: ISSN: 1932-6203
CoNE: https://pure.mpg.de/cone/journals/resource/1000000000277850

Source 2

show

hide

Title: Humanities Commons

Abbreviation : HC

Source Genre: Journal

Creator(s):

Affiliations:

Publ. Info: New York : Modern Language Association

Pages: - Volume / Issue: - Sequence Number: m051-e049 Start / End Page: - Identifier: URN: https://hcommons.org/