Making genealogical language classifications available for phylogenetic analysis

One of the best-known types of non-independence between languages is caused by genealogical relationships due to descent froma commonancestor.These canbe represented by (more or less resolved and controversial) language family trees. In theory, one can argue that language families should be built through the strict application of the comparativemethod of historical linguistics, but in practice this is not always the case, and there are several proposed classifications of languages into language families, each with its own advantages and disadvantages. A major stumbling block shared by most of them is that they are relatively difficult to use with computational methods, and in particular with phylogenetics. This is due to their lack of standardization, coupled with the general non-availability of branch length information, which encapsulates the amount of evolution taking place on the family tree. In this paper I introduce a method (and its implementation in R) that converts the language classifications provided by four widely-used databases (Ethnologue, WALS, AUTOTYP and Glottolog) into * Many thanks to the authors of the databases used here for making their data freely available, to Balthasar Bickel for computing the AUTOTYP distance and agreeing to making it freely available with this paper, to Luke Maurits for clarifying their “genetic method,” to Michael Cysouw formakingme aware of an alternative specification of cross-database language identifiers, toHaraldHammarströmandSeánRoberts for discussions, toMichaelDunn for various bug reports, to Christian Bentz, Gerhard Jäger, and two anonymous reviewers for very thorough and helpful feedback on the manuscript. Thanks to the organizers of the NIAS-Lorentz workshop “Capturing Phylogenetic Algorithms for Linguistics,” 26–30 October 2015, Leiden, the Netherlands, and to the editors of the proceedings in Language Dynamics and Change. This work is part of a project funded by the NWO (Netherlands Organisation for Scientific Research) VIDI grant number 016.124.315.


Introduction
Languages are not independent.This is due to historical processes such as language contact and descent from a common ancestor, and it is crucial to take into account the various types of non-independence between languages (e.g., Ladd et al., 2015;Roberts and Winters, 2013).Probably the most important type of non-independence is due to shared ancestry (Campbell and Poser, 2008): the daughter languages descended from a mother (or proto-)language share characteristics that they inherited from this common ancestor.This type of similarity tends to decrease with the passage of time since separation and is known as "Galton's problem" (this applies more generally to cultural phenomena; Mace and Pagel, 1994).Related languages descending from a shared ancestor form a language family, usually represented as a tree where the attested, present-day or recent languages form the leaves (or terminal nodes) and the extinct, mostly unattested, languages are internal nodes.
The identification of these genetic relationships is a complex problem (Campbell and Poser, 2008;Bowern and Evans, 2014) where controversies abound, including on the status of the so-called "macro-families" and on the composition and internal structure of language families such as Indo-European; disagreements concern the languages that belong to the same family, the internal relationships between them-the tree topology-and the amount of change that separates nodes in the tree-the branch length.
Using such language classifications with modern quantitative methods raises a number of major issues, including (i) the fact that there are several such classifications available, (ii) the fact that these are often presented in a Language Dynamics and Change 8 (2018) 1-21 figure 1 Three language families composed of the same four languages (A, B, C and D) but with different structures (left vs. center) and branch length (center vs. right).Time flows downwards from the proto-language at the top (P3, P5 and P5 respectively) towards the attested languages at the bottom.For example, in the leftmost tree, languages A and B are more closely related than either of them is to language C. In the rightmost tree, language B has changed the least from its most recent common ancestor (P4) with languages A and B.
non-standard format, and (iii) the problem that methods requiring not only the topology but also the amount of change (such as most modern phylogenetic approaches) cannot be directly applied because, in general, branch length information is lacking (and for good reason, as we generally do not know how to estimate it).The work presented here attempts to offer a solution to these issues by giving the de facto standard Newick tree format (http://evolution .genetics.washington.edu/phylip/newicktree.html) representation of language family trees from several classifications, and by adding branch length information estimated using several methods.For a few large families (such as Indo-European, Austronesian, Bantu and Uto-Aztecan), the application of Bayesian phylogenetic methods to basic vocabulary cognacy data and the use of calibration points with known dates resulted in the availability of posterior samples of trees with branch length (e.g.Bouckaert et al., 2012;Dunn et al., 2011), but there are still debates concerning these methods and their results, and the vast majority of language families did not yet receive this treatment, making an approach such as the one introduced here necessary.In this paper I describe the data, the methods and the format in which these trees are available, everything being freely accessible on the GitHub repository https://github.com/ddediu/lgfam-newick, including the actual primary data (wherever allowed by their respective licenses), the R code (R Core Team, 2014), and the language family trees with branch length in Newick format.

Primary data
The main primary data is represented by the four most widely used language classifications.For each one I acquired the classification data in a format dependent on their export capabilities and converted it into Newick trees without branch length information, resulting in a set of language tree topologies.More precisely, each database needs a uniquely tailored approach because they use particular representations of the hierarchical relationships between languages, and my solution was to write a set of R (R Core Team, 2014) types and functions which extend R's own representation of phylogenetic trees (using the class phylo from package ape; Paradis et al., 2004) and allow the representation and manipulation of language family trees.Specifically, for each of the four language classifications, the data format and procedure were as follows: -Ethnologue (Lewis et al., 2014), denoted in the following as E: the language classification data is not directly available for download, but instead the website provides1 a webpage (http://www.ethnologue.com/browse/families) with the list of all the language families and hyperlinks to their own webpages, which were downloaded and automatically parsed to extract the tree structure of the family, the internal group names, the language names, and their ISO 639-3 codes.-World Atlas of Language Structures Online (Dryer and Haspelmath, 2013) or WALS, denoted as W: the entire database (containing the language names, codes, geographic coordinates and the values for more than 130 structural features) is freely available for download at http://wals.info/static/download/wals-language.csv.zipunder a Creative Commons license (CC BY-NC-ND 2.0 DE; http://creativecommons.org/licenses/by-nc-nd/2.0/de/deed .en),and I used only the fields containing the WALS, ISO 639-3 and Glottolog codes, the languages' names, their "genus" and "family," as this classification is flattened into a mostly three-(but sometimes four) level structure.-AUTOTYP (Nichols et al., 2013), denoted A: the family trees are freely available for download at http://www.autotyp.uzh.ch/available.html, and can be used and distributed provided that their source is clearly mentioned; the language families are in a format similar to WALS, with each language being listed with its names, the AUTOTYP LID, the Glottolog and the ISO 639-3 codes, as well as the tree given as the "stock," "mbranch," "sbranch," "ssbranch" and "lsbranch" names, each being a hierarchical level (with the "stock" being the highest, the language family), sometimes with missing intermediate levels.
Please note that, while this paper concerns particular versions of these resources, I will try to keep the GitHub repository updated and compatible with newer versions and releases.
The methods for inferring branch length implemented here (see below) can use either the tree topology directly, a numeric constant, or a distance matrix.While the framework and my actual implementation in R can handle any distance matrix, for this paper I have used the following 11 distances (which fall into five types), respectively based on: 1. vocabulary: (1) ASJP16 distance, 2. geography: (2) great-circle distance, 3. WALS: (3) Gower distance and (4) Euclidean distance, without (3a and 4a) and with (3b and 4b) missing data imputation, 4. AUTOTYP: (5) Gower distance with missing data, using only the variables with a single datapoint per language (this distance was computed by Balthasar Bickel), and 5. the tree topology: the "genetic method" of Maurits and Griffiths (2014) applied to the WALS (6), Ethnologue (7), Glottolog (8) and AUTOTYP (9) classifications.
(1) represents the distances between languages as given by  and Change 8 (2018) 1-21 tances between standardized short wordlists transcribed with a reduced set of symbols (Bakker et al., 2009).After processing and conversion (manual replacement of some non-ASCII characters in the language descriptors and the 26-character language identifiers exported by ASJP v2.1),I exported these as a 3932 × 3932 distance matrix (with no missing data) between languages identified by their ISO 639-3 codes.
(3) and ( 4) represent distances between languages computed on the feature values in the WALS typological database, using R's function daisy() (package cluster; Maechler et al., 2015), either method gower (3; each feature is standardized between 0 and 1 by subtracting the feature's minimum and dividing by its range; Gower, 1971) or euclidean (4; standard Euclidean distance on the feature space).However, there is a lot of missing data in the WALS database (85.1% of the cells), so I have computed these distances using per variable mode data imputation, resulting in the following four 2679 × 2679 distance matrices: Gower (48.9% missing data; 3a), Gower with imputation (no missing data; 3b), Euclidean (48.9% missing data; 4a), and Euclidean with imputation (no missing data; 4b).
( 5) is similar to (3) without missing data imputation but using the AUTOTYP typological database, resulting in a 2928 × 2928 distance matrix with 57.6% missing data.
Finally, (6) to ( 9) are distances between languages belonging to the same family, computed using the family tree topology as described in the "genetic method" of Maurits and Griffiths (2014):2 languages with n shared intermediate nodes on their path to the root have a distance d = M − ∑ n i=1 α i (where M is the maximum possible distance, and α is fixed at 0.69); I implemented it in R, and its application to each of the four classifications resulted in four distance matrices: MG2015 using WALS (2607 × 2607), Ethnologue (7492 × 7492), Glottolog (15772 × 15772), and AUTOTYP (2926 × 2926).

2.2
Unique identifiers across classifications The question of allocating unique persistent identifiers to linguistic entities is essential, and several schemes are currently in wider use.Relevant here are: ISO 639-3 codes (three letters, denoted in the following as i; http://www-01.sil.org/iso639-3),WALS codes (three letters, w; http://wals.info),AUTOTYP LIDs (numeric, a; http://www.autotyp.uzh.ch), and Glottocodes (alphanumeric, four letters followed by four digits, g; http://glottolog.org/glottolog/glottologinformation).The mapping between these schemes is not yet standardized.3Here I devise a flexible scheme for uniquely mapping linguistic entities between these four systems.Some databases provide a mapping between their primary identifier and some others: Ethnologue (primary: i, secondary: none), WALS (primary: w, secondary: i and g), AUTOTYP (primary: a, secondary: i and g), and Glottolog (primary: g, secondary: i), allowing the reciprocal mapping (not always unique) between these four systems.With this, the coding scheme (or "UULID," the Universally Unique Language IDentifier) is standardized as 'NAME ,' where the optional NAME represents a human-readable language name,4 followed by a SPACE and the four unique codes I (ISO 639-3), W (WALS), A (AUTOTYP) and G (Glottocode), where any or all of these can be missing (the empty string "") or have multiple values separated by "-".A few examples (taken from the WALS classification of the Indo-European family) are: 'German Zurich ' UULIDs and the mapping between the four coding schemes and linguistic entity names are freely available in the GitHub repository.

2.3
Representing language classifications as Newick trees The de facto standard Newick tree format5 is widely used in evolutionary biology, is almost universally imported and exported by software applications and libraries, and is very flexible, being able to accommodate rooted or unrooted trees, with our without leaf and internal node names, and with or without branch length information.In this format, subtrees are enclosed within parentheses "()" with the branch length optionally given as a number preceded by ":" dediu Language Dynamics and Change 8 (2018) 1-21 immediately after the branch, and the description is terminated by a semicolon ";".For example, the leftmost tree in Fig. 2 (where languages are the leaves or terminal nodes, the proto-languages or groups are the internal nodes, and for simplicity all branches are taken to have the same length of 1) can be represented as: -just the topology (structure): ((,),); -also showing leaf (terminal nodes) names: ((A,B),C); -also showing internal node (proto-languages, group) names (e.g., P1 is the name of the last common ancestor of A and B and immediately follows the "()" enclosing its descendants): ((A,B)P1,C)P2; -leaf nodes and branch length (a number separated by ":" from the node name; e.g. the branch from the last common ancestor of A and B to A has length 1): ((A:1,B:1),C:1); -finally, everything (leaf and internal node names, and branch length): Here, the Newick trees representing language classifications use UULIDs (see Section 2.2) as the leaf and internal node names.

2.4
Extracting the topologies of the language classifications The collection of the language classification from each of the four databases is met by specific challenges due to the particular representation of the genetic relationships between languages, and the actual format(s) in which these are available.I provide a set of R (R Core Team, 2014) classes and functions that extend the standard representation of phylogenetic trees as objects of class phylo (package ape; Paradis et al., 2004) and standardize the extraction of language family tree topologies, their conversion to the common Newick format, and its export to and import from file.
The extraction of standardized tree topologies from these diverse formats is based on the maintenance of a forest of partially built language family trees, which are updated by adding new full paths from the proto-languages to their daughter languages.More precisely, given a path between an internal node and a leaf (e.g., "Indo-European → Germanic → North-West Germanic → English"), the method attempts to identify an already existing partial tree that contains the origin of the path (here it would match an existing partial Indo-European tree) and to add the path to the tree, building the whole forest of all language family trees simultaneously from the ground up (see Fig. 2 for an example).
As a side note, a frequent issue that arises when using language classification data with R's ape package is the mishandling of so-called "single nodes" (i.e., internal nodes that have a single descendant in the tree, such as node P1 in the mid (degenerated) tree in Fig. 2).To address it, I have re-implemented ape's collapse.nodes()function in such a way that it can correctly handle these cases.

2.5
Adding branch length to language classifications In general, branch length represents the amount of evolutionary change that took place on a particular branch in the tree and must be interpreted in relative terms, unless the tree has been dated using absolute calibration points derived from other sources of data (Felsenstein, 2004;Bouckaert et al., 2012).I have added branch length information to the given tree topology T of a language classification using six methods falling into three broad classes: 1. methods that use the tree topology (and possibly a constant k > 0) to generate branch lengths: (1) constant, (2) proportional and (3) grafen, 2. a method that uses a distance matrix to generate the tree topology with branch lengths: (4) nj, and 3. methods that map a distance matrix onto the topology: (5) nnls and ( 6) ga.Method (1) computes branch lengths such that for every path from the root to the leaves, the branch lengths add up to a given constant k (i.e., the same amount of evolution k has happened on all paths from the root to the terminal nodes) by defining the minimum branch length brlen min = k/(number of levels in the tree) and allocating to each branch, starting from the root, a length of brlen min , making sure that the terminal nodes have a total path length of k (thus, "telescoping" them if necessary to "accommodate" any remaining path length longer than brlen min ).For example, the leftmost tree in Fig. 2 with k = 1.0 becomes ((A:0.667,B:0.667)P1:0.333,C:1)P2;.6 Method (2) simply forces each branch to the same length k (i.e., the amount of evolution is proportional to the number of splits on the path), resulting in ((A:1,B:1)P1:1,C:1)P2;.
(3) reimplements Grafen's (1989) method, where each internal node is first given a "height" defined as the number of leaves in its subtree minus 1 (leaves get a "height" of 0), and the difference between the heights of the lower and the upper nodes defines the branch length: ((A:1,B:1)P1:1,C:2)P2;.
Method (4) is the classic "Neighbor-Joining" (or NJ) method (Saitou and Nei, 1987), which iteratively joins taxa into higer groupings.Given a family tree T and a distance matrix D between (not necessarily all of) the languages in T, NJ (implemented by R's function njs() in package ape; Paradis et al., 2004) constructs the corresponding phylogenetic tree with branch lengths (thus the actual topology in T is discarded, as only its set of languages is used).For example, for the languages in Fig. 2, consider the distance matrix: which approximates the distances between the three languages in the rightmost tree (assuming method (1) with k = 2. 0), we obtain the NJ tree (C:3, B:1.2, A:0.9);-it is clear that NJ does not care about the structure (topology) of the original tree and might very well produce a very different topology.
Language Dynamics and Change 8 (2018) 1-21 Methods ( 5) and ( 6) make use of both the given tree topology T and the distance matrix D by estimating branch lengths on T that approximate as closely as possible the original distances in D, in the sense that the distance matrix between the languages obtained by recording the minimum path lengths separating any two languages in the tree, D ′ , is very similar to D. While these two methods have similar goals and produce very similar results, method (5) may be less robust than method (6) especially when the tree topology is complex, but method ( 6) is much slower, especially for very large trees, and might produce non-unique (but similar) solutions.
Specifically, method (5) estimates the branch lengths using the non-negative least squares (NNLS) approach implemented by R's function nnls.tree()(package phangorn; Schliep, 2011).The fundamental idea is that, for a given distance matrix D and a tree topology T with n branches, the method estimates the set of n branch lengths for T, b 1 , b 2 , … b n , such that the resulting patristic distance matrix7 D ′ best approximates the original distances D in the sense of minimizing the sum of squared errors (SSE), using least squares.8Here it produces the tree ((A:1.05,B:1.05)P1:0.975,C:2.02)P2;.
Method ( 6) is an original proposal that uses a standard genetic algorithm (R's function ga() in package GA; Scrucca, 2013) to estimate the branch length.With the notations above, I defined the "genome" as being composed of n real-valued "genes" G = (g 1 , g 2 , …, g n ) representing branch lengths, and the "fitness function" given by the SSE (sum of squared errors) between the original distances D and the current distances D ′ , computed on topology T with the branch lengths g 1 , g 2 , …, . The genetic algorithm searches for the best solution G * = (g * 1 , g * 2 , …, g * n ) that minimizes the fitness function f(T, D, G), using a population of 100 individuals for a maximum of 10,000 iterations (the search can stop earlier if the fitness didn't change for 100 consecutive iterations).A set of possible "best" trees could be:9 ((A:0.9,B:1.2)P1:1.41,C:1.59)P2;, ((A:0.9,B:1.2)P1: 1.75, C:1.25)P2;, or ((A:0.9,B:1.2)P1:1.73,C:1.27)P2;.dediu Language Dynamics and Change 8 (2018) 1-21 An important question concerns the robustness of these branch-length inference methods against violations of the conditions on D for being a true distance matrix: a the diagonal is zero: the matrix is symmetric: and  d the triangle inequality is satisfied: In principle, NJ and NNLS require a true distance matrix, but a closer examination of their implementation suggests that they might be robust against violations; by contrast, GA has no such requirements.To test this, I generated a set of four matrices (based on the test distance matrix used here, D) that violate each of the conditions in turn, as well as one matrix that violates all of them simultaneously.I then ran the methods NJ, NNLS and GA on them, observing that none of them crashed, nor did they fail to produce an output tree with branch length, and these trees do make sense given the violations.
Another question concerns the values of GA's parameters (population size N, maximum number of generations N g , and number of generations without fitness improvement for premature stopping N const g ), given that they affect the probability of finding the optimal solution(s), especially for complex trees, but also the computational costs of this search.Therefore, I compared the "standard" values N = 10000, N g = 100, and N const g = 100 with a "thorough" set N = 50000, N g = 150, and N const g = 200.When run on a compute cluster using a dedicated CPU per classification, the first required about 10 "wall clock" days to complete, while the second had to be stopped after 52 days (when all trees except one had converged).Despite this difference in computational costs, the computed branch lengths are very similar (across classifications and families: median Pearson's r = 1.00, median Euclidean distance d = 0. 02), as are the optimal fitness values (r = 0. 93, p < 2. 2 ⋅ 10 −16 , paired t-test t(3182) = 1.52, p = 0. 13).The number of generations required to stop is highly correlated (r = 0. 89, p < 2. 2 ⋅ 10 −16 ) but significantly higher for the "thorough" condition (mean difference 1174.6,paired t-test t(3182) = 20.14, p = 5. 2 ⋅ 10 −85 ).Thus, I can conclude that the "normal" GA settings used here strike a good balance between computational efficiency and probability of converging to the optimal solution(s).Change 8 (2018) 1-21   table 1 Summaries of the successfully harvested and exported language family tree topologies (i.e., no branch length information) per database; while the first two rows refer to the number of trees and leaf nodes in the whole database, the last five rows refer to the leaf nodes and levels per family tree.

Results
Table 1 gives some summaries concerning the successfully harvested and exported language family trees available in the GitHub repository.
As detailed in the methods, for each of the four databases (Ethnologue, WALS, AUTOTYP and Glottolog) I applied each of the six methods of branch length estimation (constant, proportional, grafen, nj, nnls, ga).For the last three, there was a choice of 11 distance matrices (asjp16, great circle, wals (gower), wals (gower with imputation), wals (euclidean), wals (euclidean with imputation), autotyp (gower), Maurits and Griffiths (2014)'s "genetic method" mg2015 (on wals), mg2015 (on ethnologue), mg2015 (on glottolog), and mg2015 (autotyp); the last four being applied only to the corresponding database, i.e., there is no mg2015 (wals) applied to the Ethnologue trees), and each of these trees with branch length information was saved in the Newick format as part of Nexus (Maddison et al., 1997) files.Table 2 gives various summaries about these trees with branch length; please note that the number of trees differs between databases and that, within a database, the number of languages might differ by method due to the inherent missing data in the method's parameters and/or the incomplete overlap between the data in the method's parameters and the languages in the classification.sis to test whether the results remain similar enough across language classifications (tree topologies) and estimates of the amount of evolution (branch lengths).For many languages, data is simply not available or very restricted, often resulting in distance matrices with a very high proportion of missing data.While in some cases one may arguably use some form of data imputation, in others this is not warranted, as there are no good models available and/or the missing data patterns are non-random.However, even in such cases, a subset of families containing only a subset of languages may allow better inferences than the use of only a few large families of unknown representativeness.
The work presented in this paper is intended as an answer to these desiderata.It provides a collection of language families as phylogenetic trees in the de facto standard Newick format, free of charge and directly importable into the majority of modern phylogenetic software.Additional optional features are branch length information derived from a multitude of sources including the tree's own topology and inter-language distance matrices derived from various typological databases.Moreover, I also provide, under a liberal open source license, the actual R code (R Core Team, 2014) that loads, adds branch length information and exports the trees.This allows thus the user to consider new sources of information on the amount of evolution (e.g., from human genetic data, actual road distance, or linguistic typological databases) or new ways to map such external sources of information onto language family trees.
I hope that the method described here, the associated computer code, and the resulting language family trees will help promote quantitative approaches to problems in linguistic typology, language history and evolution, and even in the wider field of cultural evolution.

Supplementary material
The complete R code for extracting the language family tree topologies from these four databases, converting them to the Newick/Nexus format using the cross-database Universally Unique Language Identifiers (UULIDs), and for exporting and importing this format from file, as well as for computing the distance matrices described here, is freely available under a GPLv2 license (http:// www.gnu.org/licenses/old-licenses/gpl-2.0.en.html) in the GitHub repository https://github.com/ddediu/lgfam-newick,also containing the resulting language family trees with the various branch length estimates.This repository contains more information about the data, the various license terms, as well as details about the results.Their use is encouraged, and bug reports and suggestions are welcome using the GitHub repository's ticketing mechanism or by directly e-mailing the author.

Language
figure 2 Building the family trees by adding new paths, here adding the path P2 → P1 → D to the left-hand tree, results in the right-hand tree because the algorithm recognizes that the path P1 → P2 is already present in the partial left-hand tree.

table 2
Summaries of the language family tree topologies (as listed in Table1) with branch length information successfully added; an asterisk (*) indicates that missing data mode imputation was used.