« How Are New Languages Formed?
Main
He's Back »
February 3, 2008
Building A Semitic Tree
I'm thinking of attempting to construct a phylogenetic tree for a set of Semitic languages using Markov chain Monte Carlo methods. This is the same statistical approach that Atkinson et al used in the paper I discussed yesterday. It is also a method commonly used by biologists to build phylogenetic trees. While I am very open to suggestions, I think the basic dataset should be include cognate words from the Swadesh lists for Hebrew (Biblical), Aramaic (Achaemenid), Phoenician (excluding Punic), Akkadian (neo-Assyrian), Arabic (classical) and Ugaritic. Other languages and dialects would be added as the project developed.
The Swadesh list for any language is a very basic vocabulary of about 200 words that do not tend to be borrowed between languages. Swadesh lists exist for Hebrew and Arabic but I'm not so sure that they have been constructed for the other languages. But even if the lists already exist, a lot of work is necessary to prepare the database for processing. Every cognate set must be reduced to a binary code that represents the presence or absence of that specific cognate in each language. Exactly how one does this is not as yet clear to me. Then the data is processes a few hundred thousand times. I think that is the easy part. Trees then must to be analyzed in a number of ways including branch length calculations. Finally, the results need to be published.
This is a lot of work. I'm looking for maybe three or four colleagues who would like to share both the work and the glory. It would also be good to have an evolutionary biologist who had actually worked with this kind of data set on the team. I'm thinking of contacting one of the biologists that worked on the Atkinson paper. If you are interested in working on this project, please leave a comment or send me an email.
Posted by Duane Smith at February 3, 2008 8:42 AM | Read more on Hebrew Bible |
Trackback Pings
TrackBack URL for this entry:
http://www.telecomtally.com/mt/mt-tb.cgi/2404
Comments
You will find that adding an outlier will improve your result.
More work.
Posted by: Gary Hurd at February 3, 2008 10:39 AM
Gary,
Thanks for the suggestion. I am sure you are correct. I'm thinking that Ethiopic might serve as an outlier but Egyptian might be even better. In the case of Egyptian, a few of the pronouns and a few, but very few, other words in the Swadesh list are "cognate."
Posted by: Duane at February 3, 2008 11:25 AM
Duane,
This sounds a lot like glottochronology (a.k.a. lexicostatistics). Before you put the time into this, you should check out the many, many criticisms of Swadesh's assumptions. This approach for establishing genetic relationships for languages is rarely used nowadays, and for good reason. A good basic introduction and critique is available in Lyle Campbell's textbook, Historical Linguistics: An Introduction (2nd ed), chapter 6 ("Linguistic Classification").
Robert
Posted by: Robert Holmstedt at February 3, 2008 1:06 PM
Robert,
Thank you for your remarks and the reference. While I am aware of the fact the lexicostatistics is currently out of favor (and for good reasons) and of some criticisms of Swadesh's assumptions, I am not aware of Campbell's specific concerns. I will read his work with care. On the other hand, I cannot find any place where Markov chain Monte Carlo methods have been used to build relational trees for Semitic languages. That doesn't mean that such a project would produce fruit. It might turnout as unproductive as other such attempts. I have an even greater worry. I worry that the nature of Semitic cognates and their ubiquity may make such an attempt unsuccessful even as a theoretical experiment with Monte Carlo methods much less a practical method for teasing out relationships between languages.
Please notice that I started my post with the words, "I'm thinking of" and while I did ask if there were those who might want to work on such a project, those words ("I'm thinking of") were the most important words in the whole post.
I also need to look at,
Q. D. Atkinson, R. D. Gray, in Phylogenetic methods and the prehistory of
languages J. Clackson, P. Forster, C. Renfrew, Eds. (MacDonald Institute for
Archaeological Research, Cambridge, 2006) pp. 91-109.
and
S. Embleton, Statistics in Historical Linguistics (Brockmeyer, Bochum, 1986).
Posted by: Duane at February 3, 2008 1:47 PM
Duane, it will be a long project. Good luck.
Posted by: Aydin at February 3, 2008 3:57 PM
I echo Robert Holmstedt's piece of wisdom. This is why I made my rant Language waves and the satem innovation in PIE. Looking at language change as a series of "isogloss waves" is much more natural and explains better the processes of areal influence, multilingual interference and convergeance. Indo-Europeanists have long since known that the so-called "Centum-Satem" split can no longer be seen as a split because there are certain satem dialects that share features with certain centum dialects but not others. This suggests that centum and satem dialects developed side by side and there simply was no clear "split". We can only approximate the date of the fragmentation of Proto-Indo-European based on what we subjectively feel is the point at which dialects became sufficiently "mutually unintelligible". So some might say it was 4000 BCE, some 4500, and in a sick way, they're both correct answers. It's inevitably vague for those with a classification obsession ;) And the same concepts hold for Proto-Semitic and other proto-languages, of course.
Phylogenetics is in fact a false analogy to linguistic change when you think about it. Distinct species are just sets of animals with mutually incompatible genomes. A combination of their genetics would result in something that's simply non-viable and therefore there is no species-crossover. This however never happens in linguistics. There is no such thing as "mutually incompatible languages" in the sense that every language can converge with every other language given the right conditions. So, it's really a horrible way of getting a handle on diachronic linguistics, I'm afraid.
Posted by: Glen Gordon at February 3, 2008 5:59 PM
And just in case my own blog explorations on language waves seem too hoaky for more cultured, erudite tastes, perhaps this link might be of some use to anyone intoxicated by the misapplications of computer programmers in comparative linguistics: McMahon/McMahon, Language Classification by Numbers (2005), p.199: "The key problem with approaches to dating linguistic 'events' is that even careful and judicious analyses of the sort seen in Gray and Atkinson (2003) are prone to essential and underlying difficulties, such that the results obtained there for Indo-European, for instance, cannot appropriately or confidently be generalized to other families. [...] In contact languages, like pidgins, creoles, mixed languages, perhaps dying languages, and languages in convergence areas, these assumptions are arguably invalid all the way along the line." Oh oh. Danger, Will Robinson, danger! Hehe :)
Posted by: Glen Gordon at February 3, 2008 6:32 PM
It looks like I've generated more controversy than I intended. First, agree completely that even approximate absolute dating is impossible using glottochronology by any other name. This is a long dead pipedream. Further, I agree that it is impossible to say much if anything about the "proto-" languages using these techniques. However, I do believe that Atkinson et al results are consistent with some aspects of modern linguistic theory, more on this later. The best one can hope for is some insight into the relative relationship between the languages. It is clear that some of these languages enjoyed a long history of interaction: witness the definite article in Hebrew and Phoenician. By the way, I think the article was borrowed by Phoenician but any such hypothesis would be hard to support.
Finally, I do not intend to start on this project without a dedicated team and, more importantly, a rather complete literature, including McMahon, review which I have not even began.
Posted by: Duane at February 3, 2008 7:36 PM
I used MDS and clustering over 30 years ago to sort the relationships between professions and familiarity and use of various words. The data came from two Yucatec Maya villages, and the words were related to pottery and pottery production. The math methods currently called "genomic" are far from new- they are old news in linguistics.
There are serious objections to using these methods to assign dates to divergent useage- agreed. But there are still some very interesting results to be had.
Posted by: Gary Hurd at February 3, 2008 7:36 PM
Hi Duane - I wondered if you know of this blog - http://www.balashon.com/ The Hebrew detective just a curious place on language relationships. I am in computers and data design by the way but I don't think I could help much with any of these big problems - Besides my infancy in languages, the problems look intractable to me until we invent time travel.
Posted by: Bob MacDonald at February 4, 2008 8:22 AM
Duane,
Patrick Bennett's Comparative Semitic Linguistics: A Manual (Eisenbrauns, 1998) includes a somewhat nuanced introduction to Lexicostatistics with several wordlists at the end. It may be a good starting point if you are interested in playing with the data.
Pete
Posted by: Peter Bekins at February 5, 2008 6:01 AM
Bob,
Yes, I do know Balashon. There is a lot of good stuff there.
Pete,
Bennett's book is about two feet out of my easy reach. I consult it often. The words in his word lists are in general those that one would find in a typical Swadesh list. However, without work they are not machine readable and have a few things missing. You are correct; his work is a very good place to start.
But the real work is in coding the lists and making a series of judgments that are required to control for some of the problems in using such lists.
Duane
Posted by: Duane at February 5, 2008 8:40 AM
Just a note that your link "Markov chain Monte Carlo methods" which I presume should be pointing to http://www.evolution.rdg.ac.uk is pointing incorrectly to "http://www.telecomtally.com/blog/www.evolution.rdg.ac.uk". (Perhaps you forgot "http://" in your url and so then it references back to your site?)
Posted by: Glen Gordon at February 5, 2008 9:11 PM
Glen,
You are correct. Thanks. I'm not sure how it got that way but I fixed it.
Posted by: Duane at February 5, 2008 9:30 PM
Sorry, comments are closed for this post.
Send me an email if it is important.