Study in Indo-European tree priors appeared online. I applied three tree priors (FBD, Coalesecent and Uniform) to multiple datasets of Indo-European and determined which tree prior is the best. Author copy is available here.
Presented Towards identifying the optimal datasize for lexically-based Bayesian inference of linguistic phylogenies at COLING 2018 (acceptance rate 37.4%) and a winning shared task paper at VarDial 2018.
Paper accepted at CoNLL 2018 (acceptance rate of 20.5%) on "Similarity dependent Chinese Restaurant Process for cognate detection"
Papers accepted at LOUHI 2018 and Universal Dependencies workshop.
Participated in Shared tasks at SMM4H and CoNLL-SIGMORPHON Shared Task
The CFP for a special issue of the journal Computational linguistics on: Computational approaches in historical linguistics after the quantitative turn guest-edited by Taraka Rama, Simon J. Greenhill, Harald Hammarström, Gerhard Jäger, and Johann-Mattis List.
The Telugu treebank is released and is available. The paper is presented at TLT 2018 and is available here.
My google scholar page is updated regularly. My CV is here. My list of publications are available here.
I defended my PhD thesis in Computational Historical Linguistics from University of Gothenburg under the supervision of Lars Borin and Søren Wichmann. In my thesis, I investigated hypothesis that the phoneme inventory sizes of languages reduce as one moves away from Africa, time-depth of language families, cognate identification, creation of lexical database for more than 300 South Asian languages.
During my PhD, I worked as a research assistant at the Digital Areal Linguistics project headed by Anju Saxena and hosted at University of Uppsala. I was involved in investigating the branching structure of Dravidian languages using Maximum Parsimony.
I was a Post-Doctoral fellow in EVOLAEMP headed by Gerhard Jäger from November 2015 till August 2017 at University of Tübingen. During my stay at Tübingen, I worked in automatic cognate identification (neural networks, Online PMI), dialect classification, discriminating between similar languages, native language identification, and dating of Indo-European language family. I applied Siamese ConvNets for cognate identification using articulatory embeddings.
I served as a reviewer at EACL, ACL, COLING, EMNLP, Information Processing and Management, Language Dynamics and Change, PeerJ Computer Science.
AcademicsDoctor in Philosophy in Natural Language Processing (University of Gothenburg)
Masters of Technology in Computational Linguistics (IIIT-Hyderabad)
Bachelors of Technology in Information and Communication Technology (DAIICT)
Back to Top
Publications Taraka Rama (2018) Three tree priors and five datasets: A study of Indo-European phylogenetics. Language Dynamics and Change, 8(2), pp.182-218.
Taraka Rama. Similarity dependent Chinese Restaurant Process for Cognate Identification in Multilingual Wordlists. Accepted at CoNLL 2018.
Aleksandrs Berdicevskis, Çağrı Çöltekin, Katharina Ehret, Kilu von Prince, Daniel Ross, Bill Thompson, Chunxiao Yan, Vera Demberg, Gary Lupyan, Taraka Rama and Christian Bentz. Using Universal Dependencies in cross-linguistic complexity research. Accepted at Universal Dependencies Workshop 2018.
Taraka Rama and Pål H. Brekke and Øystein Nytrø and Lilja Øvrelid. Iterative development of family history annotation guidelines using a synthetic corpus of clinical text. Accepted at LOUHI workshop at EMNLP 2018
Çağrı Çöltekin and Taraka Rama and Verena Blaschke (2018). Tübingen-Oslo team at the VarDial 2018 evaluation campaign: An analysis of n-gram features in language variety identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 55-65. 2018.
Taraka Rama and Søren Wichmann (2018) Towards identifying the optimal datasize for lexically-based Bayesian
inference of linguistic phylogenies Proceedings of COLING 2018, Santa Fe
Çağrı Çöltekin and Taraka Rama (2018) Tübingen-Oslo at SemEval-2018 Task 2:SVMs perform better than RNNs at Emoji Prediction at SemEval-2018 Shared Task on Multilingual Emoji prediction Proceedings of The 12th International Workshop on Semantic Evaluation, NAACL 2018
Taraka Rama, Johann-Mattis List, Johannes Wahle, and Gerhard Jäger (2018) Are Automatic Methods for Cognate Detection Good Enough for Phylogenetic Reconstruction in Historical Linguistics? Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)
Sowmya Vajjala and Taraka Rama (2018) Experiments with Universal CEFR Classification Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications
Çağrı Çöltekin and Taraka Rama (2018) Exploiting Universal Dependencies Treebanks for Measuring Morphosyntactic Complexity. For Proceedings of the First Workshop on Measuring Language Complexity, pages 1–7, Torun, Poland.
Taraka Rama and Sowmya Vajjala (2017) A Dependency Treebank for Telugu For Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories, Prague, Jan. 23-24, 2018
Søren Wichmann and Taraka Rama (2017) Jackknifing the black sheep: ASJP classification performance and Austronesian. For the proceedings of the symposium “Let’s talk about trees”, National Museum of Ethnology, Osaka, Febr. 9-10, 2013.
Roland Mühlenbernd and Taraka Rama (2017) What phoneme networks tell us about the age of language families. Journal of Language Evolution (2017): lzx007.
Çağrı Çöltekin and Taraka Rama (2017) Tubingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing. Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 146–155.
Taraka Rama, Çağrı Çöltekin and Pavel Sofroniev (2017) Computational analysis of Gondi dialects. Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 26–35.
Taraka Rama and Çağrı Çöltekin (2017) Fewer features perform well at Native Language Identification task. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 255-260.
Taraka Rama and Çağrı Çöltekin (2016) LSTM Autoencoders for Dialect Analysis. Proceedings of Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), COLING 2016, Osaka, Japan, 2016.
Çağrı Çöltekin and Taraka Rama (2016) Discriminating Similar Languages with Linear SVMs and Neural Networks. Proceedings of Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), COLING 2016, Osaka, Japan, 2016.
Taraka Rama (2016) Siamese Convolutional Networks for Cognate Identification. In Proceedings of the 26th International Conference on Computational Linguistics (COLING 2016). Osaka, Japan, 2016.
Taraka Rama (2016) Ancestry sampling for Indo-European phylogeny and dates. Capturing phylogenetic algorithms for linguistics, Leiden
Taraka Rama and Lars Borin (2015) Comparative evaluation of string similarity measures for automatic language classification. Ján Mačutek
and George K. Mikros, editors, Sequences in Language and Text, pages 203-231. Walter de Gruyter.
Taraka Rama (2015)Automatic cognate identification with gap-weighted string subsequences. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
Taraka Rama (2015) Gap-weighted subsequences for automatic cognate identification and phylogenetic inference Unpublished.
Lars Borin, Anju Saxena, Taraka Rama, Bernard Comrie Linguistic landscaping of South Asia using digital language resources: Genetic vs. areal linguistics. Ninth International Conference on Language Resources and Evaluation (LREC'14), pp. 3137-3144.
Taraka Rama (2013) Vocabulary lists in computational historical linguistics. Licentiate thesis. Opponent: Roman Yangarber
Taraka Rama and Lars Borin (2013) N-Gram Approaches to the Historical Dynamics of Basic Vocabulary. Journal of Quantitative Linguistics, 21 (1):50-64
Taraka Rama, Prasant Kolachina, Sudheer Kolachina (2013) Two methods for automatic identification of cognates. Quantitative Investigations in Theoretical Linguistics, volume 5, pages 76-80.
Taraka Rama (2013) Phonotactic Diversity Predicts the Time Depth of the World's Language Families. PloS one, 8(5), p.e63238
Taraka Rama and Sudheer Kolachina (2013) Distance-based Phylogenetic Inference Algorithms in the Subgrouping of Dravidian Languages. Lars Borin and Anju Saxena, editors, Approaches to Measuring Linguistic
Differences, pages 141--174. De Gruyter Mouton, Berlin. ISBN 978-3-11-030525-8.
Taraka Rama and Prasanth Kolachina (2012) How good are Typological Distances for determining Genealogical Relationships among Languages? Proceedings of the 24th International Conference on Computational Linguistics, pages 975--984.
Soeren Wichmann, Robert Walker, Taraka Rama and Eric W Holman (2011) Correlates of reticulation in linguistic phylogenies. Language Dynamics and Change, 1(2):205--240
Taraka Rama and Lars Borin (2011) Estimating language relationships from a parallel corpus. A study of the Europarl corpus. NEALT Proceedings
Series (NODALIDA 2011 Conference Proceedings), volume 11, pages 161--167. http://hdl.handle.net/10062/17303
Sudheer Kolachina, Taraka Rama, Lakshmi Bai B. (2011) Maximum parsimony method in the subgrouping of Dravidian languages. Quantitative Investigations in Theoretical Linguistics, volume 4, pages 52--56, 2011.
Soeren Wichmann, Taraka Rama, Eric W Holman (2011) Phonological diversity, word length, and population sizes across languages: The ASJP evidence Linguistic Typology, 15(2):177--197, 2011.
Anil Kumar Singh, Sethu Subramaniam, and Taraka Rama (2010) Transliteration as alignment vs. transliteration as generation for crosslingual information retrieval. Traitement Automatique des Langues, 51(2), 2010.
Taraka Rama and Anil Kumar Singh (2009) From Bag of Languages to Family Trees From Noisy Corpus. Proceedings of the Conference on Recent Advances in Natural Language Processing, pages 355--359, Borovets, Bulgaria, 2009
Taraka Rama, Anil Kumar Singh, Sudheer Kolachina (2009) Modeling letter-to-phoneme conversion as a phrase based statistical machine translation problem with minimum error rate training. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium, pages 90--95. Association for Computational Linguistics, 2009.
- Analytics Consultant at 24/7 Innovation labs
- Research assistant at Digital Areal Linguistics Project
- Visiting PhD student at the Max Planck Institute (EVA), Leipzig in the spring of 2012
- Guest Research Scholar, University of Groeningen, 2013