View on GitHub

Taraka Rama

tarakark [at]


Since September 2017 I started to work as a post-doctoral fellow at Language Technology Group at University of Oslo. I am currently working in BIGMED project which analyzes Norwegian patient records.

The CFP for a special issue of the journal Computational linguistics on: Computational approaches in historical linguistics after the quantitative turn guest-edited by Taraka Rama, Simon J. Greenhill, Harald Hammarström, Gerhard Jäger, and Johann-Mattis List.

I am involved in developing a manually annotated Universal Dependencies Treebank for Telugu along with Sowmya Vajjala . The material for annotation comes from A grammar of Modern Telugu. The treebank is released and is available.

About me

My google scholar page is updated regularly. My CV is here. My list of publications are available here.

I defended my PhD thesis in Computational Historical Linguistics from University of Gothenburg under the supervision of Lars Borin and Søren Wichmann. In my thesis, I investigated hypothesis that the phoneme inventory sizes of languages reduce as one moves away from Africa, time-depth of language families, cognate identification, creation of lexical database for more than 300 South Asian languages.

During my PhD, I worked as a research assistant at the Digital Areal Linguistics project headed by Anju Saxena and hosted at University of Uppsala. I was involved in investigating the branching structure of Dravidian languages using Maximum Parsimony.

I was a Post-Doctoral fellow in EVOLAEMP headed by Gerhard Jäger from November 2015 till August 2017 at University of Tübingen. During my stay at Tübingen, I worked in automatic cognate identification (neural networks, Online PMI), dialect classification, discriminating between similar languages, native language identification, and dating of Indo-European language family. I applied Siamese ConvNets for cognate identification using articulatory embeddings.

I served as a reviewer at EACL, ACL, COLING, Language Dynamics and Change, PeerJ Computer Science.


  • Doctor in Philosophy in Natural Language Processing (University of Gothenburg)
  • Masters of Technology in Computational Linguistics (IIIT-Hyderabad)
  • Bachelors of Technology in Information and Communication Technology (DAIICT)
  • Back to Top


  • Çağrı Çöltekin and Taraka Rama (2018) Tübingen-Oslo at SemEval-2018 Task 2:SVMs perform better than RNNs at Emoji Prediction at SemEval-2018 Shared Task on Multilingual Emoji prediction Proceedings of The 12th International Workshop on Semantic Evaluation, NAACL 2018
  • Taraka Rama, Johann-Mattis List, Johannes Wahle, and Gerhard Jäger (2018) Are Automatic Methods for Cognate Detection Good Enough for Phylogenetic Reconstruction in Historical Linguistics? Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)
  • Sowmya Vajjala and Taraka Rama (2018) Experiments with Universal CEFR Classification Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications
  • Çağrı Çöltekin and Taraka Rama (2018) Exploiting Universal Dependencies Treebanks for Measuring Morphosyntactic Complexity. For Proceedings of the First Workshop on Measuring Language Complexity, pages 1–7, Torun, Poland.
  • Taraka Rama and Sowmya Vajjala (2017) A Dependency Treebank for Telugu For Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories, Prague, Jan. 23-24, 2018
  • Søren Wichmann and Taraka Rama (2017) Jackknifing the black sheep: ASJP classification performance and Austronesian. For the proceedings of the symposium “Let’s talk about trees”, National Museum of Ethnology, Osaka, Febr. 9-10, 2013.
  • Roland Mühlenbernd and Taraka Rama (2017) What phoneme networks tell us about the age of language families. Journal of Language Evolution (2017): lzx007.
  • Çağrı Çöltekin and Taraka Rama (2017) Tubingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing. Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 146–155.
  • Taraka Rama, Çağrı Çöltekin and Pavel Sofroniev (2017) Computational analysis of Gondi dialects. Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 26–35.
  • Taraka Rama and Çağrı Çöltekin (2017) Fewer features perform well at Native Language Identification task. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 255-260.
  • Taraka Rama and Çağrı Çöltekin (2016) LSTM Autoencoders for Dialect Analysis. Proceedings of Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), COLING 2016, Osaka, Japan, 2016.
  • Çağrı Çöltekin and Taraka Rama (2016) Discriminating Similar Languages with Linear SVMs and Neural Networks. Proceedings of Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), COLING 2016, Osaka, Japan, 2016.
  • Taraka Rama (2016) Siamese Convolutional Networks for Cognate Identification. In Proceedings of the 26th International Conference on Computational Linguistics (COLING 2016). Osaka, Japan, 2016.
  • Taraka Rama (2016) Ancestry sampling for Indo-European phylogeny and dates. Capturing phylogenetic algorithms for linguistics, Leiden
  • Taraka Rama and Lars Borin (2015) Comparative evaluation of string similarity measures for automatic language classification. Ján Mačutek and George K. Mikros, editors, Sequences in Language and Text, pages 203-231. Walter de Gruyter.
  • Taraka Rama (2015)Automatic cognate identification with gap-weighted string subsequences. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  • Taraka Rama (2015) Gap-weighted subsequences for automatic cognate identification and phylogenetic inference Unpublished.
  • Lars Borin, Anju Saxena, Taraka Rama, Bernard Comrie Linguistic landscaping of South Asia using digital language resources: Genetic vs. areal linguistics. Ninth International Conference on Language Resources and Evaluation (LREC'14), pp. 3137-3144.
  • Taraka Rama (2013) Vocabulary lists in computational historical linguistics. Licentiate thesis. Opponent: Roman Yangarber
  • Taraka Rama and Lars Borin (2013) N-Gram Approaches to the Historical Dynamics of Basic Vocabulary. Journal of Quantitative Linguistics, 21 (1):50-64
  • Taraka Rama, Prasant Kolachina, Sudheer Kolachina (2013) Two methods for automatic identification of cognates. Quantitative Investigations in Theoretical Linguistics, volume 5, pages 76-80.
  • Taraka Rama (2013) Phonotactic Diversity Predicts the Time Depth of the World's Language Families. PloS one, 8(5), p.e63238
  • Taraka Rama and Sudheer Kolachina (2013) Distance-based Phylogenetic Inference Algorithms in the Subgrouping of Dravidian Languages. Lars Borin and Anju Saxena, editors, Approaches to Measuring Linguistic Differences, pages 141--174. De Gruyter Mouton, Berlin. ISBN 978-3-11-030525-8.
  • Taraka Rama and Prasanth Kolachina (2012) How good are Typological Distances for determining Genealogical Relationships among Languages? Proceedings of the 24th International Conference on Computational Linguistics, pages 975--984.
  • Soeren Wichmann, Robert Walker, Taraka Rama and Eric W Holman (2011) Correlates of reticulation in linguistic phylogenies. Language Dynamics and Change, 1(2):205--240
  • Taraka Rama and Lars Borin (2011) Estimating language relationships from a parallel corpus. A study of the Europarl corpus. NEALT Proceedings Series (NODALIDA 2011 Conference Proceedings), volume 11, pages 161--167.
  • Sudheer Kolachina, Taraka Rama, Lakshmi Bai B. (2011) Maximum parsimony method in the subgrouping of Dravidian languages. Quantitative Investigations in Theoretical Linguistics, volume 4, pages 52--56, 2011. 2011
  • Soeren Wichmann, Taraka Rama, Eric W Holman (2011) Phonological diversity, word length, and population sizes across languages: The ASJP evidence Linguistic Typology, 15(2):177--197, 2011.
  • Anil Kumar Singh, Sethu Subramaniam, and Taraka Rama (2010) Transliteration as alignment vs. transliteration as generation for crosslingual information retrieval. Traitement Automatique des Langues, 51(2), 2010.
  • Taraka Rama and Anil Kumar Singh (2009) From Bag of Languages to Family Trees From Noisy Corpus. Proceedings of the Conference on Recent Advances in Natural Language Processing, pages 355--359, Borovets, Bulgaria, 2009
  • Taraka Rama, Anil Kumar Singh, Sudheer Kolachina (2009) Modeling letter-to-phoneme conversion as a phrase based statistical machine translation problem with minimum error rate training. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium, pages 90--95. Association for Computational Linguistics, 2009.
  • Work experience