Machine assisted translation of cuneiform texts

The CDLI is delighted to announce that the international research collaboration Machine Translation and Automated Analysis of Cuneiform Languages (MTAAC) has been funded through the Trans-Atlantic Platform Digging into Data Challenge by the American National Endowment for the Humanities, the German Research Foundation, and the Canadian Social Sciences and Humanities Research Council.

Methods of computer assisted annotation of cuneiform texts are currently rule and dictionary based. To be reliable, these methods are dependent on human intervention. In the case of large corpora with non-homogeneous texts, the time required to verify each line appears to make this approach impracticable. Additionally, current methods are not context-aware, or are implemented in a fashion that is not specialized enough to enable unsupervised text analysis. Advanced methods of Natural Language Processing (NLP) are already available for research in modern languages, and computational linguists are working to develop these tools to accommodate the processing of extinct languages. Our project will see the first application of these methods to languages written in cuneiform.

The MTAAC project’s broad goal is to address the gap in the NLP of cuneiform languages. More specifically, our objectives are to:

  • formulate, test and evaluate methodologies for the automated analysis and machine translation (MT) of transliterated cuneiform documents, and to make the technology thus developed available to specialists in the field;
  • make available the translation of a specific and representative set of cuneiform documents to scholars in related disciplines and to a networked public (see below);
  • provide new data for the study of the language, culture, history, economy and politics of the ancient Near East by harvesting the linguistic byproducts of the translation and information extraction processes;
  • formalize these new data utilizing Linked Open Data (LOD) vocabularies, and foster the standardization, open data and LOD as practices integral to projects in digital humanities and computational philology.

As a representative and robust test set of cuneiform documents to be used in the initial phase of MTAAC, we have chosen the corpus of Ur III legal and administrative texts. We believe that these 21st century BC documents represent the best candidates for machine learning experiments due to their simple syntax, homogeneity and imposing numbers: nearly 68,000 texts with 1.5 million lines in Canonical ASCII Transliteration Format, 20,000 of which in translation, are maintained by the CDLI, a project that, moreover, has substantial expertise in the interpretation of this and related cuneiform corpora.

Principal investigator of the MTAAC research team is Heather D. Baker of the University of Toronto; co-PIs are Christian Chiarcos of the University of Frankfurt, and CDLI Director Robert K. Englund of UCLA. Émilie Pagé-Perron, CDLI co-PI, assumes the role of project coordinator. The CDLI will work closely with the MTAAC project among other tasks by providing and annotating the data needed for the research, by placing the new data generated into persistent storage, and by adapting its web services to facilitate the dissemination of the new data and applications produced in the course of the international collaboration.

Visit the MTAAC project website