Universal Dependency Treebank for Santali Language

Main Article Content

Article Sidebar

Published Oct 8, 2021
Satya Dash Sunil Sahoo Brojo Kishore Mishra Shantipriya Parida Jatindra Nath Besra Atul Kr. Ojha

Abstract

Santhali language is an Indian low resource language that belongs to the Austroasiatic language group. Santals are the largest adivasi (indigenous) community in the Indian subcontinent with a population of more than 10 million, and they reside mostly in the Indian states of Jharkhand, Orissa, West Bengal, Assam and Bihar, and sparsely in Bangladesh and Nepal. With the rise in language technology for Indian languages, significant developments have been achieved in major Indian languages but contributions towards research in the lesser-known/low-resourced languages remain minimal. Most parsers and treebanks have been developed for the scheduled (official) languages; the non-scheduled and lesser known languages still have a long way to go. In its endeavour to fill this gap, the present paper discusses the creation and development of Santhali Universal Dependency (UD) treebank and parser. UD has been acknowledged as an emerging framework for cross-linguistically consistent grammatical annotation. The primary aim of this project is to facilitate multilingual parser development. The system will also take into account cross lingual learning and perform parsing research from the perspective of language typology. A major effort is currently underway to develop a large scale treebank for Indian low resource Languages (ILRLs). The lack of such a resource has been a major limiting factor in the development of good natural language tools and applications for ILRLs. Apart from that, a rich and large-scale tree bank can be an essential resource for linguistic investigations. This paper presents the first publicly available treebank of Santhali low resource Indian language. The treebank contains approx. 500 tokens (50 sentences) in Santhali language. All the selected sentences are manually annotated following the "Universal Dependency" guidelines. The morphological analysis of the Santhali treebank was performed using machine learning techniques. The Santhali annotated treebank will enrich the Santhali language resource and will help in building language technology tools for cross-lingual learning and typological research. We also build a preliminary Santhali parser using a machine learning approach. Finally, the paper briefly discusses the linguistic analysis of the Santhali Universal Dependencies (UD) treebank.

How to Cite

Dash, S., Sunil Sahoo, Brojo Kishore Mishra, Shantipriya Parida, Jatindra Nath Besra, & Atul Kr. Ojha. (2021). Universal Dependency Treebank for Santali Language. SPAST Abstracts, 1(01). Retrieved from https://spast.org/techrep/article/view/2111
Abstract 155 |

Article Details

Keywords

Santali, Universal Dependency, treebank, low resource

References
[1] Zeman, Daniel, and Philip Resnik. "Cross-language parser adaptation between related languages." Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages. 2008.
[2] De Marneffe, Marie-Catherine, Bill MacCartney, and Christopher D. Manning. "Generating typed dependency parses from phrase structure parses." Lrec. Vol. 6. 2006.
[3] De Marneffe, Marie-Catherine, and Christopher D. Manning. "The Stanford typed dependencies representation." Coling 2008: proceedings of the workshop on cross-framework and cross-domain parser evaluation. 2008.
[4].Lin, Yuri, et al. "Syntactic annotations for the google books ngram corpus." Proceedings of the ACL 2012 system demonstrations. 2012.
[5] Ojha, Atul Kr, and Daniel Zeman. "Universal Dependency treebanks for low-resource Indian languages: The case of Bhojpuri." Proceedings of the WILDRE5–5th Workshop on Indian Language Data: Resources and Evaluation. 2020.
Section
GE3- Computers & Information Technology

Most read articles by the same author(s)