COIL-D Logo

Fostering innovation in Human Language Technology by creating a unified repository of Indian language data.

Under BHASHINI, Funded by Ministry of Electronics and Information Technology (MeitY), Govt. of India

About Us

COIL-D: Centre for Indian Language Data is a funded project from the Ministry of Electronics and Information Technology (MeitY), Govt. of India. The project is to be executed in consortium mode led by IIT Patna. The other partnering institutions are IIT Delhi, IIIT Guwahati, IIIT Delhi, IGDTUW, Digital India Bhashini Division DIC and MIT Manipal. The project seeks to develop language resources for Human Language Technology (HLT) and establish applications, standards, guidelines, and best practices for building and benchmarking Machine Translation (MT) systems and other NLP tools. MIT is responsible for developing parallel corpora for Dravidian languages Kannada, Tamil, Malayalam, and Telugu.

Aims

    Develop a suite of language resources, including parallel corpora for machine translation between Indian languages and benchmark datasets for key NLP tasks like Part-of-Speech (PoS) tagging, Named Entity Recognition (NER), Automatic Speech Recognition (ASR), and Text-to-Speech (TTS).

Primary Focus

    50% domain-specific content from a mix of sectors: science, healthcare, agriculture, climate, tourism, and judiciary. 30% educational content, focusing on academic materials, textbooks, and learning resources. 20% conversational and governance content, including social media, dialogues, and official documents.

Key Focus

    Standardize, preserve, and create language resources to support NLP research and applications.

Applications

    This dataset will provide translation between Tamil and Kannada, Malayalam, Telugu and it will also enable the development of multilingual chatbots. This will facilitate more effective cross-sector communication.

Intiative

    Establish a centralized repository for Indian language data and provide a platform for developing and benchmarking Human Language Technology (HLT) applications.

Results

    To ensure rapid and efficient translation, the final output requires minimal post-editing, making it immediately suitable for publication on print and digital platforms.

Development

    Create comprehensive leaderboards to systematically evaluate the performance of models in key Natural Language Processing (NLP) tasks. These leaderboards will serve as benchmarks for assessing machine translation (MT), Part-of-Speech (PoS) tagging, Named Entity Recognition (NER), Natural Language Generation (NLG), sentiment analysis, Automatic Speech Recognition (ASR), and Text-to-Speech (TTS).

Delivery

    These advanced, no-cost tools offer a superior alternative to existing solutions like Google Translate, while simultaneously contributing to the preservation and growth of our linguistic heritage.

Our Primary Objectives

The COIL-D (Centre for Indian Language Data) project is building a single, comprehensive hub for Indian language data. Its main goals are to set up a standardized platform to evaluate machine translation and other natural language processing systems, encourage the development and preservation of language resources for human language technology applications, and define benchmarks for linguistic performance.

Step 1: Identification of Tamil language resources

    Identify and list existing Tamil datasets and tools to understand the current resource availability.

Step 2: Acquisition of resources across target domains

    To ensure comprehensive coverage of real-world language usage, data will be collected from key domains, including Education, Governance & Policy, Judiciary, Science & Technology, Healthcare, Agriculture, Climate, and Tourism.

Step 3: Creation of language resources and benchmarks

    Develop language resources and datasets for Machine Translation and NLP, and establish benchmarks to systematically evaluate tool performance.

Step 4: Development of MT evaluation leaderboards

    Develop MT evaluation leaderboards to assess translation systems, enabling performance tracking and fostering continuous improvement.

Step 5: Leaderboards for ASR and TTS technologies

    Develop evaluation scoreboards for speech recognition and synthesis systems to measure accuracy, clarity, and overall performance.

Step 6: Benchmarks for PoS and NER taggers

    Define evaluation protocols for tools such as PoS and NER taggers to ensure fair and consistent assessment.

Leadership

Member Photo

Commander(Dr) Anil Rana

Director

Manipal Institute of Technology

MAHE, Manipal, India

Member Photo

Dr. Chandrakala C B

Joint Director

Additional Professor

School of Computer Engineering

MIT, MAHE Manipal, India

Member Photo

Dr. Radhika M Pai

Dean and Professor

School of Computer Engineering

MIT, MAHE Manipal, India

Member Photo

Dr. P C Siddalingaswamy

Professor and Associate Dean

School of Computer Engineering

MIT, MAHE Manipal, India

Member Photo

Dr. Smitha N Pai

Professor and Associate Dean

School of Computer Engineering

MIT, MAHE Manipal, India

Project Investigators

Member Photo

Dr. Muralikrishna SN

Principal Investigator

Associate Professor

School of Computer Engineering

MIT, MAHE Manipal, India

Google Scholar LinkedIn
Member Photo

Dr. Ashalatha Nayak

Co-Investigator

Professor

School of Computer Engineering

MIT, MAHE Manipal, India

Google Scholar LinkedIn
Member Photo

Dr. Ashwath Rao B

Co-Investigator

Assistant Professor - Selection Grade

School of Computer Engineering

MIT, MAHE Manipal, India

Google Scholar LinkedIn
Member Photo

Dr. Raghavendra Ganiga

Co-Investigator

Associate Professor

School of Computer Engineering

MIT, MAHE Manipal, India

Google Scholar LinkedIn
Member Photo

Mr. Ganesh Babu C

Co-Investigator

Assistant Professor Senior Scale

School of Computer Engineering

MIT, MAHE Manipal, India

LinkedIn
Member Photo

Dr. Raghurama Holla

Co-Investigator

Assistant Professor

School of Computer Engineering

MIT, MAHE Manipal, India

Google Scholar LinkedIn

Project Staffs

Ms. Niveditha

Liaison Officer

Mr. PVSS Harshavardhan

Junior Research Associate (Tech)

Ms. Deeksha

Junior Research Associate

Ms. Shruthi

Junior Research Associate

Ms. Sona T

Junior Research Associate

Ms. Shama Bhat

Junior Research Associate

Ms. Shrilatha Kulal

Junior Research Associate

Ms. Raksha

Junior Research Associate

Dr. Umalatha Kannoth

Junior Research Associate

Ms. Kavya

Former Junior Research Associate

Interns

Current Interns

Name: Sakshi

Project: Machine Translation models (IndicTrans2)

Name: Rajdeep

Project: Machine Translation models (IndicTrans2)

Name: Shrikanth Nayak

Project: Web-based data acquisition system

Name: Shreesha

Project: Web-based data acquisition system

Name: Prathiksha

Project: Speaker Diarization

Name: Sameeksha

Project: Speaker Diarization

Previous Interns

Name: Ranjan Shettigar

Project: Speech acquisition and recognition

Name: Bhavin kumar

Project: Speech acquisition and recognition

Name: Sathwik

Project: POS taggers for Dravidian languages

Name: Prajwal

Project: POS taggers for Dravidian languages

Name: Athreya

Project: Speech annotation

Name: Adithya Chawhan

Project: Speaker Diarization

For internship inquiries and applications, please contact our team at coild.mit@manipal.edu

Open Positions

Freelance Translators and Reviewer

We are looking for Freelance Translators and Reviewer proficient in Dravidian languages. The language pair for translation are:

  • Tamil - Kannada
  • Tamil - Malayalam
  • Tamil-Telugu

Note that Tamil is the source language

Eligibility Criteria
  • Proficiency in Tamil
  • Proficient in either Kannada, Malayalam or Telugu
  • Basic Computer Knowledge

Translation price: ₹ 1.5/source word
Review price: ₹ 0.75/source word
Apply now

Collaborating Institutions

Get In Touch

Contact details

Email: coild.mit@manipal.edu

LinkedIn Logo COIL-D MIT

Address: COIL-D (MIT), Manipal Institute of Technology, Manipal
Udupi District, Karnataka, India 576104