Sinmin

Follow me on GitHub

Introduction.

Language does the main role when people communicate with each other. Most of the nations has gone through lots of researches about their native languages and they have been able to come up with interesting results and tools. Those results are effectively used in implementing Translators, Grammar and Spell checkers, Language pattern detectors, Dictionaries and etc.

A Language Corpus is the base tool that can be used for above applications. Idea of a Corpus is to gather every possible electronic data of a language and store them in a manner so that one can easily find language details.

Background and context

Corpus linguistics approaches the study of language in use through corpora. In short, corpus linguistics serves to answer two fundamental research questions:

  • What particular patterns are associated with lexical or grammatical features?
  • How do these patterns differ within varieties and registers?

A word in and of itself does not carry meaning, but that meaning is often made through several words in a sequence. This is the idea that forms the backbone of corpus linguistics. Broadly, corpus linguistics looks to see what patterns are associated with lexical and grammatical features. Searching corpora provides answers to questions like these:

What are the most frequent words and phrases in a language?
  • What are the differences between spoken and written languages?
  • What tenses do people use most frequently?
  • What prepositions follow particular verbs?
  • How do people use words like can, may, and might?
  • Which words are used in more formal situations and which are used in more informal ones?
  • How often do people use idiomatic expressions?
  • How many words must a learner know to participate in everyday conversation?
  • How many different words do native speakers generally use in conversation?

Aims

For Sinhala language there were few projects to develop a Corpus. One of the challenge in designing a corpus for Sinhala language is lack of resources. Most of the corpora that were designed currently, mainly focused in extracting data from sinhala news papers. And they are stored in raw text files which doesn’t provide much support in analysing valuable data from the corpus.

We have identified that existing corpora for sinhala language has following drawbacks.

  • Lack of data sources
  • Not keeping sufficient meta data
  • Only contains old data, not updating with new resources
  • No proper interface where outsiders can make use of them (API or WEB interface)

In this project we are trying to design and implement a corpus for Sinhala language which is

  1. Continuously updating
  2. Dynamic (Scalable)
  3. Covers wide range of language (Structured and unstructured)

We are planning to create a corpus using all possible kinds of available sinhala digital data sources like newspapers, blogs, ebooks, wiki pages, etc. Instead of keeping them in raw text files we expect to store them in a structured manner in databases in order to provide support to searching and updating features. We will perform performance testing using several database solutions for variety of queries and will check what is the most suitable solution for our scenario.

Corpus will be designed in a way so that it will be automatically updated with new data sources added to the internet (latest newspaper articles or blog feed). This will make our life easy when keeping the corpus up to date with latest data sources.

A Web based interface with visualization tools will be created, so that anyone can use this corpus to find details of the language. Also corpus will be made available through an API so anyone can make use of them in their applications.