SMC/SoC/2008
Participation of SMC in GSOC 2008 is not confirmed. Use this page for collecting the Project Ideas
Ideas for Google Summer of Code 2008
Tokenizer/Lemmatiser for malayalam for GATE
Write a Lemmatiser for Malayalam. See whether we can do a plugin for GATE for malayalam, that would help NLP reasearchers a lot and that would be a great idea. IGoogle search GATE,download and install GATE , and in the plugins directory a hindi tokenizer and lemmatiser is available.
Functional Optical character Recognition system
Add malayalam Support for tesseract OCR . Stages and objectives to be defined clearly
Write a Gnome Speech Driver for Dhvani and Integrate it with Orca
Orca for visually impaired users uses gnome speech for speech engines. Currently Festival, Espeak, freetts etc have drivers for gnome speech. We need to write a driver for dhvani.
Swathantra Malayalam Corpus Phase 1
The whole swathantra malayalam corpus is aimed at building a Free and Open source annotated corpus,related APIs, programs to build different types of corpus etc.
Details:
- Needs an annotated image and speech corpus to support the Speech and image related FOSS driven research and development.
- It should be able to act as a standard train and test data for the R&D activities.
- In the first phase need to build a specification document, clearly written manual for building the corpus and should build the tools needed to build the corpus and use the corpus.
- Anybody who like to contribute to the project must be able to do so and the specifications should be of the best covering all the aspects on classification of data, annotation of data, structure of storage and all related details.
- As a part of the project, when we finish the summer, we must be able to build a complete specification document and programs to build the corpus and access the corpus(building the whole process must be a collaborative effort, it is not coming under this phase).
Please add more details that can be added to a corpora project.