COCOHub: A crowdsourced dataset builder and community for NLP in underrepresented languages

Because it is so hard to find appropriately structured datasets when learning/researching NLP (specifically machine translation and image captioning) for underrepresented languages, I decided to create a crowdsourced site that translates MS-COCO 2015 to create two kinds of dataset:

1. Machine translation for any two languages in COCOHub. This language list is an open, append-only list of language projects that can be added to as long as someone asks for it.

2. Image captioning. This is very useful from an education and accessibility point of view. Without regard for whether the datasets themselves will be sufficient to completely solve the task, they will certainly be necessary, to begin with.

Goals: to create a collection of open datasets offering novel language pairs for machine translation, captions for images, and using existing open source infrastructure to support the evaluation of competitions that advance SoTA in both translation and image captioning for underrepresented languages.

The beneficiaries of this project include students and researchers first, then end users who will eventually get translators and image captioning systems working for their local languages.

1. Image captioning

The MS-COCO 2015 Image Captioning challenge published nearly one million sentences, with five sentences attached to each of around 330,000 images. Being an image captioning dataset, translating it automatically gives us captions in the underrepresented languages that we select. Each language is its own project, managed by a team of verifiers and champions, feeding a data pipeline that publishes the same data structure as the original, except in each of the target languages. Each image ID has 5 sentences independently provided by professionals (see paper). At the end of it all, there will be a simple web page that lets people search for a language and download a compressed JSON file, split into training, validation and test sets, for their language of choice. This opens up image captioning as a competition-level research opportunity for many languages that previously wouldn't have been available. Further, following on the linked paper, an analysis framework can be built, and all data preprocessing tasks required for each language will now have sample data to test with.

2. Machine translation

There will probably be more than 50 language targets in COCOHub. The five independently sourced descriptions attached to each unique image ID will have 5-10 translations contributed, both to mitigate the negative effects of spam, and to find interesting translation variants that can be voted on by the community. The dataset also has unique integer IDs for each sentence, which means that, given a single source (English) sentence, each integer maps to a translation in as many languages as there are volunteers to help complete the project. So the translations aren't just to English - they point to other languages as well. This has nice linguistic properties as it reflects how a lot of Africans think, by code-switching.

COCOHub's crowdsourcing tools will support voting and (eventually) statistical verification, letting people vote on the best of many contributed translations for a single sentence. This will ensure that when a project is completed, i.e. when all the English sentences have been translated to a language and people have spent time voting and verifying, the highest quality sentences get published and used by students and researchers who want to work in those languages. 

A bonus of this approach will be that for each completed language project, there may be another language that would never have been a translation target for it, so people can mix and match languages in very interesting ways. We'll be fundamentally creating the same data structure, just in 50+ different languages. Imagine someone deciding to translate Luhya to Yoruba as a research/study project just because the option is there to play around with.

There are various preprocessing tasks that are needed for the resultant technology to be practically achievable, from Sentencepiece tokenization and lemmatization to web scraping of unstructured documents. Another way to find monolingual corpora is to use the Common Crawl dataset, by writing language detection systems using fastText models that let us extract the relevant sentences from it. It is very computationally expensive as Common Crawl has terabytes of data, so it is important to publish this data as well, in order to help people down the road. One further step is using the scraped data to construct fastText word vectors that provide very useful word-level context for word use. An example of using fastText to bootstrap an English-French translation model - a form of transfer learning for machine translation - is here.

The original MS-COCO captions dataset is released on a Creative Commons 4.0 Attribution license, which gives leeway to make the resultant derivative datasets open by default. This means that there will be a need for a team of verifiers for each language pair, probably professional linguists. University faculty can plug in nicely to this part of the project.