Msingi Data Challenge Guidelines

Preserving Kenya's Indigenous Voices Through Collaborative Data Collection

Our Mission

We aim to collect high-quality text and audio data from Kenya's indigenous languages, annotate the data accurately, and build a comprehensive, open-source dataset to empower local NLP research.

Target Languages

We are focusing on the following indigenous Kenyan languages:

  • • Kikuyu
  • • Luo
  • • Kalenjin
  • • Luhya
  • • Meru
  • • All other Kenyan Languages

Data Submission Guidelines

Text Data

  • • Sources: Folklore, traditional literature, community blogs
  • • Formats: .txt, .docx, .pdf
  • • UTF-8 encoding
  • • Naming Convention: [LanguageCode]_[Source]_[Date]_[ContributorInitials].txt
  • • Include metadata in file header

Audio Data

  • • Sources: Oral storytelling, interviews, radio recordings
  • • Formats: .wav or .mp3 (prefer .wav)
  • • Sample rate: 16 kHz or higher
  • • Naming Convention: [LanguageCode]_[Type]_[Date]_[ContributorInitials].wav
  • • Include detailed metadata

Submission Process

Submission Portal

Submit Your Data

Support

Contact our support team for any questions

Data Annotation Guidelines

Text Data Annotation

  • • Language Tagging using ISO codes
  • • Named Entity Recognition
  • • Translation Pairs (if applicable)
  • • Consistent annotation format
  • • Verify accuracy with native speakers

Audio Data Annotation

  • • Verbatim transcription
  • • Include timestamps
  • • Annotate speaker details
  • • Add contextual notes
  • • Use recommended annotation tools

Final Steps

  • • Expert review of submitted data
  • • Quality assurance process
  • • Publication on GitHub and Hugging Face
  • • Credit for high-quality contributions

Ready to Contribute?

Help us preserve and empower Kenya's indigenous languages through AI

Submit Your Data