An inspiring open science journey to remember 💙
5 min read

An inspiring open science journey to remember 💙

Aya (Model & Dataset) is going to be released in 6 hours from now. Meanwhile, I thought of writing my thoughts on my journey and the things I have learned while collaborating with so many people across the world who had a common set of values and goals. I've been part of this effort since June 2023, and time flew so fast that I forgot how I even joined this in the first place. So this is a reflection for me to look back again at this day, from sometime in the future.

We all know that data is a cornerstone of AI advancements. But apart from English, most languages in the world have negligible representation on the Internet. Can we change this with the power of community collaboration? This was what Aya was all about. It was a massive global effort to uplift many under-resourced languages in the context of the current natural language advancement landscape.

Before elaborating on how I joined this project and the specifics of it, I want to think out loud about open science efforts. During the duration of this project, I saw,

  • people from various age groups, various levels of expertise gather towards building datasets for under resourced languages
  • like minded individuals coming together for a common goal and inspiring more people to work on these research questions
  • people who are new to research leading this effort, carefully paying attention to very subtle details and writing a research paper encapsulating the effort in terms of a scientific article [1,2].

All this unfolding on Discord was so inspiring to watch. And to finally see a multilingual large language model (LLM), proficient in 101 languages – and more importantly, seeing Aya-101 [1] proficient in my native language, Sinhala (සිංහල) too was such a joy 🎉❤️.

I asked these two questions from Aya in Sinhala: 1) වෙහෙසකර අභියෝගයක් සාර්ථකව ජය ගත් පසු කුමක්ද කරන්න ඕනේ? 2) ලංකාවෙ නිදහස් දිනය කවද්ද?
While there is a long way to go for Sinhala LLMs I'm honestly suprised that it follows Sinhala instructions quite nicely!

The Start

If I recall correctly, during my final year of undergrad, I joined C4AI (Cohere For AI$^\diamond$) discord server. I saw so many wonderful initiatives by this community to help students/researchers who wanted to learn about AI and ultimately contribute to research. I didn't follow these initiatives closely due to other time commitments. But every once in a while, I would check out the cool initiatives of this community.

Filling out a Google Form (May 2023)

On May 8th, I was getting ready for a morning run and saw this message on Discord.

When filling something up, I always think whether it is something I can do given my other commitments$^\ddagger$ 😅

I knew I didn't have much time to dedicate, but the message above seemed reasonable so I filled it out. It seemed like only knowing the language was enough. Words like multilingual and underrepresented languages made it so tempting to fill out the Google form.

An Email ✉️

June, 2023 was a great month, I was finishing up some experiments for my ICML 2023 workshop paper, and writing the paper. And then I received this email..

it was sort of unexpected.. felt serious, but then again I felt like it's something I'm capable of doing during my free time (a weekly meeting, spreading info about this project sounded good and doable 🤷🏻‍♂️).

So I replied that I'd want to help out to represent my native language! $^\dagger$.

Journey as a Language Ambassador

I attended the first meeting where I got introduced to Project Aya. The presentation was incredible... I thought to myself, "a great research group with a clear vision of what they want to achieve in 2023..". There was so much positivity and hope within this community, and I loved that energy!

Somewhat of a rough start for Sinhala 🤕

Now it was my turn to contribute to my language.  I remember going to the Aya annotation platform$^\star$, and filling out my details (such as languages I speak).

I was surprised to see Sinhala prompts and completions 😃, but soon I realized that they are machine translations which most often did not make much sense 🤕 (see left side of the below picture for some context).

So, we gradually started to refine those. Soon with the help of many Sinhala contributors, we got a massive pool of prompts and completions that we kept on refining.

Realizing that we need to speed up

From June to the end of September, we gained a lot of contributions, but soon, we realized our pace was too slow 😬. So what we did was to visualize $^\star$ our goal and work towards it. Furthermore, we spread out the message through many contributors (kudos to Jalina, Chamod, Nawoda, and Chanuka)

(a discord message from the Aya Server) 

So we accelerated our pace! 🇱🇰

For the next remaining 3 months, all Sinhala contributors helped immensely to reach our goals – we even surpassed our original goals!

The pictures below capture some of the initiatives we took to gather more contributors..