Bridging AI and Local Languages: The Moroccan Darija Dataset

Bridging AI and Local Languages: The Moroccan Darija Dataset

Nov 29, 2024

Smartly.AI is happy to announce a new initiative to tackle regional languages challenges

Moroccan Darija, spoken by over 33.5 million people, is at the heart of our mission to create chatbots that communicate naturally and effectively. Our goal is simple: break down language barriers and make technology accessible to everyone.

Standard languages like English or French often hinder effective communication for
many users. Smartly’s fine-tuned AI models enable chatbots to understand and respond
accurately in Darija, whether written in Arabic script or Latin transcription (Arabizi).
This helps businesses build trust by offering clear and precise responses in their customers’
native language.

Our goal is simple: break down language barriers and make technology accessible to everyone.


Our solution also simplifies global interactions. A Moroccan client can ask a question
in Darija and receive a reply in the same language, while the English-speaking company
processes the query directly in English through seamless, real-time translation. This
eliminates communication barriers and ensures smooth interactions.


However, existing AI models like OpenAI’s ChatGPT and LLaMA struggle with Darija as we can see in the examples below.


Figure 1: A Question in Darija badly interpreted by ChatGPT
This example shows how OpenAI fails to interpret and respond accurately to a question in Darija, producing irrelevant answers.


Figure 2: Example of a Question in Dar-ija Processed by LLaMA 3.8b.
Similarly this illustrates LLaMA’s confusing arabic darija and dutch !


To overcome these limitations, Smartly.ai fine-tunes existing AI models using a dedicated Moroccan Darija dataset.
This approach adapts the models to understand the mix of Arabic script, Arabizi, and localized expressions unique to Darija.
Our Moroccan Darija dataset includes real-world examples in Arabic script and Latin transcription, with aligned translations in English and Modern Standard Arabic. This ensures accurate and context-aware responses.

Discover the dataset and contribute here: 👉 SmartlyAI Moroccan Darija Dataset 👈



Figure 3: Example of the Open Sourced Data
This shows how transliteration from Arabic to Latin script enhances the dataset’s diversity and usability.


Starting with Moroccan Darija, Smartly.ai aims to expand to other local dialects, fostering digital inclusion across North Africa and beyond. Our tailored solutions will support industries like banking, telecommunications, and education, ensuring effective multilingual communication.
AI should empower people, not exclude them. With Smartly.ai, we’re building a future where language is no longer a barrier to opportunity or connection.


Discover the dataset and contribute here: 👉 SmartlyAI Moroccan Darija Dataset 👈


Join us in refining and expanding this dataset to make AI more inclusive for local languages
like Moroccan Darija. Together, we can shape the future of accessible AI.