Empowering Bangla Conversational AI: Building a Multi-Turn Conversation Dataset

Data is one of the most important resources for any artificially intelligent system. Concurrently, it is also one of the most significant barriers to developing state-of-the-art AI tools for any low-resource language such as Bangla. Designing and developing AI-based applications for the Bangla language is always challenging due to the lack of sufficient data resources. Natural language processing (NLP) is one of the fields of AI where we have very limited data resources in the Bangla language. Recent trends in NLP applications are being led by conversational agents. With the enormous amount of data, conversational agents are achieving near-human performance in the English language. On the other hand, the development of conversational agents in the Bangla language has yet to be adequately addressed. To the best of our knowledge, there is no multi-turn Bangla conversation dataset available. Few initiatives have resulted in some question-answering datasets and models, but these do not support multi-turn interactions, which is a crucial feature of advanced conversational agents.

Addressing this gap, our Data and Design Lab is actively working on designing and developing a multi-turn conversation dataset in the Bangla language. Our primary objective is to create a comprehensive conversation corpus with multi-turn capabilities. This dataset will be instrumental in developing a multi-turn conversational agent capable of performing coherent and contextually relevant conversations in Bangla. By leveraging this dataset, we aim to advance the state of conversational AI for Bangla, enabling more natural and effective interactions between users and AI systems. Our efforts are focused not only on filling the existing data gap but also on setting a foundation for future research and development in Bangla language AI applications