State of the Field
Recent advancements in multilingual natural language processing are focusing on enhancing efficiency and adaptability across diverse languages and domains. Innovative architectures like convolutional networks are proving competitive with large transformer models for specific tasks, significantly reducing processing time and energy consumption. Meanwhile, new encoder families, such as MrBERT, are being tailored for localized linguistic tasks and specialized domains, showcasing the potential for cost-effective deployment in high-stakes applications. The introduction of datasets like BIRDTurk highlights the challenges faced by low-resource languages in text-to-SQL systems, while also providing a framework for evaluating cross-lingual performance. Additionally, research into cross-lingual classification methods for social media data emphasizes the importance of optimizing content filtering strategies to manage the noise inherent in multilingual discourse. Collectively, these efforts are addressing commercial needs for scalable, efficient multilingual solutions, paving the way for more nuanced and effective applications in global communication and data analysis.
Papers
1–5 of 5Efficient Multilingual Name Type Classification Using Convolutional Networks
We present a convolutional neural network approach for classifying proper names by language and entity type. Our model, Onomas-CNN X, combines parallel convolution branches with depthwise-separable op...
MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation
We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code. Through targeted adaptation, this model family achieves sta...
BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish
Text-to-SQL systems have achieved strong performance on English benchmarks, yet their behavior in morphologically rich, low-resource languages remains largely unexplored. We introduce BIRDTurk, the fi...
Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data
Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investiga...
When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English
Euphemisms substitute socially sensitive expressions, often softening or reframing meaning, and their reliance on cultural and pragmatic context complicates modeling across languages. In this study, w...