Teuken-7B and Beyond: Shaping the Multilingual Future of LLMs

About this Event

The OpenGPT-X project’s Teuken-7B is a multilingual large language model supporting all 24 EU languages, powered by a custom tokenizer optimized for efficiency and coverage. Licensed under Apache-2.0, it invites global collaboration and excels in benchmarks like ARC, HellaSwag, and TruthfulQA across 21 languages, with an energy-efficient design that reduces computational costs, particularly for non-English tasks. The European LLM Leaderboard offers transparent, granular insights into its cross-linguistic performance. Looking ahead, the next generation of models will leverage JQL (Judging Quality Across Languages), combining human curation and automated filtering to create diverse, high-quality datasets for 35 languages. Built on the Modalities framework for distributed training across thousands of GPUs, these models will advance mathematical reasoning, domain-specific knowledge, and equitable language performance, redefining the future of inclusive, scalable AI.

Speakers

Dr. Michael Fromm is a Senior Research Scientist at Fraunhofer IAIS, a leading institute in Artificial Intelligence, Machine Learning, and Big Data. He specializes in the full pipeline for Large Language Models (LLMs), from raw text processing to pretraining and evaluation. As a key contributor to the OpenGPT-X project, he plays a significant role in the development of Teuken-7B, a multilingual LLM supporting all 24 official EU languages.