Llama-Thunder-LLM: Building a Korean-Specialized Large Language Model
Overview
I’m excited to share our research on building a Korean-specialized large language model. Our work, conducted at the Thunder Research Group at Seoul National University under Prof. Jaejin Lee, was recently featured in DigitalToday.
We developed three key components:
- Llama-Thunder-LLM - A Korean-specialized LLM based on Llama
- Thunder-Tok - An efficient Korean tokenizer
- Thunder-LLM Korean Benchmark - Evaluation datasets for Korean LLMs
Motivation
While interest in Korean-specialized language models has been growing domestically, the limited availability of data and the enormous costs involved make research and development challenging for small research institutions and universities. Our goal was to demonstrate that academic institutions can independently develop high-quality Korean LLMs.
Llama-Thunder-LLM
We collected and preprocessed 3TB of Korean web data and applied continual learning and post-training techniques to the base Llama model. Key highlights:
- Training data: 102B tokens with a 1:1 Korean-English ratio
- Approach: Maintaining English performance while enhancing Korean capabilities
- Result: The instruction-tuned 8B model achieved 65.0 average score on Korean benchmarks, outperforming comparable models
Thunder-Tok: Efficient Korean Tokenizer
Korean has unique grammatical characteristics that standard tokenizers don’t handle efficiently. We developed Thunder-Tok with:
- 44% token reduction compared to the original Llama tokenizer
- Morpheme-based preprocessing
- Language-specific techniques for Korean text
This means the same Korean document can be represented with significantly fewer tokens, improving both inference speed and training efficiency.
Thunder-LLM Korean Benchmark
To properly evaluate Korean LLM performance, we created a comprehensive benchmark:
- Translation + Expert Review: Machine-translated representative English benchmarks, then manually corrected and localized by domain experts
- Ko-LAMBADA: A newly designed dataset for evaluating literary context understanding in Korean, focusing on predicting important nouns within Korean sentences
Significance
As Prof. Jaejin Lee noted, this research demonstrates that academic institutions can independently develop LLMs, contributing to Korea’s “Sovereign AI” capabilities. We’ve made the model, tokenizer, and benchmark publicly available, along with detailed documentation of our development process, enabling follow-up and reproduction research.
Resources
All resources are available at the Supercomputing AI Model and Platform Optimization Center website.
This research was supported by the National Research Foundation of Korea.