Llama-Thunder-LLM: Building a Korean-Specialized Large Language Model

Overview

I’m excited to share our research on building a Korean-specialized large language model. Our work, conducted at the Thunder Research Group at Seoul National University under Prof. Jaejin Lee, was recently featured in DigitalToday.

We developed three key components:

  1. Llama-Thunder-LLM - A Korean-specialized LLM based on Llama
  2. Thunder-Tok - An efficient Korean tokenizer
  3. Thunder-LLM Korean Benchmark - Evaluation datasets for Korean LLMs

Motivation

While interest in Korean-specialized language models has been growing domestically, the limited availability of data and the enormous costs involved make research and development challenging for small research institutions and universities. Our goal was to demonstrate that academic institutions can independently develop high-quality Korean LLMs.

Llama-Thunder-LLM

We collected and preprocessed 3TB of Korean web data and applied continual learning and post-training techniques to the base Llama model. Key highlights:

  • Training data: 102B tokens with a 1:1 Korean-English ratio
  • Approach: Maintaining English performance while enhancing Korean capabilities
  • Result: The instruction-tuned 8B model achieved 65.0 average score on Korean benchmarks, outperforming comparable models

Thunder-Tok: Efficient Korean Tokenizer

Korean has unique grammatical characteristics that standard tokenizers don’t handle efficiently. We developed Thunder-Tok with:

  • 44% token reduction compared to the original Llama tokenizer
  • Morpheme-based preprocessing
  • Language-specific techniques for Korean text

This means the same Korean document can be represented with significantly fewer tokens, improving both inference speed and training efficiency.

Thunder-LLM Korean Benchmark

To properly evaluate Korean LLM performance, we created a comprehensive benchmark:

  • Translation + Expert Review: Machine-translated representative English benchmarks, then manually corrected and localized by domain experts
  • Ko-LAMBADA: A newly designed dataset for evaluating literary context understanding in Korean, focusing on predicting important nouns within Korean sentences

Significance

As Prof. Jaejin Lee noted, this research demonstrates that academic institutions can independently develop LLMs, contributing to Korea’s “Sovereign AI” capabilities. We’ve made the model, tokenizer, and benchmark publicly available, along with detailed documentation of our development process, enabling follow-up and reproduction research.

Resources

All resources are available at the Supercomputing AI Model and Platform Optimization Center website.


This research was supported by the National Research Foundation of Korea.