A.X K1: SK Telecom's 519B Parameter Reasoning Model
Overview
SK Telecom has officially released A.X K1, a 519 billion parameter Mixture-of-Experts (MoE) large language model. As an intern at the Omnimodal Foundation Model Office, I had the opportunity to contribute to this project, specifically working on enhancing the model’s reasoning capabilities.
The model is now available on HuggingFace and GitHub.
Model Architecture
A.X K1 employs a Mixture-of-Experts architecture that balances computational efficiency with model capacity:
- Total Parameters: 519 billion
- Active Parameters: 33 billion per token
- Architecture: Decoder-only Transformer with MoE layers
- Layers: 61 (1 dense + 60 MoE)
- Experts: 192 experts + 1 shared expert (8+1 active per token)
- Context Length: 128K tokens (~100,000 Korean words)
- Vocabulary Size: 163,840 tokens
The MoE architecture allows the model to maintain the knowledge capacity of a 519B model while only activating 33B parameters for each token, making inference significantly more efficient.
Benchmark Performance
A.X K1 demonstrates strong performance across reasoning benchmarks, particularly in mathematics and coding:
| Benchmark | A.X K1 | DeepSeek-V3.1 | Relative |
|---|---|---|---|
| AIME25 (Math) | 89.8 | 88.4 | 102% |
| LiveCodeBench (EN) | 75.8 | 69.5 | 109% |
| LiveCodeBench (KO) | 73.1 | 66.2 | 110% |
The model excels in Korean coding tasks, achieving 110% of DeepSeek-V3.1’s performance on the Korean LiveCodeBench.
Hybrid Reasoning: Think/Non-Think Modes
One of A.X K1’s key features is its hybrid reasoning control system. The model supports two reasoning modes:
- Think Mode: Enables extended chain-of-thought reasoning for complex problems
- Non-Think Mode: Provides direct responses for straightforward queries
This flexibility allows users to balance reasoning depth with response latency based on their specific use case.
Training Efficiency
The A.X K1 team achieved remarkable efficiency in developing this model:
- Development Time: Approximately 4 months
- Training Data: ~10 trillion tokens
- Data Sources: Web content, code, STEM fields, reasoning datasets
This demonstrates that with the right architecture choices and training strategies, competitive large-scale models can be developed with limited GPU resources.
Resources
This work was conducted during my internship at SK Telecom’s Omnimodal Foundation Model Office.