A.X K1: SK Telecom's 519B Parameter Reasoning Model

Overview

SK Telecom has officially released A.X K1, a 519 billion parameter Mixture-of-Experts (MoE) large language model. As an intern at the Omnimodal Foundation Model Office, I had the opportunity to contribute to this project, specifically working on enhancing the model’s reasoning capabilities.

The model is now available on HuggingFace and GitHub.

Model Architecture

A.X K1 employs a Mixture-of-Experts architecture that balances computational efficiency with model capacity:

Total Parameters: 519 billion
Active Parameters: 33 billion per token
Architecture: Decoder-only Transformer with MoE layers
Layers: 61 (1 dense + 60 MoE)
Experts: 192 experts + 1 shared expert (8+1 active per token)
Context Length: 128K tokens (~100,000 Korean words)
Vocabulary Size: 163,840 tokens

The MoE architecture allows the model to maintain the knowledge capacity of a 519B model while only activating 33B parameters for each token, making inference significantly more efficient.

Benchmark Performance

A.X K1 demonstrates strong performance across reasoning benchmarks, particularly in mathematics and coding:

Benchmark	A.X K1	DeepSeek-V3.1	Relative
AIME25 (Math)	89.8	88.4	102%
LiveCodeBench (EN)	75.8	69.5	109%
LiveCodeBench (KO)	73.1	66.2	110%

The model excels in Korean coding tasks, achieving 110% of DeepSeek-V3.1’s performance on the Korean LiveCodeBench.

Hybrid Reasoning: Think/Non-Think Modes

One of A.X K1’s key features is its hybrid reasoning control system. The model supports two reasoning modes:

Think Mode: Enables extended chain-of-thought reasoning for complex problems
Non-Think Mode: Provides direct responses for straightforward queries

This flexibility allows users to balance reasoning depth with response latency based on their specific use case.

Training Efficiency

The A.X K1 team achieved remarkable efficiency in developing this model:

Development Time: Approximately 4 months
Training Data: ~10 trillion tokens
Data Sources: Web content, code, STEM fields, reasoning datasets

This demonstrates that with the right architecture choices and training strategies, competitive large-scale models can be developed with limited GPU resources.

Resources

This work was conducted during my internship at SK Telecom’s Omnimodal Foundation Model Office.