SKT Unveils Multimodal and General Document Interpretation Technology Based on Its Own LLM
- Input
- 2025-07-29 10:22:19
- Updated
- 2025-07-29 10:22:19
The models released by SK Telecom to the open-source community Hugging Face are 'A.X Encoder' and 'A.X 4.0 Vision Language Light (VL Light)' two types. These models can be freely used for academic research or commercial use.
SKT introduced two types of A.X 4.0 models (standard, lightweight) based on large-scale training (CPT) in July, followed by two types of A.X 3.1 models (standard, lightweight) from scratch. By adding two technologies to utilize LLM more broadly in the industrial field, a total of six models were announced.
SKT plans to continuously enhance the utilization and performance of LLMs, including the A.X 4.0 inference model to be announced in the future, while steadily continuing the development of LLMs from scratch (building all from the initial stage of the model).
In natural language processing technology, an encoder is a core component that helps perform various natural language processing tasks by converting the input sentence into context and understanding the interrelations of all words in the sentence to grasp the overall meaning and context.
SK Telecom developed the A.X Encoder to apply to the entire data processing process required for the A.X model. The A.X Encoder is suitable for large-scale LLM training as it can process long documents quickly and efficiently.
The A.X Encoder operates based on 149 million (149M) parameters. It achieved an average score of 85.47 in natural language understanding performance indicators, confirming its state-of-the-art (SOTA) level performance globally. It surpasses the performance indicator (80.19 points) of 'RoBerTa-base' released by the KLUE team based on existing global open-source models.
The A.X Encoder can process up to 16,384 tokens, implementing up to 3 times the inference speed and 2 times the training speed compared to existing models. Typically, existing models could process 512 tokens, handling sentences or paragraphs, but it processes much larger contexts quickly and efficiently.
This large-scale, high-speed document processing technology is expected to be efficiently applicable to various AI-based document processing beyond LLM training.
The A.X 4.0 VL Light is a visual-language model (VLM) trained on a large-scale multimodal Korean dataset. It provides excellent performance not only in understanding visual information and language related to Korean but also in enterprise applications such as table and graph understanding, and manufacturing drawing understanding.
Developed based on the A.X 4.0 Light model with 7 billion (7B) parameters, it is easily applicable to user systems while boasting powerful performance at the level of a medium-sized model.
The A.X 4.0 VL Light recorded an average of 79.4 points on the Korean visual benchmark, showing superior performance despite being much smaller in model size than Qwen2.5-VL32B (73.4 points). It also recorded an average of 60.2 points on the Korean text benchmark, ranking at the top among domestic models despite being a lightweight model.
It recorded 80.2 points on K-Viscuit, a multimodal benchmark designed to evaluate understanding of Korean culture and context. It achieved 89.8 points on the KoBizDoc benchmark, which focuses on understanding complex document structures and charts/tables. Each is superior to or similar to the Qwen2.5-VL32B model (72.3 points, 88.8 points).
The A.X 4.0 VL Light can contribute to reducing costs for companies using about 41% fewer text tokens compared to Qwen2.5-VL32B when inputting the same Korean data.
Tae-Yoon Kim, head of the SK Telecom Foundation Model, said, “As securing independent technology is the core of sovereign AI, we will enhance our capabilities and accelerate collaboration with consortium companies to secure global top-level AI competitiveness.”
mkchang@fnnews.com Min-Kwon Jang Reporter