- πOverview
- π§Installation
- πQuick Start
- πAcknowledgement
- π©Citation
SkillX is a fully automated framework that constructs a reusable, plug-and-play skill knowledge base for LLM agents from experience.
Instead of storing raw trajectories, workflows, or loosely structured reflections, SkillX distills agent experience into a three-level skill hierarchy:
- Planning Skills for high-level task organization
- Functional Skills for reusable tool-based subroutines
- Atomic Skills for execution-oriented tool usage patterns
Built with a strong backbone agent, SkillX produces a transferable skill library that can be directly plugged into weaker base agents and new environments. Across challenging long-horizon, user-interactive benchmarks such as AppWorld, BFCL-v3, and Ο2-Bench, SkillX consistently improves both task success and execution efficiency.
SkillX transforms raw trajectories into a structured three-tier skill space:
- Planning Skills capture high-level decomposition and ordering
- Functional Skills represent reusable multi-step tool subroutines
- Atomic Skills encode practical tool usage constraints and patterns
SkillX provides an end-to-end automated pipeline that:
- rolls out agents on training tasks,
- extracts reusable skills from successful trajectories,
- consolidates and filters low-quality skills,
- and builds a reusable plug-and-play skill knowledge base.
SkillX continuously improves the skill library through:
- skill merging for consolidating redundant behaviors,
- quality filtering for removing brittle or hallucinated skills,
- and iterative updates that add, modify, or keep skills based on execution feedback.
Beyond seed demonstrations, SkillX proactively discovers new skills by:
- identifying under-used and failure-prone tools,
- guiding environment exploration,
- synthesizing new tasks from exploratory trajectories,
- and expanding skill coverage beyond the original training distribution.
The resulting skill library can be directly injected into different base agents, enabling strong-to-weak transfer without retraining the underlying model.
SkillX consistently improves:
- task success rate on challenging benchmarks,
- execution efficiency by reducing unnecessary exploration and tool misuse,
- and generalization through structured, reusable experience abstraction.
- ~10% absolute improvement for weaker base agents on multiple benchmarks
- Consistent gains on AppWorld, BFCL-v3, and Ο2-Bench
- Stronger transferability than trajectory-based, workflow-based, and memory-based baselines
- Improved execution efficiency with fewer redundant steps
- Effective even when the skill library is built by a stronger model and used by weaker ones
Existing experience-learning methods often suffer from:
- Isolated learning: agents repeatedly rediscover similar behaviors
- Weak transferability: raw trajectories and reflections often do not generalize well
- Capability bottlenecks: self-extracted experience is limited by the agentβs own strength
SkillX addresses these issues by building a structured skill knowledge base that is:
- reusable across tasks
- transferable across agents
- lightweight to retrieve
- easy to inject into prompts
- more robust than long-context progressive skill formats
SkillX consists of three core components:
From successful trajectories, SkillX automatically extracts:
- Planning skills: concise, reusable task plans
- Functional skills: reusable tool-composition procedures
- Atomic skills: tool-specific usage guidance, constraints, and failure notes
SkillX improves library quality through:
- Skills Merge: cluster and consolidate similar skills
- Skills Filter: remove non-portable, hallucinated, or invalid skills
- Skills Update: add, modify, or keep skills across iterations
SkillX expands beyond observed demonstrations by:
- guiding exploration toward under-covered tools and failure modes,
- synthesizing new tasks from exploration,
- and rerunning extraction + refinement to grow the skill library.
SkillX improves agentic performance across multiple LLM backbones and benchmarks.
- On Qwen3-32B, SkillX brings around 10-point improvements on several benchmarks
- On Kimi-K2-Instruct-0905, SkillX yields clear gains especially on AppWorld
- On GLM-4.6, SkillX still improves performance and execution efficiency despite the model already being strong
- AppWorld
- BFCL-v3
- Ο2-Bench
SkillX outperforms strong experience-learning baselines such as:
- A-Mem
- AWM
- ExpeL
- No-memory
This shows that how experience is represented matters as much as, or more than, where it comes from.
Compared with prior experience formats:
- Raw trajectories are verbose and difficult to transfer
- Insights/reflections are often too abstract
- Workflows may miss low-level tool constraints
- Claude-style skills rely on long-context progressive disclosure and complex environment support
In contrast, SkillX offers:
- hierarchical, itemized, reusable skills
- one-time prompt injection
- lightweight retrieval
- strong transfer across agents and environments
SkillX is especially useful for:
- tool-using LLM agents
- long-horizon task execution
- interactive application environments
- cross-agent knowledge transfer
- building reusable agent skill libraries from experience
A realistic ecosystem of apps and APIs for long-horizon agent execution.
A challenging benchmark for multi-turn function calling and tool use.
A user-interactive benchmark focused on conversational tool-using agents.
We will publicly release:
- the SkillX codebase
- the automatically constructed skill knowledge base
- and supporting resources for skill extraction, refinement, and retrieval
If you find this work helpful, please consider citing:
@article{wang2026skillx,
author = {Chenxi Wang and
Zhuoyun Yu and
Xin Xie and
Wuguannan Yao and
Runnan Fang and
Shuofei Qiao and
Kexin Cao and
Guozhou Zheng and
Xiang Qi and
Peng Zhang and
Shumin Deng},
title = {SkillX: Automatically Constructing Skill Knowledge Bases for Agents},
year = {2026},
eprint = {2604.04804},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2604.04804}
}This repository builds upon code from ReMe and AgentEvolver. The baseline implementations are adapted from AMEM, AWM, and Expel. We sincerely thank all contributors for their outstanding work!
