🌟 Nemotron-Terminal: a compact suite of solutions for terminal tasks.
NVIDIA has developed the Nemotron-Terminal family of models designed for autonomous operation within Linux terminals. These models can automatically install dependencies, write and execute code, configure environments, and perform complex engineering tasks—all without human intervention.
The foundation of this project is based on Qwen3 and a specially prepared dataset called Terminal-Corpus. The key focus here is not architecture but data quality.
🟡 NVIDIA has created a bidirectional process called Terminal-Task-Gen, combining two streams.
The first stream adapts existing datasets in mathematics, programming, and software engineering tasks to the terminal format—training does not utilize large language models (LLMs) during this process.
The second stream is responsible for generating synthetic data in two ways: first—seed-based, where an LLM creates new tasks based on existing ones from related fields; second—skill-based, where the model combines up to five fundamental skills from a taxonomy covering nine areas: security, Data Science, system administration, and others.
🟡 The public release includes three models with 8B, 14B, and 32B parameters, as well as two datasets:
Terminal-Corpus contains approximately 366,000 task execution trajectories. They are divided into two streams: about 226,000 adapted examples from Math/Code/SWE and roughly 140,000 synthetic tasks based on the skill taxonomy.
Synthetic-Tasks includes standardized tasks with instructions, Docker environments built from nine preconfigured images, and a test set using pytest.
🟡 Test results on benchmarks.
On Terminal-Bench 2.0, all three models significantly outperform the base Qwen3: the 8B parameter model shows an increase from 2.5% to 13%, the 14B version from 4% to 20.2%, and the largest from 3.4% to 27.4%.
For comparison: Qwen3-Coder with 480B parameters achieves around 24%, GPT-5-Mini slightly more—24%, Grok4 approximately the same. Meanwhile, Nemotron-Terminal at 32 billion parameters demonstrates results comparable to or even surpassing these models despite its smaller size.
🟡 Several unexpected findings from ablation experiments.
Removing unsuccessful trajectories negatively impacts performance—models trained only on successful data perform better (around 5%) compared to those trained on all trajectories combined (about 12.4%).
Training with a curriculum system (from simple to complex) did not outperform the standard mixed approach.
Additionally, increasing the context window from 32K to 65K tokens did not yield significant effects—long trajectories were saturated with noise.
📌 The models are released under the NVIDIA Open Model License.
📌 The datasets are available under the CC-BY-4.0 license.
Source: arXiv
The latest developments are available for study and use in research work.
Created with n8n:
https://cutt.ly/n8n
Created with syllaby:
https://cutt.ly/syllaby
