SEAL for LLM Training | Self-Training & Reinforcement Learning

🌟 SEAL is a new attempt to create an automated mechanism for training large language models (LLMs).

This approach, developed at the Massachusetts Institute of Technology, involves a method where the model can independently generate training materials and adjust training parameters to better solve new tasks. Instead of traditional fine-tuning on external data, the model learns to analyze context, create artificial data based on it, and use this data to improve its own weights through reinforcement learning.

The SEAL process can be envisioned as a combination of two sequential cycles:

🟢 The first, an outer cycle based on reinforcement learning (RL), in which the model receives instructions in natural language. These instructions specify what data to generate and how to modify settings.

🟢 The second, an inner cycle, where the model applies the generated instructions: fine-tunes on the created data and evaluates the effectiveness of its changes on specific tasks. The ReSTEM algorithm is responsible for assessing and adjusting the generation strategy. To reduce costs, lightweight adapters like LoRA are used — they modify only part of the parameters rather than the entire model.

This cycle repeats multiple times, gradually training the model to convert raw data into useful training signals.

In practice, SEAL has been tested on two types of tasks: learning new knowledge and few-shot learning. In the first case, the model draws logical inferences from text, fine-tunes on them, and increases its accuracy without returning to the original source. In the second case, the system selects optimal methods for data expansion and hyperparameters for training on examples from the ARC-AGI dataset.

In both scenarios, SEAL demonstrated better results compared to fixed-pattern methods (such as ICL, TTT+Self Edit, without RL) and even with synthetic data generated by GPT-4.1.

Nevertheless, the approach remains experimental and primarily academic, with certain limitations:

🟠 Frequent repetition of changes risks causing the model to forget previously learned information — a phenomenon known as “catastrophic forgetting.”

🟠 Each iteration requires significant computational resources: the model needs to be fine-tuned and tested again.

The project repository includes source code, datasets, and instructions for two main purposes:

🟢 Integrating new factual knowledge into the model;

🟢 Training on new tasks using example-based methods.

The project is licensed under the MIT License.

For those interested in a more detailed overview, a link to the project page, a preprint on arXiv, and the GitHub repository are available.