Self-Flow Diffusion Model Training | High-Quality Image Generation

🌟 Self-Flow: diffusion model training without external encoders from Black Forest Labs

Black Forest Labs together with MIT have found a solution for one of the main challenges of diffusion and flow models — to generate high-quality images, they require powerful semantic representations. Typically, such features are provided externally, aligning the model’s internal parameters with encoder features like DINOv2. This approach works but has its nuances.

The stronger the encoder, the worse the final results: experiments replacing DINOv2-B with a more advanced DINOv3-H+ led to degraded FID scores. At the same time, the model became tied to fixed external features and stopped scaling effectively. Regarding video and audio — using encoders like V-JEPA2 or MERT produced results worse than simple flow matching.

🟡 Self-Flow introduces a Dual-Timestep Scheduling mechanism

In classic flow matching, all tokens are noised equally, so the model solves the task locally and does not learn to build global connections. Self-Flow, however, selects two different noise levels and randomly assigns them to different tokens — some input data is noisier, some less so. This creates asymmetry: to restore heavily noised tokens, the model must rely on cleaner parts and form a global context.

This is complemented by self-supervised learning based on a distillation principle: two versions of the model are trained simultaneously — student and teacher (EMA copy with exponential moving average). The student learns to predict teacher features from the noisy input, which helps develop strong semantic representations without an external encoder.

🟡 Testing results

🟢 On ImageNet at 256×256 resolution, Self-Flow achieves an FID of around 5.70, compared to approximately 5.89 for REPA; moreover, this is the first case where a self-supervised method outperformed external alignment methods on this benchmark.

🟢 For text-to-image generation, the result: FID is 3.61 versus 3.92 for REPA.

🟢 For video: FVD for Self-Flow is about 47.81 compared to 49.75 for REPA.

🟢 In the audio domain, the best FAD scores among all variants were achieved.

Interestingly, when scaling the model from 290 million to one billion parameters, the gap with REPA widens: Self-Flow with 625 million parameters surpasses REPA with a billion.

This approach is versatile across different data types — it performs equally well with images, videos, and audio. This suggests potential applications in multimodal training.

The project repository includes inference code based on SiT-XL/2 with token conditioning at each timestep, checkpoints for ImageNet 256×256, and scripts for sample generation for FID evaluation via the ADM evaluation suite. Modes SDE and ODE are supported, along with multi-GPU execution via torchrun.

Created with n8n:
https://cutt.ly/n8n

Created with syllaby:
https://cutt.ly/syllaby