Paper Project Page

Drifting Preference Optimization for One-Step Generative Models

Online preference finetuning for deterministic one-step text-to-image generators using reward-ranked samples, dipole preference fields, and reference drift.

Zhou Jiang1   Yandong Wen1   Zhen Liu2,*

1Westlake University   2The Chinese University of Hong Kong, Shenzhen

Matched SDXL-Turbo to DrPO examples for text-to-image prompts.
Matched SDXL-Turbo prompts before and after DrPO preference optimization, while preserving one-step generation.
1-step inference is preserved after finetuning
No reward gradients the target reward is only used for ranking
3.51x HPSv3 per-update speedup over reward-gradient updates
Black-box ready works with large or non-differentiable rewards

Abstract

Preference optimization without likelihoods, denoising trajectories, or reward backpropagation.

One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult. DrPO is an online preference-finetuning method for deterministic one-step generators. For each prompt, DrPO samples candidates from the current generator, ranks them with a target reward, and uses high- and low-scoring samples to synthesize a feature-space update direction. The update combines a non-parametric dipole preference field with a reference drift estimated from the frozen base generator, and is optimized through a detached feature-space regression target. The target reward is used only for ranking, so DrPO can optimize with large, black-box, or non-differentiable rewards while inference remains a single generator call.

Method

Dipole preference drift plus reference correction.

DrPO method overview with a feature-space drift field and fixed-prompt samples across finetuning steps.
Left: DrPO constructs a detached feature-space target for each on-policy sample. High-reward samples attract, low-reward samples repel, and frozen-reference samples provide a stabilizing drift toward the base generator; the combined field yields the target update. Right: fixed prompts sampled along online finetuning show the generator moving toward more preference-aligned outputs.
01

Generate on-policy candidates

For each prompt, DrPO samples a candidate set from the current one-step generator under different latent seeds.

02

Rank with the target reward

The reward model selects positive and negative supports. It does not need to be differentiable through the generated image.

03

Fit toward a detached target

Preference and reference fields define a feature-space target, and LoRA parameters are optimized with a regression loss.

Results

DrPO gives the largest reward-gradient-free gains among one-step preference baselines.

Results on SDXL-Turbo. PS denotes PickScore x 100 and IR denotes ImageReward x 10.
Method Steps Reward grad. Pick-a-Pic v2 Test Parti-Prompts
PS AES IR PS AES IR
SDXL base 50 - 22.15 6.104 6.85 22.64 5.761 7.24
SDXL-DPO 50 - 22.57 6.076 9.38 22.95 5.811 10.66
SDXL-Turbo base 1 - 22.45 6.059 9.36 22.77 5.693 9.13
DRaFT 1 Yes 24.45 6.712 12.70 24.34 6.485 12.66
VGGFlow 1 Yes 24.27 6.490 12.19 23.98 6.200 11.40
DPO1step 1 No 22.77 6.227 10.44 22.85 6.019 11.25
PSO 1 No 22.56 6.092 8.97 22.87 5.744 9.17
GRPO1step 1 No 22.50 6.077 9.57 22.80 5.710 9.27
DrPO 1 No 23.66 6.717 12.46 23.71 6.665 12.60

SDXL-Turbo main result

DrPO improves SDXL-Turbo from 22.45 to 23.66 PickScore, 6.059 to 6.717 AES, and 9.36 to 12.46 ImageReward on Pick-a-Pic v2.

SD-Turbo generalization

On SD-Turbo, DrPO improves PickScore from 21.88 to 23.49 and ImageReward from 5.75 to 9.54 on Pick-a-Pic v2.

Large reward models

With HPSv3, DrPO removes reward-model backpropagation and reduces matched-batch update time from 21.62s to 6.17s.

Qwen3-VL pairwise preference chart on Pick-a-Pic v2.
Qwen3-VL pairwise preference evaluation on Pick-a-Pic v2.
Qwen3-VL pairwise preference chart on Parti-Prompts.
Qwen3-VL pairwise preference evaluation on Parti-Prompts.
Reward convergence curves for different optimization methods.
Reward convergence curves under matched one-step finetuning.
HPSv3 wall-clock evaluation curve comparing DrPO and DRaFT.
HPSv3 optimization progresses faster without reward backpropagation.
GenEval-style online DrPO examples for cups, clocks, sheep, and apples.
Each pair shows SD-Turbo on the left and the DrPO-optimized result on the right; GenEval-style checks provide ranking signals.

Qualitative

Matched-prompt comparisons and reward-model ablations.

A tourist is looking at a whale using binoculars

A dolphin in an astronaut suit on Saturn

A wine bottle with a lit candle in its spout

Subway train with lobsters reading newspaper

Burger with wheels speeding on the race track, supercharged, detailed, hyperrealistic, 4K

A toy poodle

Mystical forest with glowing mushrooms and a babbling brook

A spaceship pointing up on a plain background

Qualitative comparison across different target reward models.
Swapping the target reward changes the optimized behavior while the same drifting update is reused.

Extended quantitative results.

SD-Turbo transfer

Results on SD-Turbo. PS denotes PickScore x 100 and IR denotes ImageReward x 10.
Method Steps Reward grad. Pick-a-Pic v2 Test Parti-Prompts
PS AES IR PS AES IR
SD1.5 50 - 20.79 5.455 1.22 21.49 5.358 2.25
LCM-SD1.5 4 - 20.50 5.497 -3.08 21.15 5.396 -1.94
SD2.1 50 - 21.09 5.645 2.49 21.77 5.547 3.97
SD-Turbo base 1 - 21.88 6.054 5.75 22.29 5.758 5.37
DRaFT 1 Yes 24.69 6.820 9.63 23.07 6.516 7.72
VGGFlow 1 Yes 23.73 6.378 7.74 22.99 6.027 6.50
DPO1step 1 No 22.02 6.080 5.95 22.39 5.793 5.21
PSO 1 No 21.88 6.059 5.80 22.29 5.763 5.42
GRPO1step 1 No 21.98 6.077 6.08 22.35 5.779 5.65
DrPO 1 No 23.49 6.485 9.54 22.99 6.284 7.46

Efficiency and robustness

Efficiency comparison on HPSv3.
Metric DRaFT DrPO
Reward grad. Yes No
Effective batch 192 192
Update time (s) 21.62 6.17
Speedup 1.00x 3.51x
Backward (s) 9.99 0.34
GenEval task performance.
Metric SD-Turbo +DrPO
Single 98.8 100.0
Two 46.5 55.6
Count 33.8 42.5
Colors 83.8 87.2
Position 8.0 13.0
Color Attr. 9.0 13.0
Reference-objective comparison on Pick-a-Pic v2.
Objective PickScore CLIP AES HPSv2
No reference 23.55 25.88 6.603 34.85
Perceptual loss 23.42 25.98 6.455 34.90
Ref. drift loss 23.49 26.22 6.485 35.07

Ablations

Batch generation.
Candidates PS AES
SD-Turbo 21.88 6.054
K=16 23.24 6.409
K=24 23.53 6.552
K=32 23.57 6.599
Feature map.
Feature PS AES
SD-Turbo 21.88 6.054
latent-MAE1 23.55 6.513
latent-MAE2 23.50 6.526
latent-MAE3 23.48 6.506
Latent 20.52 4.543
VAE-dec. + DINOv2 22.28 6.252
Kernel.
Kernel PS AES
SD-Turbo 21.88 6.054
Cosine 23.63 6.509
RBF 23.50 6.590
Exponential 23.53 6.594
Laplacian 23.51 6.515
Velocity scale eta.
Weight PS AES
SD-Turbo 21.88 6.054
eta=1000 23.51 6.510
eta=3000 23.53 6.542
eta=5000 23.51 6.485
eta=10000 23.46 6.444