RL Training Non-convergence on OmniGen2 with EditScore-7B

Hi authors,

Thanks for your impressive work on EditScore! I am currently reproducing the RL training using omnigen2_edit_rl_single_machine_editscore7b. However, I've encountered some stability issues where the reward_mean fluctuates significantly without a clear upward trend.

<img width="826" height="356" alt="Image" src="https://github.com/user-attachments/assets/55cc8cd5-7906-4ad2-8226-a24516968f8c" />

After analyzing the training logs and specific tasks, I have two major observations and would love to hear your insights:

1. Reward Sparsity in motion_change Tasks
I observed that in many motion_change samples , the SC_score is frequently 0.
even when the model attempts an edit, EditScore often gives a zero score for Semantic Conformity, leading to sparse rewards.
Do you think this is caused by EditScore being too strict on complex motion, or is it a limitation of the base model's initial exploration? How did you handle these zero-reward samples during your training to avoid gradient instability?

<img width="1280" height="1024" alt="Image" src="https://github.com/user-attachments/assets/5120bb52-a872-4139-b1d9-5f4282853b50" />

2. Reward Inconsistency in background_change Tasks
I noticed cases where two images have very similar background fidelity/similarity, yet their rewards differ significantly.
This high variance in rewards for similar visual outputs seems to introduce a lot of noise into the policy gradient.
Is this inconsistency a known behavior of the 7B reward model? Or are there other normalization techniques you found effective?

<img width="1280" height="1024" alt="Image" src="https://github.com/user-attachments/assets/66a9a79f-69bf-4fec-a89e-aa65a5398695" />

Environment & Hyperparameters:

Base Model: OmniGen2

Reward Model: EditScore-7B

Tasks: rl_abs_9tasks.jsonl

Training setup: Single machine, default parameters from the repo.

I've attached my training curve and some example cases for reference. Looking forward to your guidance!

Best regards,
Spike


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RL Training Non-convergence on OmniGen2 with EditScore-7B #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RL Training Non-convergence on OmniGen2 with EditScore-7B #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions