OpenEnv Round 2 Demo

The case starts with uncertainty and ends with a decision

InvoiceGuard is a story-driven AP investigation demo. Follow a case from raw documents, through policy checks, to final resolution and see how training changes behavior.

Open GitHub Repo Open Environment Space Open Code Repo Best SFT Adapter (v5c) Best Checkpoint (v5d)

Local baseline score

Best SFT score

Peak success rate

Improvement

Case narrative demo

Switch between cases. Each case shows the evidence packet, policy signal, and how the baseline agent differs from the trained policy.

Policy focus:

Baseline action sequence (untrained)

Tends to investigate repeatedly and timeout at 12 steps.

Trained action sequence (SFT)

Investigates quickly, then reaches `submit_final_resolution` in 3-5 steps.

Why this demo matters

The baseline often investigates without closure. The trained policy learns to collect sufficient evidence and submit a grounded decision in fewer steps.

Read invoice, PO, GRN, and policy context

Investigate only what changes the decision

Submit `submit_final_resolution` with evidence

Receive deterministic grader score across 6 criteria

Open tasks and scoring breakdown Read full training story

Agent walkthrough simulator (easy, medium, hard)

Step through one representative case per difficulty and see exactly how the trained agent reads documents, uses tools, and lands on the final decision.

Loading simulator... Step 0/0

Goal:

Current step

Reads:

Tool:

Action trace

Training progression dashboard

This is the full journey from local baseline to submit-focused SFT and then warm-started GRPO. The best GRPO checkpoint appears at iteration 2.

Score progression: baseline → SFT → GRPO.

Stage snapshot for score, success rate, and steps.

GRPO signal trends across task updates.

GRPO policy and KL loss components (log scale).