OpenEnv Round 2 Demo

The case starts with uncertainty and ends with a decision

InvoiceGuard is a story-driven AP investigation demo. Follow a case from raw documents, through policy checks, to final resolution and see how training changes behavior.

Local baseline score
-
Best SFT score
-
Peak success rate
-
Improvement
-

Case narrative demo

Switch between cases. Each case shows the evidence packet, policy signal, and how the baseline agent differs from the trained policy.

Loading...

Policy focus:

Baseline action sequence (untrained)

Tends to investigate repeatedly and timeout at 12 steps.

    Trained action sequence (SFT)

    Investigates quickly, then reaches `submit_final_resolution` in 3-5 steps.

      Why this demo matters

      The baseline often investigates without closure. The trained policy learns to collect sufficient evidence and submit a grounded decision in fewer steps.

      1
      Read invoice, PO, GRN, and policy context
      2
      Investigate only what changes the decision
      3
      Submit `submit_final_resolution` with evidence
      4
      Receive deterministic grader score across 6 criteria

      Agent walkthrough simulator (easy, medium, hard)

      Step through one representative case per difficulty and see exactly how the trained agent reads documents, uses tools, and lands on the final decision.

      Loading simulator... Step 0/0
      Goal:

      Current step

      Reads:

      Tool:

      Action trace

        Training progression dashboard

        This is the full journey from local baseline to submit-focused SFT and then warm-started GRPO. The best GRPO checkpoint appears at iteration 2.

        Baseline SFT GRPO score progression
        Score progression: baseline → SFT → GRPO.
        Stage comparison chart
        Stage snapshot for score, success rate, and steps.
        GRPO training signals
        GRPO signal trends across task updates.
        GRPO loss components
        GRPO policy and KL loss components (log scale).