Environment Explainability

Tasks, actions, and scoring in one view

This page explains what the agent sees, what it can do, and how the grader evaluates each episode.

Task structure

InvoiceGuard contains 12 canonical tasks and 10 hard tasks (22 total). Tasks cover clean matches, duplicates, price mismatches, policy violations, and false-positive traps.

Slice Count Purpose
Canonical 12 Core AP exception patterns and expected policy decisions
Hard 10 Ambiguous edge cases, traps, and deeper multi-document reasoning

Action space

Category Actions
Investigation inspect_invoice_line_items, inspect_purchase_order, inspect_goods_receipt_note, inspect_vendor_profile, inspect_policy_rules, check_for_duplicate_invoice, compare_quantity, compare_price, compare_totals, summarize_findings
Proposal propose_exception_type
Terminal submit_final_resolution

Deterministic grader components

Component Weight Interpretation
Decision correctness0.35Final decision aligns with task ground truth
Exception type0.20Correct classification of issue type
Evidence sufficiency0.15Appropriate documents/actions were used
Investigation quality0.10Depth and quality of exploration
Explanation quality0.10Clear, policy-aware reasoning in resolution
Efficiency0.10Avoids wasted steps and timeouts

Baseline vs trained snapshot

Comparison chart baseline vs trained

The key transition is from endless investigation loops (baseline) to timely evidence-backed submissions (trained SFT checkpoints).

Round 2 progression curves

These curves show how warm-starting GRPO from SFT improves holdout quality, with the best checkpoint at iteration 2.

Success and steps progression
Success rate and steps across baseline, SFT, and GRPO.
SFT training loss log scale
SFT training convergence curves on log scale.

Artifact links