Why different

Peeld is built around the standard you need to prove, not a dashboard you need to interpret.

Compare
Typical eval stack
Peeld
Question
Is the model observable?
Can the team defend this behavior?
Unit of work
Trace, span, dataset, dashboard
Published module, run, failure, report
Customization
Usually prompt sets or library code
Domain, compliance, code, workflow, and private standards
Output
Scores and debugging signals
Evidence tied to prompts, targets, versions, and review status

The difference is the artifact.

Peeld treats every assessment as something a product, risk, compliance, or platform team may need to explain later. The report is not a screenshot of a score. It is the traceable record of what was tested and why it matters.