Agents Without Evals Are Just Vibes: Agents Evaluations CLI

There is no file in your repo that proves your agent got better after the last commit. That is the uncomfortable truth: without a scored test run sitting next to your code, every instruction tweak is a trust exercise. The Agents Evaluations CLI changes that by sending prompts to your declarative agent, scoring responses with Azure AI Evaluation metrics, and producing reports you can attach to a pull request.

Why Agent Quality Is Harder Than It Looks

A declarative agent is deceptively easy to change. You tweak an instruction, add an API plugin, update a grounding source, tighten a description_for_model. Each change feels small. Each change can shift behavior.

Maybe the agent stops calling the right tool. Maybe it answers confidently but forgets to cite sources. Maybe your onboarding flow still works, but the support escalation path regressed. You would not know unless you tested both.

Manual testing catches some of this. I still chat with my agents after every change, and you should too. But manual testing does not scale across multiple scenarios, multiple personas, edge cases, instruction rewrites, plugin changes, and pull request reviews. At some point, “I tried it and it looked good” becomes the weakest link in the development process.

That is exactly the gap the Agents Evaluations CLI fills. It turns manual smoke tests into repeatable assets that travel with your code.

Installing and Running the Agents Evaluations CLI

The package is on npm. No install step required. Run it from your agent project directory with a single npx command:

cd path/to/your-agent-project
npx -y --package @microsoft/m365-copilot-eval@latest runevals

For Agents Toolkit projects, the CLI auto-discovers your .env.local and reads M365_TITLE_ID to identify the agent. Keep non-secret config in .env.local, put secrets like AZURE_AI_API_KEY in .env.local.user, and never commit the user file. For non-Agents Toolkit projects, point to an environment file:

npx -y --package @microsoft/m365-copilot-eval@latest runevals --env dev

The tool sends your prompts to the deployed agent, collects responses, and scores them locally using the Azure AI Evaluation SDK. It supports inline prompts, JSON dataset files, and interactive mode. Output formats include HTML (for local reporting), JSON (for diffing and CI), and CSV (for teams that live in spreadsheets).

💡 Tip

Use --output .evals/latest.json to save results to a predictable path for diffing across runs. Great for tracking improvements over time.

Seven Evaluators, One Quality Vocabulary

The real power of the Agents Evaluations CLI is not the scores themselves. It is the shared language they create for talking about agent quality. Here are the seven evaluator types available today:

Evaluator	Type	Scale	What it tells you
Relevance	LLM-based	1-5	Did the answer address the actual user ask?
Coherence	LLM-based	1-5	Was the answer logically structured and clear?
Groundedness	LLM-based	1-5	Did the answer stay anchored to provided context?
Similarity	LLM-based	1-5	Did the answer match the expected meaning?
Citations	Count-based	>= 0	Did the agent reference its sources?
ExactMatch	String match	Boolean	Did the response match a known expected string?
PartialMatch	String match	0.0-1.0	Did the response include required text without being identical?

Relevance and Coherence are enabled by default. The others activate when the scenario calls for them.

This is not about machines perfectly judging your agent. It is about giving your team a vocabulary. Instead of saying “the agent feels worse,” you can say “Relevance dropped on the escalation prompts” or “Groundedness fails when SharePoint returns partial results.” That is a reviewable, debuggable, repeatable conversation.

Writing Your First Eval Document

The CLI auto-discovers dataset files named prompts.json, evals.json, or tests.json in your project root or in an evals/ folder. If it finds nothing, it offers to create a starter file for you.

Here is a realistic eval document for a support escalation agent. It uses the recommended versioned schema and includes both a single-turn test and a multi-turn conversation:

{
  "schemaVersion": "1.2.0",
  "default_evaluators": {
    "Relevance": {},
    "Coherence": {}
  },
  "items": [
    {
      "name": "Escalation policy lookup",
      "prompt": "A customer says they will cancel unless we respond today. What should I do?",
      "expected_response": "Identify this as an urgent escalation, recommend involving a manager, and summarize the next steps without inventing policy details.",
      "evaluators": {
        "Groundedness": {},
        "Citations": {}
      },
      "evaluators_mode": "extend"
    },
    {
      "name": "Multi-turn context retention",
      "turns": [
        {
          "prompt": "I need help escalating a Sev A customer issue.",
          "expected_response": "Ask for the required escalation details and explain the escalation path."
        },
        {
          "prompt": "The customer is Contoso and the blocker is a failed SSO rollout.",
          "expected_response": "Keep the Sev A escalation context, incorporate Contoso and the SSO blocker, and produce a concise escalation summary."
        }
      ]
    }
  ]
}

The default_evaluators block applies Relevance and Coherence to every item, and the first item extends those defaults with Groundedness and Citations via evaluators_mode: "extend". The second item uses turns for multi-turn evaluation, because interesting failures rarely surface on the first message.

📝 Note

Set evaluators_mode to "replace" when you want only the per-item evaluators to run, ignoring the defaults entirely.

Evals Close the Gap in Agentic Coding Workflows

If you are building agents with coding agents, evals become even more important.

A coding agent can scaffold a project, write the manifest, tune instructions, and wire up plugins in minutes. That speed is the whole point. It is also the risk. When development accelerates, quality assurance has to accelerate with it. Otherwise every generated improvement becomes a trust exercise. The instruction rewrite might improve behavior, or it might just sound more polished. The new plugin description might help tool selection, or it might steal traffic from another function.

Evals give the coding agent something concrete to optimize against. The workflow looks like this:

Ask the coding agent to change the agent.
Run evals.
Read the failing prompts.
Fix instructions, grounding, or plugin descriptions.
Run evals again.
Attach the JSON report to the PR.

The pull request stops being “trust me, I tested it.” It becomes “here is the behavior contract, here is what changed, here is the report.” That is a fundamentally healthier CI/CD loop.

Running Evals with the Work IQ Evaluator Skill

Remembering the right npx flags is fine for CI scripts, but during development you want a faster loop. The m365-agent-evaluator skill in GitHub Copilot CLI handles eval creation, execution, failure analysis, and instruction fixes in a single conversation.

It ships in the microsoft-365-agents-toolkit plugin from the Work IQ marketplace. Install it once:

/plugin marketplace add microsoft/work-iq
/plugin install microsoft-365-agents-toolkit@work-iq
# Restart GitHub Copilot CLI

Then invoke /m365-agent-evaluator from any agent project:

GitHub Copilot

/m365-agent-evaluator Create an eval document for my support escalation agent, run the evals, analyze failures, and recommend instruction changes.

I’ll set up evals, run them against your deployed agent, and analyze the results.

Created evals/evals.json with 5 prompts covering escalation lookup, multi-turn context, and boundary cases.

Running evals:

npx -y --package @microsoft/m365-copilot-eval@latest runevals --prompts-file evals/evals.json --concurrency 1 --output .evals/latest.json

Results: 3 passed, 2 failed.

Failures:

Escalation boundary: Relevance 2/5. The agent escalated a low-severity issue that should have been handled directly.
Context retention: Groundedness 1/5. The agent lost the customer name after the second turn.

Recommended instruction changes:

Add a severity classification step before escalation routing.
Explicitly instruct the agent to retain customer name and issue details across turns.

Want me to apply these changes and re-run?

The skill uses the package-scoped npx -y --package @microsoft/m365-copilot-eval@latest runevals command under the hood, so you always get the latest version without managing installs.

Putting Evals in Your Pull Request Flow

You do not need to block production on eval scores today. The tool is in public preview, prompt sets evolve, and some evaluators fit certain scenarios better than others. But every serious agent PR should include eval evidence.

Start with a local run:

npx -y --package @microsoft/m365-copilot-eval@latest runevals --prompts-file evals/evals.json --concurrency 1 --output .evals/latest.json

Then bake the habit into your PR template:

## Agent evals

- [ ] Added or updated eval prompts for the behavior change
- [ ] Ran evals
- [ ] Attached the JSON report or pasted the score summary
- [ ] Explained any known failures or intentionally changed expectations

When you are ready to wire this into CI, one formula matters:

Peak concurrent Azure OpenAI calls = concurrency x LLM evaluators per prompt

Five prompts running concurrently with three LLM-based evaluators means fifteen concurrent Azure OpenAI judge calls. That is the kind of capacity planning detail that tells you this tooling is built for real pipelines, not just local demos.

Where This Still Has Edges

No tool replaces human judgment. Evals do not magically tell you which prompts matter, and a bad eval set can give you false confidence. Test only happy paths and your agent will still be fragile everywhere else.

There are practical constraints too:

The tool is in public preview.
The CLI runs on Windows today with macOS and Linux support in preview.
You need a deployed M365 Copilot agent in your tenant.
You need an Azure OpenAI endpoint and API key for LLM-based scoring.
You need to keep secrets out of version control.
You need thoughtful prompt design. Picking the right evaluator for each scenario is a skill, not a checkbox.

That is fine. Preview tooling is allowed to have rough edges. The shape is right: versioned eval documents, structured reports, and a changelog that moves fast across multi-turn evaluation, per-prompt evaluator overrides, parallelization, and agent auto-discovery.

The Value You Just Unlocked

Here is what changes when you add evals to your declarative agent workflow:

Repeatable quality loops: Send prompts, score responses, and track improvements over time instead of relying on manual spot checks.
Shared quality vocabulary: Talk about “relevance dropped” or “groundedness fails” instead of “the agent feels worse.”
PR-ready evidence: Attach JSON reports so reviewers see behavior changes, not just code diffs.
Safer agentic coding: Give coding agents a concrete target to optimize against, closing the gap between development speed and quality assurance.
Multi-turn regression detection: Catch context failures that only surface in follow-up turns, not in isolated prompts.

Without evals, a declarative agent is a demo. With evals, it starts becoming software.

Why Agent Quality Is Harder Than It Looks

Installing and Running the Agents Evaluations CLI

Seven Evaluators, One Quality Vocabulary

Writing Your First Eval Document

Evals Close the Gap in Agentic Coding Workflows

Running Evals with the Work IQ Evaluator Skill

Putting Evals in Your Pull Request Flow

Where This Still Has Edges

The Value You Just Unlocked

Resources