The evaluationengine for everyagent you deploy.
AgentFit is the open-source framework that gives teams a structured, reproducible way to evaluate any AI agent against their specific business requirements — across 7 behavioral dimensions, with explanations you can actually act on.
Six capabilities. One evaluation layer.
Built to make every deployment defensible.
Business Need Profiles
Define your organization's agent requirements in a structured markdown file — which capabilities matter, how they should be weighted, what compliance standards apply. Every evaluation is anchored to your context, not an abstract benchmark.
Seven Evaluation Dimensions
Task Competence, Tool Use, Autonomy & Escalation, Safety & Alignment, Compliance & Auditability, Operational Performance, and Deployment Compatibility — each producing a 0–1 score with sub-metrics and weighted feedback.
LLM-Powered Interpretability
After scoring, AgentFit packages the full evaluation — scores, sub-metric breakdowns, exact arithmetic, and your BNP context — into a structured prompt sent to your LLM. The result: explanations grounded in your requirements, not post-hoc commentary.
Framework-Agnostic Protocol
Bring any OpenAI, Anthropic, Google, or fully custom agent. AgentFit evaluates through a universal protocol without you changing a single line of agent code. Pre-built adapters for the major providers — custom adapters in 3 async methods.
Reproducible & Auditable
Every evaluation locked, timestamped, and exportable. Compare agent versions side by side. Share results across teams. Full calculation trail for enterprise governance — no black box, no silent weighting.
Scalable Architecture
From a single laptop run to a multi-tenant evaluation platform. All seven dimensions run concurrently via asyncio. REST API with background tasks supports CI/CD pipelines, webhooks, and batch evaluation on every agent commit.
Four steps to a defensible deployment.
Every score traceable to its source.
Define Your BNP
Write a Business Need Profile — a lightweight markdown file expressing your organization's agent requirements: which capabilities matter, how they should be weighted, what compliance standards apply, and at what task complexity you're operating.
Connect Your Agent
Wrap any agent in the universal protocol using pre-built adapters for OpenAI, Anthropic, and Google — or write a custom adapter in 3 async methods. Zero changes to your existing agent code required.
Run the Evaluation
Seven behavioral dimensions evaluated concurrently. Each produces a 0–1 score with sub-metrics, weighted feedback, and pass/fail thresholds — all anchored to your BNP, not an abstract benchmark.
Get the Interpretation
The LLM receives your complete evaluation — scores, sub-metric breakdowns, exact weighted arithmetic, and BNP context — and returns business-grounded explanations with prioritized, actionable recommendations.
Seven dimensions. One agent score.
AgentFit evaluates agents across seven behavioral dimensions, each weighted according to your Business Need Profile. A fintech running compliance workflows weights Compliance differently than a DevOps agent — and AgentFit adapts accordingly.
- →Task Competence82
- →Tool Use74
- →Safety & Alignment88
- →Compliance68
- →Autonomy55
Every evaluation. Logged, timestamped, reproducible.
The full evaluation record — from BNP definition to interpretation — in a reproducible, immutable log. Compare agent versions over time. Share results across teams. Export for enterprise governance without losing the calculation trail.
- →Every score, sub-metric, and weight contribution attributed to its source
- →Full calculation trail — no black-box aggregation, ever
- →Version your evaluations alongside your agent code
- →JSON and PDF export for compliance and governance teams
The gap between your BNP target and agent score is the signal.
After scoring, AgentFit packages the complete evaluation — every sub-metric, weight, and arithmetic step — into a structured prompt for your chosen LLM. What comes back are business-grounded explanations of exactly why the agent scored as it did, not a generic summary.
- →Explanations grounded in your exact calculation trail — arithmetically verifiable
- →Per-dimension summaries with identified strengths and weaknesses
- →Prioritized recommendations tied to your lowest-scoring dimensions
- →Supports 10+ LLM providers — OpenAI, Anthropic, Groq, Ollama, and more
Bring any agent.
Evaluate it the same way.
AgentFit works with any AI provider or custom agent through a universal protocol. No vendor lock-in — compare OpenAI, Anthropic, Google, or your own agent implementation side by side.
Open-source.
Enterprise-ready.
AgentFit is free and open-source for every team. Enterprise support is available for organizations running agents at scale.
Open Source
Self-hosted · Apache 2.0 · No usage caps
Forever. No credit card required.
Enterprise
Managed service · Dedicated support · Custom SLA
Pricing based on your usage and requirements.
Not sure which fits your team? Book a 30-minute call and we will walk you through the right setup.
Start evaluating
your agents.
Free and open-source · Apache 2.0