Test before you deploy
Simulation sandbox with multi-metric evaluation — BLEU, ROUGE-L, embedding similarity, and LLM-as-judge combined into a weighted ensemble measured at 0.562 Spearman ρ against human judgment (N=50).
Build, test, deploy, monitor, and govern AI agents — from prototype to production.
Measured over a 50-example hand-labeled set. Reproduce with benchmarks/run_benchmarks.py.
Simulation sandbox with multi-metric evaluation — BLEU, ROUGE-L, embedding similarity, and LLM-as-judge combined into a weighted ensemble measured at 0.562 Spearman ρ against human judgment (N=50).
Budget limits, permission controls, kill switch, and immutable audit trails. Enterprise-grade safety with 0.02 ms median overhead for the full governance stack.
Live dashboard tracks every LLM call, tool use, and dollar spent. Embedding drift detection alerts you before your RAG pipeline degrades.
Every capability is benchmarked with quantitative results.
| Metric | Spearman ρ |
|---|---|
| BLEU | 0.406 |
| ROUGE-L | 0.477 |
| Embedding Similarity | 0.600 |
| Combined (overall_score) | 0.562 |
Hand-labeled agent responses (N=50). Reproduce with benchmarks/run_benchmarks.py.
Three interchangeable backends via get_embeddings():
OpenAI — text-embedding-3-small via API for hosted, highest-quality retrieval when data can leave your network.
all-MiniLM-L6-v2 — local sentence-transformers inference; private, no API key, runs on CPU.
TF-IDF + SVD — pure-Python fallback with zero external dependencies for air-gapped or minimal installs.
| Feature | Median (ms) | P95 (ms) |
|---|---|---|
| Budget guard | 0.000167 | 0.000209 |
| Permission guard | 0.000084 | 0.000125 |
| Audit trail logging | 0.000791 | 0.000875 |
| Full governance | 0.02 | 0.03 |
Measured with time.perf_counter over 10000 iterations per feature.
A multi-metric pipeline that actually correlates with human judgment.
from agentos.sandbox.metrics import evaluate_response report = evaluate_response( response="The tip on $85 at 15% is $12.75", expected="Uses calculator, returns correct tip amount", tools_called=["calculator"], expected_tools=["calculator"], ) # See benchmarks/run_benchmarks.py for measured correlations (N=50)
From zero-cost baseline to best-in-class API
from agentos.rag.embeddings import get_embeddings # Recommended: local embeddings (no API key needed) embedder = get_embeddings("local") vectors = embedder.embed(["Your document text here"])
Proper hypothesis testing, not just which-number-is-bigger
We report Welch's t-test (parametric) and Mann-Whitney U (non-parametric) because LLM quality scores are often bimodal. Cohen's d measures practical significance — a statistically significant difference that's too small to matter gets flagged.
MMD-based embedding drift detection catches quality drops before users do
Reference ↔ Current
MMD: 0.03
✅ No drift detected
Reference ↔ Shifted
MMD: 0.18
⚠️ Re-indexing recommended
from agentos.rag.drift import EmbeddingDriftDetector detector = EmbeddingDriftDetector(threshold=0.1) detector.set_reference(initial_embeddings) report = detector.check(current_embeddings) if report.is_drifted: trigger_reindex()
10-line agents with @tool decorator
Token-by-token WebSocket streaming
Ingest, chunk, embed, search, detect drift
Vision (GPT-4o) and PDF extraction
100+ scenario testing with multi-metric eval
Welch's t-test, Cohen's d, bootstrap CI
Fork conversations, explore what-if paths
Real-time cost, tool usage, quality tracking
Budget, permissions, kill switch, audit trail
Cron and interval scheduling, history
Pub/sub with webhook, timer, file triggers
Multi-step pipelines with retry and fallback
Publish, discover, install agent templates
White-label chat widget, one script tag
Expose tools to Claude Desktop and Cursor
We don't claim to beat everyone at everything.
| Feature | AgentOS | LangChain | CrewAI | AutoGen |
|---|---|---|---|---|
| Testing Sandbox | ✓ Built-in | ✗ | ✗ | ✗ |
| Multi-Metric Eval | ✓ Built-in | ✗ | ✗ | ✗ |
| Statistical A/B Testing | ✓ Built-in | ✗ | ✗ | ✗ |
| Embedding Drift Detection | ✓ Built-in | ✗ | ✗ | ✗ |
| Governance & Kill Switch | ✓ Built-in | ✗ | ✗ | ✗ |
| Live Dashboard | ✓ Built-in | ⚡ LangSmith | ✗ | ✗ |
| RAG Pipeline | ✓ 3 backends | ✓ | ✗ | ✗ |
| Embeddable Widget | ✓ Built-in | ✗ | ✗ | ✗ |
| MCP Server | ✓ Built-in | ✗ | ✗ | ✗ |
| Workflow Engine | ✓ Built-in | ✓ LangGraph | ✓ | ✗ |
| Multi-Agent | 🔜 Roadmap | ✓ | ✓ | ✓ |
| Community | 🌱 Growing | ✓ Massive | ✓ Large | ✓ Large |
AgentOS focuses on what others don't — testing, evaluation, and governance from day one. For multi-agent orchestration, LangGraph and CrewAI are excellent complements.
from agentos.governed_agent import GovernedAgent from agentos.core.tool import tool @tool(description="Calculate a math expression") def calculator(expression: str) -> str: return str(eval(expression)) agent = GovernedAgent( name="my-agent", model="gpt-4o-mini", tools=[calculator], ) result = agent.run("What's 15% tip on $85?")
from agentos.sandbox.scenario import Scenario report = agent.test([ Scenario( name="Math test", user_message="What's 25% of 400?", expected_behavior="Returns 100", ), ]) # Passed: 1/1 | Quality: 9.1/10 | Cost: $0.0001
<script src="https://cdn.agentos.dev/embed.js" data-agent="support-bot" data-theme="light"></script>
$ pip install agentos-platform $ agentos init my-agent ✅ Created agent project: my-agent/ $ cd my-agent $ agentos serve --demo INFO: Uvicorn running on http://0.0.0.0:8000
One command. 13 sections. Zero configuration.
Test before you deploy. Govern in production. Measure everything.