v0.3.1 · Open Source · Apache 2.0

Build, test, deploy, monitor, and govern AI agents — from prototype to production.

$ pip install agentos-platform
0
Combined ρ, N=50
0
Embedding Backends
0
Gov. Overhead (median)
0
Embedding ρ, N=50
0
Commits

Measured over a 50-example hand-labeled set. Reproduce with benchmarks/run_benchmarks.py.

Other frameworks help you build agents. We help you ship them.

Testing

Test before you deploy

Simulation sandbox with multi-metric evaluation — BLEU, ROUGE-L, embedding similarity, and LLM-as-judge combined into a weighted ensemble measured at 0.562 Spearman ρ against human judgment (N=50).

0.562 Combined ρ, N=50
Governance

Govern what agents can do

Budget limits, permission controls, kill switch, and immutable audit trails. Enterprise-grade safety with 0.02 ms median overhead for the full governance stack.

0.02 ms median · 0.03 ms P95
Monitoring

See everything in real-time

Live dashboard tracks every LLM call, tool use, and dollar spent. Embedding drift detection alerts you before your RAG pipeline degrades.

MMD drift detection

Measured, not claimed

Every capability is benchmarked with quantitative results.

Evaluation Accuracy

MetricSpearman ρ
BLEU0.406
ROUGE-L0.477
Embedding Similarity0.600
Combined (overall_score)0.562

Hand-labeled agent responses (N=50). Reproduce with benchmarks/run_benchmarks.py.

RAG Embeddings

Three interchangeable backends via get_embeddings():

OpenAItext-embedding-3-small via API for hosted, highest-quality retrieval when data can leave your network.

all-MiniLM-L6-v2 — local sentence-transformers inference; private, no API key, runs on CPU.

TF-IDF + SVD — pure-Python fallback with zero external dependencies for air-gapped or minimal installs.

Governance Overhead

FeatureMedian (ms)P95 (ms)
Budget guard0.0001670.000209
Permission guard0.0000840.000125
Audit trail logging0.0007910.000875
Full governance0.020.03

Measured with time.perf_counter over 10000 iterations per feature.

Not just LLM-as-judge

A multi-metric pipeline that actually correlates with human judgment.

BLEU ScoreWeight: 10% · N-gram precision
ROUGE-LWeight: 10% · Subsequence overlap
Semantic SimilarityWeight: 25% · Embedding cosine distance
Toxicity CheckWeight: 20% · Safety baseline
Tool AccuracyWeight: 15% · Correct tool usage
LLM JudgeWeight: 15% · Qualitative assessment
Combined: 0.562 ρ (N=50)
from agentos.sandbox.metrics import evaluate_response

report = evaluate_response(
    response="The tip on $85 at 15% is $12.75",
    expected="Uses calculator, returns correct tip amount",
    tools_called=["calculator"],
    expected_tools=["calculator"],
)
# See benchmarks/run_benchmarks.py for measured correlations (N=50)

Three ways to embed

From zero-cost baseline to best-in-class API

Baseline

TF-IDF + SVD

Zero dependencies. Pure Python. Lightweight fallback for keyword-heavy queries and air-gapped installs.

Zero CostAir-GappedNo GPU
Recommended

Sentence Transformers

Local inference with all-MiniLM-L6-v2. Private, no API key required — no data leaves your machine.

LocalPrivateNo API Key
Premium

OpenAI Embeddings

text-embedding-3-small via API. Hosted embeddings with batching and caching built in.

HostedSimple Setup
from agentos.rag.embeddings import get_embeddings

# Recommended: local embeddings (no API key needed)
embedder = get_embeddings("local")
vectors = embedder.embed(["Your document text here"])

A/B testing with real statistics

Proper hypothesis testing, not just which-number-is-bigger

┌─────────────────────────────────────────────────────┐ │ A/B Test: GPT-4o-mini vs GPT-4o │ │ │ │ Variant A (4o-mini) 7.2 ± 0.3 (95% CI) │ │ Variant B (4o) 8.1 ± 0.2 (95% CI) │ │ │ │ Welch's t-test t = 4.21 p = 0.0003 │ │ Mann-Whitney U U = 892 p = 0.0005 │ │ Cohen's d 0.73 (medium effect) │ │ Bootstrap CI [0.62, 0.84] │ │ │ │ ✅ Winner: GPT-4o (99.97% confidence) │ │ 📊 Effect: Meaningful in practice │ └─────────────────────────────────────────────────────┘

We report Welch's t-test (parametric) and Mann-Whitney U (non-parametric) because LLM quality scores are often bimodal. Cohen's d measures practical significance — a statistically significant difference that's too small to matter gets flagged.

Know when your RAG degrades

MMD-based embedding drift detection catches quality drops before users do

Healthy

Reference ↔ Current

MMD: 0.03

✅ No drift detected

Drifted

Reference ↔ Shifted

MMD: 0.18

⚠️ Re-indexing recommended

from agentos.rag.drift import EmbeddingDriftDetector

detector = EmbeddingDriftDetector(threshold=0.1)
detector.set_reference(initial_embeddings)

report = detector.check(current_embeddings)
if report.is_drifted:
    trigger_reindex()

How it all fits together

15 modules. One pip install.

Core

Agent SDK

10-line agents with @tool decorator

Core

Streaming

Token-by-token WebSocket streaming

Core

RAG Pipeline

Ingest, chunk, embed, search, detect drift

Core

Multi-modal

Vision (GPT-4o) and PDF extraction

Testing

Sandbox

100+ scenario testing with multi-metric eval

Testing

A/B Testing

Welch's t-test, Cohen's d, bootstrap CI

Testing

Branching

Fork conversations, explore what-if paths

Production

Dashboard

Real-time cost, tool usage, quality tracking

Production

Governance

Budget, permissions, kill switch, audit trail

Production

Scheduler

Cron and interval scheduling, history

Production

Events

Pub/sub with webhook, timer, file triggers

Production

Workflows

Multi-step pipelines with retry and fallback

Ecosystem

Marketplace

Publish, discover, install agent templates

Ecosystem

Embed SDK

White-label chat widget, one script tag

Ecosystem

MCP Server

Expose tools to Claude Desktop and Cursor

Honest comparison

We don't claim to beat everyone at everything.

FeatureAgentOSLangChainCrewAIAutoGen
Testing Sandbox✓ Built-in
Multi-Metric Eval✓ Built-in
Statistical A/B Testing✓ Built-in
Embedding Drift Detection✓ Built-in
Governance & Kill Switch✓ Built-in
Live Dashboard✓ Built-in⚡ LangSmith
RAG Pipeline✓ 3 backends
Embeddable Widget✓ Built-in
MCP Server✓ Built-in
Workflow Engine✓ Built-in✓ LangGraph
Multi-Agent🔜 Roadmap
Community🌱 Growing✓ Massive✓ Large✓ Large

AgentOS focuses on what others don't — testing, evaluation, and governance from day one. For multi-agent orchestration, LangGraph and CrewAI are excellent complements.

10 lines. Production-ready.

from agentos.governed_agent import GovernedAgent
from agentos.core.tool import tool

@tool(description="Calculate a math expression")
def calculator(expression: str) -> str:
    return str(eval(expression))

agent = GovernedAgent(
    name="my-agent",
    model="gpt-4o-mini",
    tools=[calculator],
)

result = agent.run("What's 15% tip on $85?")
from agentos.sandbox.scenario import Scenario

report = agent.test([
    Scenario(
        name="Math test",
        user_message="What's 25% of 400?",
        expected_behavior="Returns 100",
    ),
])
# Passed: 1/1 | Quality: 9.1/10 | Cost: $0.0001

Add to any website in one script tag

<script src="https://cdn.agentos.dev/embed.js"
  data-agent="support-bot"
  data-theme="light"></script>
$ pip install agentos-platform
$ agentos init my-agent
✅ Created agent project: my-agent/
$ cd my-agent
$ agentos serve --demo
INFO: Uvicorn running on http://0.0.0.0:8000

Full web platform included

One command. 13 sections. Zero configuration.

$ agentos serve
Agent Builder
Templates
Chat
Branching
Monitor
Analytics
Scheduler
Events
A/B Testing
Multi-modal
Marketplace
Embed SDK
Auth & Usage

Build agents the right way.

Test before you deploy. Govern in production. Measure everything.

$ pip install agentos-platform