AgentOS — The Operating System for AI Agents

v0.3.1 · Open Source · Apache 2.0

Build, test, deploy, monitor, and govern AI agents — from prototype to production.

Get Started → Star on GitHub

$ pip install agentos-platform

Combined ρ, N=50

Embedding Backends

Gov. Overhead (median)

Embedding ρ, N=50

Commits

Measured over a 50-example hand-labeled set. Reproduce with benchmarks/run_benchmarks.py.

Why AgentOS

Other frameworks help you build agents. We help you ship them.

Testing

Test before you deploy

Simulation sandbox with multi-metric evaluation — BLEU, ROUGE-L, embedding similarity, and LLM-as-judge combined into a weighted ensemble measured at 0.562 Spearman ρ against human judgment (N=50).

0.562 Combined ρ, N=50

Governance

Govern what agents can do

Budget limits, permission controls, kill switch, and immutable audit trails. Enterprise-grade safety with 0.02 ms median overhead for the full governance stack.

0.02 ms median · 0.03 ms P95

Monitoring

See everything in real-time

Live dashboard tracks every LLM call, tool use, and dollar spent. Embedding drift detection alerts you before your RAG pipeline degrades.

MMD drift detection

Benchmarks

Measured, not claimed

Every capability is benchmarked with quantitative results.

Evaluation Accuracy

Metric	Spearman ρ
BLEU	0.406
ROUGE-L	0.477
Embedding Similarity	0.600
Combined (overall_score)	0.562

Hand-labeled agent responses (N=50). Reproduce with benchmarks/run_benchmarks.py.

RAG Embeddings

Three interchangeable backends via get_embeddings():

OpenAI — text-embedding-3-small via API for hosted, highest-quality retrieval when data can leave your network.

all-MiniLM-L6-v2 — local sentence-transformers inference; private, no API key, runs on CPU.

TF-IDF + SVD — pure-Python fallback with zero external dependencies for air-gapped or minimal installs.

Governance Overhead

Feature	Median (ms)	P95 (ms)
Budget guard	0.000167	0.000209
Permission guard	0.000084	0.000125
Audit trail logging	0.000791	0.000875
Full governance	0.02	0.03

Measured with time.perf_counter over 10000 iterations per feature.

Evaluation

Not just LLM-as-judge

A multi-metric pipeline that actually correlates with human judgment.

BLEU ScoreWeight: 10% · N-gram precision

ROUGE-LWeight: 10% · Subsequence overlap

Semantic SimilarityWeight: 25% · Embedding cosine distance

Toxicity CheckWeight: 20% · Safety baseline

Tool AccuracyWeight: 15% · Correct tool usage

LLM JudgeWeight: 15% · Qualitative assessment

Combined: 0.562 ρ (N=50)

from agentos.sandbox.metrics import evaluate_response

report = evaluate_response(
    response="The tip on $85 at 15% is $12.75",
    expected="Uses calculator, returns correct tip amount",
    tools_called=["calculator"],
    expected_tools=["calculator"],
)
# See benchmarks/run_benchmarks.py for measured correlations (N=50)

RAG Pipeline

Three ways to embed

From zero-cost baseline to best-in-class API

Baseline

TF-IDF + SVD

Zero dependencies. Pure Python. Lightweight fallback for keyword-heavy queries and air-gapped installs.

Zero CostAir-GappedNo GPU

Recommended

Sentence Transformers

Local inference with all-MiniLM-L6-v2. Private, no API key required — no data leaves your machine.

LocalPrivateNo API Key

Premium

OpenAI Embeddings

text-embedding-3-small via API. Hosted embeddings with batching and caching built in.

HostedSimple Setup

from agentos.rag.embeddings import get_embeddings

# Recommended: local embeddings (no API key needed)
embedder = get_embeddings("local")
vectors = embedder.embed(["Your document text here"])

Experimentation

A/B testing with real statistics

Proper hypothesis testing, not just which-number-is-bigger

┌─────────────────────────────────────────────────────┐ │ A/B Test: GPT-4o-mini vs GPT-4o │ │ │ │ Variant A (4o-mini) 7.2 ± 0.3 (95% CI) │ │ Variant B (4o) 8.1 ± 0.2 (95% CI) │ │ │ │ Welch's t-test t = 4.21 p = 0.0003 │ │ Mann-Whitney U U = 892 p = 0.0005 │ │ Cohen's d 0.73 (medium effect) │ │ Bootstrap CI [0.62, 0.84] │ │ │ │ ✅ Winner: GPT-4o (99.97% confidence) │ │ 📊 Effect: Meaningful in practice │ └─────────────────────────────────────────────────────┘

We report Welch's t-test (parametric) and Mann-Whitney U (non-parametric) because LLM quality scores are often bimodal. Cohen's d measures practical significance — a statistically significant difference that's too small to matter gets flagged.

Reliability

Know when your RAG degrades

MMD-based embedding drift detection catches quality drops before users do

Healthy

Reference ↔ Current

MMD: 0.03

✅ No drift detected

Drifted

Reference ↔ Shifted

MMD: 0.18

⚠️ Re-indexing recommended

from agentos.rag.drift import EmbeddingDriftDetector

detector = EmbeddingDriftDetector(threshold=0.1)
detector.set_reference(initial_embeddings)

report = detector.check(current_embeddings)
if report.is_drifted:
    trigger_reindex()

Architecture

How it all fits together

Interface

CLI: serve · init · mcp

Web Platform: Builder · Chat · Dashboard

Application

Agent SDK

Sandbox

Monitor

Governance

ML Engine

Eval: BLEU · ROUGE · Semantic · LLM

RAG: Embed · Search · Drift

A/B: t-test · Bootstrap · Cohen's d

Providers

OpenAI

Claude

Ollama

Local Models

Mock

Protocol

MCP Server (JSON-RPC / stdio)

Platform

Expose tools to Claude Desktop and Cursor

Compare

Honest comparison

We don't claim to beat everyone at everything.

Feature	AgentOS	LangChain	CrewAI	AutoGen
Testing Sandbox	✓ Built-in	✗	✗	✗
Multi-Metric Eval	✓ Built-in	✗	✗	✗
Statistical A/B Testing	✓ Built-in	✗	✗	✗
Embedding Drift Detection	✓ Built-in	✗	✗	✗
Governance & Kill Switch	✓ Built-in	✗	✗	✗
Live Dashboard	✓ Built-in	⚡ LangSmith	✗	✗
RAG Pipeline	✓ 3 backends	✓	✗	✗
Embeddable Widget	✓ Built-in	✗	✗	✗
MCP Server	✓ Built-in	✗	✗	✗
Workflow Engine	✓ Built-in	✓ LangGraph	✓	✗
Multi-Agent	🔜 Roadmap	✓	✓	✓
Community	🌱 Growing	✓ Massive	✓ Large	✓ Large

AgentOS focuses on what others don't — testing, evaluation, and governance from day one. For multi-agent orchestration, LangGraph and CrewAI are excellent complements.

Code

10 lines. Production-ready.

from agentos.governed_agent import GovernedAgent
from agentos.core.tool import tool

@tool(description="Calculate a math expression")
def calculator(expression: str) -> str:
    return str(eval(expression))

agent = GovernedAgent(
    name="my-agent",
    model="gpt-4o-mini",
    tools=[calculator],
)

result = agent.run("What's 15% tip on $85?")

from agentos.sandbox.scenario import Scenario

report = agent.test([
    Scenario(
        name="Math test",
        user_message="What's 25% of 400?",
        expected_behavior="Returns 100",
    ),
])
# Passed: 1/1 | Quality: 9.1/10 | Cost: $0.0001

Embed

Add to any website in one script tag

<script src="https://cdn.agentos.dev/embed.js"
  data-agent="support-bot"
  data-theme="light"></script>

$ pip install agentos-platform
$ agentos init my-agent
✅ Created agent project: my-agent/
$ cd my-agent
$ agentos serve --demo
INFO: Uvicorn running on http://0.0.0.0:8000

Platform

Full web platform included

One command. 13 sections. Zero configuration.

$ agentos serve

Agent Builder

Templates

Chat

Branching

Monitor

Analytics

Scheduler

Events

A/B Testing

Multi-modal

Marketplace

Embed SDK

Auth & Usage

Build agents the right way.

Test before you deploy. Govern in production. Measure everything.

Get Started → Star on GitHub

$ pip install agentos-platform