An autonomous AI scientist for scientific discovery, implementing the architecture described in Lu et al. (2024).
Kosmos is an open-source implementation of an autonomous AI scientist that can:
- Generate hypotheses from literature and data analysis
- Design experiments to test those hypotheses
- Execute code in sandboxed Docker containers
- Validate discoveries using an 8-dimension quality framework
- Build knowledge graphs to track relationships between concepts
The system runs autonomous research cycles, generating tasks, executing analyses, and synthesizing findings into validated discoveries.
- Python 3.11+
- Anthropic API key or OpenAI API key
- Docker (recommended for code execution)
Without Docker, code runs via exec() with static validation. See "Code Execution Security" below.
git clone https://github.com/jimmc414/Kosmos.git
cd Kosmos
pip install -e .
cp .env.example .env
# Edit .env and set ANTHROPIC_API_KEY or OPENAI_API_KEY# Run smoke tests
python scripts/smoke_test.py
# Run unit tests
pytest tests/unit/ -v --tb=shortimport asyncio
from kosmos.workflow.research_loop import ResearchWorkflow
async def run():
workflow = ResearchWorkflow(
research_objective="Your research question here",
artifacts_dir="./artifacts"
)
result = await workflow.run(num_cycles=5, tasks_per_cycle=10)
report = await workflow.generate_report()
print(report)
asyncio.run(run())# Run research with default settings
kosmos run "What metabolic pathways differ between cancer and normal cells?" --domain biology
# With budget limit
kosmos run "How do perovskites optimize efficiency?" --domain materials --budget 50
# Interactive mode (recommended for first time)
kosmos run --interactive
# Maximum verbosity
kosmos run "Your question" --domain biology --trace
# Real-time streaming display
kosmos run "Your question" --stream
# Streaming with token display disabled
kosmos run "Your question" --stream --no-stream-tokens
# Show system information
kosmos info
# Run diagnostics
kosmos doctor| Feature | Description | Status |
|---|---|---|
| Research Loop | Multi-cycle autonomous research with hypothesis generation | Complete |
| Literature Search | ArXiv, PubMed, Semantic Scholar integration | Complete |
| Code Execution | Docker-sandboxed Jupyter notebooks | Complete |
| Knowledge Graph | Neo4j-based relationship storage (optional) | Complete |
| Context Compression | Query-based hierarchical compression (20:1 ratio) | Complete |
| Discovery Validation | 8-dimension ScholarEval quality framework | Complete |
| Multi-Provider LLM | Anthropic, OpenAI, LiteLLM (100+ providers) | Complete |
| Budget Enforcement | Cost tracking with configurable limits and enforcement | Complete |
| Error Recovery | Exponential backoff with circuit breaker | Complete |
| Debug Mode | 4-level verbosity with stage tracking | Complete |
| Real-time Streaming | SSE/WebSocket events, CLI --stream flag | Complete |
AI-generated code runs in isolated Docker containers:
| Layer | Implementation |
|---|---|
| Container Isolation | --cap-drop=ALL, no privileged access |
| Network | Disabled (--network=none) |
| Filesystem | Read-only root, tmpfs for scratch |
| Resources | CPU: 2 cores, Memory: 2GB, Timeout: 300s |
| Pooling | Pre-warmed containers reduce cold start |
See: kosmos/execution/sandbox.py, docker_manager.py
Without Docker, falls back to CodeValidator static analysis + exec(). Not recommended for untrusted inputs.
| Agent | Role |
|---|---|
| Research Director | Master orchestrator coordinating all agents |
| Hypothesis Generator | Generates testable hypotheses from literature |
| Experiment Designer | Creates experimental protocols |
| Data Analyst | Analyzes results and interprets findings |
| Literature Analyzer | Searches and synthesizes papers |
| Plan Creator/Reviewer | Strategic task generation with 70/30 exploration/exploitation |
The system processes literature in batches, not bulk:
- Relevance Sorting: Papers ranked by query relevance before processing
- Batch Size: Top 10 papers per batch
- Statistics Extraction: Regex-based extraction of p-values, sample sizes, effect sizes
- Tiered Summarization:
- Task: 42K lines code to 2-line summary + extracted stats
- Cycle: 10 task summaries to cycle overview
- Synthesis: 20 cycles to final narrative
- Detail: Full content lazy-loaded when needed
Effective ratio: ~20:1. See kosmos/compression/compressor.py.
All configuration via environment variables. See .env.example for the full list.
# Anthropic (default)
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
# OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
# LiteLLM (supports 100+ providers including local models)
LLM_PROVIDER=litellm
LITELLM_MODEL=ollama/llama3.1:8b
LITELLM_API_BASE=http://localhost:11434BUDGET_ENABLED=true
BUDGET_LIMIT_USD=10.00Budget enforcement raises BudgetExceededError when the limit is reached, gracefully transitioning the research to completion.
Three independent limits in kosmos/config.py:
| Setting | Default | Range |
|---|---|---|
max_parallel_hypotheses |
3 | 1-10 |
max_concurrent_experiments |
10 | 1-16 |
max_concurrent_llm_calls |
5 | 1-20 |
The paper describes 10 parallel tasks. Default now matches paper specification.
# Neo4j (optional, for knowledge graph features)
NEO4J_URI=bolt://localhost:7687
NEO4J_PASSWORD=your-password
# Redis (optional, for distributed caching)
REDIS_URL=redis://localhost:6379Start Neo4j, Redis, and PostgreSQL with Docker Compose:
# Start all optional services (Neo4j, Redis, PostgreSQL)
docker compose --profile dev up -d
# Or start individual services
docker compose up -d neo4j
docker compose up -d redis
docker compose up -d postgres
# Stop services
docker compose --profile dev downService URLs when running via Docker:
- Neo4j Browser: http://localhost:7474 (user: neo4j, password: kosmos-password)
- PostgreSQL: localhost:5432 (user: kosmos, password: kosmos-dev-password)
- Redis: localhost:6379
Literature search via Semantic Scholar works without authentication. An API key is optional but increases rate limits:
# Optional: Get API key from https://www.semanticscholar.org/product/api
SEMANTIC_SCHOLAR_API_KEY=your-key-here# Enable debug mode with level 1-3
DEBUG_MODE=true
DEBUG_LEVEL=2
# Or use CLI flag for maximum verbosity
kosmos run "Your research question" --traceSee docs/DEBUG_MODE.md for comprehensive debug documentation.
kosmos/
├── agents/ # Research agents (director, hypothesis, experiment, etc.)
├── compression/ # Context compression (20:1 ratio)
├── core/ # LLM providers, metrics, configuration
│ └── providers/ # Anthropic, OpenAI, LiteLLM with async support
├── execution/ # Docker-based sandboxed code execution
├── knowledge/ # Neo4j knowledge graph (1,025 lines)
├── literature/ # ArXiv, PubMed, Semantic Scholar clients
├── orchestration/ # Plan creation/review, task delegation
├── validation/ # ScholarEval 8-dimension quality framework
├── workflow/ # Main research loop integration
└── world_model/ # State management, JSON artifacts
| Category | Percentage | Description |
|---|---|---|
| Paper gaps | 100% | All 17 paper implementation gaps complete |
| Ready for user testing | 95% | Core research loop, agents, LLM providers, validation |
| Deferred | 5% | Phase 4 production mode (polyglot persistence) |
| Issue | Description | Status |
|---|---|---|
| #66 | CLI deadlock - async refactor | ✅ Fixed |
| #67 | SkillLoader domain mapping | ✅ Fixed |
| #68 | Pydantic V2 migration | ✅ Fixed |
| #54-#58 | Critical paper gaps | ✅ Fixed |
| #59 | h5ad/Parquet data formats | ✅ Fixed |
| #69 | R language execution | ✅ Fixed |
| #60 | Figure generation | ✅ Fixed |
| #61 | Jupyter notebook generation | ✅ Fixed |
| #70 | Null model statistical validation | ✅ Fixed |
| #63 | Failure mode detection | ✅ Fixed |
| #62 | Code line provenance | ✅ Fixed |
| #64 | Multi-run convergence framework | ✅ Fixed |
| #65 | Paper accuracy validation | ✅ Fixed |
| #72 | Real-time streaming API | ✅ Fixed |
All 17 paper implementation gaps have been addressed. Full tracking: PAPER_IMPLEMENTATION_GAPS.md
| Category | Count | Status |
|---|---|---|
| Unit tests | 2251 | Passing |
| Integration tests | 415 | Passing |
| E2E tests | 121 | Most pass, some skip (environment-dependent) |
| Requirements tests | 815 | Passing |
E2E tests skip based on environment:
- Neo4j not configured (
@pytest.mark.requires_neo4j) - Docker not running (sandbox execution tests)
- API keys not set (tests requiring live LLM calls)
This project implements the architecture from the Kosmos paper but has not yet reproduced the paper's claimed results:
| Paper Claim | Implementation Status |
|---|---|
| 79.4% accuracy on scientific statements | Architecture implemented, not validated |
| 7 validated discoveries | Not reproduced |
| 1,500 papers per run | Architecture supports this |
| 42,000 lines of code per run | Architecture supports this |
| 200 agent rollouts | Configurable via max_iterations |
The system is suitable for experimentation and further development. Before production research use, validation studies should be conducted.
-
Docker recommended: Without Docker, code execution falls back to direct
exec()which is unsafe for untrusted code. -
Neo4j optional: Knowledge graph features require Neo4j. Set
NEO4J_URI,NEO4J_USER,NEO4J_PASSWORDto enable. -
R support via Docker: R language execution requires the R-enabled Docker image (
docker/sandbox/Dockerfile.r) with TwoSampleMR, susieR, and MendelianRandomization packages. -
Single-user: No multi-tenancy or user isolation.
-
Not a reproduction study: We have not yet reproduced the paper's 79.4% accuracy or 7 validated discoveries.
- archive/PAPER_IMPLEMENTATION_GAPS.md - Paper implementation gaps (17/17 complete)
- docs/DEBUG_MODE.md - Debug mode guide
- archive/120525_implementation_gaps_v2.md - Original implementation gaps analysis
- archive/120625_code_review.md - Code review (Dec 2025)
- archive/GETTING_STARTED.md - Detailed usage examples
- CONTRIBUTING.md - Development guidelines (archived)
- CHANGELOG.md - Version history
The original paper omitted implementation details for 6 critical components. This repository provides those implementations:
| Gap | Problem | Solution |
|---|---|---|
| 0 | Context compression for 1,500 papers | Hierarchical 3-tier compression (20:1 ratio) |
| 1 | State Manager schema unspecified | 4-layer hybrid architecture (JSON + Neo4j + Vector + Citations) |
| 2 | Task generation algorithm unstated | Plan Creator + Plan Reviewer pattern |
| 3 | Agent integration mechanism unclear | Skill loader with 116 domain-specific skills (see #67) |
| 4 | Execution environment not described | Docker sandbox with Python + R support (see #69) |
| 5 | Discovery validation criteria missing | ScholarEval 8-dimension quality framework |
For detailed analysis, see archive/120525_implementation_gaps_v2.md.
- Paper: Kosmos: An AI Scientist for Autonomous Discovery (Lu et al., 2024)
- K-Dense ecosystem: Pattern repositories for AI agent systems
- kosmos-figures: Analysis patterns
See CONTRIBUTING.md.
Areas where contributions would be useful:
- Docker sandbox testing and hardening
- Additional scientific domain skills
- Performance benchmarking with production LLMs
- Validation studies to measure actual accuracy
- Multi-tenancy and user isolation
MIT License
Version: 0.2.0-alpha | Tests: 3704 passing | Last Updated: 2025-12-09