Skip to content

test: benchmark IntentClassifier with FunctionGemma vs gpt-4o-mini #535

@iAmGiG

Description

@iAmGiG

Summary

Benchmark FunctionGemma vs gpt-4o-mini for tool calling in IntentClassifier.

Scope

Testing the structured function dispatch - NOT reasoning. This is the resolve_ticker_with_llm() function that outputs a fixed schema:

{"company_name": "Apple Inc.", "ticker": "AAPL", "found": true}

Current code: src/cli/utils/intent_classifier.py:162-249

Test Cases (Tool Dispatch Only)

# Simple ticker extraction - structured output
test_inputs = [
    ("buy AAPL", "AAPL"),           # Direct ticker
    ("sell SPY at 600", "SPY"),     # Ticker with price
    ("check NVDA", "NVDA"),         # Direct ticker
    ("buy 10 TSLA", "TSLA"),        # Ticker with quantity
]

# Company resolution - requires some knowledge
company_inputs = [
    ("buy apple", "AAPL"),          # Common
    ("sell microsoft", "MSFT"),     # Common
    ("check nvidia", "NVDA"),       # Common
]

Metrics to Capture

Metric gpt-4o-mini FunctionGemma
Direct ticker accuracy ? ?
Company resolution accuracy ? ?
Avg latency (ms) ? ?
Schema compliance (valid JSON) ? ?

Success Criteria

  • ≥98% accuracy on direct ticker extraction
  • ≥90% accuracy on common company names (top 20 stocks)
  • <100ms average latency

Deliverables

  • Benchmark script tests/benchmarks/intent_classifier_benchmark.py
  • Results documented in docs/08_research/
  • Go/no-go recommendation

Dependencies

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions