test: benchmark IntentClassifier with FunctionGemma vs gpt-4o-mini

## Summary
Benchmark FunctionGemma vs gpt-4o-mini for **tool calling** in IntentClassifier.

## Scope
Testing the structured function dispatch - NOT reasoning. This is the `resolve_ticker_with_llm()` function that outputs a fixed schema:
```json
{"company_name": "Apple Inc.", "ticker": "AAPL", "found": true}
```

Current code: `src/cli/utils/intent_classifier.py:162-249`

## Test Cases (Tool Dispatch Only)
```python
# Simple ticker extraction - structured output
test_inputs = [
    ("buy AAPL", "AAPL"),           # Direct ticker
    ("sell SPY at 600", "SPY"),     # Ticker with price
    ("check NVDA", "NVDA"),         # Direct ticker
    ("buy 10 TSLA", "TSLA"),        # Ticker with quantity
]

# Company resolution - requires some knowledge
company_inputs = [
    ("buy apple", "AAPL"),          # Common
    ("sell microsoft", "MSFT"),     # Common
    ("check nvidia", "NVDA"),       # Common
]
```

## Metrics to Capture
| Metric | gpt-4o-mini | FunctionGemma |
|--------|-------------|---------------|
| Direct ticker accuracy | ? | ? |
| Company resolution accuracy | ? | ? |
| Avg latency (ms) | ? | ? |
| Schema compliance (valid JSON) | ? | ? |

## Success Criteria
- ≥98% accuracy on direct ticker extraction
- ≥90% accuracy on common company names (top 20 stocks)
- <100ms average latency

## Deliverables
- [ ] Benchmark script `tests/benchmarks/intent_classifier_benchmark.py`
- [ ] Results documented in `docs/08_research/`
- [ ] Go/no-go recommendation

## Dependencies
- #533 (Ollama infrastructure)
- #534 (LLM backend abstraction)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: benchmark IntentClassifier with FunctionGemma vs gpt-4o-mini #535

Summary

Scope

Test Cases (Tool Dispatch Only)

Metrics to Capture

Success Criteria

Deliverables

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	gpt-4o-mini	FunctionGemma
Direct ticker accuracy	?	?
Company resolution accuracy	?	?
Avg latency (ms)	?	?
Schema compliance (valid JSON)	?	?

test: benchmark IntentClassifier with FunctionGemma vs gpt-4o-mini #535

Description

Summary

Scope

Test Cases (Tool Dispatch Only)

Metrics to Capture

Success Criteria

Deliverables

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions