Title Abstractor Enterprise

Enterprise-ready title abstraction system built with FastAPI, Next.js, MongoDB, and Celery. This is a complete rebuild of the Streamlit prototype with production-grade architecture.

🆕 Recent Updates

v3.0.36 - January 14, 2026 (Public Records Search Tab)

NY State Warrant Search:

✅ Tax Warrant Search - Search NY Open Data API for state tax warrants
- Searches all names extracted from abstract documents
- Real-time results with warrant ID, debtor info, amount, filed date
- Support for multiple debtor names (joint filers)
- PDF links to original warrant documents
✅ Child Support Warrant Search - Search for child support warrants
- Same functionality as tax warrants with separate results tab

Selection & Persistence:

✅ Selection Persistence - Warrant selections saved to database
- Persist across page refreshes with debounced auto-save
- "Selected for Abstract" panel shows pending items
✅ Send to Abstract - Add warrants as documents to abstract
- Individual send buttons per warrant type
- "Send All to Abstract" for batch operations
- Warrants rendered with WARRANT_DOS template

Results Management:

✅ Merge Toggle - Combine all results into deduplicated list
✅ Sort Dropdown - Sort by Date (Oldest/Newest), Amount, or Name
✅ Sent to Abstract List - Shows all public records already in abstract

v3.0.35 - January 10, 2026 (Bankruptcy Court Improvements & Admin Controls)

Bankruptcy Court Auto-Selection:

✅ Auto-Select Court from Property County - Automatically selects appropriate bankruptcy court based on property county extracted from legal descriptions
- Regex patterns extract county from legal descriptions ("County of X", "in X County", etc.)
- NY county-to-court mapping for all 62 counties across 4 federal districts
- Priority: saved preference → search history → auto-detect → defaults

Court Selection Persistence:

✅ Remember Court Selection - User's court selection saved and restored on return visits
- New bankruptcy_courts_selected field on Abstract model
- New API endpoint to save selection with debounced updates

County Info Tooltips:

✅ Hover to See Counties - Info icon next to each court shows covered counties
- "i" icon on each court checkbox
- Tooltip lists all counties for that court

Tab & UI Improvements:

✅ Tab Reordering - Bankruptcy tab moved to second position (after Documents)
✅ JSON Tab Admin-Only - Only visible to authenticated admins, moved to last position
✅ Cost Metrics Admin-Only - Dollar amounts (API cost, cost saved) hidden for non-admins

v3.0.34 - January 8, 2026 (PDF-Markdown Bidirectional Text)

✅ Markdown → PDF Highlighting - Select text in editor to highlight on PDF with match navigation
✅ PDF → Clipboard Extraction - Shift+drag on PDF to OCR and copy text
✅ Backend OCR caching for performance

v3.0.33 - January 5, 2026 (Pop-Out PDF Viewer)

✅ Dual Monitor Support - Pop-out PDF in separate window with bi-directional sync
✅ Keyboard Shortcut - Ctrl+Shift+P to toggle pop-out

v3.0.32 - January 5, 2026 (Edit Drawer Navigation)

✅ Navigate Between Setouts - Prev/Next arrows and jump-to dropdown in edit drawer
✅ Save and Next Button - Saves and auto-opens next setout
✅ Reset to Original - Restore complete AI-extracted state from snapshot

v3.0.30-31 - January 5, 2026 (Legal Description Formatting)

✅ Smart Sentence Case - Proper noun detection for cities, counties, streets, person names
✅ Preserves ALL CAPS - Legal opening phrases kept uppercase
✅ Enhanced Patterns - Company names, multi-word last names, county abbreviations

v3.0.29 - January 4, 2026 (Sort Order Toggle)

✅ Recording Date vs Scanned Order - Per-abstract toggle for document ordering
✅ Auto-regenerate References - Cross-references update when toggled

v3.0 - November 30, 2025 (AI-Powered Analytics & Admin Security Release)

Admin Dashboard Password Protection:

✅ Password-Protected Admin Access - Secure authentication for admin dashboard and settings
- Simple password protection using localStorage-based authentication
- Password: GoBills! (configurable in AdminAuthContext)
- Password prompt modal with Lock icon on first access
- Persistent authentication with logout functionality
- Protects: Admin Dashboard, AI Insights, AI Improvements, and Settings pages
- AdminAuthContext provides authentication state management
- AdminPasswordPrompt component wraps protected routes

AI Insights - Analytics Chat Assistant:

✅ Conversational Analytics - Chat with AI about edit patterns and system analytics
- Natural language queries about document processing patterns
- Ask questions like "Why do dates get edited so often?" or "Which document types have highest error rates?"
- AI-powered responses using edit tracking data and analytics
- Example prompts: Compare LLM performance, analyze extraction errors, identify improvement opportunities
- Chat history maintained during session
- Access via /admin/ai-insights with Sparkles icon

AI Improvements - Automated Enhancement Workflow:

✅ AI-Generated Improvements - System learns from edit patterns and suggests enhancements
- Automatic detection of recurring edit patterns
- Confidence scoring for each improvement suggestion
- Evidence-based suggestions with frequency counts
- Multi-stage approval workflow:
  1. Pending Review: Initial AI suggestions awaiting admin review
  2. A/B Testing: Run controlled experiments before full deployment
  3. Test Complete: Review test results with statistical significance
  4. Approved: Deploy improvements to production
  5. Rejected: Archive suggestions not suitable for implementation
  6. Rolled Back: Revert deployed changes if issues arise
- A/B test configuration with variant split and minimum sample size
- Statistical analysis with p-values and confidence levels
- Rollback capability for deployed improvements
- Access via /admin/improvements with TrendingUp icon

Edit Analytics System:

✅ Comprehensive Edit Tracking - Track all user edits to improve AI extraction
- Monitor edit patterns across document types and fields
- Field-level edit frequency analysis
- Edit type categorization (corrections, additions, formatting)
- Integration with AI Improvements for automated suggestions
- Analytics API endpoints for reporting
- Access via /admin/edit-analytics with BarChart3 icon

Admin Authentication Components:

AdminAuthContext (/frontend/src/contexts/AdminAuthContext.tsx)
- React Context for authentication state management
- isAuthenticated state with localStorage persistence
- login(password) - Validates password and sets auth state
- logout() - Clears auth state and localStorage
- useAdminAuth() hook for consuming auth context
AdminPasswordPrompt (/frontend/src/components/admin/AdminPasswordPrompt.tsx)
- Modal password prompt with Lock icon
- Wraps protected components/pages
- Shows children only when authenticated
- Loading state during hydration
- Error handling for incorrect passwords

New API Endpoints:

POST   /api/v1/chat/analytics         # AI chat for analytics queries
GET    /api/v1/improvements/list      # List all AI improvements
GET    /api/v1/improvements/{id}      # Get improvement details
POST   /api/v1/improvements/{id}/approve-for-test    # Start A/B test
POST   /api/v1/improvements/{id}/approve             # Deploy improvement
POST   /api/v1/improvements/{id}/reject              # Reject improvement
POST   /api/v1/improvements/{id}/rollback            # Rollback deployment
GET    /api/v1/admin/edit-analytics   # Edit tracking analytics

Protected Routes:

All admin routes now require password authentication:

/admin - Admin Dashboard
/admin/ai-insights - AI Analytics Chat
/admin/improvements - AI Improvements Management
/admin/edit-analytics - Edit Analytics Dashboard
/settings - Application Settings

Documentation:

See ADMIN_DASHBOARD_GUIDE.md for admin dashboard usage
Default password: GoBills! (change in AdminAuthContext for production)

v2.6 - November 23, 2025 (Admin Dashboard Release)

Comprehensive Admin Dashboard:

✅ Real-time Metrics Dashboard - Complete administrator interface with live analytics
- Overview Tab - Hero metrics cards showing total documents, success rate, processing time, costs, and system health
- Performance Tab - Processing volume charts, document type breakdowns with interactive bar/pie chart views
- Costs Tab - LLM cost analytics by provider (Gemini, Claude, Azure), cost trends, and ROI calculations
- Quality Tab - Uncertain fields tracking, severity breakdown, and abstracts needing review
- System Tab - MongoDB, Redis, Celery worker health monitoring, and error logs
✅ Time Series Data & Charts - Historical analytics with Recharts visualizations
- Processing volume line/bar charts with time series data
- Cost breakdown pie charts by LLM provider
- Document types distribution (33+ document types tracked)
- Interactive tooltips with white background for readability
✅ Backend Analytics Tasks - Automated metrics collection via Celery Beat
- Hourly processing metrics aggregation (aggregate_hourly_metrics)
- 5-minute system health checks (collect_system_metrics)
- Historical data backfill script for populating metrics
- Metrics retention and cleanup tasks

✅ Admin API Endpoints:

GET /api/v1/admin/overview              # Hero metrics and trends
GET /api/v1/admin/metrics/processing    # Processing performance metrics
GET /api/v1/admin/metrics/cost          # Cost analytics by provider
GET /api/v1/admin/system/health         # System health monitoring
GET /api/v1/admin/documents             # Paginated document list
GET /api/v1/admin/errors                # Error log viewer
GET /api/v1/analytics/quality           # Quality metrics and uncertain fields
GET /api/v1/analytics/document-types    # Document type analytics

✅ Dashboard Features:
- Period selector (24h, 7d, 30d, 90d, all time)
- Manual refresh button for on-demand updates
- Export dashboard data to JSON/CSV
- Home button navigation to main app
- Responsive design with tabbed interface

Documentation:

See ADMIN_DASHBOARD_GUIDE.md for complete usage guide
WEBSOCKET_KNOWN_ISSUE.md - Known development mode issues

v2.3 - November 18, 2025 (Snippets & Template Enhancements)

Text Snippet Auto-Expansion System:

Snippet Management - Create reusable text shortcuts that expand to full phrases
- Define shortcuts like mtg, sam, cov that expand to standard legal phrases
- Variable placeholders: {grantee}, {grantor}, {amount}, {dated}, {recording}, {date}, {___}
- Category organization (Legal Description, Document Type, Cross-Reference, Recording)
- Settings page for managing custom snippets
- Default snippets seeded on first run
Built-in Default Snippets:
- sam → "Being the same premises conveyed to {grantee}"
- mtg → "MORTGAGE dated {dated}, made by {grantor} to {grantee}, in the principal sum of {amount}"
- cov → "Covers same premises shown at No. {___} above"
- rec → "Recorded {recording}"
- lisp → "NOTICE OF PENDENCY filed {dated} by {plaintiff} against {defendants}"
- nop → "Object of action: to foreclose Mortgage No. {___}. For further proceedings please see the docket maintained in the County Clerk's Office or shown on the New York State Unified Court System."
- And more for deeds, assignments, satisfactions, agreements

Template System Enhancements:

Tax Warrant Template - New dedicated template for state/county tax warrants
- Proper creditor/debtor formatting with VS separator
- Warrant ID and amount fields
- Debtor address display
Document Type Normalization - Automatic display name corrections
- "TRANSCRIPT OF JUDGMENT" → "JUDGMENT"
- "CERTIFICATE OF DEATH" → "DEATH CERTIFICATE"
Template Mapping Improvements - Better routing for document types
- Support for underscore formats (TAX_WARRANT, STATE_TAX_WARRANT)
- LIEN → judgment.j2, TAX WARRANT → tax_warrant.j2
Mortgage Template Fixes - Smart legal description handling
- Detects "same premises" variations to avoid duplicate content
- Patterns: "covers same premises", "being the same premises", "same premises as"

API Endpoints:

GET    /api/v1/snippets              # List all snippets
POST   /api/v1/snippets              # Create snippet
GET    /api/v1/snippets/{id}         # Get snippet
PUT    /api/v1/snippets/{id}         # Update snippet
DELETE /api/v1/snippets/{id}         # Delete snippet
POST   /api/v1/snippets/seed         # Seed default snippets

v2.2 - November 13, 2025 (Real-Time Updates Release)

WebSocket Real-Time Status Updates:

✅ WebSocket Integration - Real-time job status updates via WebSocket connections
✅ Redis Pub/Sub - Cross-process messaging between Celery workers and FastAPI server
✅ Eliminated HTTP Polling Spam - No more hundreds of GET requests during processing
✅ Live Progress Updates - Real-time display of current step and progress percentage
✅ Graceful Fallback - Automatic fallback to HTTP polling if WebSocket fails

Technical Implementation:

WebSocket endpoint: ws://localhost:8000/api/v1/ws/jobs/{job_id}
Redis channels for job-specific updates: job_updates:{job_id}
ConnectionManager subscribes to Redis and broadcasts to connected clients
Celery workers publish status updates to Redis instead of direct WebSocket calls
Scalable architecture supporting multiple worker processes

v2.0 - November 12, 2025 (Optimization Release)

Performance & Code Quality Improvements:

Completed comprehensive 3-4 week optimization plan. See docs/backend/OPTIMIZATION_PROGRESS.md for full details.

Key Achievements:

✅ 50-70% Memory Reduction - Fixed PIL image memory leaks in OCR processing
✅ 10x Template Rendering Speed - Implemented LRU cache with TTL for compiled templates
✅ 60-80% Response Size Reduction - Added GZip compression middleware
✅ Service Layer Architecture - Clean separation of business logic from API routes
✅ DRY Code Improvements - Extracted shared utilities for date and recording reference parsing
✅ 75-Test Suite - Comprehensive pytest coverage (93-100% on utilities, 60-83% on services)
✅ Frontend Performance - React.memo(), useMemo(), useCallback() to eliminate unnecessary re-renders
✅ Code Organization - Moved utility scripts to /backend/scripts directory
✅ Caching System - Template cache management API with clear/stats endpoints

Test Coverage:

cd backend
pytest                              # 75 tests, <3 seconds
pytest --cov=app --cov-report=html  # With coverage report

v1.8 - November 9, 2025

PDF Storage System & Chain Analysis Enhancements:

✅ Individual Document PDF Storage - Automatically extract and save each document as separate PDF
- Single PDFs: Save original + individual document PDFs
- Merged PDFs: Create master merged PDF + individual document PDFs
- New utility functions: merge_pdf_files() and extract_pages_to_file()
- Document schema extended with document_pdf_path field
✅ Comprehensive PDF Deletion - Delete ALL associated files when abstract is deleted
- Removes main PDF, source PDFs, and all individual document PDFs
- Updated cleanup script to match delete endpoint behavior
- No orphaned files on server
✅ Document PDF Endpoint - Serve individual document PDFs with fallback extraction
- GET /api/v1/abstracts/{id}/documents/{index}/pdf
- Backward compatible with old abstracts
✅ Chain Analysis Positioning Fixes - Fixed overlapping document cards
- Increased child chain offset to 350px with 300px minimum clearance
- Collision detection with occupied position tracking
- Cards are 288px wide (w-72) - spacing now accommodates dimensions
✅ Chain Issues UI Improvements - Better layout and full-width text display
- Moved issues section below visualization
- Issue messages span full width using Alert component's flex structure

v1.7 Updates (November 6, 2025):

Abstract Metadata Fields & Markdown Formatting:

✅ Three New Abstract Metadata Fields - Control abstract-level settings
- date_from - Date-only field (displayed as M/D/YYYY, e.g., "4/8/2020")
- effective_date - Date and time field (displayed with 12-hour format, e.g., "4/8/2020, 3:30 PM")
- starting_setout_value - Starting number for document numbering/setouts (default: 1)
✅ Document Numbering with Offset - Support for continuing numbering from previous abstracts
- Display number = starting_setout_value + document_index
- Cross-references automatically use display numbers
- Example: starting_setout_value=20 → documents numbered 20, 21, 22...
✅ Markdown Format Simplification - Removed document numbers and headers from markdown output
- Before: ## 1. MORTGAGE → After: MORTGAGE
- Document numbers only appear on document cards in UI
- All 13 backend templates and 8 frontend render functions updated
✅ Enhanced Metadata UI - Upload form and detail screen now include editable metadata fields
✅ Backward Compatibility - Automatic conversion of old date formats with Pydantic field validator

v1.6 Updates (November 4, 2025):

Document Editing Interface Overhaul:

✅ Type-Specific Field Display - Show only relevant fields for each document type (deed, mortgage, estate, UCC, etc.)
✅ Template-Driven Field Ordering - Fields arranged to match template output (top to bottom)
✅ Cross-Reference Display System - Read-only display of legal description comparisons and mortgage cross-references
✅ Enhanced TypeScript Types - Complete type definitions for all document types and cross-references
✅ Schema Cleanup - Removed unused fields, added all missing fields for 8 document types
✅ Object Handling - Robust handling of complex fields (e.g., principal_amount as object)
✅ UI/UX Improvements - Removed section headers, improved labels, eliminated redundant fields

v1.5 Updates (November 3, 2025):

✅ Multi-Chain Support - System now handles multiple independent property chains with separate visualization
✅ Enhanced Legal Description Matching - 4-pass lot number extraction strategy with tuple-based comparison
- Pass 1: Standard patterns ("Lot 37", "Lot No. 37", "Lot #37")
- Pass 2: Standalone "#37" patterns
- Pass 3: Multiple lot patterns ("and 37", ", 37")
- Pass 4: Parenthesized numbers ("Lot Number Thirty-seven (37)")
✅ Synonym Normalization - tract/plot/piece → parcel for better matching accuracy
✅ Tuple-Based Lot Comparison - Prevents false positives (e.g., "Lots 3,4" vs "Lot 4" = different)
✅ Compact Chain Layout - Reduced horizontal spacing to 400px, vertical spacing to 250px for unconnected docs
✅ Branch Point Detection - Child chains positioned at same Y level as parent branch point
✅ Unconnected Document Positioning - Documents without connections positioned at x=-400 to prevent overlap
✅ Party Matching Improvements - Enhanced survivorship handling and name normalization
✅ Book/Page Sorting - Same-date documents sorted by book and page numbers
✅ Document Count Storage - Added document_count field to abstract model
✅ Pass 3 Timing Display - Fixed timing display issues in processing logs

v1.4 Updates (October 29, 2025):

✅ Database Performance Optimization - 300-1000x speedup on analytics (10-30s → 0.03s), 10-15x on feedback analysis
✅ Per-Document Markdown Editor - Edit markdown for individual documents with template regeneration
✅ Jinja2 Template System - 10 professional templates for different document types (deed, mortgage, lis pendens, affidavit, UCC, etc.)
✅ Quality Metrics Dashboard - Comprehensive uncertain fields tracking with 17 reason codes across 5 categories
✅ Analytics Quality Endpoint - Real-time quality analytics with severity breakdown and 30-day trends
✅ Enhanced Feedback Analysis - AI analysis now includes relevant extraction prompt excerpts for better suggestions
✅ Export Individual Documents - Export single documents or all documents as separate files in ZIP archive
✅ Database Indexing - Compound indexes for optimal query performance

v1.2 Updates (October 27, 2025):

✅ Dual LLM Support - Choose between Gemini 2.5 Pro and Claude Sonnet 4.5 for document processing
✅ LLM Tracking & Display - Track and display which AI model processed each abstract
✅ Delete Functionality - Delete abstracts with confirmation dialog (removes abstract, documents, and PDF file)
✅ Eastern Time Timestamps - Processing logs now display in 12-hour Eastern Time format with seconds
✅ Performance Timing System - Detailed step-by-step timing logs in Celery worker terminal

v1.1 Updates (October 26, 2025):

✅ OCR Quality Scoring System - 6-metric quality assessment with automatic Vision API fallback
✅ Google Cloud Vision API Integration - Handwriting support and quality-based model selection
✅ Feedback System - Complete user feedback loop with AI-powered analysis and dashboard
✅ Gemini Model Update - Upgraded to gemini-2.5-pro (Gemini 2.5.0)
✅ Home Page Redesign - Improved table layout with better information density
✅ Search Functionality - Global search across all abstracts

See docs/RECENT_UPDATES.md for complete details on all changes.

Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Next.js   │────▶│   FastAPI    │────▶│   MongoDB   │
│  Frontend   │◀────│   Backend    │     │  Database   │
└─────────────┘ WS  └──────────────┘     └─────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │    Redis     │
                    │ Job Queue &  │
                    │   Pub/Sub    │
                    └──────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │    Celery    │
                    │   Workers    │
                    └──────────────┘

Real-Time Updates Flow:

Client connects to WebSocket: ws://localhost:8000/api/v1/ws/jobs/{job_id}
FastAPI subscribes to Redis channel: job_updates:{job_id}
Celery worker publishes status updates to Redis
FastAPI receives from Redis and broadcasts to WebSocket clients
Frontend receives live updates without polling

Tech Stack

Backend

FastAPI - Modern async Python web framework
MongoDB - Document database (Beanie ODM)
Redis - Message broker for job queue
Celery - Distributed task queue for async processing
Dual LLM Support:
- Google Gemini 2.5 Pro - AI extraction (gemini-2.5-pro)
- Anthropic Claude Sonnet 4.5 - AI extraction (claude-sonnet-4-5-20250929)
Tesseract OCR - Primary OCR engine with quality scoring
Google Cloud Vision API - Handwriting OCR and fallback processing

Frontend

Next.js 16 - React framework with App Router
TypeScript - Type-safe development
Tailwind CSS - Utility-first styling
shadcn/ui - React component library
react-pdf - PDF viewing

Infrastructure

Docker Compose - Local development
Kubernetes/AKS - Azure cloud deployment with auto-scaling
Nginx - Reverse proxy (production)

Project Structure

title-abstractor-enterprise/
├── backend/
│   ├── app/
│   │   ├── api/v1/          # API routes (to be built)
│   │   ├── core/            # Business logic (copied from current app)
│   │   │   ├── abstractor.py
│   │   │   ├── gemini_client.py
│   │   │   ├── chain_analyzer.py
│   │   │   └── prompts/
│   │   ├── models/          # MongoDB models ✅
│   │   │   ├── abstract.py
│   │   │   └── job.py
│   │   ├── schemas/         # Pydantic schemas (to be built)
│   │   ├── workers/         # Celery tasks (to be built)
│   │   └── main.py          # FastAPI app ✅
│   └── requirements.txt     # Dependencies ✅
├── frontend/                # Next.js app (to be built)
├── docker-compose.yml       # Local dev setup (to be built)
└── .env.example             # Environment variables ✅

What's Been Built

✅ Phase 1: Backend API - COMPLETED

Directory structure - Full project scaffolding
Core business logic - Abstractor, Gemini client, Chain analyzer, OCR system
Backend configuration - Pydantic settings with env management
MongoDB models - Abstract, Job, Settings, and Feedback models with Beanie ODM
FastAPI app - App with health check, MongoDB connection, CORS middleware
API routes - Complete REST API for abstracts, jobs, settings, prompts, and feedback
Celery workers - Background PDF processing with real-time job tracking
Pydantic schemas - Request/response validation for all endpoints
Docker Compose - Full local development environment
OCR System - Tesseract + Google Vision API with quality scoring
Feedback System - CRUD + AI analysis endpoints

✅ Phase 2: Frontend - COMPLETED

Next.js 16 app - Complete React frontend with Turbopack
Upload UI - Single file and bulk upload modes
Real-time job polling - Status updates during processing
Document viewer - PDF viewer with citations and highlighting
Document editing - Full inline editing with markdown export
Settings UI - Time estimation and prompt management
Feedback UI - Per-document feedback with AI analysis dashboard
Search - Global search across all abstracts
Responsive design - Mobile-friendly interface

🚧 Future Enhancements

✅ WebSocket/SSE - Real-time updates via WebSocket + Redis pub/sub (COMPLETED v2.2)
✅ Tests - Unit and integration test suites (COMPLETED v2.0 - 75 tests)
Production deployment - Nginx reverse proxy, production Docker images
✅ Batch operations - Bulk delete, bulk export (COMPLETED v2.0)

Quick Start

See QUICK_START.md for detailed setup instructions.

Prerequisites

Docker & Docker Compose (recommended) OR
Python 3.11+, MongoDB, Redis (for manual setup)
API Keys (at least one required):
- Google Gemini API key (for Gemini 2.5 Pro)
- Anthropic API key (for Claude Sonnet 4.5)

Option 1: Docker Compose (Recommended)

# 1. Copy environment file
cp .env.example .env

# 2. Edit .env and add your API keys
# Required: GOOGLE_API_KEY (for Gemini) and/or ANTHROPIC_API_KEY (for Claude)
nano .env

# 3. Start all services (MongoDB, Redis, Backend, Celery)
docker-compose up -d

# 4. View logs
docker-compose logs -f backend

# 5. Access API docs
open http://localhost:8000/docs

Option 2: Manual Setup

# 1. Start MongoDB and Redis
docker run -d -p 27017:27017 --name mongodb mongo:7
docker run -d -p 6379:6379 --name redis redis:7-alpine

# 2. Backend setup
cd backend
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp ../.env.example ../.env
# Edit .env and add GOOGLE_API_KEY and/or ANTHROPIC_API_KEY
uvicorn app.main:app --reload --port 8000

# 3. Start Celery worker (in new terminal)
cd backend
source venv/bin/activate
celery -A app.workers.celery_app worker --loglevel=info

Visit http://localhost:8000/docs for interactive API documentation.

Development Roadmap

Phase 1: Backend API ✅ (COMPLETED)

Phase 2: Frontend ✅ (COMPLETED)

Initialize Next.js project
Create upload page with drag & drop
Create abstracts list page
Create document viewer with PDF side-by-side
Implement job progress polling
Add chain visualization components

Phase 3: Integration ✅ (COMPLETED)

Docker Compose for full stack
File upload to storage (local/S3)
Error handling and validation
Export endpoints (markdown/JSON)
WebSocket for real-time updates (v2.2)

Phase 4: Polish & Deploy 🚀

Unit & integration tests (75+ tests, see backend/tests/)
Production Docker images (Docker Compose + Kubernetes/AKS ready)
Nginx reverse proxy (optional - currently using direct FastAPI)
Environment-based configuration (.env files + Pydantic Settings)
Documentation & API specs (13 docs files, /docs endpoint)

API Endpoints

POST   /api/v1/abstracts/upload          # Upload PDF, return job_id
GET    /api/v1/jobs/{job_id}             # Check job status
GET    /api/v1/abstracts                 # List all abstracts
GET    /api/v1/abstracts/{id}            # Get abstract details
PUT    /api/v1/abstracts/{id}/documents/{doc_id}  # Edit document
GET    /api/v1/abstracts/{id}/chain      # Get chain analysis
POST   /api/v1/abstracts/{id}/chat       # Chatbot Q&A
GET    /api/v1/abstracts/{id}/export     # Download markdown/JSON
DELETE /api/v1/abstracts/{id}            # Delete abstract

Features Ported from Current App

All core features from the Streamlit prototype are now available:

✅ Dual LLM Support - Choose between Gemini 2.5 Pro and Claude Sonnet 4.5
✅ 3-pass PDF extraction - Pass 1: Inventory, Pass 2: Details, Pass 3: Chain
✅ OCR System - Tesseract + Google Vision API with quality scoring
✅ Document editing interface - Full inline editing with markdown export
✅ Document validation - Deduplication and data validation
✅ Chain of title analysis - Automated relationship detection
✅ Chain visualization - Interactive React Flow diagrams
✅ Legal description comparison - Automated matching and analysis
✅ Feedback system - User feedback with AI analysis
✅ Delete functionality - Remove abstracts with confirmation dialog
✅ Export to markdown/JSON - Multiple export formats
✅ Search - Global search across all abstracts

Notes

Current Streamlit app remains untouched - This is a completely separate project
Reuses battle-tested logic - All extraction and analysis code from current app
Production-ready architecture - Async processing, proper database, scalable workers
Modern frontend - React with TypeScript for maintainability
No authentication yet - Can be added later with JWT tokens (backend already has passlib/jose)

Next Steps

All core functionality is complete! Future enhancements:

Testing - Unit and integration test suites ✅ DONE (75 tests)
WebSocket/SSE - Replace polling with real-time updates ✅ DONE (v2.2)
Production deployment - ✅ Kubernetes/AKS complete, optional: Add nginx reverse proxy
Batch operations - ✅ DONE (delete script + export ZIP endpoint)

Documentation

All documentation is centralized in the /docs directory:

Getting Started:

QUICK_START.md - Quick start guide
docs/DOCUMENTATION.md - Complete user & technical documentation

Deployment:

DOCKER_USAGE.md - Docker Compose setup and local development
KUBERNETES_DEPLOYMENT.md - Azure AKS deployment guide

Technical Documentation:

docs/TECHNICAL_DOCUMENTATION.md - Architecture deep dive
docs/TECHNICAL_DOCUMENTATION_PART2.md - Algorithms & OCR
docs/SCHEMA_REFERENCE.md - Complete data schema

Backend:

backend/README.md - Backend quick start
docs/backend/OPTIMIZATION_PROGRESS.md - v2.0 optimizations
backend/tests/README.md - Testing guide
backend/scripts/README.md - Utility scripts

Updates & Reference:

docs/RECENT_UPDATES.md - Detailed changelog
docs/DOCUMENTATION_INDEX.md - Complete documentation index

License

Same as original Title Abstractor project.

Name		Name	Last commit message	Last commit date
Latest commit History 266 Commits
backend		backend
frontend		frontend
kubernetes		kubernetes
.env.example		.env.example
.gitignore		.gitignore
ADMIN_DASHBOARD_GUIDE.md		ADMIN_DASHBOARD_GUIDE.md
ADMIN_DASHBOARD_PLAN.md		ADMIN_DASHBOARD_PLAN.md
AI_INSIGHTS_GUIDE.md		AI_INSIGHTS_GUIDE.md
CHAIN_ANALYZER.md		CHAIN_ANALYZER.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
DOCKER_USAGE.md		DOCKER_USAGE.md
EDIT_TRACKING_ROADMAP.md		EDIT_TRACKING_ROADMAP.md
KUBERNETES_DEPLOYMENT.md		KUBERNETES_DEPLOYMENT.md
PASS1_PROMPT_CHANGES.md		PASS1_PROMPT_CHANGES.md
PROCESSING_WALKTHROUGH.md		PROCESSING_WALKTHROUGH.md
QUICK_START.md		QUICK_START.md
RAILWAY_DEPLOYMENT.md		RAILWAY_DEPLOYMENT.md
README.md		README.md
SUBJECT_PREMISES_BUGS.md		SUBJECT_PREMISES_BUGS.md
TECHNOLOGY_STACK.md		TECHNOLOGY_STACK.md
TODO.md		TODO.md
WEBSOCKET_KNOWN_ISSUE.md		WEBSOCKET_KNOWN_ISSUE.md
docker-compose.yml		docker-compose.yml
nixpacks.toml		nixpacks.toml
railway.toml		railway.toml
requirements.txt		requirements.txt
start-local.sh		start-local.sh
start-production.sh		start-production.sh
start.sh		start.sh
stop.sh		stop.sh

snowsecure/Automated-Stewart-Search

Folders and files

Latest commit

History

Repository files navigation