Enterprise-ready title abstraction system built with FastAPI, Next.js, MongoDB, and Celery. This is a complete rebuild of the Streamlit prototype with production-grade architecture.
NY State Warrant Search:
- ✅ Tax Warrant Search - Search NY Open Data API for state tax warrants
- Searches all names extracted from abstract documents
- Real-time results with warrant ID, debtor info, amount, filed date
- Support for multiple debtor names (joint filers)
- PDF links to original warrant documents
- ✅ Child Support Warrant Search - Search for child support warrants
- Same functionality as tax warrants with separate results tab
Selection & Persistence:
- ✅ Selection Persistence - Warrant selections saved to database
- Persist across page refreshes with debounced auto-save
- "Selected for Abstract" panel shows pending items
- ✅ Send to Abstract - Add warrants as documents to abstract
- Individual send buttons per warrant type
- "Send All to Abstract" for batch operations
- Warrants rendered with WARRANT_DOS template
Results Management:
- ✅ Merge Toggle - Combine all results into deduplicated list
- ✅ Sort Dropdown - Sort by Date (Oldest/Newest), Amount, or Name
- ✅ Sent to Abstract List - Shows all public records already in abstract
Bankruptcy Court Auto-Selection:
- ✅ Auto-Select Court from Property County - Automatically selects appropriate bankruptcy court based on property county extracted from legal descriptions
- Regex patterns extract county from legal descriptions ("County of X", "in X County", etc.)
- NY county-to-court mapping for all 62 counties across 4 federal districts
- Priority: saved preference → search history → auto-detect → defaults
Court Selection Persistence:
- ✅ Remember Court Selection - User's court selection saved and restored on return visits
- New
bankruptcy_courts_selectedfield on Abstract model - New API endpoint to save selection with debounced updates
- New
County Info Tooltips:
- ✅ Hover to See Counties - Info icon next to each court shows covered counties
- "i" icon on each court checkbox
- Tooltip lists all counties for that court
Tab & UI Improvements:
- ✅ Tab Reordering - Bankruptcy tab moved to second position (after Documents)
- ✅ JSON Tab Admin-Only - Only visible to authenticated admins, moved to last position
- ✅ Cost Metrics Admin-Only - Dollar amounts (API cost, cost saved) hidden for non-admins
- ✅ Markdown → PDF Highlighting - Select text in editor to highlight on PDF with match navigation
- ✅ PDF → Clipboard Extraction - Shift+drag on PDF to OCR and copy text
- ✅ Backend OCR caching for performance
- ✅ Dual Monitor Support - Pop-out PDF in separate window with bi-directional sync
- ✅ Keyboard Shortcut - Ctrl+Shift+P to toggle pop-out
- ✅ Navigate Between Setouts - Prev/Next arrows and jump-to dropdown in edit drawer
- ✅ Save and Next Button - Saves and auto-opens next setout
- ✅ Reset to Original - Restore complete AI-extracted state from snapshot
- ✅ Smart Sentence Case - Proper noun detection for cities, counties, streets, person names
- ✅ Preserves ALL CAPS - Legal opening phrases kept uppercase
- ✅ Enhanced Patterns - Company names, multi-word last names, county abbreviations
- ✅ Recording Date vs Scanned Order - Per-abstract toggle for document ordering
- ✅ Auto-regenerate References - Cross-references update when toggled
Admin Dashboard Password Protection:
- ✅ Password-Protected Admin Access - Secure authentication for admin dashboard and settings
- Simple password protection using localStorage-based authentication
- Password:
GoBills!(configurable in AdminAuthContext) - Password prompt modal with Lock icon on first access
- Persistent authentication with logout functionality
- Protects: Admin Dashboard, AI Insights, AI Improvements, and Settings pages
- AdminAuthContext provides authentication state management
- AdminPasswordPrompt component wraps protected routes
AI Insights - Analytics Chat Assistant:
- ✅ Conversational Analytics - Chat with AI about edit patterns and system analytics
- Natural language queries about document processing patterns
- Ask questions like "Why do dates get edited so often?" or "Which document types have highest error rates?"
- AI-powered responses using edit tracking data and analytics
- Example prompts: Compare LLM performance, analyze extraction errors, identify improvement opportunities
- Chat history maintained during session
- Access via
/admin/ai-insightswith Sparkles icon
AI Improvements - Automated Enhancement Workflow:
- ✅ AI-Generated Improvements - System learns from edit patterns and suggests enhancements
- Automatic detection of recurring edit patterns
- Confidence scoring for each improvement suggestion
- Evidence-based suggestions with frequency counts
- Multi-stage approval workflow:
- Pending Review: Initial AI suggestions awaiting admin review
- A/B Testing: Run controlled experiments before full deployment
- Test Complete: Review test results with statistical significance
- Approved: Deploy improvements to production
- Rejected: Archive suggestions not suitable for implementation
- Rolled Back: Revert deployed changes if issues arise
- A/B test configuration with variant split and minimum sample size
- Statistical analysis with p-values and confidence levels
- Rollback capability for deployed improvements
- Access via
/admin/improvementswith TrendingUp icon
Edit Analytics System:
- ✅ Comprehensive Edit Tracking - Track all user edits to improve AI extraction
- Monitor edit patterns across document types and fields
- Field-level edit frequency analysis
- Edit type categorization (corrections, additions, formatting)
- Integration with AI Improvements for automated suggestions
- Analytics API endpoints for reporting
- Access via
/admin/edit-analyticswith BarChart3 icon
Admin Authentication Components:
-
AdminAuthContext (
/frontend/src/contexts/AdminAuthContext.tsx)- React Context for authentication state management
isAuthenticatedstate with localStorage persistencelogin(password)- Validates password and sets auth statelogout()- Clears auth state and localStorageuseAdminAuth()hook for consuming auth context
-
AdminPasswordPrompt (
/frontend/src/components/admin/AdminPasswordPrompt.tsx)- Modal password prompt with Lock icon
- Wraps protected components/pages
- Shows children only when authenticated
- Loading state during hydration
- Error handling for incorrect passwords
New API Endpoints:
POST /api/v1/chat/analytics # AI chat for analytics queries
GET /api/v1/improvements/list # List all AI improvements
GET /api/v1/improvements/{id} # Get improvement details
POST /api/v1/improvements/{id}/approve-for-test # Start A/B test
POST /api/v1/improvements/{id}/approve # Deploy improvement
POST /api/v1/improvements/{id}/reject # Reject improvement
POST /api/v1/improvements/{id}/rollback # Rollback deployment
GET /api/v1/admin/edit-analytics # Edit tracking analytics
Protected Routes:
All admin routes now require password authentication:
/admin- Admin Dashboard/admin/ai-insights- AI Analytics Chat/admin/improvements- AI Improvements Management/admin/edit-analytics- Edit Analytics Dashboard/settings- Application Settings
Documentation:
- See ADMIN_DASHBOARD_GUIDE.md for admin dashboard usage
- Default password:
GoBills!(change in AdminAuthContext for production)
Comprehensive Admin Dashboard:
-
✅ Real-time Metrics Dashboard - Complete administrator interface with live analytics
- Overview Tab - Hero metrics cards showing total documents, success rate, processing time, costs, and system health
- Performance Tab - Processing volume charts, document type breakdowns with interactive bar/pie chart views
- Costs Tab - LLM cost analytics by provider (Gemini, Claude, Azure), cost trends, and ROI calculations
- Quality Tab - Uncertain fields tracking, severity breakdown, and abstracts needing review
- System Tab - MongoDB, Redis, Celery worker health monitoring, and error logs
-
✅ Time Series Data & Charts - Historical analytics with Recharts visualizations
- Processing volume line/bar charts with time series data
- Cost breakdown pie charts by LLM provider
- Document types distribution (33+ document types tracked)
- Interactive tooltips with white background for readability
-
✅ Backend Analytics Tasks - Automated metrics collection via Celery Beat
- Hourly processing metrics aggregation (
aggregate_hourly_metrics) - 5-minute system health checks (
collect_system_metrics) - Historical data backfill script for populating metrics
- Metrics retention and cleanup tasks
- Hourly processing metrics aggregation (
-
✅ Admin API Endpoints:
GET /api/v1/admin/overview # Hero metrics and trends GET /api/v1/admin/metrics/processing # Processing performance metrics GET /api/v1/admin/metrics/cost # Cost analytics by provider GET /api/v1/admin/system/health # System health monitoring GET /api/v1/admin/documents # Paginated document list GET /api/v1/admin/errors # Error log viewer GET /api/v1/analytics/quality # Quality metrics and uncertain fields GET /api/v1/analytics/document-types # Document type analytics -
✅ Dashboard Features:
- Period selector (24h, 7d, 30d, 90d, all time)
- Manual refresh button for on-demand updates
- Export dashboard data to JSON/CSV
- Home button navigation to main app
- Responsive design with tabbed interface
Documentation:
- See ADMIN_DASHBOARD_GUIDE.md for complete usage guide
- WEBSOCKET_KNOWN_ISSUE.md - Known development mode issues
Text Snippet Auto-Expansion System:
-
Snippet Management - Create reusable text shortcuts that expand to full phrases
- Define shortcuts like
mtg,sam,covthat expand to standard legal phrases - Variable placeholders:
{grantee},{grantor},{amount},{dated},{recording},{date},{___} - Category organization (Legal Description, Document Type, Cross-Reference, Recording)
- Settings page for managing custom snippets
- Default snippets seeded on first run
- Define shortcuts like
-
Built-in Default Snippets:
sam→ "Being the same premises conveyed to {grantee}"mtg→ "MORTGAGE dated {dated}, made by {grantor} to {grantee}, in the principal sum of {amount}"cov→ "Covers same premises shown at No. {___} above"rec→ "Recorded {recording}"lisp→ "NOTICE OF PENDENCY filed {dated} by {plaintiff} against {defendants}"nop→ "Object of action: to foreclose Mortgage No. {___}. For further proceedings please see the docket maintained in the County Clerk's Office or shown on the New York State Unified Court System."- And more for deeds, assignments, satisfactions, agreements
Template System Enhancements:
- Tax Warrant Template - New dedicated template for state/county tax warrants
- Proper creditor/debtor formatting with VS separator
- Warrant ID and amount fields
- Debtor address display
- Document Type Normalization - Automatic display name corrections
- "TRANSCRIPT OF JUDGMENT" → "JUDGMENT"
- "CERTIFICATE OF DEATH" → "DEATH CERTIFICATE"
- Template Mapping Improvements - Better routing for document types
- Support for underscore formats (TAX_WARRANT, STATE_TAX_WARRANT)
- LIEN → judgment.j2, TAX WARRANT → tax_warrant.j2
- Mortgage Template Fixes - Smart legal description handling
- Detects "same premises" variations to avoid duplicate content
- Patterns: "covers same premises", "being the same premises", "same premises as"
API Endpoints:
GET /api/v1/snippets # List all snippets
POST /api/v1/snippets # Create snippet
GET /api/v1/snippets/{id} # Get snippet
PUT /api/v1/snippets/{id} # Update snippet
DELETE /api/v1/snippets/{id} # Delete snippet
POST /api/v1/snippets/seed # Seed default snippets
WebSocket Real-Time Status Updates:
- ✅ WebSocket Integration - Real-time job status updates via WebSocket connections
- ✅ Redis Pub/Sub - Cross-process messaging between Celery workers and FastAPI server
- ✅ Eliminated HTTP Polling Spam - No more hundreds of GET requests during processing
- ✅ Live Progress Updates - Real-time display of current step and progress percentage
- ✅ Graceful Fallback - Automatic fallback to HTTP polling if WebSocket fails
Technical Implementation:
- WebSocket endpoint:
ws://localhost:8000/api/v1/ws/jobs/{job_id} - Redis channels for job-specific updates:
job_updates:{job_id} - ConnectionManager subscribes to Redis and broadcasts to connected clients
- Celery workers publish status updates to Redis instead of direct WebSocket calls
- Scalable architecture supporting multiple worker processes
Performance & Code Quality Improvements:
Completed comprehensive 3-4 week optimization plan. See docs/backend/OPTIMIZATION_PROGRESS.md for full details.
Key Achievements:
- ✅ 50-70% Memory Reduction - Fixed PIL image memory leaks in OCR processing
- ✅ 10x Template Rendering Speed - Implemented LRU cache with TTL for compiled templates
- ✅ 60-80% Response Size Reduction - Added GZip compression middleware
- ✅ Service Layer Architecture - Clean separation of business logic from API routes
- ✅ DRY Code Improvements - Extracted shared utilities for date and recording reference parsing
- ✅ 75-Test Suite - Comprehensive pytest coverage (93-100% on utilities, 60-83% on services)
- ✅ Frontend Performance - React.memo(), useMemo(), useCallback() to eliminate unnecessary re-renders
- ✅ Code Organization - Moved utility scripts to
/backend/scriptsdirectory - ✅ Caching System - Template cache management API with clear/stats endpoints
Test Coverage:
cd backend
pytest # 75 tests, <3 seconds
pytest --cov=app --cov-report=html # With coverage reportPDF Storage System & Chain Analysis Enhancements:
- ✅ Individual Document PDF Storage - Automatically extract and save each document as separate PDF
- Single PDFs: Save original + individual document PDFs
- Merged PDFs: Create master merged PDF + individual document PDFs
- New utility functions:
merge_pdf_files()andextract_pages_to_file() - Document schema extended with
document_pdf_pathfield
- ✅ Comprehensive PDF Deletion - Delete ALL associated files when abstract is deleted
- Removes main PDF, source PDFs, and all individual document PDFs
- Updated cleanup script to match delete endpoint behavior
- No orphaned files on server
- ✅ Document PDF Endpoint - Serve individual document PDFs with fallback extraction
GET /api/v1/abstracts/{id}/documents/{index}/pdf- Backward compatible with old abstracts
- ✅ Chain Analysis Positioning Fixes - Fixed overlapping document cards
- Increased child chain offset to 350px with 300px minimum clearance
- Collision detection with occupied position tracking
- Cards are 288px wide (w-72) - spacing now accommodates dimensions
- ✅ Chain Issues UI Improvements - Better layout and full-width text display
- Moved issues section below visualization
- Issue messages span full width using Alert component's flex structure
v1.7 Updates (November 6, 2025):
Abstract Metadata Fields & Markdown Formatting:
- ✅ Three New Abstract Metadata Fields - Control abstract-level settings
date_from- Date-only field (displayed as M/D/YYYY, e.g., "4/8/2020")effective_date- Date and time field (displayed with 12-hour format, e.g., "4/8/2020, 3:30 PM")starting_setout_value- Starting number for document numbering/setouts (default: 1)
- ✅ Document Numbering with Offset - Support for continuing numbering from previous abstracts
- Display number =
starting_setout_value + document_index - Cross-references automatically use display numbers
- Example: starting_setout_value=20 → documents numbered 20, 21, 22...
- Display number =
- ✅ Markdown Format Simplification - Removed document numbers and headers from markdown output
- Before:
## 1. MORTGAGE→ After:MORTGAGE - Document numbers only appear on document cards in UI
- All 13 backend templates and 8 frontend render functions updated
- Before:
- ✅ Enhanced Metadata UI - Upload form and detail screen now include editable metadata fields
- ✅ Backward Compatibility - Automatic conversion of old date formats with Pydantic field validator
v1.6 Updates (November 4, 2025):
Document Editing Interface Overhaul:
- ✅ Type-Specific Field Display - Show only relevant fields for each document type (deed, mortgage, estate, UCC, etc.)
- ✅ Template-Driven Field Ordering - Fields arranged to match template output (top to bottom)
- ✅ Cross-Reference Display System - Read-only display of legal description comparisons and mortgage cross-references
- ✅ Enhanced TypeScript Types - Complete type definitions for all document types and cross-references
- ✅ Schema Cleanup - Removed unused fields, added all missing fields for 8 document types
- ✅ Object Handling - Robust handling of complex fields (e.g., principal_amount as object)
- ✅ UI/UX Improvements - Removed section headers, improved labels, eliminated redundant fields
v1.5 Updates (November 3, 2025):
- ✅ Multi-Chain Support - System now handles multiple independent property chains with separate visualization
- ✅ Enhanced Legal Description Matching - 4-pass lot number extraction strategy with tuple-based comparison
- Pass 1: Standard patterns ("Lot 37", "Lot No. 37", "Lot #37")
- Pass 2: Standalone "#37" patterns
- Pass 3: Multiple lot patterns ("and 37", ", 37")
- Pass 4: Parenthesized numbers ("Lot Number Thirty-seven (37)")
- ✅ Synonym Normalization - tract/plot/piece → parcel for better matching accuracy
- ✅ Tuple-Based Lot Comparison - Prevents false positives (e.g., "Lots 3,4" vs "Lot 4" = different)
- ✅ Compact Chain Layout - Reduced horizontal spacing to 400px, vertical spacing to 250px for unconnected docs
- ✅ Branch Point Detection - Child chains positioned at same Y level as parent branch point
- ✅ Unconnected Document Positioning - Documents without connections positioned at x=-400 to prevent overlap
- ✅ Party Matching Improvements - Enhanced survivorship handling and name normalization
- ✅ Book/Page Sorting - Same-date documents sorted by book and page numbers
- ✅ Document Count Storage - Added document_count field to abstract model
- ✅ Pass 3 Timing Display - Fixed timing display issues in processing logs
v1.4 Updates (October 29, 2025):
- ✅ Database Performance Optimization - 300-1000x speedup on analytics (10-30s → 0.03s), 10-15x on feedback analysis
- ✅ Per-Document Markdown Editor - Edit markdown for individual documents with template regeneration
- ✅ Jinja2 Template System - 10 professional templates for different document types (deed, mortgage, lis pendens, affidavit, UCC, etc.)
- ✅ Quality Metrics Dashboard - Comprehensive uncertain fields tracking with 17 reason codes across 5 categories
- ✅ Analytics Quality Endpoint - Real-time quality analytics with severity breakdown and 30-day trends
- ✅ Enhanced Feedback Analysis - AI analysis now includes relevant extraction prompt excerpts for better suggestions
- ✅ Export Individual Documents - Export single documents or all documents as separate files in ZIP archive
- ✅ Database Indexing - Compound indexes for optimal query performance
v1.2 Updates (October 27, 2025):
- ✅ Dual LLM Support - Choose between Gemini 2.5 Pro and Claude Sonnet 4.5 for document processing
- ✅ LLM Tracking & Display - Track and display which AI model processed each abstract
- ✅ Delete Functionality - Delete abstracts with confirmation dialog (removes abstract, documents, and PDF file)
- ✅ Eastern Time Timestamps - Processing logs now display in 12-hour Eastern Time format with seconds
- ✅ Performance Timing System - Detailed step-by-step timing logs in Celery worker terminal
v1.1 Updates (October 26, 2025):
- ✅ OCR Quality Scoring System - 6-metric quality assessment with automatic Vision API fallback
- ✅ Google Cloud Vision API Integration - Handwriting support and quality-based model selection
- ✅ Feedback System - Complete user feedback loop with AI-powered analysis and dashboard
- ✅ Gemini Model Update - Upgraded to
gemini-2.5-pro(Gemini 2.5.0) - ✅ Home Page Redesign - Improved table layout with better information density
- ✅ Search Functionality - Global search across all abstracts
See docs/RECENT_UPDATES.md for complete details on all changes.
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Next.js │────▶│ FastAPI │────▶│ MongoDB │
│ Frontend │◀────│ Backend │ │ Database │
└─────────────┘ WS └──────────────┘ └─────────────┘
│
▼
┌──────────────┐
│ Redis │
│ Job Queue & │
│ Pub/Sub │
└──────────────┘
│
▼
┌──────────────┐
│ Celery │
│ Workers │
└──────────────┘
Real-Time Updates Flow:
- Client connects to WebSocket:
ws://localhost:8000/api/v1/ws/jobs/{job_id} - FastAPI subscribes to Redis channel:
job_updates:{job_id} - Celery worker publishes status updates to Redis
- FastAPI receives from Redis and broadcasts to WebSocket clients
- Frontend receives live updates without polling
- FastAPI - Modern async Python web framework
- MongoDB - Document database (Beanie ODM)
- Redis - Message broker for job queue
- Celery - Distributed task queue for async processing
- Dual LLM Support:
- Google Gemini 2.5 Pro - AI extraction (
gemini-2.5-pro) - Anthropic Claude Sonnet 4.5 - AI extraction (
claude-sonnet-4-5-20250929)
- Google Gemini 2.5 Pro - AI extraction (
- Tesseract OCR - Primary OCR engine with quality scoring
- Google Cloud Vision API - Handwriting OCR and fallback processing
- Next.js 16 - React framework with App Router
- TypeScript - Type-safe development
- Tailwind CSS - Utility-first styling
- shadcn/ui - React component library
- react-pdf - PDF viewing
- Docker Compose - Local development
- Kubernetes/AKS - Azure cloud deployment with auto-scaling
- Nginx - Reverse proxy (production)
title-abstractor-enterprise/
├── backend/
│ ├── app/
│ │ ├── api/v1/ # API routes (to be built)
│ │ ├── core/ # Business logic (copied from current app)
│ │ │ ├── abstractor.py
│ │ │ ├── gemini_client.py
│ │ │ ├── chain_analyzer.py
│ │ │ └── prompts/
│ │ ├── models/ # MongoDB models ✅
│ │ │ ├── abstract.py
│ │ │ └── job.py
│ │ ├── schemas/ # Pydantic schemas (to be built)
│ │ ├── workers/ # Celery tasks (to be built)
│ │ └── main.py # FastAPI app ✅
│ └── requirements.txt # Dependencies ✅
├── frontend/ # Next.js app (to be built)
├── docker-compose.yml # Local dev setup (to be built)
└── .env.example # Environment variables ✅
- Directory structure - Full project scaffolding
- Core business logic - Abstractor, Gemini client, Chain analyzer, OCR system
- Backend configuration - Pydantic settings with env management
- MongoDB models - Abstract, Job, Settings, and Feedback models with Beanie ODM
- FastAPI app - App with health check, MongoDB connection, CORS middleware
- API routes - Complete REST API for abstracts, jobs, settings, prompts, and feedback
- Celery workers - Background PDF processing with real-time job tracking
- Pydantic schemas - Request/response validation for all endpoints
- Docker Compose - Full local development environment
- OCR System - Tesseract + Google Vision API with quality scoring
- Feedback System - CRUD + AI analysis endpoints
- Next.js 16 app - Complete React frontend with Turbopack
- Upload UI - Single file and bulk upload modes
- Real-time job polling - Status updates during processing
- Document viewer - PDF viewer with citations and highlighting
- Document editing - Full inline editing with markdown export
- Settings UI - Time estimation and prompt management
- Feedback UI - Per-document feedback with AI analysis dashboard
- Search - Global search across all abstracts
- Responsive design - Mobile-friendly interface
- ✅ WebSocket/SSE - Real-time updates via WebSocket + Redis pub/sub (COMPLETED v2.2)
- ✅ Tests - Unit and integration test suites (COMPLETED v2.0 - 75 tests)
- Production deployment - Nginx reverse proxy, production Docker images
- ✅ Batch operations - Bulk delete, bulk export (COMPLETED v2.0)
See QUICK_START.md for detailed setup instructions.
- Docker & Docker Compose (recommended) OR
- Python 3.11+, MongoDB, Redis (for manual setup)
- API Keys (at least one required):
- Google Gemini API key (for Gemini 2.5 Pro)
- Anthropic API key (for Claude Sonnet 4.5)
# 1. Copy environment file
cp .env.example .env
# 2. Edit .env and add your API keys
# Required: GOOGLE_API_KEY (for Gemini) and/or ANTHROPIC_API_KEY (for Claude)
nano .env
# 3. Start all services (MongoDB, Redis, Backend, Celery)
docker-compose up -d
# 4. View logs
docker-compose logs -f backend
# 5. Access API docs
open http://localhost:8000/docs# 1. Start MongoDB and Redis
docker run -d -p 27017:27017 --name mongodb mongo:7
docker run -d -p 6379:6379 --name redis redis:7-alpine
# 2. Backend setup
cd backend
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp ../.env.example ../.env
# Edit .env and add GOOGLE_API_KEY and/or ANTHROPIC_API_KEY
uvicorn app.main:app --reload --port 8000
# 3. Start Celery worker (in new terminal)
cd backend
source venv/bin/activate
celery -A app.workers.celery_app worker --loglevel=infoVisit http://localhost:8000/docs for interactive API documentation.
- Create API route: POST /api/v1/abstracts/upload
- Create API route: GET /api/v1/jobs/{job_id}
- Create API route: GET /api/v1/abstracts
- Create API route: GET /api/v1/abstracts/{id}
- Create API route: GET /api/v1/abstracts/{id}/pdf (download)
- Create API route: GET /api/v1/abstracts/{id}/export (markdown)
- Create API route: DELETE /api/v1/abstracts/{id}
- Create Pydantic schemas for requests/responses
- Set up Celery worker configuration
- Implement PDF processing Celery task
- Docker Compose for local development
- Initialize Next.js project
- Create upload page with drag & drop
- Create abstracts list page
- Create document viewer with PDF side-by-side
- Implement job progress polling
- Add chain visualization components
- Docker Compose for full stack
- File upload to storage (local/S3)
- Error handling and validation
- Export endpoints (markdown/JSON)
- WebSocket for real-time updates (v2.2)
- Unit & integration tests (75+ tests, see backend/tests/)
- Production Docker images (Docker Compose + Kubernetes/AKS ready)
- Nginx reverse proxy (optional - currently using direct FastAPI)
- Environment-based configuration (.env files + Pydantic Settings)
- Documentation & API specs (13 docs files, /docs endpoint)
POST /api/v1/abstracts/upload # Upload PDF, return job_id
GET /api/v1/jobs/{job_id} # Check job status
GET /api/v1/abstracts # List all abstracts
GET /api/v1/abstracts/{id} # Get abstract details
PUT /api/v1/abstracts/{id}/documents/{doc_id} # Edit document
GET /api/v1/abstracts/{id}/chain # Get chain analysis
POST /api/v1/abstracts/{id}/chat # Chatbot Q&A
GET /api/v1/abstracts/{id}/export # Download markdown/JSON
DELETE /api/v1/abstracts/{id} # Delete abstract
All core features from the Streamlit prototype are now available:
- ✅ Dual LLM Support - Choose between Gemini 2.5 Pro and Claude Sonnet 4.5
- ✅ 3-pass PDF extraction - Pass 1: Inventory, Pass 2: Details, Pass 3: Chain
- ✅ OCR System - Tesseract + Google Vision API with quality scoring
- ✅ Document editing interface - Full inline editing with markdown export
- ✅ Document validation - Deduplication and data validation
- ✅ Chain of title analysis - Automated relationship detection
- ✅ Chain visualization - Interactive React Flow diagrams
- ✅ Legal description comparison - Automated matching and analysis
- ✅ Feedback system - User feedback with AI analysis
- ✅ Delete functionality - Remove abstracts with confirmation dialog
- ✅ Export to markdown/JSON - Multiple export formats
- ✅ Search - Global search across all abstracts
- Current Streamlit app remains untouched - This is a completely separate project
- Reuses battle-tested logic - All extraction and analysis code from current app
- Production-ready architecture - Async processing, proper database, scalable workers
- Modern frontend - React with TypeScript for maintainability
- No authentication yet - Can be added later with JWT tokens (backend already has passlib/jose)
All core functionality is complete! Future enhancements:
- Testing - Unit and integration test suites ✅ DONE (75 tests)
- WebSocket/SSE - Replace polling with real-time updates ✅ DONE (v2.2)
- Production deployment - ✅ Kubernetes/AKS complete, optional: Add nginx reverse proxy
- Batch operations - ✅ DONE (delete script + export ZIP endpoint)
All documentation is centralized in the /docs directory:
Getting Started:
- QUICK_START.md - Quick start guide
- docs/DOCUMENTATION.md - Complete user & technical documentation
Deployment:
- DOCKER_USAGE.md - Docker Compose setup and local development
- KUBERNETES_DEPLOYMENT.md - Azure AKS deployment guide
Technical Documentation:
- docs/TECHNICAL_DOCUMENTATION.md - Architecture deep dive
- docs/TECHNICAL_DOCUMENTATION_PART2.md - Algorithms & OCR
- docs/SCHEMA_REFERENCE.md - Complete data schema
Backend:
- backend/README.md - Backend quick start
- docs/backend/OPTIMIZATION_PROGRESS.md - v2.0 optimizations
- backend/tests/README.md - Testing guide
- backend/scripts/README.md - Utility scripts
Updates & Reference:
- docs/RECENT_UPDATES.md - Detailed changelog
- docs/DOCUMENTATION_INDEX.md - Complete documentation index
Same as original Title Abstractor project.