Skip to content

snowsecure/Automated-Stewart-Search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

266 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Title Abstractor Enterprise

Enterprise-ready title abstraction system built with FastAPI, Next.js, MongoDB, and Celery. This is a complete rebuild of the Streamlit prototype with production-grade architecture.

🆕 Recent Updates

v3.0.36 - January 14, 2026 (Public Records Search Tab)

NY State Warrant Search:

  • Tax Warrant Search - Search NY Open Data API for state tax warrants
    • Searches all names extracted from abstract documents
    • Real-time results with warrant ID, debtor info, amount, filed date
    • Support for multiple debtor names (joint filers)
    • PDF links to original warrant documents
  • Child Support Warrant Search - Search for child support warrants
    • Same functionality as tax warrants with separate results tab

Selection & Persistence:

  • Selection Persistence - Warrant selections saved to database
    • Persist across page refreshes with debounced auto-save
    • "Selected for Abstract" panel shows pending items
  • Send to Abstract - Add warrants as documents to abstract
    • Individual send buttons per warrant type
    • "Send All to Abstract" for batch operations
    • Warrants rendered with WARRANT_DOS template

Results Management:

  • Merge Toggle - Combine all results into deduplicated list
  • Sort Dropdown - Sort by Date (Oldest/Newest), Amount, or Name
  • Sent to Abstract List - Shows all public records already in abstract

v3.0.35 - January 10, 2026 (Bankruptcy Court Improvements & Admin Controls)

Bankruptcy Court Auto-Selection:

  • Auto-Select Court from Property County - Automatically selects appropriate bankruptcy court based on property county extracted from legal descriptions
    • Regex patterns extract county from legal descriptions ("County of X", "in X County", etc.)
    • NY county-to-court mapping for all 62 counties across 4 federal districts
    • Priority: saved preference → search history → auto-detect → defaults

Court Selection Persistence:

  • Remember Court Selection - User's court selection saved and restored on return visits
    • New bankruptcy_courts_selected field on Abstract model
    • New API endpoint to save selection with debounced updates

County Info Tooltips:

  • Hover to See Counties - Info icon next to each court shows covered counties
    • "i" icon on each court checkbox
    • Tooltip lists all counties for that court

Tab & UI Improvements:

  • Tab Reordering - Bankruptcy tab moved to second position (after Documents)
  • JSON Tab Admin-Only - Only visible to authenticated admins, moved to last position
  • Cost Metrics Admin-Only - Dollar amounts (API cost, cost saved) hidden for non-admins

v3.0.34 - January 8, 2026 (PDF-Markdown Bidirectional Text)

  • Markdown → PDF Highlighting - Select text in editor to highlight on PDF with match navigation
  • PDF → Clipboard Extraction - Shift+drag on PDF to OCR and copy text
  • Backend OCR caching for performance

v3.0.33 - January 5, 2026 (Pop-Out PDF Viewer)

  • Dual Monitor Support - Pop-out PDF in separate window with bi-directional sync
  • Keyboard Shortcut - Ctrl+Shift+P to toggle pop-out

v3.0.32 - January 5, 2026 (Edit Drawer Navigation)

  • Navigate Between Setouts - Prev/Next arrows and jump-to dropdown in edit drawer
  • Save and Next Button - Saves and auto-opens next setout
  • Reset to Original - Restore complete AI-extracted state from snapshot

v3.0.30-31 - January 5, 2026 (Legal Description Formatting)

  • Smart Sentence Case - Proper noun detection for cities, counties, streets, person names
  • Preserves ALL CAPS - Legal opening phrases kept uppercase
  • Enhanced Patterns - Company names, multi-word last names, county abbreviations

v3.0.29 - January 4, 2026 (Sort Order Toggle)

  • Recording Date vs Scanned Order - Per-abstract toggle for document ordering
  • Auto-regenerate References - Cross-references update when toggled

v3.0 - November 30, 2025 (AI-Powered Analytics & Admin Security Release)

Admin Dashboard Password Protection:

  • Password-Protected Admin Access - Secure authentication for admin dashboard and settings
    • Simple password protection using localStorage-based authentication
    • Password: GoBills! (configurable in AdminAuthContext)
    • Password prompt modal with Lock icon on first access
    • Persistent authentication with logout functionality
    • Protects: Admin Dashboard, AI Insights, AI Improvements, and Settings pages
    • AdminAuthContext provides authentication state management
    • AdminPasswordPrompt component wraps protected routes

AI Insights - Analytics Chat Assistant:

  • Conversational Analytics - Chat with AI about edit patterns and system analytics
    • Natural language queries about document processing patterns
    • Ask questions like "Why do dates get edited so often?" or "Which document types have highest error rates?"
    • AI-powered responses using edit tracking data and analytics
    • Example prompts: Compare LLM performance, analyze extraction errors, identify improvement opportunities
    • Chat history maintained during session
    • Access via /admin/ai-insights with Sparkles icon

AI Improvements - Automated Enhancement Workflow:

  • AI-Generated Improvements - System learns from edit patterns and suggests enhancements
    • Automatic detection of recurring edit patterns
    • Confidence scoring for each improvement suggestion
    • Evidence-based suggestions with frequency counts
    • Multi-stage approval workflow:
      1. Pending Review: Initial AI suggestions awaiting admin review
      2. A/B Testing: Run controlled experiments before full deployment
      3. Test Complete: Review test results with statistical significance
      4. Approved: Deploy improvements to production
      5. Rejected: Archive suggestions not suitable for implementation
      6. Rolled Back: Revert deployed changes if issues arise
    • A/B test configuration with variant split and minimum sample size
    • Statistical analysis with p-values and confidence levels
    • Rollback capability for deployed improvements
    • Access via /admin/improvements with TrendingUp icon

Edit Analytics System:

  • Comprehensive Edit Tracking - Track all user edits to improve AI extraction
    • Monitor edit patterns across document types and fields
    • Field-level edit frequency analysis
    • Edit type categorization (corrections, additions, formatting)
    • Integration with AI Improvements for automated suggestions
    • Analytics API endpoints for reporting
    • Access via /admin/edit-analytics with BarChart3 icon

Admin Authentication Components:

  • AdminAuthContext (/frontend/src/contexts/AdminAuthContext.tsx)

    • React Context for authentication state management
    • isAuthenticated state with localStorage persistence
    • login(password) - Validates password and sets auth state
    • logout() - Clears auth state and localStorage
    • useAdminAuth() hook for consuming auth context
  • AdminPasswordPrompt (/frontend/src/components/admin/AdminPasswordPrompt.tsx)

    • Modal password prompt with Lock icon
    • Wraps protected components/pages
    • Shows children only when authenticated
    • Loading state during hydration
    • Error handling for incorrect passwords

New API Endpoints:

POST   /api/v1/chat/analytics         # AI chat for analytics queries
GET    /api/v1/improvements/list      # List all AI improvements
GET    /api/v1/improvements/{id}      # Get improvement details
POST   /api/v1/improvements/{id}/approve-for-test    # Start A/B test
POST   /api/v1/improvements/{id}/approve             # Deploy improvement
POST   /api/v1/improvements/{id}/reject              # Reject improvement
POST   /api/v1/improvements/{id}/rollback            # Rollback deployment
GET    /api/v1/admin/edit-analytics   # Edit tracking analytics

Protected Routes:

All admin routes now require password authentication:

  • /admin - Admin Dashboard
  • /admin/ai-insights - AI Analytics Chat
  • /admin/improvements - AI Improvements Management
  • /admin/edit-analytics - Edit Analytics Dashboard
  • /settings - Application Settings

Documentation:

  • See ADMIN_DASHBOARD_GUIDE.md for admin dashboard usage
  • Default password: GoBills! (change in AdminAuthContext for production)

v2.6 - November 23, 2025 (Admin Dashboard Release)

Comprehensive Admin Dashboard:

  • Real-time Metrics Dashboard - Complete administrator interface with live analytics

    • Overview Tab - Hero metrics cards showing total documents, success rate, processing time, costs, and system health
    • Performance Tab - Processing volume charts, document type breakdowns with interactive bar/pie chart views
    • Costs Tab - LLM cost analytics by provider (Gemini, Claude, Azure), cost trends, and ROI calculations
    • Quality Tab - Uncertain fields tracking, severity breakdown, and abstracts needing review
    • System Tab - MongoDB, Redis, Celery worker health monitoring, and error logs
  • Time Series Data & Charts - Historical analytics with Recharts visualizations

    • Processing volume line/bar charts with time series data
    • Cost breakdown pie charts by LLM provider
    • Document types distribution (33+ document types tracked)
    • Interactive tooltips with white background for readability
  • Backend Analytics Tasks - Automated metrics collection via Celery Beat

    • Hourly processing metrics aggregation (aggregate_hourly_metrics)
    • 5-minute system health checks (collect_system_metrics)
    • Historical data backfill script for populating metrics
    • Metrics retention and cleanup tasks
  • Admin API Endpoints:

    GET /api/v1/admin/overview              # Hero metrics and trends
    GET /api/v1/admin/metrics/processing    # Processing performance metrics
    GET /api/v1/admin/metrics/cost          # Cost analytics by provider
    GET /api/v1/admin/system/health         # System health monitoring
    GET /api/v1/admin/documents             # Paginated document list
    GET /api/v1/admin/errors                # Error log viewer
    GET /api/v1/analytics/quality           # Quality metrics and uncertain fields
    GET /api/v1/analytics/document-types    # Document type analytics
    
  • Dashboard Features:

    • Period selector (24h, 7d, 30d, 90d, all time)
    • Manual refresh button for on-demand updates
    • Export dashboard data to JSON/CSV
    • Home button navigation to main app
    • Responsive design with tabbed interface

Documentation:

v2.3 - November 18, 2025 (Snippets & Template Enhancements)

Text Snippet Auto-Expansion System:

  • Snippet Management - Create reusable text shortcuts that expand to full phrases

    • Define shortcuts like mtg, sam, cov that expand to standard legal phrases
    • Variable placeholders: {grantee}, {grantor}, {amount}, {dated}, {recording}, {date}, {___}
    • Category organization (Legal Description, Document Type, Cross-Reference, Recording)
    • Settings page for managing custom snippets
    • Default snippets seeded on first run
  • Built-in Default Snippets:

    • sam → "Being the same premises conveyed to {grantee}"
    • mtg → "MORTGAGE dated {dated}, made by {grantor} to {grantee}, in the principal sum of {amount}"
    • cov → "Covers same premises shown at No. {___} above"
    • rec → "Recorded {recording}"
    • lisp → "NOTICE OF PENDENCY filed {dated} by {plaintiff} against {defendants}"
    • nop → "Object of action: to foreclose Mortgage No. {___}. For further proceedings please see the docket maintained in the County Clerk's Office or shown on the New York State Unified Court System."
    • And more for deeds, assignments, satisfactions, agreements

Template System Enhancements:

  • Tax Warrant Template - New dedicated template for state/county tax warrants
    • Proper creditor/debtor formatting with VS separator
    • Warrant ID and amount fields
    • Debtor address display
  • Document Type Normalization - Automatic display name corrections
    • "TRANSCRIPT OF JUDGMENT" → "JUDGMENT"
    • "CERTIFICATE OF DEATH" → "DEATH CERTIFICATE"
  • Template Mapping Improvements - Better routing for document types
    • Support for underscore formats (TAX_WARRANT, STATE_TAX_WARRANT)
    • LIEN → judgment.j2, TAX WARRANT → tax_warrant.j2
  • Mortgage Template Fixes - Smart legal description handling
    • Detects "same premises" variations to avoid duplicate content
    • Patterns: "covers same premises", "being the same premises", "same premises as"

API Endpoints:

GET    /api/v1/snippets              # List all snippets
POST   /api/v1/snippets              # Create snippet
GET    /api/v1/snippets/{id}         # Get snippet
PUT    /api/v1/snippets/{id}         # Update snippet
DELETE /api/v1/snippets/{id}         # Delete snippet
POST   /api/v1/snippets/seed         # Seed default snippets

v2.2 - November 13, 2025 (Real-Time Updates Release)

WebSocket Real-Time Status Updates:

  • WebSocket Integration - Real-time job status updates via WebSocket connections
  • Redis Pub/Sub - Cross-process messaging between Celery workers and FastAPI server
  • Eliminated HTTP Polling Spam - No more hundreds of GET requests during processing
  • Live Progress Updates - Real-time display of current step and progress percentage
  • Graceful Fallback - Automatic fallback to HTTP polling if WebSocket fails

Technical Implementation:

  • WebSocket endpoint: ws://localhost:8000/api/v1/ws/jobs/{job_id}
  • Redis channels for job-specific updates: job_updates:{job_id}
  • ConnectionManager subscribes to Redis and broadcasts to connected clients
  • Celery workers publish status updates to Redis instead of direct WebSocket calls
  • Scalable architecture supporting multiple worker processes

v2.0 - November 12, 2025 (Optimization Release)

Performance & Code Quality Improvements:

Completed comprehensive 3-4 week optimization plan. See docs/backend/OPTIMIZATION_PROGRESS.md for full details.

Key Achievements:

  • 50-70% Memory Reduction - Fixed PIL image memory leaks in OCR processing
  • 10x Template Rendering Speed - Implemented LRU cache with TTL for compiled templates
  • 60-80% Response Size Reduction - Added GZip compression middleware
  • Service Layer Architecture - Clean separation of business logic from API routes
  • DRY Code Improvements - Extracted shared utilities for date and recording reference parsing
  • 75-Test Suite - Comprehensive pytest coverage (93-100% on utilities, 60-83% on services)
  • Frontend Performance - React.memo(), useMemo(), useCallback() to eliminate unnecessary re-renders
  • Code Organization - Moved utility scripts to /backend/scripts directory
  • Caching System - Template cache management API with clear/stats endpoints

Test Coverage:

cd backend
pytest                              # 75 tests, <3 seconds
pytest --cov=app --cov-report=html  # With coverage report

v1.8 - November 9, 2025

PDF Storage System & Chain Analysis Enhancements:

  • Individual Document PDF Storage - Automatically extract and save each document as separate PDF
    • Single PDFs: Save original + individual document PDFs
    • Merged PDFs: Create master merged PDF + individual document PDFs
    • New utility functions: merge_pdf_files() and extract_pages_to_file()
    • Document schema extended with document_pdf_path field
  • Comprehensive PDF Deletion - Delete ALL associated files when abstract is deleted
    • Removes main PDF, source PDFs, and all individual document PDFs
    • Updated cleanup script to match delete endpoint behavior
    • No orphaned files on server
  • Document PDF Endpoint - Serve individual document PDFs with fallback extraction
    • GET /api/v1/abstracts/{id}/documents/{index}/pdf
    • Backward compatible with old abstracts
  • Chain Analysis Positioning Fixes - Fixed overlapping document cards
    • Increased child chain offset to 350px with 300px minimum clearance
    • Collision detection with occupied position tracking
    • Cards are 288px wide (w-72) - spacing now accommodates dimensions
  • Chain Issues UI Improvements - Better layout and full-width text display
    • Moved issues section below visualization
    • Issue messages span full width using Alert component's flex structure

v1.7 Updates (November 6, 2025):

Abstract Metadata Fields & Markdown Formatting:

  • Three New Abstract Metadata Fields - Control abstract-level settings
    • date_from - Date-only field (displayed as M/D/YYYY, e.g., "4/8/2020")
    • effective_date - Date and time field (displayed with 12-hour format, e.g., "4/8/2020, 3:30 PM")
    • starting_setout_value - Starting number for document numbering/setouts (default: 1)
  • Document Numbering with Offset - Support for continuing numbering from previous abstracts
    • Display number = starting_setout_value + document_index
    • Cross-references automatically use display numbers
    • Example: starting_setout_value=20 → documents numbered 20, 21, 22...
  • Markdown Format Simplification - Removed document numbers and headers from markdown output
    • Before: ## 1. MORTGAGE → After: MORTGAGE
    • Document numbers only appear on document cards in UI
    • All 13 backend templates and 8 frontend render functions updated
  • Enhanced Metadata UI - Upload form and detail screen now include editable metadata fields
  • Backward Compatibility - Automatic conversion of old date formats with Pydantic field validator

v1.6 Updates (November 4, 2025):

Document Editing Interface Overhaul:

  • Type-Specific Field Display - Show only relevant fields for each document type (deed, mortgage, estate, UCC, etc.)
  • Template-Driven Field Ordering - Fields arranged to match template output (top to bottom)
  • Cross-Reference Display System - Read-only display of legal description comparisons and mortgage cross-references
  • Enhanced TypeScript Types - Complete type definitions for all document types and cross-references
  • Schema Cleanup - Removed unused fields, added all missing fields for 8 document types
  • Object Handling - Robust handling of complex fields (e.g., principal_amount as object)
  • UI/UX Improvements - Removed section headers, improved labels, eliminated redundant fields

v1.5 Updates (November 3, 2025):

  • Multi-Chain Support - System now handles multiple independent property chains with separate visualization
  • Enhanced Legal Description Matching - 4-pass lot number extraction strategy with tuple-based comparison
    • Pass 1: Standard patterns ("Lot 37", "Lot No. 37", "Lot #37")
    • Pass 2: Standalone "#37" patterns
    • Pass 3: Multiple lot patterns ("and 37", ", 37")
    • Pass 4: Parenthesized numbers ("Lot Number Thirty-seven (37)")
  • Synonym Normalization - tract/plot/piece → parcel for better matching accuracy
  • Tuple-Based Lot Comparison - Prevents false positives (e.g., "Lots 3,4" vs "Lot 4" = different)
  • Compact Chain Layout - Reduced horizontal spacing to 400px, vertical spacing to 250px for unconnected docs
  • Branch Point Detection - Child chains positioned at same Y level as parent branch point
  • Unconnected Document Positioning - Documents without connections positioned at x=-400 to prevent overlap
  • Party Matching Improvements - Enhanced survivorship handling and name normalization
  • Book/Page Sorting - Same-date documents sorted by book and page numbers
  • Document Count Storage - Added document_count field to abstract model
  • Pass 3 Timing Display - Fixed timing display issues in processing logs

v1.4 Updates (October 29, 2025):

  • Database Performance Optimization - 300-1000x speedup on analytics (10-30s → 0.03s), 10-15x on feedback analysis
  • Per-Document Markdown Editor - Edit markdown for individual documents with template regeneration
  • Jinja2 Template System - 10 professional templates for different document types (deed, mortgage, lis pendens, affidavit, UCC, etc.)
  • Quality Metrics Dashboard - Comprehensive uncertain fields tracking with 17 reason codes across 5 categories
  • Analytics Quality Endpoint - Real-time quality analytics with severity breakdown and 30-day trends
  • Enhanced Feedback Analysis - AI analysis now includes relevant extraction prompt excerpts for better suggestions
  • Export Individual Documents - Export single documents or all documents as separate files in ZIP archive
  • Database Indexing - Compound indexes for optimal query performance

v1.2 Updates (October 27, 2025):

  • Dual LLM Support - Choose between Gemini 2.5 Pro and Claude Sonnet 4.5 for document processing
  • LLM Tracking & Display - Track and display which AI model processed each abstract
  • Delete Functionality - Delete abstracts with confirmation dialog (removes abstract, documents, and PDF file)
  • Eastern Time Timestamps - Processing logs now display in 12-hour Eastern Time format with seconds
  • Performance Timing System - Detailed step-by-step timing logs in Celery worker terminal

v1.1 Updates (October 26, 2025):

  • OCR Quality Scoring System - 6-metric quality assessment with automatic Vision API fallback
  • Google Cloud Vision API Integration - Handwriting support and quality-based model selection
  • Feedback System - Complete user feedback loop with AI-powered analysis and dashboard
  • Gemini Model Update - Upgraded to gemini-2.5-pro (Gemini 2.5.0)
  • Home Page Redesign - Improved table layout with better information density
  • Search Functionality - Global search across all abstracts

See docs/RECENT_UPDATES.md for complete details on all changes.

Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Next.js   │────▶│   FastAPI    │────▶│   MongoDB   │
│  Frontend   │◀────│   Backend    │     │  Database   │
└─────────────┘ WS  └──────────────┘     └─────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │    Redis     │
                    │ Job Queue &  │
                    │   Pub/Sub    │
                    └──────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │    Celery    │
                    │   Workers    │
                    └──────────────┘

Real-Time Updates Flow:

  1. Client connects to WebSocket: ws://localhost:8000/api/v1/ws/jobs/{job_id}
  2. FastAPI subscribes to Redis channel: job_updates:{job_id}
  3. Celery worker publishes status updates to Redis
  4. FastAPI receives from Redis and broadcasts to WebSocket clients
  5. Frontend receives live updates without polling

Tech Stack

Backend

  • FastAPI - Modern async Python web framework
  • MongoDB - Document database (Beanie ODM)
  • Redis - Message broker for job queue
  • Celery - Distributed task queue for async processing
  • Dual LLM Support:
    • Google Gemini 2.5 Pro - AI extraction (gemini-2.5-pro)
    • Anthropic Claude Sonnet 4.5 - AI extraction (claude-sonnet-4-5-20250929)
  • Tesseract OCR - Primary OCR engine with quality scoring
  • Google Cloud Vision API - Handwriting OCR and fallback processing

Frontend

  • Next.js 16 - React framework with App Router
  • TypeScript - Type-safe development
  • Tailwind CSS - Utility-first styling
  • shadcn/ui - React component library
  • react-pdf - PDF viewing

Infrastructure

  • Docker Compose - Local development
  • Kubernetes/AKS - Azure cloud deployment with auto-scaling
  • Nginx - Reverse proxy (production)

Project Structure

title-abstractor-enterprise/
├── backend/
│   ├── app/
│   │   ├── api/v1/          # API routes (to be built)
│   │   ├── core/            # Business logic (copied from current app)
│   │   │   ├── abstractor.py
│   │   │   ├── gemini_client.py
│   │   │   ├── chain_analyzer.py
│   │   │   └── prompts/
│   │   ├── models/          # MongoDB models ✅
│   │   │   ├── abstract.py
│   │   │   └── job.py
│   │   ├── schemas/         # Pydantic schemas (to be built)
│   │   ├── workers/         # Celery tasks (to be built)
│   │   └── main.py          # FastAPI app ✅
│   └── requirements.txt     # Dependencies ✅
├── frontend/                # Next.js app (to be built)
├── docker-compose.yml       # Local dev setup (to be built)
└── .env.example             # Environment variables ✅

What's Been Built

✅ Phase 1: Backend API - COMPLETED

  1. Directory structure - Full project scaffolding
  2. Core business logic - Abstractor, Gemini client, Chain analyzer, OCR system
  3. Backend configuration - Pydantic settings with env management
  4. MongoDB models - Abstract, Job, Settings, and Feedback models with Beanie ODM
  5. FastAPI app - App with health check, MongoDB connection, CORS middleware
  6. API routes - Complete REST API for abstracts, jobs, settings, prompts, and feedback
  7. Celery workers - Background PDF processing with real-time job tracking
  8. Pydantic schemas - Request/response validation for all endpoints
  9. Docker Compose - Full local development environment
  10. OCR System - Tesseract + Google Vision API with quality scoring
  11. Feedback System - CRUD + AI analysis endpoints

✅ Phase 2: Frontend - COMPLETED

  1. Next.js 16 app - Complete React frontend with Turbopack
  2. Upload UI - Single file and bulk upload modes
  3. Real-time job polling - Status updates during processing
  4. Document viewer - PDF viewer with citations and highlighting
  5. Document editing - Full inline editing with markdown export
  6. Settings UI - Time estimation and prompt management
  7. Feedback UI - Per-document feedback with AI analysis dashboard
  8. Search - Global search across all abstracts
  9. Responsive design - Mobile-friendly interface

🚧 Future Enhancements

  1. WebSocket/SSE - Real-time updates via WebSocket + Redis pub/sub (COMPLETED v2.2)
  2. Tests - Unit and integration test suites (COMPLETED v2.0 - 75 tests)
  3. Production deployment - Nginx reverse proxy, production Docker images
  4. Batch operations - Bulk delete, bulk export (COMPLETED v2.0)

Quick Start

See QUICK_START.md for detailed setup instructions.

Prerequisites

  • Docker & Docker Compose (recommended) OR
  • Python 3.11+, MongoDB, Redis (for manual setup)
  • API Keys (at least one required):
    • Google Gemini API key (for Gemini 2.5 Pro)
    • Anthropic API key (for Claude Sonnet 4.5)

Option 1: Docker Compose (Recommended)

# 1. Copy environment file
cp .env.example .env

# 2. Edit .env and add your API keys
# Required: GOOGLE_API_KEY (for Gemini) and/or ANTHROPIC_API_KEY (for Claude)
nano .env

# 3. Start all services (MongoDB, Redis, Backend, Celery)
docker-compose up -d

# 4. View logs
docker-compose logs -f backend

# 5. Access API docs
open http://localhost:8000/docs

Option 2: Manual Setup

# 1. Start MongoDB and Redis
docker run -d -p 27017:27017 --name mongodb mongo:7
docker run -d -p 6379:6379 --name redis redis:7-alpine

# 2. Backend setup
cd backend
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp ../.env.example ../.env
# Edit .env and add GOOGLE_API_KEY and/or ANTHROPIC_API_KEY
uvicorn app.main:app --reload --port 8000

# 3. Start Celery worker (in new terminal)
cd backend
source venv/bin/activate
celery -A app.workers.celery_app worker --loglevel=info

Visit http://localhost:8000/docs for interactive API documentation.

Development Roadmap

Phase 1: Backend API ✅ (COMPLETED)

  • Create API route: POST /api/v1/abstracts/upload
  • Create API route: GET /api/v1/jobs/{job_id}
  • Create API route: GET /api/v1/abstracts
  • Create API route: GET /api/v1/abstracts/{id}
  • Create API route: GET /api/v1/abstracts/{id}/pdf (download)
  • Create API route: GET /api/v1/abstracts/{id}/export (markdown)
  • Create API route: DELETE /api/v1/abstracts/{id}
  • Create Pydantic schemas for requests/responses
  • Set up Celery worker configuration
  • Implement PDF processing Celery task
  • Docker Compose for local development

Phase 2: Frontend ✅ (COMPLETED)

  • Initialize Next.js project
  • Create upload page with drag & drop
  • Create abstracts list page
  • Create document viewer with PDF side-by-side
  • Implement job progress polling
  • Add chain visualization components

Phase 3: Integration ✅ (COMPLETED)

  • Docker Compose for full stack
  • File upload to storage (local/S3)
  • Error handling and validation
  • Export endpoints (markdown/JSON)
  • WebSocket for real-time updates (v2.2)

Phase 4: Polish & Deploy 🚀

  • Unit & integration tests (75+ tests, see backend/tests/)
  • Production Docker images (Docker Compose + Kubernetes/AKS ready)
  • Nginx reverse proxy (optional - currently using direct FastAPI)
  • Environment-based configuration (.env files + Pydantic Settings)
  • Documentation & API specs (13 docs files, /docs endpoint)

API Endpoints

POST   /api/v1/abstracts/upload          # Upload PDF, return job_id
GET    /api/v1/jobs/{job_id}             # Check job status
GET    /api/v1/abstracts                 # List all abstracts
GET    /api/v1/abstracts/{id}            # Get abstract details
PUT    /api/v1/abstracts/{id}/documents/{doc_id}  # Edit document
GET    /api/v1/abstracts/{id}/chain      # Get chain analysis
POST   /api/v1/abstracts/{id}/chat       # Chatbot Q&A
GET    /api/v1/abstracts/{id}/export     # Download markdown/JSON
DELETE /api/v1/abstracts/{id}            # Delete abstract

Features Ported from Current App

All core features from the Streamlit prototype are now available:

  • Dual LLM Support - Choose between Gemini 2.5 Pro and Claude Sonnet 4.5
  • 3-pass PDF extraction - Pass 1: Inventory, Pass 2: Details, Pass 3: Chain
  • OCR System - Tesseract + Google Vision API with quality scoring
  • Document editing interface - Full inline editing with markdown export
  • Document validation - Deduplication and data validation
  • Chain of title analysis - Automated relationship detection
  • Chain visualization - Interactive React Flow diagrams
  • Legal description comparison - Automated matching and analysis
  • Feedback system - User feedback with AI analysis
  • Delete functionality - Remove abstracts with confirmation dialog
  • Export to markdown/JSON - Multiple export formats
  • Search - Global search across all abstracts

Notes

  • Current Streamlit app remains untouched - This is a completely separate project
  • Reuses battle-tested logic - All extraction and analysis code from current app
  • Production-ready architecture - Async processing, proper database, scalable workers
  • Modern frontend - React with TypeScript for maintainability
  • No authentication yet - Can be added later with JWT tokens (backend already has passlib/jose)

Next Steps

All core functionality is complete! Future enhancements:

  1. Testing - Unit and integration test suites ✅ DONE (75 tests)
  2. WebSocket/SSE - Replace polling with real-time updates ✅ DONE (v2.2)
  3. Production deployment - ✅ Kubernetes/AKS complete, optional: Add nginx reverse proxy
  4. Batch operations - ✅ DONE (delete script + export ZIP endpoint)

Documentation

All documentation is centralized in the /docs directory:

Getting Started:

Deployment:

Technical Documentation:

Backend:

Updates & Reference:

License

Same as original Title Abstractor project.

About

Title Abstractor Enterprise

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •