A modular Python pipeline for converting any web content into structured training datasets for LLM fine-tuning. Orchestrates scraping β cleaning β chunking β LLM generation to produce high-quality JSONL dialogue data.
Perfect for: Documentation, tutorials, guides, wikis, knowledge bases, or any content-rich websites you want to transform into conversational training data.
Example Use Case: Currently optimized for Black Desert Online (BDO) game guides, but easily adaptable to any domain by adding custom cleaning and chunking strategies.
- π Multi-Source Scraping: Pluggable scrapers with fallback strategies (supports custom APIs and generic web scraping)
- π§Ή Intelligent Cleaning: Domain-specific content cleaning to remove boilerplate, navigation, and clutter
- βοΈ Smart Chunking: Heading-aware text splitting with token-level precision
- π Table/List Splitting: Context-preserving segmentation for long structured data
- π€ LLM Generation: Google Gemini API integration for dialogue creation (easily swappable)
- π JSONL Output: Training-ready conversational format compatible with major LLM frameworks
- π³ Docker Support: Containerized deployment with docker-compose
- βοΈ Fully Configurable: Environment-based configuration system
- π Extensible Architecture: Easy to add new domains, cleaning strategies, or chunking methods
- Game Guides & Wikis: Convert gaming documentation into interactive Q&A datasets
- Technical Documentation: Transform API docs, tutorials, or manuals into conversational training data
- Knowledge Bases: Extract structured information from FAQs, help centers, or support sites
- Educational Content: Convert courses, lessons, or learning materials into dialogue format
- Product Documentation: Turn product guides into customer support training data
Current Implementation: Optimized for Black Desert Online guides (3 domains: Black Desert Foundry, Garmoth.com, Official Wiki)
βββ config.py # Configuration & environment variables
βββ data_extraction.py # Main pipeline orchestration
βββ scraper.py # Web scraping (multi-API strategy)
βββ cleaner.py # Domain-specific content cleaning
βββ chunker.py # Token-aware text chunking
βββ table_list_splitter.py # Long table/list splitting
βββ generator.py # LLM interaction & JSONL output
βββ requirements.txt # Python dependencies
βββ requirements-minimal.txt # Minimal Python dependencies for Docker
βββ Dockerfile # Container image definition
βββ docker-compose.yml # Docker orchestration
βββ .env.example # Environment variable template
βββ ARCHITECTURE.md # System architecture & data flow
βββ CONTRIBUTING.md # Contribution guidelines
βββ LICENSE # MIT License
βββ data/ # Input data directory (volume-mounted in Docker)
β βββ LINKS.md # Markdown file with URLs to process (batch mode)
β βββ processed_links.txt # Tracks processed URLs to avoid reprocessing
βββ output/ # Generated JSONL files (volume-mounted in Docker)
β βββ *.jsonl # Training data output files
βββ logs/ # Application logs (volume-mounted in Docker)
βββ *_errors.log # Failed generation attempts with details
βββ short_chunks.log # Chunks below quality thresholds
-
data/: Input files and processing state- Place your markdown files with URLs here
processed_links.txtautomatically tracks completed URLs- Prevents reprocessing on pipeline restarts
-
output/: Generated training data- JSONL files named according to
PATHS_OUTPUT_FILENAMEin.env - Each line is a complete conversation with metadata
- Ready for immediate use in LLM fine-tuning
- JSONL files named according to
-
logs/: Debugging and quality monitoring- Error logs capture failed LLM generations with full context
- Short chunks log helps tune quality thresholds
- Useful for troubleshooting and pipeline optimization
- Python 3.13+
- Google Generative AI API Key (Get one here)
- Hugging Face Token (Get one here)
-
Clone the repository
git clone https://github.com/nidea1/content-to-training-data.git cd content-to-training-data -
Create virtual environment
python -m venv venv # Windows venv\Scripts\activate # macOS/Linux source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Configure environment
copy .env.example .env # Windows # cp .env.example .env # macOS/Linux
Edit
.envand add your API keys:GOOGLE_API_KEY=your_google_api_key_here HF_TOKEN=your_huggingface_token_here
-
Run the pipeline
# Single URL mode python data_extraction.py # Batch mode (processes links from markdown file) # Edit .env: APP_BATCH_MODE=true python data_extraction.py
-
Setup environment
cp .env.example .env # Edit .env with your API keys and configuration -
Create data directories (if not already present)
mkdir -p data output logs
-
Place input markdown file
# Create a markdown file with your target URLs # Default location: data/LINKS.md (configurable via PATHS_MARKDOWN_FILENAME) # Example format: cat > data/LINKS.md << 'EOF' # My Content Links ## Category 1 - [Article Title 1](https://example.com/article1) - [Article Title 2](https://example.com/article2) ## Category 2 - [Guide Title](https://example.com/guide) EOF
-
Run with Docker Compose
# Build and run in foreground (see logs) docker-compose up # Or run in detached mode (background) docker-compose up -d # View logs docker-compose logs -f # Stop and remove containers docker-compose down # Stop and remove volumes (cleans everything) docker-compose down -v
The pipeline uses volume mounting to persist data between container runs:
# docker-compose.yml mounts three directories:
volumes:
- ./data:/app/data # Input files & processing state
- ./output:/app/output # Generated JSONL training data
- ./logs:/app/logs # Error logs & debug infoWhat this means:
-
./data β /app/data:- Your local
data/folder is accessible inside the container - Put markdown files with URLs here
processed_links.txtpersists between runs (no reprocessing)- Required for batch mode: Place your links file here
- Your local
-
./output β /app/output:- Generated JSONL files appear in your local
output/folder - Survives container restarts and rebuilds
- Easy access to training data without entering container
- Generated JSONL files appear in your local
-
./logs β /app/logs:- Error logs and quality reports written to local
logs/folder - Debug failed generations without
docker exec - Monitor pipeline health in real-time
- Error logs and quality reports written to local
# 1. Configure for single URL mode
cat > .env << 'EOF'
GOOGLE_API_KEY=your_key_here
HF_TOKEN=your_token_here
APP_BATCH_MODE=false
APP_TARGET_URL=https://example.com/article
PATHS_OUTPUT_FILENAME=output/single_article.jsonl
EOF
# 2. Run container
docker-compose up
# 3. Check output
cat output/single_article.jsonl | jq .# 1. Create input file with URLs
mkdir -p data
cat > data/my_guides.md << 'EOF'
# Gaming Guides
- [Guide 1](https://example.com/guide1)
- [Guide 2](https://example.com/guide2)
- [Guide 3](https://example.com/guide3)
EOF
# 2. Configure for batch mode
cat > .env << 'EOF'
GOOGLE_API_KEY=your_key_here
HF_TOKEN=your_token_here
APP_BATCH_MODE=true
PATHS_MARKDOWN_FILENAME=data/my_guides.md
PATHS_PROCESSED_LINKS_FILE=data/processed_links.txt
PATHS_OUTPUT_FILENAME=output/my_guides_dataset.jsonl
EOF
# 3. Run pipeline
docker-compose up
# 4. Check results
ls -lh output/my_guides_dataset.jsonl
wc -l output/my_guides_dataset.jsonl
# 5. View processing state
cat data/processed_links.txt# If pipeline failed or was interrupted:
# 1. Check what was already processed
cat data/processed_links.txt
# 2. Check error logs
cat logs/*_errors.log
# 3. Resume (automatically skips processed URLs)
docker-compose up
# Pipeline reads processed_links.txt and continues from where it stopped!# Terminal 1: Run pipeline
docker-compose up
# Terminal 2: Watch output file grow
watch -n 2 'wc -l output/*.jsonl'
# Terminal 3: Monitor errors
tail -f logs/*_errors.log
# Terminal 4: Check short chunks (quality issues)
tail -f logs/short_chunks.logMount a custom prompt file:
# docker-compose.yml
volumes:
- ./data:/app/data
- ./output:/app/output
- ./logs:/app/logs
- ./my_custom_prompt.txt:/app/meta_prompt.txt:ro # Read-only mountBefore running the pipeline, ensure these directories exist:
# Create required directories
mkdir -p data output logs
# Verify structure
ls -la data/ output/ logs/Directory Roles:
data/: Input markdown files and processing state trackeroutput/: Generated JSONL training datasetslogs/: Error logs and quality reports
Extract data from a single webpage:
# Configure in .env
APP_BATCH_MODE=false
APP_TARGET_URL=https://example.com/your-guide-or-article
PATHS_OUTPUT_FILENAME=output/single_article.jsonl
# Run pipeline
python data_extraction.py
# Check output
cat output/single_article.jsonlOutput location: File specified by PATHS_OUTPUT_FILENAME (default: output/bdo_guides.jsonl)
Process multiple URLs from a markdown file:
# 1. Create input file with URLs
cat > data/my_links.md << 'EOF'
# My Content Links
## Technical Docs
- [Python Tutorial](https://example.com/python-tutorial)
- [API Guide](https://example.com/api-guide)
## Tutorials
- [Getting Started](https://example.com/getting-started)
EOF
# 2. Configure in .env
APP_BATCH_MODE=true
PATHS_MARKDOWN_FILENAME=data/my_links.md
PATHS_PROCESSED_LINKS_FILE=data/processed_links.txt
PATHS_OUTPUT_FILENAME=output/my_dataset.jsonl
# 3. Run pipeline
python data_extraction.py
# 4. Check results
wc -l output/my_dataset.jsonl
cat data/processed_links.txtMarkdown file format (standard markdown links):
# Your Content Links
## Category 1
- [Article Title 1](https://example.com/article1)
- [Article Title 2](https://example.com/article2)
## Category 2
- [Guide Title](https://example.com/guide)Processing State:
processed_links.txttracks completed URLs- Re-running skips already processed links
- Delete file to reprocess everything
# Watch output file grow
tail -f output/my_dataset.jsonl
# Count generated conversations
wc -l output/my_dataset.jsonl
# Check for errors
cat logs/*_errors.log
# View quality issues (short chunks)
cat logs/short_chunks.log
# See processed URLs
cat data/processed_links.txtAll output paths are configurable via .env:
# Input
PATHS_MARKDOWN_FILENAME=data/LINKS.md # Batch mode input
PATHS_PROCESSED_LINKS_FILE=data/processed_links.txt # Progress tracker
# Output
PATHS_OUTPUT_FILENAME=output/training_data.jsonl # Generated dataset
# Logs
PATHS_SHORT_CHUNKS_LOG=logs/short_chunks.log # Quality warnings
# Error logs auto-generated: logs/<output_name>_errors.logAll settings can be configured via environment variables. See .env.example for full options:
Key Settings:
GENERATION_MODEL_NAME: LLM model (gemma-3-27b-it,gemini-2.0-flash, etc.)GENERATION_TEMPERATURE: Creativity (0-2, default 0.6)CHUNKING_MAX_TOKENS: Max tokens per chunk (default 3500)QUALITY_MIN_PAIRS_PER_CHUNK: Minimum QA pairs (default 10)
Generated JSONL files contain conversational dialogues ready for LLM fine-tuning:
{
"conversations": [
{"role": "system", "content": "You are a helpful assistant with expertise in..."},
{"role": "user", "content": "What are the key features of X?"},
{"role": "assistant", "content": "X has several key features including..."}
],
"url": "https://example.com/article",
"date": "2024-01-15"
}Compatible with: OpenAI fine-tuning format, Axolotl, Hugging Face Transformers, and most LLM training frameworks.
The pipeline follows a modular design:
URL β Scraper β Cleaner β Chunker β TableListSplitter β Generator β JSONL
Each component is independently configurable and replaceable. See ARCHITECTURE.md for detailed diagrams and data flow.
- Scraper: Fetches content using domain-specific APIs or generic fallbacks
- Cleaner: Removes navigation, boilerplate, and site-specific clutter
- Chunker: Splits text by headings while respecting token limits
- TableListSplitter: Segments long tables/lists with context preservation
- Generator: Creates Q&A dialogues using LLM prompts
- Output: Writes validated JSONL with metadata
This pipeline is designed to be easily adapted to any content source. Here's how:
No code changes needed! The pipeline includes fallback scrapers that work with most websites:
# Just set your target URL
APP_TARGET_URL=https://your-website.com/article
python data_extraction.pyFor better results, add domain-specific logic:
Edit scraper.py to add your domain's API or custom scraping:
# In scraper.py, add to scrape_content():
if "yourdomain.com" in url:
return self._scrape_your_custom_api(url)Edit cleaner.py to remove your site's boilerplate:
def clean_yourdomain(content: str) -> str:
"""Remove navigation, footers, etc. specific to yourdomain.com"""
# Your cleaning logic here
return cleaned_content
# Register in CLEANING_STRATEGIES dict
CLEANING_STRATEGIES = {
"yourdomain.com": clean_yourdomain,
# ... existing strategies
}If your content has unique heading patterns, add to chunker.py:
def chunk_by_headings_yourdomain(self, text: str) -> List[str]:
"""Split by your domain's heading structure"""
# Your chunking logic
return chunksEdit meta_prompt.txt or set META_PROMPT_TEMPLATE in config.py to match your domain expertise.
# From: Black Desert Online game guides (current)
# To: Programming tutorial site
# In cleaner.py:
def clean_programming_tutorials(content: str) -> str:
# Remove code playground widgets
content = re.sub(r'<CodeSandbox.*?/>', '', content)
# Remove "Try it yourself" buttons
content = re.sub(r'\[Try it\]\(.*?\)', '', content)
return content
CLEANING_STRATEGIES["programming-tutorials.com"] = clean_programming_tutorialsContributions are welcome! Please see CONTRIBUTING.md for:
- Code of conduct
- Development setup
- Code style guidelines
- Pull request process
Priority areas:
- Unit tests and test coverage
- New domain support (add your website/documentation source!)
- New scraping/cleaning strategies
- Performance improvements
- Documentation enhancements
- Scraper: Add method in
scraper.pyfor domain-specific API/scraping - Cleaner: Add cleaning function in
cleaner.pyto remove boilerplate - Chunker: Add heading pattern detection in
chunker.py(if needed) - Test: Process a sample URL and verify output quality
See ARCHITECTURE.md for detailed component descriptions.
- ARCHITECTURE.md - System design & data flow
- CONTRIBUTING.md - Contribution guidelines
- .env.example - Configuration reference
Docker permission issues (Linux/macOS)
# Fix directory permissions
chmod -R 777 data output logs
# Or set proper ownership
sudo chown -R $(id -u):$(id -g) data output logs
# Verify directories exist
ls -la data/ output/ logs/Output files not appearing
- Check
PATHS_OUTPUT_FILENAMEpoints tooutput/directory - Verify Docker volume mounts:
docker-compose config - Local run: Ensure
output/directory exists:mkdir -p output - Check for write permissions
Low quality output (too few QA pairs)
- Lower quality thresholds in
.env:QUALITY_MIN_PAIRS_PER_CHUNK=5 # Default: 10 QUALITY_MAX_PAIRS_PER_CHUNK=50 # Default: 30
- Adjust generation settings:
GENERATION_TEMPERATURE=0.8 # More creative (default: 0.6) GENERATION_MAX_OUTPUT_TOKENS=16000 # More content (default: 10240)
- Check
logs/short_chunks.logfor rejected chunks
Docker container not starting
# Rebuild without cache
docker-compose down
docker-compose build --no-cache
docker-compose up
# Check for port conflicts
docker-compose ps
# Verify .env file format (no spaces around =)
cat .envChanges to .env not taking effect
# Docker: Restart containers
docker-compose down
docker-compose up
# Local: Ensure .env is in current directory
ls -la .env
# Verify environment variables loaded
python -c "from dotenv import load_dotenv; import os; load_dotenv(); print(os.getenv('GOOGLE_API_KEY'))"Enable comprehensive logging:
# In .env
APP_DEBUG_MODE=true
# Run pipeline
python data_extraction.py
# Or with Docker
docker-compose upDebug output includes:
- Detailed scraping responses
- Chunk token counts
- LLM prompts and responses
- JSON validation details
- Processing step timing
| File | Contents | When to Check |
|---|---|---|
logs/*_errors.log |
Failed LLM generations with full context | Empty/invalid output |
logs/short_chunks.log |
Chunks below quality thresholds | Low data yield |
| Console output | Pipeline progress and status | Real-time monitoring |
If issues persist:
- Enable debug mode:
APP_DEBUG_MODE=true - Run pipeline and save full output:
python data_extraction.py > debug.log 2>&1 - Collect error logs:
cat logs/*_errors.log - Check your configuration:
cat .env - Open an issue with:
- Debug output
- Error logs
- Configuration (remove API keys!)
- Target URL (if not sensitive)
This project is licensed under the MIT License - see LICENSE for details.
- Google Generative AI for LLM capabilities
- Jina AI Reader for web-to-markdown conversion
- urltomarkdown & tomarkdown APIs for content extraction
- HuggingFace for tokenizer support
- Repository: https://github.com/nidea1/content-to-training-data
- Issues: https://github.com/nidea1/content-to-training-data/issues
- Discussions: https://github.com/nidea1/content-to-training-data/discussions
Made with β€οΈ for the open source and LLM community