-
Notifications
You must be signed in to change notification settings - Fork 25
Feat/combined test #495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat/combined test #495
Conversation
- Add mlflow_artifacts.py with functions to collect and upload trace/log files - Add upload_mlflow_artifacts() wrapper in global_vars.py - Integrate artifact upload in trainer.py before MLflow run ends - Add mlflow_upload_traces and mlflow_upload_logs config options - Add unique timestamp-based output directories for multi-node consistency - Pass MLflow environment variables through Docker container
- Add TraceLens trace analysis report generation (XLSX, CSV formats) - Add mlflow_upload_tracelens_report config option (default: false) - Add mlflow_tracelens_ranks, mlflow_tracelens_max_reports options - Add mlflow_tracelens_output_format option (all, xlsx, csv) - Auto-install TraceLens from GitHub if not present - Upload analysis reports to MLflow artifacts/trace_analysis/
…config parser Addresses Copilot review comment: if mlflow_tracelens_ranks is configured as a string in YAML (e.g., '[0,8]' instead of [0, 8]), the code would receive a string instead of a list, causing _filter_traces_by_rank to silently filter out all trace files. Added ast.literal_eval() conversion in: - generate_tracelens_reports() - upload_tracelens_reports_to_mlflow() Falls back to None (process all ranks) with a warning if parsing fails.
When output_format='all', previously the trace file was parsed twice: - Once for XLSX generation - Once for CSV generation Now when format is 'all', we call generate_perf_report_pytorch once with both output_xlsx_path and output_csvs_dir parameters, parsing the trace file only once and generating both formats from the same data. This improves performance significantly for the common use case of generating both report formats.
After TraceLens reports are successfully uploaded to MLflow, the local tracelens_reports directory is automatically cleaned up to save disk space. This addresses the issue of temporary directories not being cleaned up after artifact upload. The reports remain accessible in MLflow while freeing up local storage. Other directories checked: - tensorboard_dir: Contains original trace files, NOT temporary - exp_root_path/logs: Contains original log files, NOT temporary - tracelens_reports: Processed reports uploaded to MLflow, safe to cleanup
Added mlflow_tracelens_cleanup_after_upload parameter to control whether local TraceLens reports are removed after upload to MLflow. Default: True (cleanup to save disk space) Set to False to keep reports locally for inspection/debugging Changes: - Added cleanup_after_upload parameter to upload_tracelens_reports_to_mlflow() - Added tracelens_cleanup_after_upload to upload_artifacts_to_mlflow() - Added mlflow_tracelens_cleanup_after_upload config in YAML (default: true) - Updated trainer to pass through the parameter Use cases: - True (default): Production runs, save disk space - False: Development/debugging, keep local copies for inspection
Changed mlflow_tracelens_cleanup_after_upload default from True to False. New behavior: - Default (False): Keep reports locally for easy inspection - Opt-in (True): Cleanup to save disk space Rationale: - Reports are valuable for local analysis and debugging - Users can inspect without downloading from MLflow - Disk space is less critical than convenience for most users - Those who need cleanup can explicitly set it to True Changes: - cleanup_after_upload parameter: True → False - tracelens_cleanup_after_upload parameter: True → False - YAML config default: true → false - Updated docstrings to reflect new default behavior
Added 'generate_tracelens_report' parameter to enable local report generation without requiring MLflow upload. New functionality: - generate_tracelens_report: Generate reports locally only - mlflow_upload_tracelens_report: Upload to MLflow (auto-enables generation) Usage modes: 1. generate=F, upload=F → No reports 2. generate=T, upload=F → Generate locally only (NEW) 3. generate=F, upload=T → Generate AND upload (auto-enabled) 4. generate=T, upload=T → Generate AND upload (explicit) Key benefits: - Local development without MLflow dependency - Quick profiling analysis without upload overhead - Flexible workflow for different use cases - Backward compatible (upload=T still works) Changes: - Added generate_tracelens_reports_locally() function - Added generate_tracelens_report parameter to all entry points - Updated logic to handle both parameters with auto-enable - Updated YAML config with clear documentation - Added comprehensive logging for each mode Documentation: scripts/TRACELENS_GENERATION_MODES.md
Removed mlflow_tracelens_max_reports parameter to simplify API and avoid confusion. Problem: - Using both 'ranks' and 'max_reports' was confusing - Example: ranks=[0,4,8,12] + max_reports=2 → only [0,4] analyzed - Users explicitly specify ranks but get silently truncated Solution: - Use tracelens_ranks to control which ranks AND how many - Want 2 reports? Specify 2 ranks: ranks=[0,8] - Want all ranks? Use ranks=null - Clear and explicit behavior Changes: - Removed max_reports parameter from all functions - Removed mlflow_tracelens_max_reports from config - Updated docstrings to clarify rank-based control - Simplified logic - no truncation step needed Migration: Before: ranks=[0,4,8,12], max_reports=2 After: ranks=[0,4] # Just specify the ranks you want!
The experiment name contains square brackets like [deepseek_v2_lite-pretrain_...]-rank[0] which are interpreted as glob pattern character classes, causing glob.glob to return empty results even though files exist. Fixed by using glob.escape() on directory paths before using them with glob.glob().
…r all format TraceLens uses either/or logic - if output_csvs_dir is set, it only generates CSVs and ignores output_xlsx_path. To get both formats, we now call generate_perf_report_pytorch twice: once for XLSX and once for CSVs. Also added _ensure_openpyxl_installed() to automatically install openpyxl which is required for XLSX file generation.
- Change from uploading individual CSV files to uploading directories - Fixes issue where rank 8 CSVs overwrite rank 0 CSVs in MLflow - Preserves local structure: rank[0].pt.trace/ and rank[8].pt.trace/ Before: CSVs mixed together, only last rank visible After: CSVs grouped by rank in separate directories Changes: - Line 412: generated_files.append(csv_subdir) instead of extend(csv_files) - Line 434: generated_files.append(csv_subdir) instead of extend(csv_files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces unique timestamped output directories for each training run and adds MLflow environment variable support to Docker/Podman containers. The changes ensure organized output management, prevent directory conflicts in multi-node setups, and enable MLflow tracking in containerized environments.
Changes:
- Automatic generation of timestamped output directories per run using
MODEL_NAME_TIMESTAMPformat - Synchronization of timestamp across multi-node SLURM jobs to ensure all nodes write to the same directory
- Addition of MLflow environment variables to Docker/Podman container configuration
- Path simplification by clearing PRIMUS_TEAM and PRIMUS_USER environment variables
Reviewed changes
Copilot reviewed 299 out of 670 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| examples/run_slurm_pretrain.sh | Generates unique timestamped directories and exports TIMESTAMP for multi-node consistency |
| examples/run_pretrain.sh | Creates timestamped output paths when not set by SLURM wrapper, with conditional logic for single vs multi-node |
| examples/run_local_pretrain.sh | Adds MLflow and output directory environment variables to Docker/Podman containers |
| mlruns/* | MLflow experiment tracking artifacts (auto-generated metadata and parameters) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
No description provided.