Skip to content

Conversation

@gphuang
Copy link

@gphuang gphuang commented Jan 15, 2026

No description provided.

- Add mlflow_artifacts.py with functions to collect and upload trace/log files
- Add upload_mlflow_artifacts() wrapper in global_vars.py
- Integrate artifact upload in trainer.py before MLflow run ends
- Add mlflow_upload_traces and mlflow_upload_logs config options
- Add unique timestamp-based output directories for multi-node consistency
- Pass MLflow environment variables through Docker container
- Add TraceLens trace analysis report generation (XLSX, CSV formats)
- Add mlflow_upload_tracelens_report config option (default: false)
- Add mlflow_tracelens_ranks, mlflow_tracelens_max_reports options
- Add mlflow_tracelens_output_format option (all, xlsx, csv)
- Auto-install TraceLens from GitHub if not present
- Upload analysis reports to MLflow artifacts/trace_analysis/
…config parser

Addresses Copilot review comment: if mlflow_tracelens_ranks is configured
as a string in YAML (e.g., '[0,8]' instead of [0, 8]), the code would
receive a string instead of a list, causing _filter_traces_by_rank to
silently filter out all trace files.

Added ast.literal_eval() conversion in:
- generate_tracelens_reports()
- upload_tracelens_reports_to_mlflow()

Falls back to None (process all ranks) with a warning if parsing fails.
When output_format='all', previously the trace file was parsed twice:
- Once for XLSX generation
- Once for CSV generation

Now when format is 'all', we call generate_perf_report_pytorch once
with both output_xlsx_path and output_csvs_dir parameters, parsing
the trace file only once and generating both formats from the same data.

This improves performance significantly for the common use case of
generating both report formats.
After TraceLens reports are successfully uploaded to MLflow, the local
tracelens_reports directory is automatically cleaned up to save disk space.

This addresses the issue of temporary directories not being cleaned up
after artifact upload. The reports remain accessible in MLflow while
freeing up local storage.

Other directories checked:
- tensorboard_dir: Contains original trace files, NOT temporary
- exp_root_path/logs: Contains original log files, NOT temporary
- tracelens_reports: Processed reports uploaded to MLflow, safe to cleanup
Added mlflow_tracelens_cleanup_after_upload parameter to control whether
local TraceLens reports are removed after upload to MLflow.

Default: True (cleanup to save disk space)
Set to False to keep reports locally for inspection/debugging

Changes:
- Added cleanup_after_upload parameter to upload_tracelens_reports_to_mlflow()
- Added tracelens_cleanup_after_upload to upload_artifacts_to_mlflow()
- Added mlflow_tracelens_cleanup_after_upload config in YAML (default: true)
- Updated trainer to pass through the parameter

Use cases:
- True (default): Production runs, save disk space
- False: Development/debugging, keep local copies for inspection
Changed mlflow_tracelens_cleanup_after_upload default from True to False.

New behavior:
- Default (False): Keep reports locally for easy inspection
- Opt-in (True): Cleanup to save disk space

Rationale:
- Reports are valuable for local analysis and debugging
- Users can inspect without downloading from MLflow
- Disk space is less critical than convenience for most users
- Those who need cleanup can explicitly set it to True

Changes:
- cleanup_after_upload parameter: True → False
- tracelens_cleanup_after_upload parameter: True → False
- YAML config default: true → false
- Updated docstrings to reflect new default behavior
Added 'generate_tracelens_report' parameter to enable local report
generation without requiring MLflow upload.

New functionality:
- generate_tracelens_report: Generate reports locally only
- mlflow_upload_tracelens_report: Upload to MLflow (auto-enables generation)

Usage modes:
1. generate=F, upload=F  →  No reports
2. generate=T, upload=F  →  Generate locally only (NEW)
3. generate=F, upload=T  →  Generate AND upload (auto-enabled)
4. generate=T, upload=T  →  Generate AND upload (explicit)

Key benefits:
- Local development without MLflow dependency
- Quick profiling analysis without upload overhead
- Flexible workflow for different use cases
- Backward compatible (upload=T still works)

Changes:
- Added generate_tracelens_reports_locally() function
- Added generate_tracelens_report parameter to all entry points
- Updated logic to handle both parameters with auto-enable
- Updated YAML config with clear documentation
- Added comprehensive logging for each mode

Documentation: scripts/TRACELENS_GENERATION_MODES.md
Removed mlflow_tracelens_max_reports parameter to simplify API and avoid confusion.

Problem:
- Using both 'ranks' and 'max_reports' was confusing
- Example: ranks=[0,4,8,12] + max_reports=2 → only [0,4] analyzed
- Users explicitly specify ranks but get silently truncated

Solution:
- Use tracelens_ranks to control which ranks AND how many
- Want 2 reports? Specify 2 ranks: ranks=[0,8]
- Want all ranks? Use ranks=null
- Clear and explicit behavior

Changes:
- Removed max_reports parameter from all functions
- Removed mlflow_tracelens_max_reports from config
- Updated docstrings to clarify rank-based control
- Simplified logic - no truncation step needed

Migration:
  Before: ranks=[0,4,8,12], max_reports=2
  After:  ranks=[0,4]  # Just specify the ranks you want!
The experiment name contains square brackets like [deepseek_v2_lite-pretrain_...]-rank[0]
which are interpreted as glob pattern character classes, causing glob.glob to
return empty results even though files exist.

Fixed by using glob.escape() on directory paths before using them with glob.glob().
…r all format

TraceLens uses either/or logic - if output_csvs_dir is set, it only generates
CSVs and ignores output_xlsx_path. To get both formats, we now call
generate_perf_report_pytorch twice: once for XLSX and once for CSVs.

Also added _ensure_openpyxl_installed() to automatically install openpyxl
which is required for XLSX file generation.
- Change from uploading individual CSV files to uploading directories
- Fixes issue where rank 8 CSVs overwrite rank 0 CSVs in MLflow
- Preserves local structure: rank[0].pt.trace/ and rank[8].pt.trace/

Before: CSVs mixed together, only last rank visible
After: CSVs grouped by rank in separate directories

Changes:
- Line 412: generated_files.append(csv_subdir) instead of extend(csv_files)
- Line 434: generated_files.append(csv_subdir) instead of extend(csv_files)
Copilot AI review requested due to automatic review settings January 15, 2026 09:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces unique timestamped output directories for each training run and adds MLflow environment variable support to Docker/Podman containers. The changes ensure organized output management, prevent directory conflicts in multi-node setups, and enable MLflow tracking in containerized environments.

Changes:

  • Automatic generation of timestamped output directories per run using MODEL_NAME_TIMESTAMP format
  • Synchronization of timestamp across multi-node SLURM jobs to ensure all nodes write to the same directory
  • Addition of MLflow environment variables to Docker/Podman container configuration
  • Path simplification by clearing PRIMUS_TEAM and PRIMUS_USER environment variables

Reviewed changes

Copilot reviewed 299 out of 670 changed files in this pull request and generated no comments.

File Description
examples/run_slurm_pretrain.sh Generates unique timestamped directories and exports TIMESTAMP for multi-node consistency
examples/run_pretrain.sh Creates timestamped output paths when not set by SLURM wrapper, with conditional logic for single vs multi-node
examples/run_local_pretrain.sh Adds MLflow and output directory environment variables to Docker/Podman containers
mlruns/* MLflow experiment tracking artifacts (auto-generated metadata and parameters)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants