Feat/combined test #495

gphuang · 2026-01-15T09:01:49Z

No description provided.

- Add mlflow_artifacts.py with functions to collect and upload trace/log files - Add upload_mlflow_artifacts() wrapper in global_vars.py - Integrate artifact upload in trainer.py before MLflow run ends - Add mlflow_upload_traces and mlflow_upload_logs config options - Add unique timestamp-based output directories for multi-node consistency - Pass MLflow environment variables through Docker container

- Add TraceLens trace analysis report generation (XLSX, CSV formats) - Add mlflow_upload_tracelens_report config option (default: false) - Add mlflow_tracelens_ranks, mlflow_tracelens_max_reports options - Add mlflow_tracelens_output_format option (all, xlsx, csv) - Auto-install TraceLens from GitHub if not present - Upload analysis reports to MLflow artifacts/trace_analysis/

…N files

…config parser Addresses Copilot review comment: if mlflow_tracelens_ranks is configured as a string in YAML (e.g., '[0,8]' instead of [0, 8]), the code would receive a string instead of a list, causing _filter_traces_by_rank to silently filter out all trace files. Added ast.literal_eval() conversion in: - generate_tracelens_reports() - upload_tracelens_reports_to_mlflow() Falls back to None (process all ranks) with a warning if parsing fails.

When output_format='all', previously the trace file was parsed twice: - Once for XLSX generation - Once for CSV generation Now when format is 'all', we call generate_perf_report_pytorch once with both output_xlsx_path and output_csvs_dir parameters, parsing the trace file only once and generating both formats from the same data. This improves performance significantly for the common use case of generating both report formats.

After TraceLens reports are successfully uploaded to MLflow, the local tracelens_reports directory is automatically cleaned up to save disk space. This addresses the issue of temporary directories not being cleaned up after artifact upload. The reports remain accessible in MLflow while freeing up local storage. Other directories checked: - tensorboard_dir: Contains original trace files, NOT temporary - exp_root_path/logs: Contains original log files, NOT temporary - tracelens_reports: Processed reports uploaded to MLflow, safe to cleanup

Added mlflow_tracelens_cleanup_after_upload parameter to control whether local TraceLens reports are removed after upload to MLflow. Default: True (cleanup to save disk space) Set to False to keep reports locally for inspection/debugging Changes: - Added cleanup_after_upload parameter to upload_tracelens_reports_to_mlflow() - Added tracelens_cleanup_after_upload to upload_artifacts_to_mlflow() - Added mlflow_tracelens_cleanup_after_upload config in YAML (default: true) - Updated trainer to pass through the parameter Use cases: - True (default): Production runs, save disk space - False: Development/debugging, keep local copies for inspection

Changed mlflow_tracelens_cleanup_after_upload default from True to False. New behavior: - Default (False): Keep reports locally for easy inspection - Opt-in (True): Cleanup to save disk space Rationale: - Reports are valuable for local analysis and debugging - Users can inspect without downloading from MLflow - Disk space is less critical than convenience for most users - Those who need cleanup can explicitly set it to True Changes: - cleanup_after_upload parameter: True → False - tracelens_cleanup_after_upload parameter: True → False - YAML config default: true → false - Updated docstrings to reflect new default behavior

Added 'generate_tracelens_report' parameter to enable local report generation without requiring MLflow upload. New functionality: - generate_tracelens_report: Generate reports locally only - mlflow_upload_tracelens_report: Upload to MLflow (auto-enables generation) Usage modes: 1. generate=F, upload=F → No reports 2. generate=T, upload=F → Generate locally only (NEW) 3. generate=F, upload=T → Generate AND upload (auto-enabled) 4. generate=T, upload=T → Generate AND upload (explicit) Key benefits: - Local development without MLflow dependency - Quick profiling analysis without upload overhead - Flexible workflow for different use cases - Backward compatible (upload=T still works) Changes: - Added generate_tracelens_reports_locally() function - Added generate_tracelens_report parameter to all entry points - Updated logic to handle both parameters with auto-enable - Updated YAML config with clear documentation - Added comprehensive logging for each mode Documentation: scripts/TRACELENS_GENERATION_MODES.md

Removed mlflow_tracelens_max_reports parameter to simplify API and avoid confusion. Problem: - Using both 'ranks' and 'max_reports' was confusing - Example: ranks=[0,4,8,12] + max_reports=2 → only [0,4] analyzed - Users explicitly specify ranks but get silently truncated Solution: - Use tracelens_ranks to control which ranks AND how many - Want 2 reports? Specify 2 ranks: ranks=[0,8] - Want all ranks? Use ranks=null - Clear and explicit behavior Changes: - Removed max_reports parameter from all functions - Removed mlflow_tracelens_max_reports from config - Updated docstrings to clarify rank-based control - Simplified logic - no truncation step needed Migration: Before: ranks=[0,4,8,12], max_reports=2 After: ranks=[0,4] # Just specify the ranks you want!

The experiment name contains square brackets like [deepseek_v2_lite-pretrain_...]-rank[0] which are interpreted as glob pattern character classes, causing glob.glob to return empty results even though files exist. Fixed by using glob.escape() on directory paths before using them with glob.glob().

…r all format TraceLens uses either/or logic - if output_csvs_dir is set, it only generates CSVs and ignores output_xlsx_path. To get both formats, we now call generate_perf_report_pytorch twice: once for XLSX and once for CSVs. Also added _ensure_openpyxl_installed() to automatically install openpyxl which is required for XLSX file generation.

- Change from uploading individual CSV files to uploading directories - Fixes issue where rank 8 CSVs overwrite rank 0 CSVs in MLflow - Preserves local structure: rank[0].pt.trace/ and rank[8].pt.trace/ Before: CSVs mixed together, only last rank visible After: CSVs grouped by rank in separate directories Changes: - Line 412: generated_files.append(csv_subdir) instead of extend(csv_files) - Line 434: generated_files.append(csv_subdir) instead of extend(csv_files)

Copilot

Pull request overview

This PR introduces unique timestamped output directories for each training run and adds MLflow environment variable support to Docker/Podman containers. The changes ensure organized output management, prevent directory conflicts in multi-node setups, and enable MLflow tracking in containerized environments.

Changes:

Automatic generation of timestamped output directories per run using MODEL_NAME_TIMESTAMP format
Synchronization of timestamp across multi-node SLURM jobs to ensure all nodes write to the same directory
Addition of MLflow environment variables to Docker/Podman container configuration
Path simplification by clearing PRIMUS_TEAM and PRIMUS_USER environment variables

Reviewed changes

Copilot reviewed 299 out of 670 changed files in this pull request and generated no comments.

File	Description
examples/run_slurm_pretrain.sh	Generates unique timestamped directories and exports TIMESTAMP for multi-node consistency
examples/run_pretrain.sh	Creates timestamped output paths when not set by SLURM wrapper, with conditional logic for single vs multi-node
examples/run_local_pretrain.sh	Adds MLflow and output directory environment variables to Docker/Podman containers
mlruns/*	MLflow experiment tracking artifacts (auto-generated metadata and parameters)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gphuang added 19 commits December 18, 2025 08:52

docs: Fix TraceLens CSV format description (multiple files per rank)

bbfa9d3

fix: Remove unsupported HTML format option from TraceLens

0759122

fix: Use specific trace file patterns to avoid matching unrelated JSO…

4c908e5

…N files

Merge feat/6 + feat/12: Combined MLflow upload + TraceLens analysis

a2eb61c

docs: Clarify MLflow upload defaults are opt-out when MLflow enabled

677f908

feat: Enable TraceLens by default with one report per node

cd7933c

minor fix

5ba64ab

Copilot AI review requested due to automatic review settings January 15, 2026 09:01

Copilot AI reviewed Jan 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/combined test #495

Feat/combined test #495

Uh oh!

gphuang commented Jan 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feat/combined test #495

Are you sure you want to change the base?

Feat/combined test #495

Uh oh!

Conversation

gphuang commented Jan 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants