Support torchrun-style InfiniTrain multi-process launch by chen2021673 · Pull Request #184 · InfiniTensor/InfiniTrain

chen2021673 · 2026-07-03T09:10:56Z

Summary

This PR adds torchrun-style multi-process launch support for InfiniTrain while preserving the existing single-process multi-thread workflow.

Background

InfiniTrain previously relied mainly on --nthread_per_process for multi-GPU runs, which creates multiple C++ threads inside one process. PyTorch torchrun --nproc_per_node=N instead launches N processes and injects rank-related environment variables such as RANK, LOCAL_RANK, WORLD_SIZE, and LOCAL_WORLD_SIZE.

Profiling scripts need the InfiniTrain path to match the PyTorch process model.

Changes

Update infini_run to:
- support -- as the launcher/training-args separator
- launch nproc_per_node child processes
- inject both InfiniTrain and torchrun-compatible rank env vars
- propagate child process failures via exit code
Update parallel runtime to:
- read torchrun-compatible env vars as fallback
- validate process topology and rank bounds
- map local process/thread rank to CUDA device index
Update GPT-2/Llama3 examples and parallel helpers to use local-device mapping.
Update scripts/run_models_and_profile.bash to:
- always launch model commands through infini_run
- treat nproc_per_node as launcher-only config
- keep nthread_per_process as the per-process thread count
Update scripts/test_config.json to use multi-process configs:
- 8-thread cases become nproc_per_node=8, nthread_per_process=1
- original 4-rank VPP cases become nproc_per_node=4, nthread_per_process=1
Add documentation describing behavior, compatibility, and example usage.

Compatibility

Existing direct runs remain supported:

./llama3 ... --nthread_per_process 8

The launcher can also preserve the old single-process multi-thread behavior:

./infini_run --nproc_per_node=1 ./llama3 ... --nthread_per_process 8

The recommended single-node 8-GPU multi-process usage is:

./infini_run --nproc_per_node=8 ./llama3 ... --nthread_per_process 1

Test

Support torchrun-style InfiniTrain multi-process launch

6fae6b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support torchrun-style InfiniTrain multi-process launch#184

Support torchrun-style InfiniTrain multi-process launch#184
chen2021673 wants to merge 1 commit into
masterfrom
8_proc

chen2021673 commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

chen2021673 commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Changes

Compatibility

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chen2021673 commented Jul 3, 2026 •

edited

Loading