flashinfer.comm.trtllm_mnnvl_ar.trtllm_mnnvl_allreduce¶
- flashinfer.comm.trtllm_mnnvl_ar.trtllm_mnnvl_allreduce(input: Tensor, workspace: MNNVLAllReduceFusionWorkspace, launch_with_pdl: bool, output: Tensor | None = None, strategy: MNNVLAllreduceFusionStrategy = MNNVLAllreduceFusionStrategy.AUTO) Tensor¶
Perform a multi-node NVLink all-reduce operation across multiple GPUs.
This function performs an all-reduce (sum) operation using NVIDIA’s multi-node NVLink (MNNVL) technology to efficiently combine tensors across multiple GPUs and nodes.
- There are 2 variants: One-shot and Two-shot:
One-shot: Each rank stores local shard to all other ranks. Each ranks will receive all shards at the end of the communication round and perfom local reduction. Suitable for small data size and is optimized for low latency.
- Two-shot: There will be 3 steps:
Scatter each GPU’s input shard to other ranks. Each rank will received all shards of a slice of tokens.
Each rank perform reduction on the local tokens.
3. Each rank broadcast the result to all ranks. Suitable for large data size and is optimized for balancing throughput and latency.
- Parameters:
input – Local Input Shard [num_tokens, hidden_dim]
workspace – MNNVLAllReduceFusionWorkspace
launch_with_pdl – Whether to launch with PDL
output – Output tensor to store the result, empty tensor will be created if not provided.
strategy – MNNVLAllreduceFusionStrategy. Internal heuristics will be used if not provided.
- Returns:
Reduced tensor [num_tokens, hidden_dim]
- Return type:
output