flashinfer.testing.attention_tb_per_sec_with_actual_seq_lens

flashinfer.testing.attention_tb_per_sec_with_actual_seq_lens(actual_seq_lens_q, actual_seq_lens_kv, head_dim_qk, head_dim_vo, num_qo_heads, num_kv_heads, time, q_dtype=torch.bfloat16, kv_dtype=torch.bfloat16, o_dtype=torch.bfloat16)

Calculate TB per second perf achieved for a given attention layer with actual sequence lengths. Does not assume all sequence lengths are the same within the batch.

Parameters:
  • actual_seq_lens_q (torch.Tensor) – Array of actual sequence lengths of the query.

  • actual_seq_lens_kv (torch.Tensor) – Array of actual sequence lengths of the key and value.

  • head_dim_qk (int) – Head dimension of the query and key.

  • head_dim_vo (int) – Head dimension of the value.

  • num_qo_heads (int) – Number of query heads.

  • num_kv_heads (int) – Number of key and value heads.

  • time (float) – Execution time in milliseconds.

  • q_dtype (torch.dtype) – Data type of the query.

  • kv_dtype (torch.dtype) – Data type of the key and value.

  • o_dtype (torch.dtype) – Data type of the output.

Returns:

TB per second for the layer.

Return type:

tb_per_sec (float)