API Reference¶

This page is generated directly from Python docstrings.

Triton Interface¶

`flash_sparse_attn.ops.triton.interface` ¶

flash_dense_attn_func(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, is_causal: bool = False, softmax_scale: Optional[float] = None, query_scale: Optional[torch.Tensor] = None, key_scale: Optional[torch.Tensor] = None, value_scale: Optional[torch.Tensor] = None, window_sizes: Optional[torch.Tensor] = None, is_local: bool = False, is_quant: bool = False, is_split_kv: bool = False, is_split_qo: bool = False, pack_gqa: bool = False, num_splits: Optional[int] = None, page_table: Optional[torch.Tensor] = None, seqused_k: Optional[torch.Tensor] = None, out: Optional[torch.Tensor] = None, lse: Optional[torch.Tensor] = None, is_autotune: bool = True, tile_mn: Optional[Tuple[int, int]] = None, skip_checks: bool = False, return_lse: bool = False) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] ¶

Flash dense attention function that computes the attention output and optionally the logsumexp.

Parameters:

Name	Type	Description	Default
`query`	`Tensor`	Query tensor of shape [batch_size, seqlen_q, num_heads, head_dim].	required
`key`	`Tensor`	Key tensor of shape [batch_size, seqlen_k, num_kv_heads, head_dim].	required
`value`	`Tensor`	Value tensor of shape [batch_size, seqlen_k, num_kv_heads, head_dim].	required
`is_causal`	`bool`	Whether to apply a causal mask.	`False`
`softmax_scale`	`Optional[float]`	Optional scaling factor for the softmax. If None, defaults to 1/sqrt(head_dim).	`None`
`query_scale`	`Optional[Tensor]`	Optional per-tensor scale for query dequantization.	`None`
`key_scale`	`Optional[Tensor]`	Optional per-tensor scale for key dequantization.	`None`
`value_scale`	`Optional[Tensor]`	Optional per-tensor scale for value dequantization.	`None`
`window_sizes`	`Optional[Tensor]`	Optional window sizes tensor for flexible local attention. Must be shape [num_kv_heads, 4] with dtype int32, with columns [window_sink, window_left, window_right, window_dist]. If provided, is_local is automatically set to True.	`None`
`is_local`	`bool`	Whether to apply a local mask.	`False`
`is_quant`	`bool`	Whether the inputs are quantized. If True, query_scale, key_scale, and value_scale must be provided or will be computed from the input tensors.	`False`
`is_split_kv`	`bool`	Whether to enable split-KV for forward occupancy.	`False`
`is_split_qo`	`bool`	Whether to enable split-QO for backward occupancy.	`False`
`pack_gqa`	`bool`	Whether to pack grouped-query attention.	`False`
`num_splits`	`Optional[int]`	Optional split count for split-KV and split-QO. If provided, enables split-KV and split-QO, if omitted with split-KV and split-QO enabled, a heuristic is used.	`None`
`page_table`	`Optional[Tensor]`	Optional paged KV table with shape [batch_size, max_pages_per_seq]. If provided, key/value must have shape [num_pages, page_size, num_kv_heads, head_dim].	`None`
`seqused_k`	`Optional[Tensor]`	Optional tensor of shape [batch_size] indicating the actual sequence lengths for keys/values.	`None`
`out`	`Optional[Tensor]`	Optional preallocated output tensor with shape [batch_size, seqlen_q, num_heads, head_dim].	`None`
`lse`	`Optional[Tensor]`	Optional preallocated logsumexp tensor with shape [batch_size, num_heads, seqlen_q].	`None`
`is_autotune`	`bool`	If True, use the cached launch config when present, on cache miss, run autotune and store the selected config. If False, tile_mn is used directly.	`True`
`tile_mn`	`Optional[Tuple[int, int]]`	Tuple of (TILE_M, TILE_N) for launch when is_autotune is False.	`None`
`skip_checks`	`bool`	Whether to skip input validation checks for faster performance.	`False`
`return_lse`	`bool`	Whether to return the logsumexp tensor for numerical stability analysis. If True, returns a tuple (out, lse). If False, returns only out.	`False`

Returns:

Type	Description
`Union[Tensor, Tuple[Tensor, Tensor]]`	If return_lse is False, returns out with shape [batch_size, seqlen_q, num_heads, head_dim]. If return_lse is True, returns a tuple (out, lse), where lse has shape [batch_size, num_heads, seqlen_q].

flash_dense_attn_with_kvcache_func(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, softmax_scale: Optional[float] = None, query_scale: Optional[torch.Tensor] = None, key_scale: Optional[torch.Tensor] = None, value_scale: Optional[torch.Tensor] = None, window_sizes: Optional[torch.Tensor] = None, is_local: bool = False, is_quant: bool = False, num_splits: Optional[int] = None, page_table: Optional[torch.Tensor] = None, gather_kv_indices: Optional[torch.Tensor] = None, seqused_k: Optional[torch.Tensor] = None, out: Optional[torch.Tensor] = None, lse: Optional[torch.Tensor] = None, is_autotune: bool = True, tile_mn: Optional[Tuple[int, int]] = None, skip_checks: bool = False, return_lse: bool = False) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] ¶

Flash dense attention function for decoding with KV cache that computes the attention output and optionally the logsumexp.

Parameters:

Name	Type	Description	Default
`query`	`Tensor`	Query tensor of shape [batch_size, num_heads, head_dim].	required
`key`	`Tensor`	Key tensor of shape [batch_size, seqlen_k, num_kv_heads, head_dim].	required
`value`	`Tensor`	Value tensor of shape [batch_size, seqlen_k, num_kv_heads, head_dim].	required
`softmax_scale`	`Optional[float]`	Optional scaling factor for the softmax. If None, defaults to 1/sqrt(head_dim).	`None`
`query_scale`	`Optional[Tensor]`	Optional per-tensor scale for query dequantization.	`None`
`key_scale`	`Optional[Tensor]`	Optional per-tensor scale for key dequantization.	`None`
`value_scale`	`Optional[Tensor]`	Optional per-tensor scale for value dequantization.	`None`
`window_sizes`	`Optional[Tensor]`	Optional window sizes tensor for flexible local attention. Must be shape [num_kv_heads, 4] with dtype int32, with columns [window_sink, window_left, window_right, window_dist]. If provided, is_local is automatically set to True.	`None`
`is_local`	`bool`	Whether to apply a local mask.	`False`
`is_quant`	`bool`	Whether the inputs are quantized. If True, query_scale, key_scale, and value_scale must be provided for dequantization.	`False`
`num_splits`	`Optional[int]`	Optional split count for decode. If omitted, gather decode uses 1 split and regular decode uses a heuristic.	`None`
`page_table`	`Optional[Tensor]`	Optional paged KV table with shape [batch_size, max_pages_per_seq]. If provided, key/value must have shape [num_pages, page_size, num_kv_heads, head_dim].	`None`
`gather_kv_indices`	`Optional[Tensor]`	Optional TopK gather indices with shape [batch_size, topk_seqlen_k]. Each non-negative entry is an original KV sequence position to gather for decode, negative entries are masked out.	`None`
`seqused_k`	`Optional[Tensor]`	Optional tensor of shape [batch_size] indicating the actual sequence lengths for keys/values.	`None`
`out`	`Optional[Tensor]`	Optional preallocated output tensor with shape [batch_size, num_heads, head_dim].	`None`
`lse`	`Optional[Tensor]`	Optional preallocated logsumexp tensor with shape [batch_size, num_heads].	`None`
`is_autotune`	`bool`	If True, use the cached launch config when present, on cache miss, run autotune and store the selected config. If False, tile_mn is used directly.	`True`
`tile_mn`	`Optional[Tuple[int, int]]`	Tuple of (TILE_M, TILE_N) for launch when is_autotune is False.	`None`
`skip_checks`	`bool`	Whether to skip input validation checks for faster performance.	`False`
`return_lse`	`bool`	Whether to return the logsumexp tensor for numerical stability analysis. If True, returns a tuple (out, lse). If False, returns only out.	`False`

Returns:

Type	Description
`Union[Tensor, Tuple[Tensor, Tensor]]`	If return_lse is False, returns out with shape [batch_size, num_heads, head_dim]. If return_lse is True, returns a tuple (out, lse), where lse has shape [batch_size, num_heads].

flash_dense_attn_varlen_func(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, cu_seqlens_q: torch.Tensor, cu_seqlens_k: torch.Tensor, max_seqlen_q: int, max_seqlen_k: int, is_causal: bool = False, softmax_scale: Optional[float] = None, query_scale: Optional[torch.Tensor] = None, key_scale: Optional[torch.Tensor] = None, value_scale: Optional[torch.Tensor] = None, window_sizes: Optional[torch.Tensor] = None, is_local: bool = False, is_quant: bool = False, is_split_kv: bool = False, is_split_qo: bool = False, pack_gqa: bool = False, num_splits: Optional[int] = None, seqused_q: Optional[torch.Tensor] = None, seqused_k: Optional[torch.Tensor] = None, out: Optional[torch.Tensor] = None, lse: Optional[torch.Tensor] = None, is_autotune: bool = True, tile_mn: Optional[Tuple[int, int]] = None, skip_checks: bool = False, return_lse: bool = False) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] ¶

Flash dense attention function for variable-length sequences that computes the attention output and optionally the logsumexp.

Parameters:

Name	Type	Description	Default
`query`	`Tensor`	Query tensor of shape [total_seqlen_q, num_heads_q, head_dim].	required
`key`	`Tensor`	Key tensor of shape [total_seqlen_k, num_heads_kv, head_dim].	required
`value`	`Tensor`	Value tensor of shape [total_seqlen_k, num_heads_kv, head_dim].	required
`cu_seqlens_q`	`Tensor`	Cumulative sequence lengths for queries, shape [batch_size + 1].	required
`cu_seqlens_k`	`Tensor`	Cumulative sequence lengths for keys/values, shape [batch_size + 1].	required
`max_seqlen_q`	`int`	Maximum sequence length for queries.	required
`max_seqlen_k`	`int`	Maximum sequence length for keys/values.	required
`is_causal`	`bool`	Whether to apply a causal mask.	`False`
`softmax_scale`	`Optional[float]`	Optional scaling factor for the softmax. If None, defaults to 1/sqrt(head_dim).	`None`
`query_scale`	`Optional[Tensor]`	Optional per-tensor scale for query dequantization.	`None`
`key_scale`	`Optional[Tensor]`	Optional per-tensor scale for key dequantization.	`None`
`value_scale`	`Optional[Tensor]`	Optional per-tensor scale for value dequantization.	`None`
`window_sizes`	`Optional[Tensor]`	Optional window sizes tensor for flexible local attention. Must be shape [num_kv_heads, 4] with dtype int32, with columns [window_sink, window_left, window_right, window_dist]. If provided, is_local is automatically set to True.	`None`
`is_local`	`bool`	Whether to apply a local mask.	`False`
`is_quant`	`bool`	Whether the inputs are quantized. If True, query_scale, key_scale, and value_scale must be provided or will be computed from the input tensors.	`False`
`is_split_kv`	`bool`	Whether to enable split-KV for forward occupancy.	`False`
`is_split_qo`	`bool`	Whether to enable split-QO for backward occupancy.	`False`
`pack_gqa`	`bool`	Whether to pack grouped-query attention.	`False`
`num_splits`	`Optional[int]`	Optional split count for split-KV and split-QO. If provided, enables split-KV and split-QO, if omitted with split-KV and split-QO enabled, a heuristic is used.	`None`
`seqused_q`	`Optional[Tensor]`	Optional tensor of shape [total_seqlen_q] indicating the actual sequence lengths for queries. If provided, overrides cu_seqlens_q for masking.	`None`
`seqused_k`	`Optional[Tensor]`	Optional tensor of shape [total_seqlen_k] indicating the actual sequence lengths for keys/values. If provided, overrides cu_seqlens_k for masking.	`None`
`out`	`Optional[Tensor]`	Optional preallocated output tensor with shape [batch_size, seqlen_q, num_heads, head_dim].	`None`
`lse`	`Optional[Tensor]`	Optional preallocated logsumexp tensor with shape [batch_size, num_heads, seqlen_q].	`None`
`is_autotune`	`bool`	If True, use the cached launch config when present, on cache miss, run autotune and store the selected config. If False, tile_mn is used directly.	`True`
`tile_mn`	`Optional[Tuple[int, int]]`	Tuple of (TILE_M, TILE_N) for launch when is_autotune is False.	`None`
`skip_checks`	`bool`	Whether to skip input validation checks for faster performance.	`False`
`return_lse`	`bool`	Whether to return the logsumexp tensor for numerical stability analysis. If True, returns a tuple (out, lse). If False, returns only out.	`False`

Returns:

Type	Description
`Union[Tensor, Tuple[Tensor, Tensor]]`	If return_lse is False, returns out with shape [total_seqlen_q, num_heads_q, head_dim]. If return_lse is True, returns a tuple (out, lse), where lse has shape [total_seqlen_q, num_heads_q].

flash_dense_attn_varlen_with_kvcache_func(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, cu_seqlens_k: torch.Tensor, max_seqlen_k: int, softmax_scale: Optional[float] = None, query_scale: Optional[torch.Tensor] = None, key_scale: Optional[torch.Tensor] = None, value_scale: Optional[torch.Tensor] = None, window_sizes: Optional[torch.Tensor] = None, is_local: bool = False, is_quant: bool = False, seqused_k: Optional[torch.Tensor] = None, gather_kv_indices: Optional[torch.Tensor] = None, num_splits: Optional[int] = None, out: Optional[torch.Tensor] = None, lse: Optional[torch.Tensor] = None, is_autotune: bool = True, tile_mn: Optional[Tuple[int, int]] = None, skip_checks: bool = False, return_lse: bool = False) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] ¶

Flash dense attention function for variable-length decoding with KV cache that computes the attention output and optionally the logsumexp.

Parameters:

Name	Type	Description	Default
`query`	`Tensor`	Query tensor of shape [batch_size, num_heads_q, head_dim].	required
`key`	`Tensor`	Key tensor of shape [total_seqlen_k, num_heads_kv, head_dim].	required
`value`	`Tensor`	Value tensor of shape [total_seqlen_k, num_heads_kv, head_dim].	required
`cu_seqlens_k`	`Tensor`	Cumulative sequence lengths for keys/values, shape [batch_size + 1].	required
`max_seqlen_k`	`int`	Maximum sequence length for keys/values.	required
`softmax_scale`	`Optional[float]`	Optional scaling factor for the softmax. If None, defaults to 1/sqrt(head_dim).	`None`
`query_scale`	`Optional[Tensor]`	Optional per-tensor scale for query dequantization.	`None`
`key_scale`	`Optional[Tensor]`	Optional per-tensor scale for key dequantization.	`None`
`value_scale`	`Optional[Tensor]`	Optional per-tensor scale for value dequantization.	`None`
`window_sizes`	`Optional[Tensor]`	Optional window sizes tensor for flexible local attention. Must be shape [num_kv_heads, 4] with dtype int32, with columns [window_sink, window_left, window_right, window_dist]. If provided, is_local is automatically set to True.	`None`
`is_local`	`bool`	Whether to apply a local mask.	`False`
`is_quant`	`bool`	Whether the inputs are quantized. If True, query_scale, key_scale, and value_scale must be provided for dequantization.	`False`
`seqused_k`	`Optional[Tensor]`	Optional tensor indicating the actual sequence lengths for keys/values.	`None`
`gather_kv_indices`	`Optional[Tensor]`	Optional TopK gather indices with shape [batch_size, topk_seqlen_k]. Each non-negative entry is a batch-local KV sequence position to gather for decode, negative entries are masked out.	`None`
`num_splits`	`Optional[int]`	Optional split count for decode. If omitted, gather decode uses 1 split and regular decode uses a heuristic.	`None`
`out`	`Optional[Tensor]`	Optional preallocated output tensor with shape [batch_size, num_heads_q, head_dim].	`None`
`lse`	`Optional[Tensor]`	Optional preallocated logsumexp tensor with shape [batch_size, num_heads_q].	`None`
`is_autotune`	`bool`	If True, use the cached launch config when present, on cache miss, run autotune and store the selected config. If False, tile_mn is used directly.	`True`
`tile_mn`	`Optional[Tuple[int, int]]`	Tuple of (TILE_M, TILE_N) for launch when is_autotune is False.	`None`
`skip_checks`	`bool`	Whether to skip input validation checks for faster performance.	`False`
`return_lse`	`bool`	Whether to return the logsumexp tensor for numerical stability analysis. If True, returns a tuple (out, lse). If False, returns only out.	`False`

Returns:

Type	Description
`Union[Tensor, Tuple[Tensor, Tensor]]`	If return_lse is False, returns out with shape [batch_size, num_heads_q, head_dim]. If return_lse is True, returns a tuple (out, lse), where lse has shape [batch_size, num_heads_q].

flash_sparse_attn_func(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, is_causal: bool = False, softmax_scale: Optional[float] = None, query_scale: Optional[torch.Tensor] = None, key_scale: Optional[torch.Tensor] = None, value_scale: Optional[torch.Tensor] = None, window_sizes: Optional[torch.Tensor] = None, softmax_threshold: Optional[float] = None, is_local: bool = False, is_quant: bool = False, is_split_kv: bool = False, is_split_qo: bool = False, pack_gqa: bool = False, num_splits: Optional[int] = None, page_table: Optional[torch.Tensor] = None, seqused_k: Optional[torch.Tensor] = None, out: Optional[torch.Tensor] = None, lse: Optional[torch.Tensor] = None, is_autotune: bool = True, tile_mn: Optional[Tuple[int, int]] = None, skip_checks: bool = False, return_lse: bool = False) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] ¶

Flash sparse attention function that computes the attention output and optionally the logsumexp.

Parameters:

Name	Type	Description	Default
`query`	`Tensor`	Query tensor of shape [batch_size, seqlen_q, num_heads, head_dim].	required
`key`	`Tensor`	Key tensor of shape [batch_size, seqlen_k, num_kv_heads, head_dim].	required
`value`	`Tensor`	Value tensor of shape [batch_size, seqlen_k, num_kv_heads, head_dim].	required
`is_causal`	`bool`	Whether to apply a causal mask.	`False`
`softmax_scale`	`Optional[float]`	Optional scaling factor for the softmax. If None, defaults to 1/sqrt(head_dim).	`None`
`query_scale`	`Optional[Tensor]`	Optional per-tensor scale for query dequantization.	`None`
`key_scale`	`Optional[Tensor]`	Optional per-tensor scale for key dequantization.	`None`
`value_scale`	`Optional[Tensor]`	Optional per-tensor scale for value dequantization.	`None`
`window_sizes`	`Optional[Tensor]`	Optional window sizes tensor for flexible local attention. Must be shape [num_kv_heads, 4] with dtype int32, with columns [window_sink, window_left, window_right, window_dist]. If provided, is_local is automatically set to True.	`None`
`softmax_threshold`	`Optional[float]`	Optional threshold for the sparse softmax. If None, defaults to head_dim / seqlen_k.	`None`
`is_local`	`bool`	Whether to apply a local mask.	`False`
`is_quant`	`bool`	Whether the inputs are quantized. If True, query_scale, key_scale, and value_scale must be provided or will be computed from the input tensors.	`False`
`is_split_kv`	`bool`	Whether to enable split-KV for forward occupancy.	`False`
`is_split_qo`	`bool`	Whether to enable split-QO for backward occupancy.	`False`
`pack_gqa`	`bool`	Whether to pack grouped-query attention.	`False`
`num_splits`	`Optional[int]`	Optional split count for split-KV and split-QO. If provided, enables split-KV and split-QO, if omitted with split-KV and split-QO enabled, a heuristic is used.	`None`
`page_table`	`Optional[Tensor]`	Optional paged KV table with shape [batch_size, max_pages_per_seq]. If provided, key/value must have shape [num_pages, page_size, num_kv_heads, head_dim].	`None`
`seqused_k`	`Optional[Tensor]`	Optional tensor of shape [batch_size] indicating the actual sequence lengths for keys/values.	`None`
`out`	`Optional[Tensor]`	Optional preallocated output tensor with shape [batch_size, seqlen_q, num_heads, head_dim].	`None`
`lse`	`Optional[Tensor]`	Optional preallocated logsumexp tensor with shape [batch_size, num_heads, seqlen_q].	`None`
`is_autotune`	`bool`	If True, use the cached launch config when present, on cache miss, run autotune and store the selected config. If False, tile_mn is used directly.	`True`
`tile_mn`	`Optional[Tuple[int, int]]`	Tuple of (TILE_M, TILE_N) for launch when is_autotune is False.	`None`
`skip_checks`	`bool`	Whether to skip input validation checks for faster performance.	`False`
`return_lse`	`bool`	Whether to return the logsumexp tensor for numerical stability analysis. If True, returns a tuple (out, lse). If False, returns only out.	`False`

Returns:

Type	Description
`Union[Tensor, Tuple[Tensor, Tensor]]`	If return_lse is False, returns out with shape [batch_size, seqlen_q, num_heads, head_dim]. If return_lse is True, returns a tuple (out, lse), where lse has shape [batch_size, num_heads, seqlen_q].

flash_sparse_attn_with_kvcache_func(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, softmax_scale: Optional[float] = None, query_scale: Optional[torch.Tensor] = None, key_scale: Optional[torch.Tensor] = None, value_scale: Optional[torch.Tensor] = None, window_sizes: Optional[torch.Tensor] = None, softmax_threshold: Optional[float] = None, is_local: bool = False, is_quant: bool = False, num_splits: Optional[int] = None, page_table: Optional[torch.Tensor] = None, gather_kv_indices: Optional[torch.Tensor] = None, seqused_k: Optional[torch.Tensor] = None, out: Optional[torch.Tensor] = None, lse: Optional[torch.Tensor] = None, is_autotune: bool = True, tile_mn: Optional[Tuple[int, int]] = None, skip_checks: bool = False, return_lse: bool = False) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] ¶

Flash sparse attention function for decoding with KV cache that computes the attention output and optionally the logsumexp.

Parameters:

Name	Type	Description	Default
`query`	`Tensor`	Query tensor of shape [batch_size, num_heads, head_dim].	required
`key`	`Tensor`	Key tensor of shape [batch_size, seqlen_k, num_kv_heads, head_dim].	required
`value`	`Tensor`	Value tensor of shape [batch_size, seqlen_k, num_kv_heads, head_dim].	required
`softmax_scale`	`Optional[float]`	Optional scaling factor for the softmax. If None, defaults to 1/sqrt(head_dim).	`None`
`query_scale`	`Optional[Tensor]`	Optional per-tensor scale for query dequantization.	`None`
`key_scale`	`Optional[Tensor]`	Optional per-tensor scale for key dequantization.	`None`
`value_scale`	`Optional[Tensor]`	Optional per-tensor scale for value dequantization.	`None`
`window_sizes`	`Optional[Tensor]`	Optional window sizes tensor for flexible local attention. Must be shape [num_kv_heads, 4] with dtype int32, with columns [window_sink, window_left, window_right, window_dist]. If provided, is_local is automatically set to True.	`None`
`softmax_threshold`	`Optional[float]`	Optional threshold for the sparse softmax. If None, defaults to head_dim / seqlen_k.	`None`
`is_local`	`bool`	Whether to apply a local mask.	`False`
`is_quant`	`bool`	Whether the inputs are quantized. If True, query_scale, key_scale, and value_scale must be provided for dequantization.	`False`
`num_splits`	`Optional[int]`	Optional split count for decode. If omitted, gather decode uses 1 split and regular decode uses a heuristic.	`None`
`page_table`	`Optional[Tensor]`	Optional paged KV table with shape [batch_size, max_pages_per_seq]. If provided, key/value must be shaped [num_pages, page_size, num_kv_heads, head_dim].	`None`
`gather_kv_indices`	`Optional[Tensor]`	Optional TopK gather indices with shape [batch_size, topk_seqlen_k]. Each non-negative entry is an original KV sequence position to gather for decode, negative entries are masked out.	`None`
`seqused_k`	`Optional[Tensor]`	Optional tensor of shape [batch_size] indicating the actual sequence lengths for keys/values.	`None`
`out`	`Optional[Tensor]`	Optional preallocated output tensor with shape [batch_size, num_heads, head_dim].	`None`
`lse`	`Optional[Tensor]`	Optional preallocated logsumexp tensor with shape [batch_size, num_heads].	`None`
`is_autotune`	`bool`	If True, use the cached launch config when present, on cache miss, run autotune and store the selected config. If False, tile_mn is used directly.	`True`
`tile_mn`	`Optional[Tuple[int, int]]`	Tuple of (TILE_M, TILE_N) for launch when is_autotune is False.	`None`
`skip_checks`	`bool`	Whether to skip input validation checks for faster performance.	`False`
`return_lse`	`bool`	Whether to return the logsumexp tensor for numerical stability analysis. If True, returns a tuple (out, lse). If False, returns only out.	`False`

Returns:

Type	Description
`Union[Tensor, Tuple[Tensor, Tensor]]`	If return_lse is False, returns out with shape [batch_size, num_heads, head_dim]. If return_lse is True, returns a tuple (out, lse), where lse has shape [batch_size, num_heads].

flash_sparse_attn_varlen_func(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, cu_seqlens_q: torch.Tensor, cu_seqlens_k: torch.Tensor, max_seqlen_q: int, max_seqlen_k: int, is_causal: bool = False, softmax_scale: Optional[float] = None, query_scale: Optional[torch.Tensor] = None, key_scale: Optional[torch.Tensor] = None, value_scale: Optional[torch.Tensor] = None, window_sizes: Optional[torch.Tensor] = None, softmax_threshold: Optional[float] = None, is_local: bool = False, is_quant: bool = False, is_split_kv: bool = False, is_split_qo: bool = False, pack_gqa: bool = False, num_splits: Optional[int] = None, seqused_q: Optional[torch.Tensor] = None, seqused_k: Optional[torch.Tensor] = None, out: Optional[torch.Tensor] = None, lse: Optional[torch.Tensor] = None, is_autotune: bool = True, tile_mn: Optional[Tuple[int, int]] = None, skip_checks: bool = False, return_lse: bool = False) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] ¶

Flash sparse attention function for variable-length sequences that computes the attention output and optionally the logsumexp.

Parameters:

Name	Type	Description	Default
`query`	`Tensor`	Query tensor of shape [total_seqlen_q, num_heads_q, head_dim].	required
`key`	`Tensor`	Key tensor of shape [total_seqlen_k, num_heads_kv, head_dim].	required
`value`	`Tensor`	Value tensor of shape [total_seqlen_k, num_heads_kv, head_dim].	required
`cu_seqlens_q`	`Tensor`	Cumulative sequence lengths for queries, shape [batch_size + 1].	required
`cu_seqlens_k`	`Tensor`	Cumulative sequence lengths for keys/values, shape [batch_size + 1].	required
`max_seqlen_q`	`int`	Maximum sequence length for queries.	required
`max_seqlen_k`	`int`	Maximum sequence length for keys/values.	required
`is_causal`	`bool`	Whether to apply a causal mask.	`False`
`softmax_scale`	`Optional[float]`	Optional scaling factor for the softmax. If None, defaults to 1/sqrt(head_dim).	`None`
`query_scale`	`Optional[Tensor]`	Optional per-tensor scale for query dequantization.	`None`
`key_scale`	`Optional[Tensor]`	Optional per-tensor scale for key dequantization.	`None`
`value_scale`	`Optional[Tensor]`	Optional per-tensor scale for value dequantization.	`None`
`window_sizes`	`Optional[Tensor]`	Optional window sizes tensor for flexible local attention. Must be shape [num_kv_heads, 4] with dtype int32, with columns [window_sink, window_left, window_right, window_dist]. If provided, is_local is automatically set to True.	`None`
`softmax_threshold`	`Optional[float]`	Optional threshold for the sparse softmax. If None, defaults to head_dim / max_seqlen_k.	`None`
`is_local`	`bool`	Whether to apply a local mask.	`False`
`is_quant`	`bool`	Whether the inputs are quantized. If True, query_scale, key_scale, and value_scale must be provided or will be computed from the input tensors.	`False`
`is_split_kv`	`bool`	Whether to enable split-KV for forward occupancy.	`False`
`is_split_qo`	`bool`	Whether to enable split-QO for backward occupancy.	`False`
`pack_gqa`	`bool`	Whether to pack grouped-query attention.	`False`
`num_splits`	`Optional[int]`	Optional split count for split-KV and split-QO. If provided, enables split-KV and split-QO, if omitted with split-KV and split-QO enabled, a heuristic is used.	`None`
`seqused_q`	`Optional[Tensor]`	Optional tensor of shape [total_seqlen_q] indicating the actual sequence lengths for queries. If provided, overrides cu_seqlens_q for masking.	`None`
`seqused_k`	`Optional[Tensor]`	Optional tensor of shape [total_seqlen_k] indicating the actual sequence lengths for keys/values. If provided, overrides cu_seqlens_k for masking.	`None`
`out`	`Optional[Tensor]`	Optional preallocated output tensor with shape [batch_size, seqlen_q, num_heads, head_dim].	`None`
`lse`	`Optional[Tensor]`	Optional preallocated logsumexp tensor with shape [batch_size, num_heads, seqlen_q].	`None`
`is_autotune`	`bool`	If True, use the cached launch config when present, on cache miss, run autotune and store the selected config. If False, tile_mn is used directly.	`True`
`tile_mn`	`Optional[Tuple[int, int]]`	Tuple of (TILE_M, TILE_N) for launch when is_autotune is False.	`None`
`skip_checks`	`bool`	Whether to skip input validation checks for faster performance.	`False`
`return_lse`	`bool`	Whether to return the logsumexp tensor for numerical stability analysis. If True, returns a tuple (out, lse). If False, returns only out.	`False`

Returns:

Type	Description
`Union[Tensor, Tuple[Tensor, Tensor]]`	If return_lse is False, returns out with shape [total_seqlen_q, num_heads_q, head_dim]. If return_lse is True, returns a tuple (out, lse), where lse has shape [total_seqlen_q, num_heads_q].

flash_sparse_attn_varlen_with_kvcache_func(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, cu_seqlens_k: torch.Tensor, max_seqlen_k: int, softmax_scale: Optional[float] = None, query_scale: Optional[torch.Tensor] = None, key_scale: Optional[torch.Tensor] = None, value_scale: Optional[torch.Tensor] = None, window_sizes: Optional[torch.Tensor] = None, softmax_threshold: Optional[float] = None, is_local: bool = False, is_quant: bool = False, seqused_k: Optional[torch.Tensor] = None, gather_kv_indices: Optional[torch.Tensor] = None, num_splits: Optional[int] = None, out: Optional[torch.Tensor] = None, lse: Optional[torch.Tensor] = None, is_autotune: bool = True, tile_mn: Optional[Tuple[int, int]] = None, skip_checks: bool = False, return_lse: bool = False) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] ¶

Flash sparse attention function for variable-length decoding with KV cache that computes the attention output and optionally the logsumexp.

Parameters:

Name	Type	Description	Default
`query`	`Tensor`	Query tensor of shape [batch_size, num_heads_q, head_dim].	required
`key`	`Tensor`	Key tensor of shape [total_seqlen_k, num_heads_kv, head_dim].	required
`value`	`Tensor`	Value tensor of shape [total_seqlen_k, num_heads_kv, head_dim].	required
`cu_seqlens_k`	`Tensor`	Cumulative sequence lengths for keys/values, shape [batch_size + 1].	required
`max_seqlen_k`	`int`	Maximum sequence length for keys/values.	required
`softmax_scale`	`Optional[float]`	Optional scaling factor for the softmax. If None, defaults to 1/sqrt(head_dim).	`None`
`query_scale`	`Optional[Tensor]`	Optional per-tensor scale for query dequantization.	`None`
`key_scale`	`Optional[Tensor]`	Optional per-tensor scale for key dequantization.	`None`
`value_scale`	`Optional[Tensor]`	Optional per-tensor scale for value dequantization.	`None`
`window_sizes`	`Optional[Tensor]`	Optional window sizes tensor for flexible local attention. Must be shape [num_kv_heads, 4] with dtype int32, with columns [window_sink, window_left, window_right, window_dist]. If provided, is_local is automatically set to True.	`None`
`softmax_threshold`	`Optional[float]`	Optional threshold for the sparse softmax. If None, defaults to head_dim / max_seqlen_k.	`None`
`is_local`	`bool`	Whether to apply a local mask.	`False`
`is_quant`	`bool`	Whether the inputs are quantized. If True, query_scale, key_scale, and value_scale must be provided for dequantization.	`False`
`seqused_k`	`Optional[Tensor]`	Optional tensor indicating the actual sequence lengths for keys/values.	`None`
`gather_kv_indices`	`Optional[Tensor]`	Optional TopK gather indices with shape [batch_size, topk_seqlen_k]. Each non-negative entry is a batch-local KV sequence position to gather for decode, negative entries are masked out.	`None`
`num_splits`	`Optional[int]`	Optional split count for decode. If omitted, gather decode uses 1 split and regular decode uses a heuristic.	`None`
`out`	`Optional[Tensor]`	Optional preallocated output tensor with shape [batch_size, num_heads_q, head_dim].	`None`
`lse`	`Optional[Tensor]`	Optional preallocated logsumexp tensor with shape [batch_size, num_heads_q].	`None`
`is_autotune`	`bool`	If True, use the cached launch config when present, on cache miss, run autotune and store the selected config. If False, tile_mn is used directly.	`True`
`tile_mn`	`Optional[Tuple[int, int]]`	Tuple of (TILE_M, TILE_N) for launch when is_autotune is False.	`None`
`skip_checks`	`bool`	Whether to skip input validation checks for faster performance.	`False`
`return_lse`	`bool`	Whether to return the logsumexp tensor for numerical stability analysis. If True, returns a tuple (out, lse). If False, returns only out.	`False`

Returns:

Type	Description
`Union[Tensor, Tuple[Tensor, Tensor]]`	If return_lse is False, returns out with shape [batch_size, num_heads_q, head_dim]. If return_lse is True, returns a tuple (out, lse), where lse has shape [batch_size, num_heads_q].

flash_gated_attn_func(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, alpha: torch.Tensor, delta: torch.Tensor, is_causal: bool = False, softmax_scale: Optional[float] = None, query_scale: Optional[torch.Tensor] = None, key_scale: Optional[torch.Tensor] = None, value_scale: Optional[torch.Tensor] = None, window_sizes: Optional[torch.Tensor] = None, softmax_threshold: Optional[float] = None, gate_threshold: Optional[float] = None, is_logsigmoid_gate: bool = True, is_adapt_gate: bool = True, is_local: bool = False, is_quant: bool = False, is_split_kv: bool = False, is_split_qo: bool = False, pack_gqa: bool = False, num_splits: Optional[int] = None, page_table: Optional[torch.Tensor] = None, seqused_k: Optional[torch.Tensor] = None, out: Optional[torch.Tensor] = None, lse: Optional[torch.Tensor] = None, is_autotune: bool = True, tile_mn: Optional[Tuple[int, int]] = None, skip_checks: bool = False, return_lse: bool = False) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] ¶

Flash gated attention function that computes the attention output and optionally the logsumexp.

Parameters:

Name	Type	Description	Default
`query`	`Tensor`	Query tensor of shape [batch_size, seqlen_q, num_heads, head_dim].	required
`key`	`Tensor`	Key tensor of shape [batch_size, seqlen_k, num_kv_heads, head_dim].	required
`value`	`Tensor`	Value tensor of shape [batch_size, seqlen_k, num_kv_heads, head_dim].	required
`alpha`	`Tensor`	Tensor of shape [batch_size, seqlen_q, num_heads] representing the sparsity pattern for queries.	required
`delta`	`Tensor`	Tensor of shape [batch_size, seqlen_k, num_kv_heads] representing the sparsity pattern for keys/values.	required
`is_causal`	`bool`	Whether to apply a causal mask.	`False`
`softmax_scale`	`Optional[float]`	Optional scaling factor for the softmax. If None, defaults to 1/sqrt(head_dim).	`None`
`query_scale`	`Optional[Tensor]`	Optional per-tensor scale for query dequantization.	`None`
`key_scale`	`Optional[Tensor]`	Optional per-tensor scale for key dequantization.	`None`
`value_scale`	`Optional[Tensor]`	Optional per-tensor scale for value dequantization.	`None`
`window_sizes`	`Optional[Tensor]`	Optional window sizes tensor for flexible local attention. Must be shape [num_kv_heads, 4] with dtype int32, with columns [window_sink, window_left, window_right, window_dist]. If provided, is_local is automatically set to True.	`None`
`softmax_threshold`	`Optional[float]`	Optional threshold for the sparse softmax.	`None`
`gate_threshold`	`Optional[float]`	Optional threshold for the sparsity gate.	`None`
`is_logsigmoid_gate`	`bool`	Whether to use a log-sigmoid function for the sparsity gate. If False, uses a linear function.	`True`
`is_adapt_gate`	`bool`	Whether to adapt the gate threshold based on sequence length.	`True`
`is_local`	`bool`	Whether to apply a local mask.	`False`
`is_quant`	`bool`	Whether the inputs are quantized. If True, query_scale, key_scale, and value_scale must be provided or will be computed from the input tensors.	`False`
`is_split_kv`	`bool`	Whether to enable split-KV for forward occupancy.	`False`
`is_split_qo`	`bool`	Whether to enable split-QO for backward occupancy.	`False`
`pack_gqa`	`bool`	Whether to pack grouped-query attention.	`False`
`num_splits`	`Optional[int]`	Optional split count for split-KV and split-QO. If provided, enables split-KV and split-QO, if omitted with split-KV and split-QO enabled, a heuristic is used.	`None`
`page_table`	`Optional[Tensor]`	Optional paged KV table with shape [batch_size, max_pages_per_seq]. If provided, key/value must have shape [num_pages, page_size, num_kv_heads, head_dim], delta must have shape [num_pages, page_size, num_kv_heads].	`None`
`seqused_k`	`Optional[Tensor]`	Optional tensor of shape [batch_size] indicating the actual sequence lengths for keys/values/delta.	`None`
`out`	`Optional[Tensor]`	Optional preallocated output tensor with shape [batch_size, seqlen_q, num_heads, head_dim].	`None`
`lse`	`Optional[Tensor]`	Optional preallocated logsumexp tensor with shape [batch_size, num_heads, seqlen_q].	`None`
`is_autotune`	`bool`	If True, use the cached launch config when present, on cache miss, run autotune and store the selected config. If False, tile_mn is used directly.	`True`
`tile_mn`	`Optional[Tuple[int, int]]`	Tuple of (TILE_M, TILE_N) for launch when is_autotune is False.	`None`
`skip_checks`	`bool`	Whether to skip input validation checks for faster performance.	`False`
`return_lse`	`bool`	Whether to return the logsumexp tensor for numerical stability analysis. If True, returns a tuple (out, lse). If False, returns only out.	`False`

Returns:

Type	Description
`Union[Tensor, Tuple[Tensor, Tensor]]`	If return_lse is False, returns out with shape [batch_size, seqlen_q, num_heads, head_dim]. If return_lse is True, returns a tuple (out, lse), where lse has shape [batch_size, num_heads, seqlen_q].

flash_gated_attn_with_kvcache_func(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, alpha: torch.Tensor, delta: torch.Tensor, softmax_scale: Optional[float] = None, query_scale: Optional[torch.Tensor] = None, key_scale: Optional[torch.Tensor] = None, value_scale: Optional[torch.Tensor] = None, window_sizes: Optional[torch.Tensor] = None, softmax_threshold: Optional[float] = None, gate_threshold: Optional[float] = None, is_logsigmoid_gate: bool = True, is_local: bool = False, is_quant: bool = False, num_splits: Optional[int] = None, page_table: Optional[torch.Tensor] = None, gather_kv_indices: Optional[torch.Tensor] = None, seqused_k: Optional[torch.Tensor] = None, out: Optional[torch.Tensor] = None, lse: Optional[torch.Tensor] = None, is_autotune: bool = True, tile_mn: Optional[Tuple[int, int]] = None, skip_checks: bool = False, return_lse: bool = False) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] ¶

Flash gated attention function for decoding with KV cache that computes the attention output and optionally the logsumexp.

Parameters:

Name	Type	Description	Default
`query`	`Tensor`	Query tensor of shape [batch_size, num_heads, head_dim].	required
`key`	`Tensor`	Key tensor of shape [batch_size, seqlen_k, num_kv_heads, head_dim].	required
`value`	`Tensor`	Value tensor of shape [batch_size, seqlen_k, num_kv_heads, head_dim].	required
`alpha`	`Tensor`	Tensor of shape [batch_size, num_heads] representing the sparsity pattern for queries.	required
`delta`	`Tensor`	Tensor of shape [batch_size, seqlen_k, num_kv_heads] representing the sparsity pattern for keys/values.	required
`softmax_scale`	`Optional[float]`	Optional scaling factor for the softmax. If None, defaults to 1/sqrt(head_dim).	`None`
`query_scale`	`Optional[Tensor]`	Optional per-tensor scale for query dequantization.	`None`
`key_scale`	`Optional[Tensor]`	Optional per-tensor scale for key dequantization.	`None`
`value_scale`	`Optional[Tensor]`	Optional per-tensor scale for value dequantization.	`None`
`window_sizes`	`Optional[Tensor]`	Optional window sizes tensor for flexible local attention. Must be shape [num_kv_heads, 4] with dtype int32, with columns [window_sink, window_left, window_right, window_dist]. If provided, is_local is automatically set to True.	`None`
`softmax_threshold`	`Optional[float]`	Optional threshold for the sparse softmax.	`None`
`gate_threshold`	`Optional[float]`	Optional threshold for the sparsity gate.	`None`
`is_logsigmoid_gate`	`bool`	Whether to use a log-sigmoid function for the sparsity gate. If False, uses a linear function.	`True`
`is_local`	`bool`	Whether to apply a local mask.	`False`
`is_quant`	`bool`	Whether the inputs are quantized. If True, query_scale, key_scale, and value_scale must be provided for dequantization.	`False`
`num_splits`	`Optional[int]`	Optional split count for decode. If omitted, gather decode uses 1 split and regular decode uses a heuristic.	`None`
`page_table`	`Optional[Tensor]`	Optional paged KV table with shape [batch_size, max_pages_per_seq]. If provided, key/value must have shape [num_pages, page_size, num_kv_heads, head_dim], delta must have shape [num_pages, page_size, num_kv_heads].	`None`
`gather_kv_indices`	`Optional[Tensor]`	Optional TopK gather indices with shape [batch_size, topk_seqlen_k]. Each non-negative entry is an original KV sequence position to gather for decode, negative entries are masked out.	`None`
`seqused_k`	`Optional[Tensor]`	Optional tensor of shape [batch_size] indicating the actual sequence lengths for keys/values/delta.	`None`
`out`	`Optional[Tensor]`	Optional preallocated output tensor with shape [batch_size, num_heads, head_dim].	`None`
`lse`	`Optional[Tensor]`	Optional preallocated logsumexp tensor with shape [batch_size, num_heads].	`None`
`is_autotune`	`bool`	If True, use the cached launch config when present, on cache miss, run autotune and store the selected config. If False, tile_mn is used directly.	`True`
`tile_mn`	`Optional[Tuple[int, int]]`	Tuple of (TILE_M, TILE_N) for launch when is_autotune is False.	`None`
`skip_checks`	`bool`	Whether to skip input validation checks for faster performance.	`False`
`return_lse`	`bool`	Whether to return the logsumexp tensor for numerical stability analysis. If True, returns a tuple (out, lse). If False, returns only out.	`False`

Returns:

Type	Description
`Union[Tensor, Tuple[Tensor, Tensor]]`	If return_lse is False, returns out with shape [batch_size, num_heads, head_dim]. If return_lse is True, returns a tuple (out, lse), where lse has shape [batch_size, num_heads].

flash_gated_attn_varlen_func(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, alpha: torch.Tensor, delta: torch.Tensor, cu_seqlens_q: torch.Tensor, cu_seqlens_k: torch.Tensor, max_seqlen_q: int, max_seqlen_k: int, is_causal: bool = False, softmax_scale: Optional[float] = None, query_scale: Optional[torch.Tensor] = None, key_scale: Optional[torch.Tensor] = None, value_scale: Optional[torch.Tensor] = None, window_sizes: Optional[torch.Tensor] = None, softmax_threshold: Optional[float] = None, gate_threshold: Optional[float] = None, is_logsigmoid_gate: bool = True, is_adapt_gate: bool = True, is_local: bool = False, is_quant: bool = False, is_split_kv: bool = False, is_split_qo: bool = False, pack_gqa: bool = False, num_splits: Optional[int] = None, seqused_q: Optional[torch.Tensor] = None, seqused_k: Optional[torch.Tensor] = None, out: Optional[torch.Tensor] = None, lse: Optional[torch.Tensor] = None, is_autotune: bool = True, tile_mn: Optional[Tuple[int, int]] = None, skip_checks: bool = False, return_lse: bool = False) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] ¶

Flash gated attention function for variable-length sequences that computes the attention output and optionally the logsumexp.

Parameters:

Name	Type	Description	Default
`query`	`Tensor`	Query tensor of shape [total_seqlen_q, num_heads_q, head_dim].	required
`key`	`Tensor`	Key tensor of shape [total_seqlen_k, num_heads_kv, head_dim].	required
`value`	`Tensor`	Value tensor of shape [total_seqlen_k, num_heads_kv, head_dim].	required
`alpha`	`Tensor`	Tensor of shape [total_seqlen_q, num_heads_q] representing the sparsity pattern for queries.	required
`delta`	`Tensor`	Tensor of shape [total_seqlen_k, num_heads_kv] representing the sparsity pattern for keys/values.	required
`cu_seqlens_q`	`Tensor`	Cumulative sequence lengths for queries, shape [batch_size + 1].	required
`cu_seqlens_k`	`Tensor`	Cumulative sequence lengths for keys/values, shape [batch_size + 1].	required
`max_seqlen_q`	`int`	Maximum sequence length for queries.	required
`max_seqlen_k`	`int`	Maximum sequence length for keys/values.	required
`is_causal`	`bool`	Whether to apply a causal mask.	`False`
`softmax_scale`	`Optional[float]`	Optional scaling factor for the softmax. If None, defaults to 1/sqrt(head_dim).	`None`
`query_scale`	`Optional[Tensor]`	Optional per-tensor scale for query dequantization.	`None`
`key_scale`	`Optional[Tensor]`	Optional per-tensor scale for key dequantization.	`None`
`value_scale`	`Optional[Tensor]`	Optional per-tensor scale for value dequantization.	`None`
`window_sizes`	`Optional[Tensor]`	Optional window sizes tensor for flexible local attention. Must be shape [num_kv_heads, 4] with dtype int32, with columns [window_sink, window_left, window_right, window_dist]. If provided, is_local is automatically set to True.	`None`
`softmax_threshold`	`Optional[float]`	Optional threshold for the sparse softmax.	`None`
`gate_threshold`	`Optional[float]`	Optional threshold for the sparsity gate.	`None`
`is_logsigmoid_gate`	`bool`	Whether to use a log-sigmoid function for the sparsity gate. If False, uses a linear function.	`True`
`is_adapt_gate`	`bool`	Whether to adapt the gate threshold based on sequence length.	`True`
`is_local`	`bool`	Whether to apply a local mask.	`False`
`is_quant`	`bool`	Whether the inputs are quantized. If True, query_scale, key_scale, and value_scale must be provided or will be computed from the input tensors.	`False`
`is_split_kv`	`bool`	Whether to enable split-KV for forward occupancy.	`False`
`is_split_qo`	`bool`	Whether to enable split-QO for backward occupancy.	`False`
`pack_gqa`	`bool`	Whether to pack grouped-query attention.	`False`
`num_splits`	`Optional[int]`	Optional split count for split-KV and split-QO. If provided, enables split-KV and split-QO, if omitted with split-KV and split-QO enabled, a heuristic is used.	`None`
`seqused_q`	`Optional[Tensor]`	Optional tensor of shape [total_seqlen_q] indicating the actual sequence lengths for queries. If provided, overrides cu_seqlens_q for masking.	`None`
`seqused_k`	`Optional[Tensor]`	Optional tensor of shape [total_seqlen_k] indicating the actual sequence lengths for keys/values. If provided, overrides cu_seqlens_k for masking.	`None`
`out`	`Optional[Tensor]`	Optional preallocated output tensor with shape [batch_size, seqlen_q, num_heads, head_dim].	`None`
`lse`	`Optional[Tensor]`	Optional preallocated logsumexp tensor with shape [batch_size, num_heads, seqlen_q].	`None`
`is_autotune`	`bool`	If True, use the cached launch config when present, on cache miss, run autotune and store the selected config. If False, tile_mn is used directly.	`True`
`tile_mn`	`Optional[Tuple[int, int]]`	Tuple of (TILE_M, TILE_N) for launch when is_autotune is False.	`None`
`skip_checks`	`bool`	Whether to skip input validation checks for faster performance.	`False`
`return_lse`	`bool`	Whether to return the logsumexp tensor for numerical stability analysis. If True, returns a tuple (out, lse). If False, returns only out.	`False`

Returns:

Type	Description
`Union[Tensor, Tuple[Tensor, Tensor]]`	If return_lse is False, returns out with shape [total_seqlen_q, num_heads_q, head_dim]. If return_lse is True, returns a tuple (out, lse), where lse has shape [total_seqlen_q, num_heads_q].

flash_gated_attn_varlen_with_kvcache_func(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, alpha: torch.Tensor, delta: torch.Tensor, cu_seqlens_k: torch.Tensor, max_seqlen_k: int, softmax_scale: Optional[float] = None, query_scale: Optional[torch.Tensor] = None, key_scale: Optional[torch.Tensor] = None, value_scale: Optional[torch.Tensor] = None, window_sizes: Optional[torch.Tensor] = None, softmax_threshold: Optional[float] = None, gate_threshold: Optional[float] = None, is_logsigmoid_gate: bool = True, is_local: bool = False, is_quant: bool = False, seqused_k: Optional[torch.Tensor] = None, gather_kv_indices: Optional[torch.Tensor] = None, num_splits: Optional[int] = None, out: Optional[torch.Tensor] = None, lse: Optional[torch.Tensor] = None, is_autotune: bool = True, tile_mn: Optional[Tuple[int, int]] = None, skip_checks: bool = False, return_lse: bool = False) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] ¶

Flash gated attention function for variable-length decoding with KV cache that computes the attention output and optionally the logsumexp.

Parameters:

Name	Type	Description	Default
`query`	`Tensor`	Query tensor of shape [batch_size, num_heads_q, head_dim].	required
`key`	`Tensor`	Key tensor of shape [total_seqlen_k, num_heads_kv, head_dim].	required
`value`	`Tensor`	Value tensor of shape [total_seqlen_k, num_heads_kv, head_dim].	required
`alpha`	`Tensor`	Tensor of shape [batch_size, num_heads_q] representing the sparsity pattern for queries.	required
`delta`	`Tensor`	Tensor of shape [total_seqlen_k, num_heads_kv] representing the sparsity pattern for keys/values.	required
`cu_seqlens_k`	`Tensor`	Cumulative sequence lengths for keys/values, shape [batch_size + 1].	required
`max_seqlen_k`	`int`	Maximum sequence length for keys/values.	required
`softmax_scale`	`Optional[float]`	Optional scaling factor for the softmax. If None, defaults to 1/sqrt(head_dim).	`None`
`query_scale`	`Optional[Tensor]`	Optional per-tensor scale for query dequantization.	`None`
`key_scale`	`Optional[Tensor]`	Optional per-tensor scale for key dequantization.	`None`
`value_scale`	`Optional[Tensor]`	Optional per-tensor scale for value dequantization.	`None`
`window_sizes`	`Optional[Tensor]`	Optional window sizes tensor for flexible local attention. Must be shape [num_kv_heads, 4] with dtype int32, with columns [window_sink, window_left, window_right, window_dist]. If provided, is_local is automatically set to True.	`None`
`softmax_threshold`	`Optional[float]`	Optional threshold for the sparse softmax.	`None`
`gate_threshold`	`Optional[float]`	Optional threshold for the sparsity gate.	`None`
`is_logsigmoid_gate`	`bool`	Whether to use a log-sigmoid function for the sparsity gate. If False, uses a linear function.	`True`
`is_local`	`bool`	Whether to apply a local mask.	`False`
`is_quant`	`bool`	Whether the inputs are quantized. If True, query_scale, key_scale, and value_scale must be provided for dequantization.	`False`
`seqused_k`	`Optional[Tensor]`	Optional tensor indicating the actual sequence lengths for keys/values.	`None`
`gather_kv_indices`	`Optional[Tensor]`	Optional TopK gather indices with shape [batch_size, topk_seqlen_k]. Each non-negative entry is a batch-local KV sequence position to gather for decode, negative entries are masked out.	`None`
`num_splits`	`Optional[int]`	Optional split count for decode. If omitted, gather decode uses 1 split and regular decode uses a heuristic.	`None`
`out`	`Optional[Tensor]`	Optional preallocated output tensor with shape [batch_size, num_heads_q, head_dim].	`None`
`lse`	`Optional[Tensor]`	Optional preallocated logsumexp tensor with shape [batch_size, num_heads_q].	`None`
`is_autotune`	`bool`	If True, use the cached launch config when present, on cache miss, run autotune and store the selected config. If False, tile_mn is used directly.	`True`
`tile_mn`	`Optional[Tuple[int, int]]`	Tuple of (TILE_M, TILE_N) for launch when is_autotune is False.	`None`
`skip_checks`	`bool`	Whether to skip input validation checks for faster performance.	`False`
`return_lse`	`bool`	Whether to return the logsumexp tensor for numerical stability analysis. If True, returns a tuple (out, lse). If False, returns only out.	`False`

Returns:

Type	Description
`Union[Tensor, Tuple[Tensor, Tensor]]`	If return_lse is False, returns out with shape [batch_size, num_heads_q, head_dim]. If return_lse is True, returns a tuple (out, lse), where lse has shape [batch_size, num_heads_q].

`flash_proj_func(input: torch.Tensor, weight: torch.Tensor, out_dtype: torch.dtype = torch.float8_e5m2, prev_scale: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, torch.Tensor]` ¶

Fused linear projection with quantized output and per-tensor scale.

Parameters:

Name	Type	Description	Default
`input`	`Tensor`	Input tensor of shape [batch_size, seqlen, in_features].	required
`weight`	`Tensor`	Weights tensor of shape [out_features, in_features].	required
`out_dtype`	`dtype`	Target output dtype.	`float8_e5m2`
`prev_scale`	`Optional[Tensor]`	Per-tensor scale from the previous forward.	`None`

Returns:

Type	Description
`Tuple[Tensor, Tensor]`	Tuple of (out, actual_scale), where out has shape [batch_size, seqlen, out_features].

API Reference¶

Triton Interface¶

flash_sparse_attn.ops.triton.interface ¶

flash_proj_func(input: torch.Tensor, weight: torch.Tensor, out_dtype: torch.dtype = torch.float8_e5m2, prev_scale: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, torch.Tensor] ¶

`flash_sparse_attn.ops.triton.interface` ¶

`flash_proj_func(input: torch.Tensor, weight: torch.Tensor, out_dtype: torch.dtype = torch.float8_e5m2, prev_scale: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, torch.Tensor]` ¶