[Inductor] Generate triton block pointers for discontiguous strided tensors #125077

blaine-rister · 2024-04-26T22:12:30Z

🚀 The feature, motivation and pitch

I ran the following program to test what triton code is generated from a discontiguous tensor:

import sys
import os
import logging
import torch
from torch._inductor import config as inductor_config

# Enable debug logging
os.environ["TORCH_COMPILE_DEBUG"] = "1"
torch._logging.set_logs(inductor=logging.DEBUG)

# Log to stdout
handler = logging.StreamHandler(sys.stdout)
for logger in torch._dynamo.logging.get_loggers():
   logger.addHandler(handler)

inductor_config.triton.use_block_ptr = True

def foo(x, y):
    return x + y

device = torch.device('cuda')
orig_size = (32, 32)
view_size = (32, 8)
orig = torch.randn(orig_size).to(device)
view = torch.as_strided(orig, view_size, orig.stride())

compiled_foo = torch.compile(foo, backend="inductor")
compiled_foo(view, view)

The generated kernel was:

@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 256
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex % 8
    x1 = (xindex // 8)
    tmp0 = tl.load(in_ptr0 + (x0 + (32*x1)), xmask)
    tmp1 = tmp0 + tmp0
    tl.store(tl.make_block_ptr(out_ptr0, shape=[256], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp1, [XBLOCK]).to(tl.float32), boundary_check=[0])

It seems like Inductor generates a block pointer for the output, but reverts back to standard pointers for the input. Whereas if I don't call torch.as_strided on the input, I see block pointers for both.

I am wondering if it's possible for inductor to generate something like this instead:

@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xoffset = tl.program_id(0) * XBLOCK
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[32,8], strides=[32,1], block_shape=[32,XBLOCK], order=[0], offsets=[0,xoffset])).to(tl.float32), boundary_check=[0])
    tmp1 = tmp0 + tmp0
    tl.store(tl.make_block_ptr(out_ptr0, shape=[32,8], strides=[32,1], block_shape=[32,XBLOCK], order=[0], offsets=[0,xoffset]), tl.broadcast_to(tmp1, [32,XBLOCK]).to(tl.float32), boundary_check=[0])

This would use the strides argument to tl.make_block_ptr to express that the input tensor is discontiguous. On GPUs, this could avoid the address calculation using division and modulo, which might yield some performance benefit. There is probably a much bigger win for accelerators like MTIA with simpler memory systems, where this code maps very naturally to DMA engines. Without this, simpler accelerators might have a tough time handling padding between the rows of a tensor.

Is this feature feasible? The main change I see is that here XBLOCK would refer the columns of the input matrix, as opposed to the linear index. It would also be possible to block on rows.

Alternatives

In principle, it's possible for the triton compiler to recognize this pattern under the hood. But it seems like that would require reading a whole number of rows, i.e. XBLOCK must be a multiple of the row length. Also, the analysis could get complex when division and modulo are involved. I'm wondering if makes more sense to handle this in Inductor.

Instead of block pointers, it's also possible to simplify the address calculation for standard pointers, such as

x0 = tl.broadcast_to(tl.expand_dims(tl.arange(xoffset, xoffset + XBLOCK), axis=0), [32,XBLOCK])
x1 = tl.broadcast_to(tl.expand_dims(tl.arange(32), axis=1), [32,XBLOCK])
tl.load(in_ptr0 + x0 + x1 * 32)

which could more easily be converted to a block representation inside the triton compiler.

Additional context

No response

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

cc @shunting314 based on offline conversations. We were hoping for input from @jansel .

The text was updated successfully, but these errors were encountered:

shunting314 · 2024-04-26T23:15:45Z

I think not every non-contiguous access will cause inductor skips block_ptr.

E.g., for 'a + b.t()', here is the code inductor generates which uses block_ptr for all 3 memory accesses:

def triton_poi_fused_add_0(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    ynumel = 1024
    xnumel = 2048
    yoffset = tl.program_id(1) * (tl.program_id(2) + 1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[2048, 1024], strides=[1, 2048], block_shape=[XBLOCK, YBLOCK], order=[0, 1], offsets=[xoffset, yoffset]), eviction_policy='evict_last')
    tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[2048, 1024], strides=[1024, 1], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), eviction_policy='evict_last')
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[2048, 1024], strides=[1, 2048], block_shape=[XBLOCK, YBLOCK], order=[0, 1], offsets=[xoffset, yoffset]), tl.broadcast_to(tmp2, [XBLOCK, YBLOCK]).to(tl.float32))

blaine-rister · 2024-04-27T06:15:27Z

Thanks, this is good context. So it seems like 2D block pointers are already possible, it's just that inductor might not take advantage of them in the case of padded rows coming from torch.as_strided.

jansel · 2024-04-28T04:58:50Z

There is nothing special about as_strided. In that case inductor decided to generate a 1D kernel (since both dimensions had the same contiguity), but required a 2D load. Similarly, if you have a 2D kernel, but a 3D/4D load -- then block ptr won't be used.

Option 1

Change the tiling algorithm here:

pytorch/torch/_inductor/codegen/triton.py

Lines 3851 to 3854 in c5b1a4c

    
               def select_tiling(cls, node_schedule, numel, reduction_numel=sympy.Integer(1)): 
        
                   """ 
        
                   Heuristics to decide how to tile kernels. 
        
                   Currently, we tile based on stride-1 dimensions.

If you trigger a 2D tiled kernel, then block_ptr should get used.

Option 2

Generate a 2D load, then call tl.reshape. Something like:

tl.reshape(tl.load(tl.block_ptr(block_shape=[XBLOCK//8, 8], ...)), [XBLOCK])

This would require some multiple_of guards to ensure correctness.

This would be a bit more flexible.

blaine-rister · 2024-04-29T22:13:16Z

Thanks @jansel for the suggestions. I can take a shot at this. Would option 2 break the requirement that tiling dims == block pointer dims? That seems preferable, but I might attempt option 1 first just to get things working.

jansel · 2024-04-30T03:13:06Z

Would option 2 break the requirement that tiling dims == block pointer dims?

Yes, that is what I meant by "This would be a bit more flexible."

blaine-rister · 2024-05-01T21:56:11Z

I think I have a reasonable draft of option 2. It pattern matches on the div/modulo indexing expression to extract the strides and offset. I'm struggling with the statically_known_multiple_of guards, though. To preserve the iteration order, it seems like we need to know that XBLOCK is a multiple of our slice size. But at least in the examples I can find, those guards seem to apply to TRITON_MAX_BLOCK["X"]. We could know that the maximum block is safe to use, but what about the minimum block?

Instead of shape guards, would it be possible to use cdiv? For example,

tl.load(tl.block_ptr(block_shape=[tl.cdiv(XBLOCK,8), 8], ...)).reshape([XBLOCK])

I think this could work if we check that the iteration ranges are all powers of 2. (Is this always true?) If dim = 2 ** n, and we know XBLOCK = 2 ** m, then CeilDiv(XBLOCK,dim) == 2 ** (m - n) if m > n else 1.

jansel · 2024-05-02T21:19:31Z

I think there are some correctness issues with that, because the iteration order must match exactly between all loads/stores in the kernel.

The guards I was talking about would need to be on the shape of the tensor being loaded.

blaine-rister assigned jansel and shunting314 Apr 26, 2024

blaine-rister added the module: inductor label Apr 26, 2024

pytorch-bot bot added the oncall: pt2 label Apr 26, 2024

blaine-rister changed the title ~~[Inductor] generate triton block pointers for discontiguous strided tensors~~ [Inductor] Generate triton block pointers for discontiguous strided tensors Apr 26, 2024

blaine-rister unassigned jansel and shunting314 Apr 26, 2024

jbschlosser added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 29, 2024

blaine-rister self-assigned this Apr 29, 2024

blaine-rister mentioned this issue May 2, 2024

Propagate constexpr through tl.minimum and tl.maximum? openai/triton#3815

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor] Generate triton block pointers for discontiguous strided tensors #125077

[Inductor] Generate triton block pointers for discontiguous strided tensors #125077

blaine-rister commented Apr 26, 2024 •

edited

shunting314 commented Apr 26, 2024

blaine-rister commented Apr 27, 2024

jansel commented Apr 28, 2024 •

edited

blaine-rister commented Apr 29, 2024 •

edited

jansel commented Apr 30, 2024

blaine-rister commented May 1, 2024 •

edited

jansel commented May 2, 2024 •

edited

[Inductor] Generate triton block pointers for discontiguous strided tensors #125077

[Inductor] Generate triton block pointers for discontiguous strided tensors #125077

Comments

blaine-rister commented Apr 26, 2024 • edited

🚀 The feature, motivation and pitch

Alternatives

Additional context

shunting314 commented Apr 26, 2024

blaine-rister commented Apr 27, 2024

jansel commented Apr 28, 2024 • edited

Option 1

Option 2

blaine-rister commented Apr 29, 2024 • edited

jansel commented Apr 30, 2024

blaine-rister commented May 1, 2024 • edited

jansel commented May 2, 2024 • edited

blaine-rister commented Apr 26, 2024 •

edited

jansel commented Apr 28, 2024 •

edited

blaine-rister commented Apr 29, 2024 •

edited

blaine-rister commented May 1, 2024 •

edited

jansel commented May 2, 2024 •

edited