Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDF mode process_frame_stack(stack) #1506

Open
matbryan52 opened this issue Aug 25, 2023 · 8 comments
Open

UDF mode process_frame_stack(stack) #1506

matbryan52 opened this issue Aug 25, 2023 · 8 comments

Comments

@matbryan52
Copy link
Member

I seem to frequently encounter a situation where I am implementing a UDF which could quite easily be vectorized along a stack of whole frames. For example applying an FFT or some kind of filter always needs to have the whole frame at once, so i can't use process_tile but I could easily process a whole stack of frames in one operation. A lot of image processing functions can be applied to the last two axes of stack (e.g. numpy.fft.fft2) for (in some cases) free speedups. Particularly on GPU for convolution (think about (nBatch, h, w, nFeatures) in deep learning).

I think this could essentially be forced even with the current API by playing with tileshape negotiation, i.e. implement process_tile(tile) but require tiles which are whole-frames tall/wide. It's a bit clunky, though. There might be space for a process_frame_stack method, given that any process_tile function should automatically work as a process_frame[stack](). It would be similar to process_partition but with some mechanism to restrict the size of a stack to less than a whole partition.

How this could be concretely provided to the user I am not yet sure. I was wondering if this has ever been discussed or thought about ?

@sk1p
Copy link
Member

sk1p commented Aug 25, 2023

I was wondering if this has ever been discussed or thought about ?

I have indeed thought about this before, the last time when working on the WDD UDF - there, compression is currently done frame-by-frame (so with two matrix-vector operations). It's clear that this can be improved upon by compressing multiple frames at once (which is then two matrix-matrix operations), but it was less clear to me how this would work for arbitrary tiles (it should be possible to "decompose" the compression such that is works on tiles, but it is of course much less complex to just work on full stacks of frames). /cc @bangunarya

Correct, this process_frame_stack method would be in the place between process_partition and process_frame, meaning we still have somewhat reasonable total size for the frame stack, and not just push full partitions through it.

I think we may want to start with a benchmark, to understand the performance better - one reason why the interface is the way it is currently is to kind-of match the I/O size to fit into the L3 cache, which works out for (some) individual frames, or deep-but-slim tiles, but probably no longer for stacks of full frames. Did you benchmark the fft2 case? It would be interesting to see the gains over process_frame for different detector sizes, and if there's a cliff for >L3 which we have to overcome with more gains by means of a larger stack.

@matbryan52
Copy link
Member Author

Did you benchmark the fft2 case?

Here is a benchmark for np.fft, cp.fft and pyfftw.interfaces.scipy_fftpack :

image

with code like this:

for stack_size in stack_sizes:
    for sig_shape in sig_shapes.keys():
        warmup_data = xp.random.uniform(size=(stack_size, *sig_shape)).astype(xp.float32)
        fft.fft2(warmup_data)
        fft.fft2(warmup_data[0])
        stack_data = xp.random.uniform(size=(stack_size, *sig_shape)).astype(xp.float32)
        res_stack = %timeit -o fft.fft2(stack_data)
        frame_data = [f.copy() for f in stack_data]
        res_frame = %timeit -o [fft.fft2(f) for f in frame_data]

For useful frame sizes, seems like the CPU implementations don't gain much or are worse when processing a stack of frames, while the CuPy implementation gains a lot (not surprising!). This doesn't account for the potential additional bonus of reading a whole batch of frames in one I/O operation, and less run_udf overhead. I'd need to implement a special tileshape + process_tile to test the full system performance.

@sk1p
Copy link
Member

sk1p commented Aug 27, 2023

Here is a benchmark for np.fft, cp.fft and pyfftw.interfaces.scipy_fftpack :

👍

for stack_size in stack_sizes:
    for sig_shape in sig_shapes.keys():
        warmup_data = xp.random.uniform(size=(stack_size, *sig_shape)).astype(xp.float32)
        fft.fft2(warmup_data)
        fft.fft2(warmup_data[0])
        stack_data = xp.random.uniform(size=(stack_size, *sig_shape)).astype(xp.float32)
        res_stack = %timeit -o fft.fft2(stack_data)
        frame_data = [f.copy() for f in stack_data]
        res_frame = %timeit -o [fft.fft2(f) for f in frame_data]

I'm don't think this benchmark will give you correct results for the CuPy case - operations like cp.fft.fft2 are asynchronous. For accurate results, I think you need to use the benchmark function instead (or write a synchronizing wrapper, but that looks tricky).

This doesn't account for the potential additional bonus of reading a whole batch of frames in one I/O operation, ...

... which, on CPU, has to be weighted against falling off the L3 cache cliff.

... and less run_udf overhead. I'd need to implement a special tileshape + process_tile to test the full system performance.

I think a full perf benchmark would indeed be interesting, as there are many components at play.

A quick way to hack this is to include a process_frame no-op UDF in the list of UDFs, which should switch the base sig shape to the sig shape of the data set. Then you can write a process_tile UDF, which can influence the tiling depth using UDF.get_tiling_preferences. That should allow, at least, to influence the minimal depth - the data source is still allowed to give a larger tiling depth (and smaller at the edges, of course), so I'd advise to test with a generic data source like RawFileDataSet.

@matbryan52
Copy link
Member Author

I'm don't think this benchmark will give you correct results for the CuPy case - operations like cp.fft.fft2 are asynchronous. For accurate results, I think you need to use the benchmark function instead (or write a synchronizing wrapper, but that looks tricky).

Thanks for the advice, am new to GPU perf benchmarking! Will adapt the code once I'm back from vacation and integrate it into a full UDF example while I'm at it 🤞

@bangunarya
Copy link

bangunarya commented Aug 28, 2023

Hi @matbryan52 , @sk1p, so yeah it is very interesting if we can do full stacks of frames compression. I am wondering, why we could not directly implement broadcasting in this case, @sk1p?

@matbryan52
Copy link
Member Author

Here is the same graph but running fft2 through the UDF interface on frames / stacks of frames. Info:

  • inline context
  • 1 partition of 512 frames in each dataset
  • uses standard RawFileDataSet for the process_frame case, and a hacked raw dataset to force a given tileshape using adjust_tileshape (verified tileshape in the UDF)
  • For the cupy case I decided to just use a call to array.get() to force synchronisation - this adds the overhead of transferring the result but is actually quite a realistic use within a UDF, whether working on a stack or individual frames !

image

As before the results are only interesting on smaller frames, and then only really with cupy and pyfftw. That said, the code is not optimised for each backend (pinned memory, byte-aligned arrays, pre-planned fft etc). It's clear that there would be no benefit to going beyond stack sizes of ~16 for (512, 512) frames, and then only in specific, data-size-dependent cases.

I think for this to be implementable there would have to be a clear way for a UDF to have multiple implementations (tile, frame, stack, partition) and choose which to use at runtime based on the sig_shape and proposed partition size. It's hard to see how this could be done and also maintain the current system of tileshape negotiation (which is influenced by the dataset and the UDF(s)).

Instead a pragmatic option could be a mode where (for performance and/or algorithmic reasons) a UDF can fully specify the tileshape it wants to receive at runtime, ignoring things like ds.max_io_size and ds.needs_decode, with the constraint that the UDF cannot be run in parallel with other UDFs. That said, for this kind of algorithm where a stack is necessary it might be easier to use process_partition and handle this internal to the UDF.

@sk1p
Copy link
Member

sk1p commented Sep 13, 2023

Hi @matbryan52 , @sk1p, so yeah it is very interesting if we can do full stacks of frames compression. I am wondering, why we could not directly implement broadcasting in this case, @sk1p?

I don't really remember the details here, I think the conclusion was that the current per-frame method was already quite fast - maybe once Ptychography-4-0/ptychography#72 is merged, we can do a follow-up on performance, with both stack-of-frame or full partition considerations, and maybe have another go at a GPU version, too.

@sk1p
Copy link
Member

sk1p commented Sep 19, 2023

Here is the same graph but running fft2 through the UDF interface on frames / stacks of frames. [...]

Thanks for the updated graph! I didn't fully digest it yet, but one interesting outcome is the notable difference between np and pyfftw. I think part of it may be threading and plan caching: we disable plan caching for pyfftw for correctness because of LiberTEM/LiberTEM-blobfinder#35 - so if we do per-frame processing, we probably need to re-create the plan for each call. Also, using the inline executor by default allows threads in numeric libraries, so pyfftw will probably have more data to munch per thread w/ stacks? In any case, interesting!

I think for this to be implementable there would have to be a clear way for a UDF to have multiple implementations (tile, frame, stack, partition) and choose which to use at runtime based on the sig_shape and proposed partition size. It's hard to see how this could be done and also maintain the current system of tileshape negotiation (which is influenced by the dataset and the UDF(s)).

Instead a pragmatic option could be a mode where (for performance and/or algorithmic reasons) a UDF can fully specify the tileshape it wants to receive at runtime, ignoring things like ds.max_io_size and ds.needs_decode, with the constraint that the UDF cannot be run in parallel with other UDFs. That said, for this kind of algorithm where a stack is necessary it might be easier to use process_partition and handle this internal to the UDF.

This seems to have some intersection with #1353 and related plans. One thought about multiple UDFs and compatibility/performance: right now, if we combine a process_partition and a process_frame UDF, I/O will read the whole partition, then run process_frame for each frame in the partition, and process_partition once on the whole thing.

An improvement on this would be to read with a by-frame scheme, use that to run the process_frame UDF, assemble the frames into a full partition, and then run the process_partition UDF afterwards. This could even be extended to tiled processing, reading a whole "block" of tiles (meaning read tiles until the full sig shape has been read) before running the per-frame UDFs. Then all UDFs can be fed approximately the shapes they are happy with, other than disagreements between two per-tile UDFs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants