Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numba's njit 350x *slower* than regular Python for random list access #9535

Closed
soerenwolfers opened this issue Apr 21, 2024 · 16 comments
Closed
Labels
question Notes an issue as a question stale Marker label for stale issues.

Comments

@soerenwolfers
Copy link

soerenwolfers commented Apr 21, 2024

Reporting a bug

Random list access is 350x faster with regular Python than with njit. This can be somewhat alleviated by using numba's List type, but first, this probably means usage of Python's list should not be allowed in no_python mode if it makes numba so slow, and second, even with List numba is 5x slower than regular Python.

The MWE below reproduces the numbers claimed above. Note that in my real problem the list entries are not as trivial, so I can't just use arrays without lists in my real problem.

from numba import njit
from numba.typed import List
import timeit

def foo(n):
    return [(np.zeros(0), np.zeros(0))] * n

@njit
def typed_numba_foo(n):
    return List([(np.zeros(0), np.zeros(0))] * n)

def bar(a, b):
    for idx in b:
        _ = a[idx]  # the slow bit
    return None

natural_numba_foo = njit(foo)
numba_bar = njit(bar)

def call(method):
    if method == 0:
        numba_bar(natural_numba_foo(n), indices)
    elif method == 1:
        numba_bar(typed_numba_foo(n), indices)
    else:
        bar(foo(n), indices)

n = 100_000
indices = np.random.randint(0, n, n)
repeats = 10

for method, arg in [('Natural njit', 0), ('Typed njit', 1), ('Regular Python', 2)]:
    call(arg) # compile
    start = timeit.default_timer()
    for i in range(repeats):
        call(arg)
    print(method, "took", timeit.default_timer() - start, "s") 
Natural njit took 66.35497103600028 s
Typed njit took 0.977979459999915 s
Regular Python took 0.1892970940007217 s
@soerenwolfers soerenwolfers changed the title numba's njit 350x *slower* than regular Python numba's njit 350x *slower* than regular Python for random list access Apr 21, 2024
@stuartarchibald
Copy link
Contributor

Thanks for the report. I think there may be some issues present with what is being recorded and that is what is leading to the timings reported. For instance...

numba_bar(natural_numba_foo(n), indices)

is doing something like:

tmp_lists = natural_numba_foo(n)
numba_bar(tmp_lists, indices)

and this matters to Numba because Numba stores lists in a different format to CPython. The "natural" list (as described) is what Numba stores internally as "reflected list", to go from reflected lists back to Python lists requires a "boxing" step where the data in the list is translated from Numba's format to CPython's format. This process is quite costly and in creating that temporary variable the lists have to be translated from Numba format into Python format and then from Python format back into Numba format again so they can be used in the bar function.

It may be a more representative test if the lists were allocated "outside" of the timed region, OR, built inside the compiled code. Typed lists don't suffer from this problem so much as their Numba representation is carried into their CPython representation, however, this means access from the interpreter pays the performance price for doing the data format translation.

If the above were fixed up, I would anticipate that the list access speed would be much closer and potentially even faster that what can be obtained through the Python interpreter.

Also, as a general word of warning... benchmarking functions that do very little work and then return something that has no dependency on the "work" being benchmarked can result in surprising performance results when compiled... the compiler might elect to optimise the work out as it has no impact on the returned value!

Hope this helps!

@stuartarchibald stuartarchibald added the question Notes an issue as a question label Apr 22, 2024
@soerenwolfers
Copy link
Author

soerenwolfers commented Apr 22, 2024

Hi @stuartarchibald

Thanks for your answer. Unfortunately I still don't know how to solve my problem.

  • the benchmarks I provided here are very representative of the real code I'm running. It's a MWE of a real problem.
  • the time of the natural solution is spent in the call of numba_bar (reading the list), not in the call of natural_numba_foo (allocating the list). See benchmarks below.
  • if passing an object created by a numba function straight back into another numba function is this slow, that's a problem, no? It means that numba can't be used beyond single independent functions. If I understand your message correctly to say that natural_numba_foo returns a built-in Python list and reading that back into another numba function is slow, then I think it should just not return built-in Python lists from no-object mode but rather a "pointer-wrapper" pointing at a numba-List-object (or refuse to compile in no-object mode when built-in Python kept syntax is being used and attempted to be returned).
  • Even if we ignore the issues with the built-in Python lists being passed around between numba functions, is there any fix to the numba-List being that slow (5x slower than the already slow Python)? Is there an explanation why it is so slow?

@soerenwolfers
Copy link
Author

soerenwolfers commented Apr 22, 2024

Here are more benchmarks proving that

  • The "natural" numba solution spends most time in the call of numba_bar (which accesses the list). Whether that time is spent in the actual function or in an automated conversion of the inputs, I can't tell. (I'd be interested to know how to benchmark this!)
  • The "typed" numba solution is surprisingly slow at even the typed_numba_foo (which creates the list) call. I don't know why; it outputs a numba.typed.typedlist.List and should therefore not have to do any conversions? Even the numba_bar (which accesses the list) call is still slower than the regular Python solution for small n.
  • Combining allocation and reading in a single numba function is still slower than regular Python when using numba.typed.List.
  • Combining allocation and reading in a single numba function is fastest when using regular Python list syntax within numba.

Results for n=100_000:

Natural njit -foo took 1.1867223369990825 s
Natural njit -bar took 18.94784771400009 s
Typed njit -foo took 0.15264834299887298 s
Typed njit -bar took 0.14661180000257445 s
Regular Python -foo took 0.0010774939983093645 s
Regular Python -bar took 0.07347720399775426 s
Natural combined -foo took 1.0312000085832551e-05 s
Natural combined -bar took 0.0423850199986191 s
Typed combined -foo took 1.2600998161360621e-05 s
Typed combined -bar took 0.2953695609976421 s

Results for n=10_000_000:

Natural njit -foo ---not tested----
Natural njit -bar took  ---not tested----
Typed njit -foo took 2.04352636099793 s
Typed njit -bar took 2.088136158999987 s
Regular Python -foo took 0.04432875400016201 s
Regular Python -bar took 2.4120894910010975 s
Natural combined -foo took 9.609000699128956e-06 s
Natural combined -bar took 0.8744130579980265 s
Typed combined -foo took 5.560999852605164e-06 s
Typed combined -bar took 4.209406034002313 s
from numba import njit
from numba.typed import List
import timeit

def foo(n):
    return [(np.zeros(0), np.zeros(0))] * n

@njit
def typed_numba_foo(n):
    return List([(np.zeros(0), np.zeros(0))] * n)

def bar(a, b):
    for idx in b:
        _ = a[idx]  # the slow bit
    return None

@njit
def natural_foo_bar(a, b):
    a = [(np.zeros(0), np.zeros(0))] * n
    for idx in b:
        _ = a[idx]  # the slow bit
    return None


@njit
def typed_foo_bar(n, b):
    a = List([(np.zeros(0), np.zeros(0))] * n)
    for idx in b:
        _ = a[idx]  # the slow bit
    return None

natural_numba_foo = njit(foo)
numba_bar = njit(bar)

def call_foo(method, n):
    if method == 0:
        return natural_numba_foo(n)
    elif method == 1:
        return typed_numba_foo(n)
    elif method == 2:
        return foo(n)
    elif method > 2:
        return n
        
def call_bar(method, foo_res_or_n, indices):
    if method == 0:
        numba_bar(foo_res_or_n, indices)
    elif method == 1:
        numba_bar(foo_res_or_n, indices)
    elif method == 2:
        bar(foo_res, indices)
    elif method == 3:
        natural_foo_bar(foo_res_or_n, indices)
    elif method == 4:
        typed_foo_bar(foo_res_or_n, indices)


n = 10_000_000
indices = np.random.randint(0, n, n)
repeats = 1

for method, arg in [
   # ('Natural njit', 0),
                    ('Typed njit', 1), ('Regular Python', 2), ('Natural combined', 3), ('Typed combined', 4)]:
    foo_res = call_foo(arg, n) # compile
    call_bar(arg, foo_res, indices) # compile
    start = timeit.default_timer()
    for i in range(repeats):
        foo_res = call_foo(arg, n)
    print(method, "-foo", "took", timeit.default_timer() - start, "s") 
    start = timeit.default_timer()
    for i in range(repeats):
        call_bar(arg, foo_res, indices)
    print(method, "-bar", "took", timeit.default_timer() - start, "s") 

@soerenwolfers
Copy link
Author

soerenwolfers commented Apr 22, 2024

To summarize:

  • numba.typed.List solutions are slower than regular Python. Whether the List is passed between allocator and consumer functions or used in a combined function doesn't make a big difference; the total is typically 2x-5x as slow as regular Python.
  • regular Python lists can't be passed around between numba functions (350x slowdown)
  • regular Python lists (or at least: the syntax for regular Python lists) used entirely within numba and not passed around provide by far the fastest solution.

(Replacing the trivial body of bar by something non-trivial can of course affect the "regular Python" vs "numba" comparison, but doesn't affect the timings within numba nor does it change the fact that the "natural numba" solution is orders of magnitude slower than the "regular Python" solution)

@esc
Copy link
Member

esc commented Apr 23, 2024

@soerenwolfers thank you for asking about this. If you take a look at the issue tracker, you'll probably find issues that report that both Numba list implementations(reflected-list and typed-list) may suffer performance inefficiencies depending on how they are used. I'm still in process of working out what is going on in your examples, but wanted to share a quick example to show a simpler use-case. Perhaps there is something to take-away from this.

Numba's typed list is useful and can be very quick, but your mileage may vary depending on use-case. I tried the following benchmark using a simple accumulator function over integers and got some very promising results:

In [1]: from numba import njit

In [2]: from numba.typed import List

In [3]: py_list = [i for i in range(100000)]

In [4]: numba_list = List(py_list)

In [5]: @njit
   ...: def accumulate(l):
   ...:     c = 0
   ...:     for i in l:
   ...:         c += i
   ...:     return c
   ...:

In [6]: %timeit accumulate.py_func(py_list) # python function with python list

1.91 ms ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [7]: %timeit accumulate.py_func(numba_list) # python function with numba typed list
78.8 ms ± 522 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit accumulate(py_list) # numba function with numba reflected list
52.3 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit accumulate(numba_list) # numba function with numba typed list
16.6 µs ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

@esc
Copy link
Member

esc commented Apr 23, 2024

@soerenwolfers I now updated the example to use random access of the list, like:

In [9]: @njit
   ...: def accumulate(l):
   ...:     c = 0
   ...:     for i in range(len(l)):
   ...:         c += l[i]
   ...:     return c

In this case the benchmarks are:

In [13]: %timeit accumulate.py_func(py_list) # python function with python list
3.04 ms ± 84.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [14]: %timeit accumulate.py_func(numba_list) # python function with numba typed list
80.2 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit accumulate(py_list) # numba function with numba reflected list
52.7 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: %timeit accumulate(numba_list) # numba function with numba typed list
1.2 ms ± 4.44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@esc
Copy link
Member

esc commented Apr 23, 2024

@soerenwolfers I re-did my benchmark again, such that the access is truely randomized (by handing in the indices )and not sequential and I get:

In [1]: from numba import njit

In [2]: from numba.typed import List

In [3]: n = 100000

In [4]: import numpy as np

In [5]: indices = np.random.randint(0, n, n)

In [6]: @njit
   ...: def accumulate(l, n):
   ...:     c = 0
   ...:     for i in n:
   ...:         c += l[i]
   ...:     return c
   ...:

In [7]: py_list = [i for i in range(n)]

In [8]: numba_list = List(py_list)


In [10]: %timeit accumulate.py_func(py_list, indices) # python function with python list
6.98 ms ± 360 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit accumulate.py_func(numba_list, indices) # python function with numba typed list
83.3 ms ± 670 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [12]: %timeit accumulate(py_list, indices) # numba function with numba reflected list
52.1 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [13]: %timeit accumulate(numba_list, indices) # numba function with numba typed list
1.12 ms ± 1.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@soerenwolfers
Copy link
Author

@esc I suspect your example might be different from mine because the list contains "value types" instead of "reference types", but that deoends in implementation details of numba that I'm not familiar with. In any case, I agree that once the list access becomes trivial compared to the work being done, numba will look good again, but I intentionally highlighted something that does cause real problems to me, which is that in the opposite case where the work is trivial and it's all about accessing non-trivial data, numba is slower than pure Python. (By lots in some cases and by a little in other cases, depending on how you use numba and what you use it on, but still)

@esc
Copy link
Member

esc commented Apr 23, 2024

@esc I suspect your example might be different from mine because the list contains "value types" instead of "reference types", but that deoends in implementation details of numba that I'm not familiar with. In any case, I agree that once the list access becomes trivial compared to the work being done, numba will look good again, but I intentionally highlighted something that does cause real problems to me, which is that in the opposite case where the work is trivial and it's all about accessing non-trivial data, numba is slower than pure Python. (By lots in some cases and by a little in other cases, depending on how you use numba and what you use it on, but still)

Indeed. Numba isn't a silver bullet and the lists are known to have certain performance deficiencies, seem like you hit one of those. 😅 I doubt there will be a "workaround" for your use-case, unfortunately.

@esc
Copy link
Member

esc commented Apr 23, 2024

@soerenwolfers I thought about this some more, and the issue you are encountering are probably related to so-called "reference counted" types (e.g. tuple of NumPy arrays), i.e. anything non primitive, like an int or a float. I went and dug out some pointers for you, in case you want to have a crack at improving the typed.List for your use-case:

High-level interface: https://github.com/numba/numba/blob/main/numba/typed/typedlist.py#L361-L369
Compiler code: https://github.com/numba/numba/blob/main/numba/typed/listobject.py#L689-L791
C-level backing: https://github.com/numba/numba/blob/main/numba/cext/listobject.c#L305-L323

Note that the reflected list is scheduled for deprecation, so it's probably not worth attempting to improve that.

I think there are quite a few people who would be happy if the typed list performance for more complex use-cases could be improved! 🙌

Hope this helps!

Copy link

This issue is marked as stale as it has had no activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with any updates and confirm that this issue still needs to be addressed.

@github-actions github-actions bot added the stale Marker label for stale issues. label May 24, 2024
@soerenwolfers
Copy link
Author

Closing because I stopped using numba and it doesn't seem this is prioritized by the dev team.

@lesshaste
Copy link

@soerenwolfers What are you using instead?

@esc
Copy link
Member

esc commented May 25, 2024

@soerenwolfers it seems like Numba isn't a good fit for your use-case at present. FWIW: I will continue to debug setitem/getitem -- we are discussing a number of potential performace improvements with @DannyWeitekamp in this discourse thread: https://numba.discourse.group/t/performance-of-typed-list-outside-of-jit-functions/2560

@soerenwolfers
Copy link
Author

I came to the same conclusion. I'm not using anything else either, I just am losing a bit of time running regular python instead of losing it on bending numba to do what it's apparently not targeted to do.

@esc
Copy link
Member

esc commented May 25, 2024

ref: #9374

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Notes an issue as a question stale Marker label for stale issues.
Projects
None yet
Development

No branches or pull requests

4 participants