Skip to content

C integer types: the missing manual

Lars Buitinck edited this page Mar 19, 2014 · 9 revisions

Here be dragons.

Throughout the scikit are bits and pieces written in Cython, and these commonly use C integers to index into arrays. Also, the Python code regularly creates arrays of C integers by np.where and other means. Confusion arises every so often over what the correct type for integers is, esp. where we use them as indices.

There are various types of C integers, the main ones being for our purposes:

  • int: the "native" integer type. This once meant the size of a register, but on x86-64 this no longer seems to be true, as int is 32 bits wide while the registers and pointers are now 64 bits.
  • size_t: standard C89 type, defined in <stddef.h> and a variety of standard C headers. Always unsigned. Large enough to hold the size of any object, i.e. 64 bits on a 64-bit machine, 32 bits otherwise. This is the type of a C sizeof expression, of the return value of strlen, and its what functions like malloc, memcpy and strcpy expect. Use when dealing with these functions.
  • Py_ssize_t: type defined in <Python.h> and declared implicitly in Cython, that can hold the size (in bytes) of the largest object the Python interpreter ever creates. Index type for list. 63 bits + sign on x86-64; in general, the signed counterpart of size_t, with the sign used for negative indices so l[-1] works in C as well. Use when dealing with the CPython API.
  • np.npy_intp: type defined by the NumPy Cython module that is always large enough to hold the value of a pointer, like intptr_t in C99. 63 bits + sign on x86-64, and probably always the size of Py_ssize_t, although there's no guarantee. Use for indices into NumPy arrays; the NumPy C API expects this type.

Now to confound matters:

  • BLAS uses int for all its integers, except (in ATLAS) the return value from certain functions, which is size_t. It follows that when you call BLAS, you shouldn't expect to be able to handle array dimensions ≥2³¹, aka INT_MAX (from <limits.h>). If this doesn't sound like much of a problem, consider the fast ways to compute the Frobenius norm of a matrix:

    scipy.linalg.norm(X)
    # or
    sqrt(np.dot(X.ravel(), X.ravel()))
    

The ravel'd array, implicit in the call to norm, has only one dimension, which may be ≥2³¹. This is no problem for NumPy's array data structure, but norm may call cblas_nrm2, and that can't handle the array size correctly (dot has been fixed). Most likely, it'll process only part of the array, but this depends on the implementation.

  • scipy.sparse uses index arrays of type int to represent matrices in COO, CSC and CSR formats, so it has much the same limitation as BLAS. n_samples, n_features and the number of non-zero entries are all three limited to 2³¹-1. SciPy 0.14 has 64-bit indices as well; we'll probably need to use fused types in all the sparse matrix-handling Cython code to properly support these.

  • Since npy_intp is an alias at the C level, NumPy has no way of showing that a variable is of this type in Python. Instead, it shows the actual type, so on x86-64 (but not on i386, and probably not on ARM), you'll get results like

    >>> type(np.intp(1))   # corresponds to npy_intp
    <type 'numpy.int64'>
    >>> type(np.intc(1))   # corresponds to a C "int"
    <type 'numpy.int32'>
    >>> np.where([True])[0].dtype
    dtype('int64')         # actually an npy_intp
    
  • np.random.randint returns a Python int (variable-size integer) when asked for one number. When asked for an array, it returns either 32-bit or 64-bit integers depending on sizeof(long); this is hardcoded in the C implementation. On most platforms, this conforms to the size of npy_intp, but again there's no guarantee and getting random indices can be tricky.

Clone this wiki locally