diff --git a/docs/glossary.md b/docs/glossary.md index 538f06a..30ad889 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -6,6 +6,8 @@ |---|---|---| | ABI | Application Binary Interface | See [here](./background/binary_interface.md) | | API | Application Programming Interface | The sum total of available functions, classes, etc. of a given program | +| AAB | Android Application Bundle | A distributable unit containing an Android application | +| APK | Android application Package | A "binary" unit for Android, installed on a device | | ARM | Advanced RISC Machines | Family of RISC architectures, second-most widely used processor family after x86 | | AVX | Advanced Vector eXtensions | Various extensions to the x86 instruction set (AVX, AVX2, AVX512), evolution after SSE | | BLAS | Basic Linear Algebra Subprograms | Specification resp. implementation for low-level linear algebra routines | @@ -29,6 +31,7 @@ | LAPACK | Linear Algebra PACKage | Standard software library for numerical linear algebra | | ISA | Instruction Set Architecture | Specification of an instruction set for a CPU; e.g. x86-64, arm64, ... | | JIT | Just-in-time Compilation | Compiling code just before execution; used in CUDA, PyTorch, PyPy, Numba etc. | +| JNI | Java Native Interface | The bridge API allowing access of Java runtime objects from native code (and vice versa) | | LLVM | - | Cross-platform compiler framework, home of Clang, MLIR, BOLT etc. | | LTO | Link-Time Optimization | See [here](./background/compilation_concepts.md#link-time-optimization-lto)| | LTS | Long-Term Support | Version of a given software/library/distribution designated for long-term support | @@ -36,6 +39,7 @@ | MPI | Message Passing Interface | Standard for message-passing in parallel computing | | MLIR | Multi-Level IR | Higher-level IR within LLVM; used i.a. in machine learning frameworks | | MSVC | Microsoft Visual C++ | Main compiler on Windows | +| NDK | Native Development Kit | The Android toolchain supporting compilation of binary modules | | NEP | Numpy Enhancement Proposal | See [here](https://numpy.org/neps/) | | OpenMP | Open Multi Processing | Multi-platform API for enabling multi-processing in C/C++/Fortran | | OS | Operating System | E.g. Linux, MacOS, Windows | diff --git a/docs/index.md b/docs/index.md index aa7f7a7..36e7b5b 100644 --- a/docs/index.md +++ b/docs/index.md @@ -68,6 +68,7 @@ workarounds for. 5. [Distributing a package containing SIMD code](key-issues/simd_support.md) 6. [Unsuspecting users getting failing from source builds](key-issues/unexpected_fromsource_builds.md) 7. [Cross compilation](key-issues/cross_compilation.md) +8. [Platforms with multiple CPU architectures](key-issues/multiple_architectures.md) ## Contributing diff --git a/docs/key-issues/cross_compilation.md b/docs/key-issues/cross_compilation.md index 27b16b6..88ed65e 100644 --- a/docs/key-issues/cross_compilation.md +++ b/docs/key-issues/cross_compilation.md @@ -23,7 +23,7 @@ compiled for the target platform. macOS also experiences this as a result of the Apple Silicon transition. Apple has provided the tools to make cross compilation from x86-64 to arm64 as easy -as possible, as well as to compile fat binaries +as possible, as well as to compile [fat binaries](multiple_architectures.md) (supporting x86-64 and arm64 at the same time) on both architectures. In the latter case, the host platform will still be one of the outputs of the compilation process, and the resulting binary will run on the CI/CD system. diff --git a/docs/key-issues/multiple_architectures.md b/docs/key-issues/multiple_architectures.md new file mode 100644 index 0000000..c0814c5 --- /dev/null +++ b/docs/key-issues/multiple_architectures.md @@ -0,0 +1,365 @@ +# Platforms with multiple CPU architectures + +One important subset of ABI concerns is the CPU architecture for which a binary +artefact has been built. Attempting to run a binary on hardware that doesn't +match the CPU architecture (or architecture variant[^1]) for which the binary +was built will generally lead to crashes, even if the ABI being used is +otherwise compatible. + +[^1]: + E.g., the x86-64 architecture has a range of well-known extensions, such as + SSE, SSE2, SSE3, AVX, AVX2, AVX512, etc. + +## Current state + +Most operating systems support multiple CPU architectures: + +* In the early days of Windows NT, both x86 and DEC Alpha CPUs were supported +* Windows 10 supports x86, x86-64, ARMv7 and ARM64; Windows 11 supports x86-64 + and ARM64. +* Due to its open source nature, Linux tends to support all CPU architectures for + which someone is interested enough to author & provide support in the kernel, + see [here](https://en.wikipedia.org/wiki/List_of_Linux-supported_computer_architectures). +* Apple transitioned Mac hardware from PowerPC to Intel (x86-64) CPUs, providing + a forwards compatibility path for binaries +* Apple is currently transitioning Mac hardware from Intel (x86-64) to + Apple Silicon (ARM64) CPUs, again providing a forwards compatibility + path +* Apple supports ARMv6, ARMv7, ARMv7s, ARM64 and ARM64e on iOS +* Android currently supports ARMv7, ARM64, x86, and x86-64; it has historically + also supported ARMv5 and MIPS + +The general expectation is that an executable or library is compiled for a +single CPU archicture. + +CPU architecture compatibility is a necessary, but not sufficient criterion for +determining binary compatibility. Even if two binaries are compiled for the same +CPU architecture, that doesn't guarantee [ABI compatibility](abi.md). + +Three approaches have emerged on operating systems that have a need to manage +multiple CPU architectures: + +### Multiple binaries + +The minimal solution is to distribute multiple binaries. This is the approach +that is taken by Windows and Linux. At time of distribution, an installer or other +downloadable artefact is provided for each supported platform, and it is up to +the user to select and download the correct artefact. + +At present, the Python ecosystem almost exclusively uses the "multiple binary" +solution. This serves the needs of Windows and Linux well, as it matches the +way end-users interact with binaries on those platforms. + +### Archiving + +The approach taken by Android is very similar to the multiple binary approach, +with some affordances and tooling to simplify distribution. + +By default Android projects use Java/Kotlin, which produces platform independent +code. However, it is possible to use non-Java/Kotlin libraries by using JNI and +the Android NDK (Native Development Kit). If a project contains native code, a +separate compilation pass is performed for each architecture. + +If a native binary library is required to compile the Android application, a +version must be provided for each supported CPU architecture. A directory layout +convention exists for providing a binary for each platform, with the same +library name. + +The final binary artefact produced for Android distrobution uses this same +directory convention. A "binary" on Android is an APK (Android Application +Package) bundle; this is effectively a ZIP file with known metadata and +structure; internally, there are subfolders for each supported CPU architecture. +This APK is bundled into AAB (Android Application Bundle) format for upload to +an app store; at time of installation, a CPU-specific APK is generated and +provided to the end-user for installation. + +### Fat binaries + +Apple has taken the approach of "fat" binaries. A fat binary is a single +executable or library artefact that contains code for multiple CPU +architectures. + +Fat binaries can be compiled in two ways: + +1. **Single pass** Apple has modified their compiler tooling with flags that + allow the user to specify a single compilation command, and instruct the + compiler to generate multiple output architectures in the output binary +2. **Multiple pass** After compiling a binary for each platform, Apple provides + a call named `lipo` to combine multiple single-architecture binaries into a + single fat binary that contains all platforms. The `delocate-fuse` command + provided by the [delocate](https://pypi.org/project/delocate/) Python package + can be used to perform this merging on Python wheels (along with other + functionality). + +At runtime, the operating system loads the binary slice for the current CPU +architecture, and the linker loads the appropriate slice from the fat binary of +any dynamic libraries. + +On macOS ARM hardware, Apple also provides Rosetta as a support mechanism; if a +user tries to run a binary that doesn't contain an ARM64 slice, but *does* +contain an x86-64 slice, the x86-64 slice will be converted at runtime into an +ARM64 binary. Complications can occur when only *some* of the binary is being +converted (e.g., if the binary being executed is fat, but a dynamic library +isn't). + +To support the transition to Apple Silicon/M1 (ARM64), Python has introduced a +`universal2` architecture target. This is effectively a "fat wheel" format; the +`.dylib` files contained in the wheel are fat binaries containing both x86-64 +and ARM64 slices. + + +??? question "What's the deal with `universal2` wheels?" + + The `universal2` fat wheel format has generated quite a bit of discussion, + and isn't well-supported by either packaging tools (e.g., there is no way + to install a `universal2` wheel from PyPI if thin wheels are also present) + or package authors (most numerical, scientific and ML/AI package authors do + not provide them). There are some arguments for and against supporting the + format or even defaulting to it. + + Arguments for: + + - While users with a technical background are usually aware of the CPU + architecture of their machine, less technical users are often unaware of + this detail (or of the significance of this detail). The universal2 wheel + format allows users to largely ignore this detail, as all CPU + architectures for the macOS platform are accomodated in a single binary + artefact. + - macOS has developed an ecosystem where end users expect that any macOS + binary will run on any macOS machine, regardless of CPU architecture. + As a result, when building macOS apps (`.dmg` downloadable installers + or similar formats, produced by tools such as + [py2app](https://py2app.readthedocs.io) or + [briefcase](https://beeware.org/project/projects/tools/briefcase/)), + the person building the project must accommodate all possible CPU + architectures where the code *could* be executed. + - If binary wheels are only available in "thin" format, any issues with + merging those wheels into fat equivalents for distribution purposes are + deferred to the person downloading the wheels (i.e., the app builder). + This can be problematic as it may require expert knowledge about the + package being merged (such as optional modules or header files that may + not be present in both thin artefacts). Universal2 artefacts captures + this knowledge by requiring the maintainers of the wheel-producing + project to resolve any merging issues. + + Arguments against: + + - `universal2` wheels are never necessary for users installing into a + locally installed Python environment exclusively for their own use, + which is the default experience most users have with Python. + - Using `universal2` wheels requires larger downloads and more disk + space - for a typical PyData stack it takes hundreds of MBs per Python + environment more than thin wheels, and users are likely to have quite a + few environments on their system at once. Meaning that defaulting to + `universal2` would use several GBs of disk space more. + + - Disk space on older MacBook Air models is 128 GB, and up to half of that + can be taken up by the OS and system data itself. So a few GBs can be + significant. + - Internet plans in many countries are not unlimited; almost doubling the + download size of wheels is a serious cost, and not desirable for any + user - but especially unfriendly to users in countries where network + infrastructure is less developed. + + - In addition, it takes extra space on PyPI (examples: `universal2` wheels + cost an extra 81.5 MB for NumPy 1.21.4 and 175.5 MB for SciPy 1.9.1), and + projects with large wheels often run into total size limits on PyPI. + - It imposes an extra maintenance burden for each project, because separate + CI jobs are needed to build and test `universal2` wheels. Typically + projects make tradeoffs there, because they cannot support every + platform. And `universal2` doesn't meet the bar for usage frequency / + user demand here - it is only asked for by macOS universal app authors, + and in practice that demand seems to be well below the demand for + wheels for other platforms with still-patchy support like `musllinux`, + `ppc64le`, and PyPy (see + [Expectations that projects provide ever more wheels](../../meta-topics/user_expectations_wheels.md) + for more on that). + - When a project provides thin wheels (which should be done when projects + have large compiled extensions, because of the the better experience for + the main use case of wheels - users installing for use on their own + machine), you cannot even install a `universal2` wheel with pip from PyPI + at all. Why upload artifacts you cannot install? This is due to a [known + bug in pip](https://github.com/pypa/pip/issues/11573). + - It is straightforward to fuse two thin wheels with `delocate-fuse` (a + tool that comes with [delocate](https://pypi.org/project/delocate/)), + it's a one-liner: `delocate-fuse $x86-64_wheel $arm64_wheel -w .` + Note though that robustness improvements in `delocate-fuse` for more + complex cases (e.g., generated header files with architecture-dependent + content) are needed (see + [delocate#180](https://github.com/matthew-brett/delocate/issues/180)). + Such cases are likely to be equally problematic for direct `universal2` + wheel builds (see, e.g., + [numpy#22805](https://github.com/numpy/numpy/pull/22805)). + - Open source projects rely on freely available CI systems to support + particular hardware architectures. CI support for macOS `arm64` was a + problem at first, but is now available through Cirrus CI. And that + availability is expected to grow over time; GitHub Actions and other + providers [will roll out support at some + point](https://github.com/github/roadmap/issues/528). This allows + building thin wheels and run tests - which is nicer than building + `universal2` wheels on x86-64 and testing only the x86-64 part of those + wheels. + +iOS has an additional complication of requiring support for multiple *ABIs* in +addition to multiple CPU architectures. The ABI for the iOS simulator and +physical iOS devices are different; however, ARM64 is a supported CPU +architecture for both. As a result, it is not possible to produce a single fat +library that supports both the iOS simulator and iOS devices. Apple provides an +additional structure - the `XCFramework` - as a wrapper format for packaging +libraries that need to span multiple ABIs. When developing an application for +iOS, a developer will need to install binaries for both the simulator and +physical devices. + +## Problems + +The problems that exist with supporting multiple architectures are limited to +those platforms that expect distributable artefacts to support multiple +platforms simultanously - macOS, iOS and Android. + +Although the `universal2` "fat wheel" format exists, there is some resistance to +using this format in some circles (in particular in the science/data ecosystem). +If a package publishes independent wheels for x86_64 and M1, there's no +ecosystem-level tooling for consuming those artefacts. However, ad-hoc approaches +using `delocate` or `lipo` can be used. + +Supporting iOS requires supporting between 2 and 5 architectures (x86-64 and +ARM64 at the minimum), and at least 2 ABIs - the iOS simulator and iOS device +have different (and incompatible) binary ABIs. At runtime, iOS expects to find a +single "fat" binary for the ABI that is in use. iOS effectively requires an +analog of `universal2` covering the 2 ABIs and multiple architectures. However: + +1. The Python ecosystem does not provide an extension mechanism that would allow + platforms to define and utilize multi-architecture build artefacts. + +2. The rate of change of CPU architectures in the iOS ecosystem is more rapid + than that seen on desktop platforms; any potential "universal iOS" target + would need to be updated or versioned regularly. A single named target would + also force developers into supporting older devices that they may not want to + support. + +Supporting Android also requires the support of between 2 and 4 architectures +(depending on the range of development and end-user configurations the app needs +to support). Android's archiving-based approach can be mapped onto the "multiple +binary" approach, as it is possible to build a single archive from multiple +individual binaries. However, some coordination is required when installing +multiple binaries. If an independent install pass (e.g., call to `pip`) is used +for each architecture, the dependency resolution process for each platform will +also be independent; if there are any discrepancies in the specific versions +available for each architecture (or any ordering instabilities in the dependency +resolution algorithm), it is possible to end up with different versions on each +platform. Some coordination between per-architecture passes is therefore +required. + +## History + +[The BeeWare Project](https://beeware.org) provides support for building both +iOS and Android binaries. On both platforms, BeeWare provides a custom package +index that contains pre-compiled binaries +([Android](https://chaquo.com/pypi-7.0/); +[iOS](https://anaconda.org/beeware/repo)). These binaries are produced using a +set of tooling +([Android](https://github.com/chaquo/chaquopy/tree/master/server/pypi); +[iOS](https://github.com/freakboy3742/chaquopy/tree/iOS-support/server/pypi)) +that is analogous to the tools used by conda-forge to build binary artefacts. +These tools patch the source and build configurations for the most common Python +binary dependencies; on iOS, these tools also manage the process of merging +single-architecture, single ABI wheels into a fat wheel. + +On iOS, BeeWare-supplied iOS binary packages provide a single "iPhone" wheel. +This wheel includes 2 binary libraries (one for the iPhone device ABI, and one +for the iPhone Simulator ABI); the iPhone simulator binary includes x86-64 and +ARM64 slices. This is effectively the "universal-iphone" approach, encoding a +specific combination of ABIs and architectures. + +BeeWare's support for Android uses [Chaquopy](https://chaquo.com/chaquopy) as a +base. Chaquopy's binary artefact repository stores a single binary wheel for +each platform; it also contains a wrapper around `pip` to manage the +installation of multiple binaries. When a Python project requests the +installation of a package: + +* Pip is run normally for one binary architecture, +* The `.dist-info` metadata is used to identify the native packages - both + those directly requested by the user, and those installed as indirect + requirements by pip, +* The native packages are separated from the pure-Python packages, and pip is + then run again for each of the remaining architectures; this time, only those + specific native packages are installed, pinned to the same versions that pip + selected for the first architecture. + +[Kivy](https://kivy.org) also provides support for iOS and Android as deployment +platforms. However, Kivy doesn't support the use of binary artefacts like wheels +on those platforms; Kivy's support for binary modules is based on the broader Kivy +platform including build support for libraries that may be required. + +## Relevant resources + +To date, there haven't been extensive public discussions about the support of +iOS or Android binary packages. However, there were discussions around the +adoption of `universal2` for macOS: + +* [The CPython discussion about `universal2` + support](https://discuss.python.org/t/apple-silicon-and-packaging/4516) +* [The addition of `universal2` to + CPython](https://github.com/python/cpython/pull/22855) +* [Support in packaging for + `universal2`](https://github.com/pypa/packaging/pull/319), which declares the + logic around resolving `universal2` to specific platforms. + +## Potential solutions or mitigations + +For macOS universal app builders, first-class tooling in the Python ecosystem +to fuse thin wheels is needed. This may be done by, for example, making +`delocate-fuse` more robust (see +[delocate#180](https://github.com/matthew-brett/delocate/issues/180)). +and then making `delocate` itself more visible or merging `delocate` into +`auditwheel`. This tooling is then available as shared tooling to support +universal app builders like `py2app` and `briefcase`. + +For the general multiple architecture case, there are two approaches that could +be used to provide a solution to this problem, depending on whether the support +of multiple architectures is viewed as a distribution or integration problem. + +### Distribution-based solution + +The first approach is to treat the problem as a package distribution issue. In +this approach, artefacts stored in package repositories include all the ABIs and +CPU architectures needed to meaningfully support a given platform. This is the +approach embodied by the `universal2` packaging solution on macOS, and the iOS +solution used by BeeWare. + +This approach would require agreement on any new "known" multi-ABI/arch tags, as +well as any resolution schemes that may be needed for those tags. + +A more general approach to this problem would be to allow for multi-architecture +and multi-ABI binaries as part of the wheel naming scheme. A wheel can already +declare compatibility with multiple CPython versions (e.g., +`cp34.cp35.cp36-abi3-manylinux1_x86_64`); it could be possible for a wheel to +declare multiple ABI or architecture inclusions. In such a scheme, +`cp310-abi3-macosx_10_9_universal2` would effectively be equivalent to +`cp310-abi3-macosx_10_9_x86_64.macosx_10_9_arm64`; an iPhone wheel for the same +package might be +`cp310-abi3-iphoneos_12_0_arm64.iphonesimulator_12_0_x86_64.iphonesimulator_12_0_arm64`. + +This would allow for more generic logic based on matching name fragments, rather +than specific "known name" targets. + +Regardless of whether "known tags" or a generic naming scheme is used, the +distribution-based approach requires modifications to the process of building +packages, and the process of installing packages. + +### Integration-based solution + +Alternatively, this could be treated as an install-time problem. This is the +approach taken by BeeWare/Chaquopy on Android. + +In this approach, package repositories would continue to store +single-architecture, single-ABI artefacts. However, at time of installation, the +installation tool allows for the specification of multiple architectures/ABI +combinations. The installer then downloads a wheel for each architecture/ABI +requested, and performs any post-processing required to merge binaries for +multiple architectures into a single fat binary, or archiving those binary +artefacts in an appropriate location. + +This approach is less invasive from the perspective of package repositories and +package build tooling; but would require significant modifications to installer +tooling. diff --git a/mkdocs.yml b/mkdocs.yml index 33a0dff..ee2fb83 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -43,6 +43,7 @@ nav: - 'key-issues/simd_support.md' - 'key-issues/unexpected_fromsource_builds.md' - 'key-issues/cross_compilation.md' + - 'key-issues/multiple_architectures.md' - 'other_issues.md' - 'Background': - 'background/binary_interface.md'