Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Cython Support to Arrow #1143

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Add Cython Support to Arrow #1143

wants to merge 10 commits into from

Conversation

13MK3
Copy link

@13MK3 13MK3 commented Nov 21, 2022

Pull Request Checklist

Thank you for taking the time to improve Arrow! Before submitting your pull request, please check all appropriate boxes:

  • 🧪 Added tests for changed code.
  • 🛠️ All tests pass when run locally (run tox or make test to find out!).
  • 🧹 All linting checks pass when run locally (run tox -e lint or make lint to find out!).
  • 📚 Updated documentation for changed code.
  • ⏩ Code is up-to-date with the master branch.

If you have any questions about your code changes or any of the points above, please submit your questions along with the pull request and we will try our best to help!

Description of Changes

These changes were made in collaboration with @khanm3

  • Modified the setup file so that the entire Arrow library is compiled using Cython
  • Modified the tox.ini file to support code coverage with Cython
  • Added Cython 3.0.0a11 to the requirements.txt
  • Modified arrow.py to compile with Cython and pass the test suite
  • Modified an existing test case to work with Cython since the mocking was broken from Cython caching builtins
  • Added a Makefile target to compile with Cython

This pull request adds Cython support to Arrow as requested by #617.

Further Explanation

In arrow/arrow.py, we had to change the variable "locale" to "locale_cls" because the variable was being reused for 2 different types and this functionality was incompatible with Cython. Cython also had difficulties recognizing the type of _datetime.isoclanedar, so we had to specifically cast it to a tuple. Without this, it would not compile.

Even with the issues mentioned above, we chose to use Cython 3.0.0a11 (which is a pre-release version) instead of Cython 0.29.32 (the most recent stable version), since from our testing, Cython 3 is better at parsing modern Python features. In general, Cython tries to subsume the functionality of python, but as you can see there are still many quirks.

We were able to allow Cython to use the existing method of calculating code coverage by adding linetracing as a compiler directive. However, due to the quirks of Cython, the resulting line coverage results are slightly lower than 100%. We found that the lines that were no longer being covered were one of 2 cases:

  1. Inside an if statement that had 'pragma no cover' (so those also shouldn't count towards coverage)
  2. Parameters to functions that had a default value (which also shouldn't count towards coverage, but for some reason Cython thinks it should)

This could be ignored by adding more 'pragma no cover', but we're not sure if there is a better way.

Performance Analysis

To compare performance of Arrow with and without Cython, we ran just the Arrow part of this benchmark. This benchmark runs 1 million executions per benchmark and picked the min of 5 repeats. This was run in Python 3.8.10.

arrow_benchmark

From these results, you can see that compiling Arrow with Cython does cause a measureable improvement to performance, especially with parsing. Additionally, compiling Cython without linetracing is even more performant, but we currently compile with it so that code coverage can be calculated.

Future Work

Adding static typing (and necessarily converting those files to .pyx files) would bring an even bigger boost to performance. Parser.py should receive special attention, as this is where the largest amount of time is being spent. This is where adding Cython to Arrow could prove to be especially valuable, since the speed increase with static typing should be much more than that gained by simply compiling with Cython. Looking into different ways of calculating code coverage so linetracing is no longer needed, or providing an option to switch the compiler directive would also be a good way to improve performance in the future.

Co-Authored by: Lukas Lemke <90070416+13MK3@users.noreply.github.com>
Co-Authored by: Lukas Lemke <90070416+13MK3@users.noreply.github.com>
Co-Authored by: Lukas Lemke <90070416+13MK3@users.noreply.github.com>
- since cython doesn't support changing builtins at runtime
Co-Authored by: Lukas Lemke <90070416+13MK3@users.noreply.github.com>
Co-Authored by: Lukas Lemke <90070416+13MK3@users.noreply.github.com>
Co-Authored by: Lukas Lemke <90070416+13MK3@users.noreply.github.com>
@13MK3 13MK3 changed the title Cython Add Cython Support to Arrow Nov 21, 2022
@krisfremen
Copy link
Member

Very nice, i had a wip branch of this, but this looks like good progress.

Co-Authored by: Lukas Lemke <90070416+13MK3@users.noreply.github.com>
Co-Authored by: Lukas Lemke <90070416+13MK3@users.noreply.github.com>

- automatically compile cython when running make build3x
- add target to clean .c and binary files produced from compiling cython
@13MK3
Copy link
Author

13MK3 commented Nov 21, 2022

We realized the system tests are failing because the test systems aren't installing Cython before running setup.py. It seems that the continuous_integration.yml is using cached requirements and therefore not installing Cython even though we specified it in requirements.txt? We are not exactly sure and could use some help with this. Cython seems to install fine when we do "make build3x" locally.

@krisfremen
Copy link
Member

tox handles setup.py dependencies a bit differently, which we install during the GH Actions.

We should probably extract those out of the workflow and in a separate requirements file.

@codecov
Copy link

codecov bot commented Nov 23, 2022

Codecov Report

Base: 100.00% // Head: 99.59% // Decreases project coverage by -0.40% ⚠️

Coverage data is based on head (3d416be) compared to base (74a759b).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff             @@
##            master    #1143      +/-   ##
===========================================
- Coverage   100.00%   99.59%   -0.41%     
===========================================
  Files            9        9              
  Lines         2319     2684     +365     
  Branches       492        0     -492     
===========================================
+ Hits          2319     2673     +354     
- Misses           0       11      +11     
Impacted Files Coverage Δ
arrow/arrow.py 99.54% <100.00%> (-0.46%) ⬇️
arrow/constants.py 88.23% <0.00%> (-11.77%) ⬇️
arrow/util.py 97.95% <0.00%> (-2.05%) ⬇️
arrow/formatter.py 99.03% <0.00%> (-0.97%) ⬇️
arrow/parser.py 99.43% <0.00%> (-0.57%) ⬇️
arrow/locales.py 99.85% <0.00%> (-0.15%) ⬇️
arrow/api.py 100.00% <0.00%> (ø)
arrow/factory.py 100.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@krisfremen
Copy link
Member

@13MK3 I see 3.8+ is passing, the rest are failing. Great work.

@anishnya @jadchaar @systemcatch how do we want to handle this going forward? do we want to go full cython or offer a pure python version as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants