Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]ENH: Convert loop unary fp rint into highway #26346

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

luyahan
Copy link

@luyahan luyahan commented Apr 25, 2024

refer #24384, #24385

@luyahan luyahan changed the title rewrite loop rint into HIGHWAY ENH: Convert loop fp rint into highway Apr 25, 2024
@luyahan luyahan changed the title ENH: Convert loop fp rint into highway ENH: Convert loop unary fp rint into highway Apr 25, 2024
@luyahan luyahan changed the title ENH: Convert loop unary fp rint into highway WIP: ENH: Convert loop unary fp rint into highway Apr 25, 2024
@luyahan luyahan changed the title WIP: ENH: Convert loop unary fp rint into highway [WIP]ENH: Convert loop unary fp rint into highway Apr 25, 2024
@Mousius
Copy link
Member

Mousius commented Apr 30, 2024

Hi @luyahan,

Have you measured if this changes/increases performance? Would be good to see some benchmarks 😸

Just a process point, I assume google/highway#2116 needs to be released before this can be merged?

@luyahan luyahan force-pushed the rint-hwy branch 2 times, most recently from c0e7346 to 86eddd1 Compare May 6, 2024 06:10
@Mousius Mousius added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label May 6, 2024
@luyahan
Copy link
Author

luyahan commented May 7, 2024

Hi @luyahan,

Have you measured if this changes/increases performance? Would be good to see some benchmarks 😸

Just a process point, I assume google/highway#2116 needs to be released before this can be merged?

Change Before [a83d469] After [efd9879] Ratio Benchmark (Parameter)
+ 1.75±0.03μs 5.27±0μs 3.01 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 1, 1, 'f')
+ 1.74±0.01μs 5.24±0μs 3.01 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 1, 1, 'f')
+ 1.77±0μs 5.25±0.01μs 2.96 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'absolute'>, 1, 1, 'f')
+ 1.77±0.01μs 5.23±0.01μs 2.95 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 1, 1, 'f')
+ 1.93±0μs 5.26±0.01μs 2.73 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rint'>, 1, 1, 'f')
+ 2.18±0.01μs 5.24±0μs 2.4 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'ceil'>, 1, 1, 'f')
+ 2.87±0.01μs 5.24±0.01μs 1.83 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sqrt'>, 1, 1, 'f')
+ 2.88±0.01μs 5.25±0μs 1.82 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'reciprocal'>, 1, 1, 'f')
+ 3.66±0.01μs 6.10±0.01μs 1.67 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sqrt'>, 4, 1, 'f')
+ 4.14±0.02μs 6.86±0.01μs 1.66 bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 1, 'd')
+ 4.13±0.03μs 6.83±0.01μs 1.65 bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 1, 'd')
+ 3.66±0μs 5.99±0.01μs 1.64 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 4, 1, 'f')
+ 3.72±0μs 6.09±0.01μs 1.64 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'reciprocal'>, 4, 1, 'f')
+ 3.66±0μs 5.96±0μs 1.63 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'ceil'>, 4, 1, 'f')
+ 3.66±0μs 5.97±0.01μs 1.63 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rint'>, 4, 1, 'f')
+ 3.66±0.01μs 5.97±0.01μs 1.63 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 4, 1, 'f')
+ 4.49±0.02μs 6.82±0.01μs 1.52 bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 1, 'e')
+ 4.49±0.02μs 6.83±0.03μs 1.52 bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 1, 'e')
+ 3.69±0.01μs 5.49±0μs 1.49 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 4, 1, 'f')
+ 4.61±0.07μs 6.84±0.01μs 1.48 bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 2, 'd')
+ 4.67±0.01μs 6.84±0.01μs 1.47 bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 2, 'e')
+ 4.66±0.01μs 6.84±0.01μs 1.47 bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 2, 'e')
+ 4.76±0.01μs 6.84±0.01μs 1.44 bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 2, 'd')
+ 3.70±0μs 5.34±0.01μs 1.44 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'absolute'>, 4, 1, 'f')
+ 5.29±0.01μs 6.91±0.05μs 1.31 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 4, 1, 'e')
+ 5.26±0.01μs 6.85±0.01μs 1.3 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 1, 'd')
+ 5.26±0.01μs 6.84±0.01μs 1.3 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 1, 'e')
+ 5.27±0.01μs 6.84±0.02μs 1.3 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 2, 'e')
+ 5.27±0.01μs 6.85±0.02μs 1.3 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 1, 'd')
+ 5.27±0.03μs 6.84±0.01μs 1.3 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 1, 'e')
+ 5.26±0.01μs 6.83±0.01μs 1.3 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 2, 'e')
+ 5.28±0.01μs 6.84±0μs 1.3 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 1, 1, 'd')
+ 5.27±0.01μs 6.86±0.02μs 1.3 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 1, 1, 'e')
+ 5.26±0.01μs 6.85±0μs 1.3 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 1, 2, 'e')
+ 5.29±0.01μs 6.85±0.01μs 1.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 1, 'e')
+ 5.29±0μs 6.85±0μs 1.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 2, 'e')
+ 5.29±0.02μs 6.83±0.01μs 1.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 4, 1, 'e')
+ 5.29±0.02μs 6.85±0.02μs 1.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 4, 2, 'e')
+ 5.30±0.01μs 6.84±0.01μs 1.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 4, 2, 'e')
+ 4.66±0.03μs 6.02±0.2μs 1.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 4, 1, 'd')
+ 4.61±0μs 5.85±0μs 1.27 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'absolute'>, 4, 1, 'd')
+ 4.64±0.01μs 5.83±0.01μs 1.26 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'ceil'>, 4, 1, 'd')
+ 4.63±0.02μs 5.86±0.01μs 1.26 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 4, 1, 'd')
+ 4.66±0.01μs 5.83±0.01μs 1.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 4, 1, 'd')
+ 4.67±0.02μs 5.83±0.01μs 1.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rint'>, 4, 1, 'd')
+ 4.35±0.02μs 5.29±0.01μs 1.22 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'ceil'>, 1, 2, 'f')
+ 4.36±0μs 5.34±0.01μs 1.22 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 1, 2, 'f')
+ 5.64±0.01μs 6.86±0μs 1.22 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 1, 2, 'd')
+ 4.35±0μs 5.28±0.01μs 1.22 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sqrt'>, 1, 2, 'f')
+ 4.36±0μs 5.26±0μs 1.21 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'absolute'>, 1, 2, 'f')
+ 5.66±0.01μs 6.85±0.01μs 1.21 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 2, 'd')
+ 4.36±0μs 5.27±0μs 1.21 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rint'>, 1, 2, 'f')
+ 4.35±0.01μs 5.25±0.01μs 1.21 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 1, 2, 'f')
+ 4.36±0.01μs 5.27±0μs 1.21 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 1, 2, 'f')
+ 5.70±0.02μs 6.85±0.01μs 1.2 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 2, 'd')
+ 4.40±0.01μs 5.30±0μs 1.2 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'reciprocal'>, 1, 2, 'f')
+ 16.2±0.02μs 19.2±0μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'd')
+ 16.2±0.02μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'f')
+ 16.2±0.02μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 2, 'd')
+ 16.1±0.01μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 2, 'f')
+ 16.1±0.02μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 1, 'd')
+ 16.1±0.01μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 1, 'f')
+ 16.2±0.02μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 2, 'd')
+ 16.2±0.01μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 2, 'f')
+ 16.2±0.01μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 1, 'd')
+ 16.2±0.07μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 1, 'f')
+ 16.2±0.03μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 2, 'd')
+ 16.2±0.04μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 2, 'f')
+ 16.2±0.02μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 1, 'd')
+ 16.2±0.02μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 1, 'f')
+ 16.2±0.02μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 2, 'd')
+ 16.2±0.05μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 2, 'f')
+ 16.2±0.02μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 1, 1, 'd')
+ 16.2±0.03μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 1, 1, 'f')
+ 16.1±0.02μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 1, 2, 'd')
+ 16.1±0.01μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 1, 2, 'f')
+ 16.2±0.02μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 4, 1, 'd')
+ 16.2±0.02μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 4, 1, 'f')
+ 16.2±0.01μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 4, 2, 'd')
+ 16.2±0.01μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 4, 2, 'f')
+ 16.1±0.01μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 1, 'd')
+ 16.1±0.01μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 1, 'f')
+ 16.2±0.03μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 2, 'd')
+ 16.2±0.01μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 2, 'f')
+ 16.2±0.01μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 1, 'd')
+ 16.2±0.02μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 1, 'f')
+ 16.2±0.01μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 2, 'd')
+ 16.2±0.02μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 2, 'f')
+ 16.2±0.03μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 1, 'd')
+ 16.2±0.02μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 1, 'f')
+ 16.2±0.01μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 2, 'd')
+ 16.1±0μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 2, 'f')
+ 16.2±0.03μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 1, 'f')
+ 16.2±0.01μs 19.2±0.02μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 2, 'd')
+ 16.2±0.01μs 19.2±0.01μs 1.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 2, 'f')
+ 5.80±0.01μs 6.86±0.01μs 1.18 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 1, 'd')
+ 5.81±0.01μs 6.85±0.01μs 1.18 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 4, 1, 'd')
+ 5.80±0.01μs 6.87±0.01μs 1.18 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 4, 1, 'd')
+ 16.4±0.1μs 19.3±0.03μs 1.17 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 1, 'd')
+ 4.39±0.02μs 4.88±0.4μs 1.11 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log10'>, 1, 1, 'e')
+ 59.6±0.3μs 65.3±0.08μs 1.1 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'reciprocal'>, 1, 1, 'e')
+ 59.5±0.03μs 65.3±0.06μs 1.1 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'reciprocal'>, 1, 2, 'e')
+ 59.6±0.02μs 65.5±0.1μs 1.1 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'reciprocal'>, 4, 2, 'e')
+ 59.6±0.09μs 65.3±0.04μs 1.09 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'reciprocal'>, 4, 1, 'e')
+ 7.00±0.03μs 7.53±0.5μs 1.08 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'e')
+ 5.38±0.02μs 5.69±0.02μs 1.06 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'ceil'>, 1, 2, 'd')
+ 5.36±0μs 5.66±0.01μs 1.06 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 1, 2, 'd')
+ 7.85±0.03μs 8.34±0.4μs 1.06 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log10'>, 1, 2, 'f')
+ 6.54±0.01μs 6.94±0.09μs 1.06 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 4, 2, 'd')
+ 5.36±0.01μs 5.68±0.02μs 1.06 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 1, 2, 'd')
+ 6.51±0.03μs 6.87±0μs 1.05 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 2, 'd')
+ 3.75±0.02μs 3.94±0μs 1.05 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'exp2'>, 1, 1, 'e')
+ 5.38±0.01μs 5.66±0.01μs 1.05 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rint'>, 1, 2, 'd')
- 5.57±0.1μs 5.27±0μs 0.95 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 4, 2, 'f')
- 67.9±0.1μs 64.5±0.08μs 0.95 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 1, 1, 'e')
- 68.1±0.1μs 64.6±0.08μs 0.95 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 1, 2, 'e')
- 67.9±0.08μs 64.6±0.04μs 0.95 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 4, 1, 'e')
- 68.1±0.09μs 64.5±0.05μs 0.95 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 4, 2, 'e')
- 67.9±0.1μs 64.6±0.1μs 0.95 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 1, 1, 'e')
- 68.0±0.2μs 64.5±0.04μs 0.95 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 1, 2, 'e')
- 68.1±0.6μs 64.5±0.03μs 0.95 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 4, 1, 'e')
- 68.0±0.07μs 64.5±0.05μs 0.95 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 4, 2, 'e')
- 5.61±0.2μs 5.25±0.01μs 0.94 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 1, 'f')
- 4.76±0.2μs 4.46±0.01μs 0.94 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log'>, 1, 1, 'e')
- 5.65±0.1μs 5.26±0.01μs 0.93 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 2, 'f')
- 72.0±0.3μs 65.9±0.07μs 0.92 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 1, 1, 'e')
- 71.9±0.2μs 65.8±0.09μs 0.92 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 1, 2, 'e')
- 71.1±0.04μs 65.4±0.09μs 0.92 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 4, 1, 'e')
- 71.2±0.02μs 65.2±0.05μs 0.92 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 4, 2, 'e')
- 3.27±0μs 2.96±0.02μs 0.91 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rint'>, 1, 1, 'd')
- 6.85±0.01μs 6.07±0.01μs 0.89 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'absolute'>, 1, 2, 'e')
- 6.78±0.7μs 6.02±0.01μs 0.89 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cosh'>, 1, 1, 'e')
- 6.84±0.01μs 6.05±0.01μs 0.88 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'absolute'>, 4, 1, 'e')
- 6.83±0.01μs 6.04±0.01μs 0.88 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'absolute'>, 4, 2, 'e')
- 16.2±0.04μs 13.1±0.01μs 0.81 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 1, 1, 'e')
- 16.3±0.04μs 13.1±0μs 0.81 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 1, 2, 'e')
- 16.2±0.02μs 13.1±0.03μs 0.81 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 4, 1, 'e')
- 16.2±0.06μs 13.1±0.01μs 0.81 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 4, 2, 'e')
- 3.72±0.01μs 2.97±0μs 0.8 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'ceil'>, 1, 1, 'd')
- 40.7±0.09μs 28.7±0.1μs 0.71 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 1, 2, 'e')
- 40.7±0.03μs 28.7±0.07μs 0.7 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 1, 1, 'e')
- 40.7±0.05μs 28.7±0.03μs 0.7 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 4, 1, 'e')
- 40.7±0.07μs 28.7±0.03μs 0.7 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 4, 2, 'e')

@luyahan
Copy link
Author

luyahan commented May 7, 2024

Just a process point, I assume google/highway#2116 needs to be released before this can be merged?

Yes, google/highway#2116 has been merged.😁

@luyahan
Copy link
Author

luyahan commented May 7, 2024

Change Before [a83d469]> After [efd9879] Ratio Benchmark (Parameter)
+ 1.75±0.03μs 5.27±0μs 3.01 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 1, 1, 'f')
+ 1.74±0.01μs 5.24±0μs 3.01 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 1, 1, 'f')
+ 1.77±0μs 5.25±0.01μs 2.96 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'absolute'>, 1, 1, 'f')
+ 1.77±0.01μs 5.23±0.01μs 2.95 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 1, 1, 'f')
+ 1.93±0μs 5.26±0.01μs 2.73 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rint'>, 1, 1, 'f')
+ 2.18±0.01μs 5.24±0μs 2.4 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'ceil'>, 1, 1, 'f')

LoadU/StoreU maybe key reason for performance degradation.

@r-devulap
Copy link
Member

namespace hn = hwy::HWY_NAMESPACE;

// Alternative to per-function HWY_ATTR: see HWY_BEFORE_NAMESPACE
#define SUPER(NAME, FUNC, IS_RECIP) \
Copy link
Member

@r-devulap r-devulap May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refrain from using macro's for functions this large. They are hard to read and will be a pain to debug. Could we make use of templates here?

@r-devulap r-devulap self-assigned this May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants