icelake like backend for RVV (RISC-V vector extension) #362

camel-cdr · 2024-01-10T17:54:25Z

Hi, I've been working on RVV native Unicode conversion routines, and have optimized validating utf8->utf32, utf8->utf16 and partial utf16->utf8 working (see last part for benchmarks).
I'd like to upstream this to simdutf in a custom backend, similar to how the icelake one works.

For testing, I generate random valid input utf32, convert it to the input format, randomly perform n random bit flips on it, and validate the output against the simdutf scalar implementation. Ideally I'd also like to use coverage guided fuzzing, but I wasn't able to get fuzzing working on RISC-V yet.

The code will be published to my RVV benchmark soon (it still needs some cleanup), hopefully with an associated article/blog post.

Edit: Here is the code: utf8_to_utf32/utf8_to_utf16, utf16_to_utf8

There are some open questions, though.

What does one need to do/which files to touch/what tools are there, to add a new architecture to simdutf?
Should we use the explicit intrinsics or the overloaded intrinsics?

The overloaded intrinsics are IMO more readable and better refactorable.
From what I can tell both are "mandated" by the RVV intrinsics spec, but whiles clang supports them since supporting RVV intrinsics (clang 16 and above), gcc currently doesn't support them, but it looks like upstream is currently working on it. I expect that once RVV 1.0 hardware becomes more available, gcc will should have support. There is currently only one board Kendryte K230, which is slowly being rolled out in batches.
Which extensions should we target?

I think we should orient our self by the RVA profiles and only support the standard V extension, so 8 to 64 bit wide elements, with a VLEN >= 128 bits, and not things like Zve64x.
Supporting Zvbb is also quite useful, as it has an endianness swap instruction, but I think we should make this optional and detect support from compiler settings.
Can should we assume fast vrgather and vcompress?

RVV has two permutation instructions that currently vary widely in performance between processors:

vcompress.vm:

	VLEN	e8m1	e8m2	e8m4	e8m8
c906	128	4	10	32	136
c908	128	4	10	32	139.4
c920	128	0.5	2.4	5.4	20.0
bobcat*	256	32	64	132	260
x280*	512	65	129	257	513

vrgather.vv:

	VLEN	e8m1	e8m2	e8m4	e8m8
c906	128	4	16	64	256
c908	128	4	16	64.9	261.1
c920	128	0.5	2.4	8.0	32.0
bobcat*	256	68	132	260	516
x280*	512	65	129	257	513

...

*bobcat: note that this is an open source proof-of-concept core, and they explicitly stated, that they didn't optimize the permutation instructions

*x280: the numbers are from llvm-mca, but I was told they match reality. There is also supposed to be a vrgather fast path for vl<=256. I think they didn't have much incentive to make this fast, as the x280 mostly targets AI.

My code currently uses e8m1 vrgather and e8m2 vcompress, which works great on the C9xx cores, but not so great on the others. I suspect, however, that well see future desktop cores implement fast vcompress and at least fast LMUL=1 vrgather.

For one, because vcompress implementations can be scaled up almost linearly with vector length, which doesn't seem to be true for vrgather without exploding the gate count (Although admittedly I don't know much about hardware design). Secondly because using vrgather for 4 bit LUTs and in lane shuffles will be the most common operations, so vendors will need to optimize for those.

For now, I wouldn't add gather free implementations and performance measurements, but that might be necessary in the future, if I'm wrong about this.

Benchmarks

Processors:

C908: in-order at 1.6GHz, supports RVV 1.0 with VLEN=128
C920: out-of-order double issue at 2GHz, supports RVV 0.7.1 with VLEN=128
I needed to manually convert the assembly to rvv 0.7.1, which increased the code size by about 20 instructions. I've yet to the conversion for the utf16_to_utf8 code, so there aren't any results for that below.

Implementations:

utf8_to_utf32/utf8_to_utf16: fast path for 1 byte, 1/2 byte, 1/2/3 byte, average > 2 bytes, general case

Emoji-Lipsum could probably be artificially speed up by an all 4 byte case, but I don't think that is a realistic case to optimize for, so I left it out.
utf16_to_utf8: fast path for 1 byte output, 1/2 byte output consumes everything until 3/4 byte output, which is converted with scalar code until a 1/2 byte output is reached.

I plan on adding a 1/2/3 vectorized path, and maybe an 1/2/3/4, if I can figure it out.

Metric:

b/c is "input bytes processed"/cycle.

c908 utf8_to_utf32

lipsum/Latin-Lipsum.utf8.txt     scalar: 0.1292010 b/c  rvv 0.7918574 b/c  speedup: 6.1288759x
wm/english.utf8.txt              scalar: 0.1107906 b/c  rvv 0.6070963 b/c  speedup: 5.4796718x
lipsum/Arabic-Lipsum.utf8.txt    scalar: 0.0328398 b/c  rvv 0.1568164 b/c  speedup: 4.7751939x
lipsum/Russian-Lipsum.utf8.txt   scalar: 0.0333284 b/c  rvv 0.1573165 b/c  speedup: 4.7201892x
lipsum/Hebrew-Lipsum.utf8.txt    scalar: 0.0332853 b/c  rvv 0.1568517 b/c  speedup: 4.7123306x
wm/arabic.utf8.txt               scalar: 0.0481720 b/c  rvv 0.2215483 b/c  speedup: 4.5991074x
wm/russian.utf8.txt              scalar: 0.0455210 b/c  rvv 0.2010354 b/c  speedup: 4.4163192x
wm/greek.utf8.txt                scalar: 0.0483209 b/c  rvv 0.2132376 b/c  speedup: 4.4129416x
wm/hebrew.utf8.txt               scalar: 0.0436073 b/c  rvv 0.1914899 b/c  speedup: 4.3912301x
wm/turkish.utf8.txt              scalar: 0.0549728 b/c  rvv 0.2392147 b/c  speedup: 4.3515041x
wm/czech.utf8.txt                scalar: 0.0496503 b/c  rvv 0.2124387 b/c  speedup: 4.2786960x
wm/persan.utf8.txt               scalar: 0.0480962 b/c  rvv 0.1994732 b/c  speedup: 4.1473762x
wm/vietnamese.utf8.txt           scalar: 0.0425005 b/c  rvv 0.1761435 b/c  speedup: 4.1445023x
wm/french.utf8.txt               scalar: 0.0676433 b/c  rvv 0.2610237 b/c  speedup: 3.8588245x
wm/german.utf8.txt               scalar: 0.0817605 b/c  rvv 0.3151995 b/c  speedup: 3.8551556x
wm/esperanto.utf8.txt            scalar: 0.0805715 b/c  rvv 0.3071995 b/c  speedup: 3.8127556x
wm/portuguese.utf8.txt           scalar: 0.0722839 b/c  rvv 0.2748710 b/c  speedup: 3.8026557x
wm/korean.utf8.txt               scalar: 0.0496705 b/c  rvv 0.1627952 b/c  speedup: 3.2774992x
wm/hindi.utf8.txt                scalar: 0.0539742 b/c  rvv 0.1739320 b/c  speedup: 3.2225014x
wm/chinese.utf8.txt              scalar: 0.0539208 b/c  rvv 0.1691128 b/c  speedup: 3.1363154x
wm/japanese.utf8.txt             scalar: 0.0536440 b/c  rvv 0.1681587 b/c  speedup: 3.1347166x
wm/thai.utf8.txt                 scalar: 0.0576478 b/c  rvv 0.1801362 b/c  speedup: 3.1247700x
lipsum/Korean-Lipsum.utf8.txt    scalar: 0.0394845 b/c  rvv 0.1222531 b/c  speedup: 3.0962288x
lipsum/Hindi-Lipsum.utf8.txt     scalar: 0.0421478 b/c  rvv 0.1226263 b/c  speedup: 2.9094319x
lipsum/Japanese-Lipsum.utf8.txt  scalar: 0.0448010 b/c  rvv 0.1226832 b/c  speedup: 2.7383986x
lipsum/Chinese-Lipsum.utf8.txt   scalar: 0.0456330 b/c  rvv 0.1225884 b/c  speedup: 2.6863942x
lipsum/Emoji-Lipsum.utf8.txt     scalar: 0.0558446 b/c  rvv 0.1189647 b/c  speedup: 2.1302799x

c908 utf8_to_utf16

lipsum/Latin-Lipsum.utf8.txt     scalar: 0.1462973 b/c  rvv 1.0275230 b/c  speedup: 7.0235252x
wm/english.utf8.txt              scalar: 0.1275831 b/c  rvv 0.7338758 b/c  speedup: 5.7521362x
lipsum/Hebrew-Lipsum.utf8.txt    scalar: 0.0330693 b/c  rvv 0.1675394 b/c  speedup: 5.0663088x
lipsum/Arabic-Lipsum.utf8.txt    scalar: 0.0331370 b/c  rvv 0.1676699 b/c  speedup: 5.0598918x
lipsum/Russian-Lipsum.utf8.txt   scalar: 0.0331387 b/c  rvv 0.1674591 b/c  speedup: 5.0532761x
wm/arabic.utf8.txt               scalar: 0.0497569 b/c  rvv 0.2353216 b/c  speedup: 4.7294242x
wm/greek.utf8.txt                scalar: 0.0497033 b/c  rvv 0.2285679 b/c  speedup: 4.5986446x
wm/russian.utf8.txt              scalar: 0.0466324 b/c  rvv 0.2121076 b/c  speedup: 4.5484982x
wm/hebrew.utf8.txt               scalar: 0.0448840 b/c  rvv 0.2028331 b/c  speedup: 4.5190476x
wm/turkish.utf8.txt              scalar: 0.0587339 b/c  rvv 0.2584671 b/c  speedup: 4.4006435x
wm/czech.utf8.txt                scalar: 0.0528302 b/c  rvv 0.2278120 b/c  speedup: 4.3121470x
wm/persan.utf8.txt               scalar: 0.0496008 b/c  rvv 0.2126346 b/c  speedup: 4.2869173x
wm/vietnamese.utf8.txt           scalar: 0.0447605 b/c  rvv 0.1853099 b/c  speedup: 4.1400298x
wm/esperanto.utf8.txt            scalar: 0.0881123 b/c  rvv 0.3412668 b/c  speedup: 3.8730881x
wm/german.utf8.txt               scalar: 0.0905761 b/c  rvv 0.3502627 b/c  speedup: 3.8670545x
wm/french.utf8.txt               scalar: 0.0737802 b/c  rvv 0.2843437 b/c  speedup: 3.8539292x
wm/portuguese.utf8.txt           scalar: 0.0791921 b/c  rvv 0.3004463 b/c  speedup: 3.7938890x
wm/korean.utf8.txt               scalar: 0.0522578 b/c  rvv 0.1727464 b/c  speedup: 3.3056579x
wm/hindi.utf8.txt                scalar: 0.0563662 b/c  rvv 0.1848405 b/c  speedup: 3.2792800x
lipsum/Korean-Lipsum.utf8.txt    scalar: 0.0399847 b/c  rvv 0.1300402 b/c  speedup: 3.2522438x
wm/thai.utf8.txt                 scalar: 0.0600823 b/c  rvv 0.1928356 b/c  speedup: 3.2095216x
wm/japanese.utf8.txt             scalar: 0.0560357 b/c  rvv 0.1775926 b/c  speedup: 3.1692714x
wm/chinese.utf8.txt              scalar: 0.0565430 b/c  rvv 0.1788198 b/c  speedup: 3.1625422x
lipsum/Hindi-Lipsum.utf8.txt     scalar: 0.0424079 b/c  rvv 0.1302720 b/c  speedup: 3.0718763x
lipsum/Japanese-Lipsum.utf8.txt  scalar: 0.0448987 b/c  rvv 0.1302059 b/c  speedup: 2.8999905x
lipsum/Chinese-Lipsum.utf8.txt   scalar: 0.0457254 b/c  rvv 0.1301323 b/c  speedup: 2.8459495x
lipsum/Emoji-Lipsum.utf8.txt     scalar: 0.0522199 b/c  rvv 0.0831130 b/c  speedup: 1.5915968x

c908 utf16_to_utf8

lipsum/Russian-Lipsum.utf16.txt  scalar: 0.0445853 b/c  rvv: 0.2163190 b/c  speedup: 4.8517938x
lipsum/Arabic-Lipsum.utf16.txt   scalar: 0.0449275 b/c  rvv: 0.2153480 b/c  speedup: 4.7932246x
lipsum/Hebrew-Lipsum.utf16.txt   scalar: 0.0448793 b/c  rvv: 0.2136721 b/c  speedup: 4.7610340x
lipsum/Latin-Lipsum.utf16.txt    scalar: 0.1028043 b/c  rvv: 0.4050746 b/c  speedup: 3.9402459x
wm/greek.utf16.txt               scalar: 0.0718830 b/c  rvv: 0.2716538 b/c  speedup: 3.7791077x
wm/russian.utf16.txt             scalar: 0.0688957 b/c  rvv: 0.2488844 b/c  speedup: 3.6124804x
wm/arabic.utf16.txt              scalar: 0.0721544 b/c  rvv: 0.2600413 b/c  speedup: 3.6039526x
wm/hebrew.utf16.txt              scalar: 0.0682180 b/c  rvv: 0.2447910 b/c  speedup: 3.5883632x
wm/esperanto.utf16.txt           scalar: 0.0963212 b/c  rvv: 0.3292119 b/c  speedup: 3.4178546x
wm/persan.utf16.txt              scalar: 0.0726062 b/c  rvv: 0.2366135 b/c  speedup: 3.2588582x
wm/english.utf16.txt             scalar: 0.1015669 b/c  rvv: 0.3270337 b/c  speedup: 3.2198835x
wm/german.utf16.txt              scalar: 0.0975311 b/c  rvv: 0.3023158 b/c  speedup: 3.0996865x
wm/portuguese.utf16.txt          scalar: 0.0962536 b/c  rvv: 0.2863991 b/c  speedup: 2.9754628x
wm/french.utf16.txt              scalar: 0.0952526 b/c  rvv: 0.2773457 b/c  speedup: 2.9116858x
wm/czech.utf16.txt               scalar: 0.0872352 b/c  rvv: 0.2453764 b/c  speedup: 2.8128122x
wm/turkish.utf16.txt             scalar: 0.0894998 b/c  rvv: 0.2483814 b/c  speedup: 2.7752177x
wm/thai.utf16.txt                scalar: 0.0742528 b/c  rvv: 0.1800184 b/c  speedup: 2.4243965x
wm/japanese.utf16.txt            scalar: 0.0750324 b/c  rvv: 0.1785757 b/c  speedup: 2.3799792x
lipsum/Chinese-Lipsum.utf16.txt  scalar: 0.0422231 b/c  rvv: 0.0993063 b/c  speedup: 2.3519384x
wm/vietnamese.utf16.txt          scalar: 0.0796325 b/c  rvv: 0.1822895 b/c  speedup: 2.2891332x
wm/chinese.utf16.txt             scalar: 0.0781047 b/c  rvv: 0.1772665 b/c  speedup: 2.2695999x
lipsum/Japanese-Lipsum.utf16.txt scalar: 0.0424322 b/c  rvv: 0.0920647 b/c  speedup: 2.1696894x
wm/hindi.utf16.txt               scalar: 0.0716071 b/c  rvv: 0.1415199 b/c  speedup: 1.9763381x
wm/korean.utf16.txt              scalar: 0.0742212 b/c  rvv: 0.1447335 b/c  speedup: 1.9500284x
lipsum/Emoji-Lipsum.utf16.txt    scalar: 0.0560671 b/c  rvv: 0.1017256 b/c  speedup: 1.8143532x
lipsum/Hindi-Lipsum.utf16.txt    scalar: 0.0423512 b/c  rvv: 0.0653709 b/c  speedup: 1.5435430x
lipsum/Korean-Lipsum.utf16.txt   scalar: 0.0431370 b/c  rvv: 0.0527462 b/c  speedup: 1.2227593x

c920 utf8_to_utf32

lipsum/Latin-Lipsum.utf8.txt     scalar: 0.1983016 b/c  rvv 1.6172459 b/c  speedup: 8.1554844x
wm/english.utf8.txt              scalar: 0.1787050 b/c  rvv 0.9249580 b/c  speedup: 5.1758932x
wm/greek.utf8.txt                scalar: 0.0720639 b/c  rvv 0.3620777 b/c  speedup: 5.0243981x
lipsum/Arabic-Lipsum.utf8.txt    scalar: 0.0489671 b/c  rvv 0.2433533 b/c  speedup: 4.9697240x
lipsum/Hebrew-Lipsum.utf8.txt    scalar: 0.0484946 b/c  rvv 0.2363269 b/c  speedup: 4.8732567x
lipsum/Russian-Lipsum.utf8.txt   scalar: 0.0501499 b/c  rvv 0.2380047 b/c  speedup: 4.7458662x
wm/czech.utf8.txt                scalar: 0.0720776 b/c  rvv 0.3390725 b/c  speedup: 4.7042668x
wm/hebrew.utf8.txt               scalar: 0.0636149 b/c  rvv 0.2747983 b/c  speedup: 4.3197121x
wm/turkish.utf8.txt              scalar: 0.0806716 b/c  rvv 0.3274432 b/c  speedup: 4.0589630x
wm/esperanto.utf8.txt            scalar: 0.1170809 b/c  rvv 0.4577151 b/c  speedup: 3.9093893x
wm/arabic.utf8.txt               scalar: 0.0714500 b/c  rvv 0.2772353 b/c  speedup: 3.8801297x
wm/persan.utf8.txt               scalar: 0.0704970 b/c  rvv 0.2690839 b/c  speedup: 3.8169557x
wm/russian.utf8.txt              scalar: 0.0683159 b/c  rvv 0.2570850 b/c  speedup: 3.7631801x
wm/german.utf8.txt               scalar: 0.1248275 b/c  rvv 0.4611884 b/c  speedup: 3.6946062x
wm/vietnamese.utf8.txt           scalar: 0.0612176 b/c  rvv 0.2055558 b/c  speedup: 3.3577844x
wm/korean.utf8.txt               scalar: 0.0727457 b/c  rvv 0.2360132 b/c  speedup: 3.2443591x
wm/portuguese.utf8.txt           scalar: 0.1077901 b/c  rvv 0.3433450 b/c  speedup: 3.1853110x
wm/japanese.utf8.txt             scalar: 0.0822007 b/c  rvv 0.2396279 b/c  speedup: 2.9151562x
wm/french.utf8.txt               scalar: 0.0996538 b/c  rvv 0.2892530 b/c  speedup: 2.9025785x
wm/hindi.utf8.txt                scalar: 0.0828941 b/c  rvv 0.2307050 b/c  speedup: 2.7831279x
lipsum/Korean-Lipsum.utf8.txt    scalar: 0.0581741 b/c  rvv 0.1554148 b/c  speedup: 2.6715442x
wm/chinese.utf8.txt              scalar: 0.0817867 b/c  rvv 0.2103001 b/c  speedup: 2.5713221x
lipsum/Hindi-Lipsum.utf8.txt     scalar: 0.0674511 b/c  rvv 0.1558572 b/c  speedup: 2.3106693x
wm/thai.utf8.txt                 scalar: 0.0933146 b/c  rvv 0.2127180 b/c  speedup: 2.2795790x
lipsum/Japanese-Lipsum.utf8.txt  scalar: 0.0739905 b/c  rvv 0.1558918 b/c  speedup: 2.1069166x
lipsum/Chinese-Lipsum.utf8.txt   scalar: 0.0762008 b/c  rvv 0.1563651 b/c  speedup: 2.0520142x
lipsum/Emoji-Lipsum.utf8.txt     scalar: 0.0956014 b/c  rvv 0.1901396 b/c  speedup: 1.9888773x

c920 utf8_to_utf16

lipsum/Latin-Lipsum.utf8.txt     scalar: 0.2109710 b/c  rvv 2.2189945 b/c  speedup: 10.518002x
wm/english.utf8.txt              scalar: 0.1827197 b/c  rvv 1.4220564 b/c  speedup: 7.7827185x
wm/greek.utf8.txt                scalar: 0.0755349 b/c  rvv 0.3727045 b/c  speedup: 4.9341973x
wm/czech.utf8.txt                scalar: 0.0755292 b/c  rvv 0.3633922 b/c  speedup: 4.8112804x
lipsum/Hebrew-Lipsum.utf8.txt    scalar: 0.0494750 b/c  rvv 0.2242518 b/c  speedup: 4.5326243x
lipsum/Russian-Lipsum.utf8.txt   scalar: 0.0509461 b/c  rvv 0.2299110 b/c  speedup: 4.5128233x
lipsum/Arabic-Lipsum.utf8.txt    scalar: 0.0497147 b/c  rvv 0.2216386 b/c  speedup: 4.4582093x
wm/arabic.utf8.txt               scalar: 0.0744813 b/c  rvv 0.3212577 b/c  speedup: 4.3132627x
wm/hebrew.utf8.txt               scalar: 0.0661844 b/c  rvv 0.2759717 b/c  speedup: 4.1697346x
wm/esperanto.utf8.txt            scalar: 0.1263762 b/c  rvv 0.5216554 b/c  speedup: 4.1277957x
wm/german.utf8.txt               scalar: 0.1296940 b/c  rvv 0.5333178 b/c  speedup: 4.1121239x
wm/turkish.utf8.txt              scalar: 0.0847053 b/c  rvv 0.3365346 b/c  speedup: 3.9730052x
wm/russian.utf8.txt              scalar: 0.0708612 b/c  rvv 0.2807201 b/c  speedup: 3.9615460x
wm/portuguese.utf8.txt           scalar: 0.1171257 b/c  rvv 0.4517186 b/c  speedup: 3.8566995x
wm/persan.utf8.txt               scalar: 0.0742834 b/c  rvv 0.2688770 b/c  speedup: 3.6196109x
wm/vietnamese.utf8.txt           scalar: 0.0642606 b/c  rvv 0.2283996 b/c  speedup: 3.5542678x
wm/french.utf8.txt               scalar: 0.1070867 b/c  rvv 0.3670641 b/c  speedup: 3.4277275x
wm/korean.utf8.txt               scalar: 0.0765637 b/c  rvv 0.2563889 b/c  speedup: 3.3486993x
wm/hindi.utf8.txt                scalar: 0.0857724 b/c  rvv 0.2719399 b/c  speedup: 3.1704799x
wm/japanese.utf8.txt             scalar: 0.0866018 b/c  rvv 0.2630760 b/c  speedup: 3.0377647x
lipsum/Korean-Lipsum.utf8.txt    scalar: 0.0596959 b/c  rvv 0.1592920 b/c  speedup: 2.6683889x
wm/chinese.utf8.txt              scalar: 0.0855446 b/c  rvv 0.2223617 b/c  speedup: 2.5993654x
wm/thai.utf8.txt                 scalar: 0.0963939 b/c  rvv 0.2377943 b/c  speedup: 2.4669006x
lipsum/Hindi-Lipsum.utf8.txt     scalar: 0.0700269 b/c  rvv 0.1601953 b/c  speedup: 2.2876225x
lipsum/Japanese-Lipsum.utf8.txt  scalar: 0.0772785 b/c  rvv 0.1603533 b/c  speedup: 2.0750034x
lipsum/Chinese-Lipsum.utf8.txt   scalar: 0.0797070 b/c  rvv 0.1608991 b/c  speedup: 2.0186326x
lipsum/Emoji-Lipsum.utf8.txt     scalar: 0.0923569 b/c  rvv 0.1242158 b/c  speedup: 1.3449541x

The text was updated successfully, but these errors were encountered:

clausecker · 2024-01-10T18:54:32Z

This is very cool! Looking forwards to seeing your code.

How do you adapt the pdep / pext bits on the mask registers? Does RVV have something like that on masks? Otherwise I suppose you'd run into problems if the vector length exceeds 64 bytes.

I plan on adding a 1/2/3 vectorized path, and maybe an 1/2/3/4, if I can figure it out.

I suppose this is due to the vpmultishiftqb instruction used? You should be able to replace it with a bunch of fixed shifts / bitfield instructions. Not exactly sure which ones though.

lemire · 2024-01-10T19:02:11Z

Fantastic. These are very impressive numbers.

What does one need to do/which files to touch/what tools are there, to add a new architecture to simdutf?

Firstly, we have an empty architecture, ppc4, that is currently all falling back on scalar. You can just copy that, so take src/ppc4 and copy it to, say, src/myarch.

First you need macros to recognize the target at compile time.

Start with simdutf.cpp where we have


#if SIMDUTF_IMPLEMENTATION_ARM64
#include "arm64/implementation.cpp"
#endif
#if SIMDUTF_IMPLEMENTATION_FALLBACK
#include "fallback/implementation.cpp"
#endif
#if SIMDUTF_IMPLEMENTATION_ICELAKE
#include "icelake/implementation.cpp"
#endif
#if SIMDUTF_IMPLEMENTATION_HASWELL
#include "haswell/implementation.cpp"
#endif
#if SIMDUTF_IMPLEMENTATION_PPC64
#include "ppc64/implementation.cpp"
#endif
#if SIMDUTF_IMPLEMENTATION_WESTMERE
#include "westmere/implementation.cpp"
#endif

Right?

So you need something like SIMDUTF_IMPLEMENTATION_MYARCH. That would be the first step. Note that SIMDUTF_IMPLEMENTATION_MYARCH should be 0 or 1. And it should be 0 if the compiler can't work with this architecture (for whatever reason).

Should we use the explicit intrinsics or the overloaded intrinsics?

It is acceptable to have constraints on the compiler. For example, not all compilers will allow us to build the icelake kernel.

But you need a compile-time way to check support for overloaded intrinsic, evidently.

Which extensions should we target?

For practical reasons, you want a risc-v binary to run without crashing if the extension is missing. So anything that is not required by risc-v may need a runtime check. This runtime check can be moderately expensive because it is cached (done often just once).

You must be concerned with software engineering: supporting multiple kernels is expensive... not only in coding time, but also in bug fixing and testing. With x64, we have millions of users (simdutf is part of Node.js so it is everywhere), but it is not so with risc-v. So I recommend reducing the code surface.

Note that it is possible to do things in stages, and add support for extensions later.

Can should we assume fast vrgather and vcompress?

Unfortunately, an ISA does not come with strict performance guarantees. For example, aarch64 processors can vary quite a bit. For icelake, the trick we use is to require VBMI2. Even if we did not need VBMI2, we know that if this extension is present, then the processor is sufficiently recent that the AVX-512 instructions don't have too many gotchas. But that's easy because we know in details the market.

In practice, you often get that there is some dominance in the market. Most processors in use are X, Y, Z and you optimize for X, Y, Z while knowing that performance can be lesser for other processors.

I'm afraid that there is no fool-proof approach but you can document your expectations.

lemire · 2024-01-10T19:04:17Z

@camel-cdr I am getting the impression that we'll have at least one code reviewer (@clausecker). :-)

camel-cdr · 2024-01-10T20:53:04Z

@lemire

Firstly, we have an empty architecture, ppc4, that is currently all falling back on scalar. You can just copy that, so take src/ppc4 and copy it to, say, src/myarch.

👍

But you need a compile-time way to check support for overloaded intrinsic, evidently

From what I can tell, checking for version 1.0 support with __riscv_v_intrinsic >= 1000000 should imply overloaded intrinsics are supported.
There used to be a __riscv_v_intrinsic_overloading macro, that seems to have been removed, but clang and gcc currently set it correctly (that is clang sets it gcc doesn't).
I've opened an issue to confirm if this is true: riscv-non-isa/rvv-intrinsic-doc#310

For practical reasons, you want a risc-v binary to run without crashing if the extension is missing. So anything that is not required by risc-v may need a runtime check.

I'm not sure what the correct way to check for it is, but hwprobe looks like it would work for checking V support. It looks like Zvbb would currently need to be parsed from the isa string in /proc/cpuinfo or via catching an illegal instruction signal. hwprobe will probably support this check in the future.

@clausecker

How do you adapt the pdep / pext bits on the mask registers?

I'm not sure what pdep/pext is used for in the current implementations.
In the general case of utf8 to utf32, after validating with the "Validating UTF-8 In Less Than One Instruction Per Byte" approach, I extract all nth bytes of a character into standalone registers with vcompress, remove prefixes, then recombine to utf32.
For utf8 to utf16, I do the same as above, but select between it and something quite close to this and vcompress afterward to obtain utf16.

BTW, here is a great resource to get an overview of the supported instructions: https://github.com/dzaima/intrinsics-viewer

clausecker · 2024-01-10T23:38:40Z

I'm not sure what pdep/pext is used for in the current implementations.

It's used to translate the various masks between the UTF-8 and the UTF-16 space, mainly for validation, but also for other things. There may be workarounds for this, but I suppose you would have mentioned redesigning the algorithm to not need this anymore.

In the general case of utf8 to utf32, after validating with the "Validating UTF-8 In Less Than One Instruction Per Byte" approach, I extract all nth bytes of a character into standalone registers with vcompress, remove prefixes, then recombine to utf32.

This sounds like you're doing a different algorithm from the Icelake kernel? The algorithm used by the Icelake kernel is explained in detail in our paper. It does validation as a part of the transcoding process, but with a different approach from the previous Keiser et al. paper.

lemire · 2024-01-11T02:29:13Z

I'm not sure what the correct way to check for it is, but hwprobe looks like it would work for checking V support.

So we are assuming that RISC-V runs on Linux. Is that true?

Nothing in simdutf assumes Linux thus far.

lemire · 2024-01-11T02:30:42Z

@camel-cdr You referred to icelake in your issue, and the icelake kernel is different, in part because it benefits from compress instructions (VBMI2).

camel-cdr · 2024-01-11T07:55:32Z

@lemire

So we are assuming that RISC-V runs on Linux. Is that true?

It certainly doesn't run on windows today, I think e.g. FreeBSD also has support, but I'd focus on Linux for now. Edit: and come to think of it Android as well.

You referred to icelake in your issue, and the icelake kernel is different, in part because it benefits from compress instructions (VBMI2).

I was referring to icelake because it uses a fully standalone implementation. The other backends partially use the simd8 api and generic/ code which doesn't work for scalable vectors.
Sorry for the confusion.

Anyways, I've uploaded the code with explicit intrinsics to my github, and will look into creating a public simdutf dev branch soon:

utf8_to_utf32/utf8_to_utf16

utf16_to_utf8

davidlt · 2024-01-27T17:02:44Z

Hi! Regarding hwprobe syscall on Linux, RISCV_HWPROBE_EXT_ZVBB is part of v6.8 kernel (already in v6.8-rc1).

T-HEAD recently added a new vendor extension (XTheadVector) and explained the difference between v0.7.1 (non-ratified vector extension version) and theirs. See:
https://github.com/T-head-Semi/thead-extension-spec/blob/master/xtheadvector.adoc

There was an agreement to support this upstream (unfortunately there are too much hardware floating around with it to avoid it). IIRC this lands in GCC 14 (or at least they will try to make it). I think, it got merged 7+ days ago and there is also a separate intrinsic header (riscv_th_vector.h) for it too.

camel-cdr · 2024-01-29T14:12:55Z

@davidlt That's great.

I was aware of the XTheadVector patches, good to hear that they are now merged.

PS: the mentioned article is done now: https://camel-cdr.github.io/rvv-bench-results/articles/vector-utf.html

camel-cdr · 2024-02-18T13:05:01Z

So I've run into a bit of a predicament.

If I understood it correctly, the current behavior for x86 is to compile all backends using SIMDUTF_TARGET_REGION, even if you don't enable them in the compiler settings.

gcc and clang support __attribute__((target("arch=+v"))), but both don't allow you to include <riscv_vector.h> even if you pragma push it globally. You need to enable V in the compiler flags to be able to include it, which makes the hwprobe detection irrelevant, because other parts of the code could get autovectorized.

For now, I'll only enable the rvv backend if it's explicitly compiled for rvv.

lemire · 2024-02-18T16:29:22Z

Your understanding is correct. We proceed in this manner because we want to make the library available in a single-header form, without making assumptions about the build system.

camel-cdr mentioned this issue Feb 21, 2024

Using vector intrinsics with __attribute__((target("arch=+v"))) riscv-non-isa/riscv-c-api-doc#69

Open

camel-cdr mentioned this issue Feb 29, 2024

add RVV (RISC-V Vector Extension) backend, resolves #362 #373

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

icelake like backend for RVV (RISC-V vector extension) #362

icelake like backend for RVV (RISC-V vector extension) #362

camel-cdr commented Jan 10, 2024 •

edited

clausecker commented Jan 10, 2024

lemire commented Jan 10, 2024 •

edited

lemire commented Jan 10, 2024

camel-cdr commented Jan 10, 2024

clausecker commented Jan 10, 2024

lemire commented Jan 11, 2024

lemire commented Jan 11, 2024

camel-cdr commented Jan 11, 2024 •

edited

davidlt commented Jan 27, 2024

camel-cdr commented Jan 29, 2024

camel-cdr commented Feb 18, 2024

lemire commented Feb 18, 2024

icelake like backend for RVV (RISC-V vector extension) #362

icelake like backend for RVV (RISC-V vector extension) #362

Comments

camel-cdr commented Jan 10, 2024 • edited

Benchmarks

clausecker commented Jan 10, 2024

lemire commented Jan 10, 2024 • edited

lemire commented Jan 10, 2024

camel-cdr commented Jan 10, 2024

clausecker commented Jan 10, 2024

lemire commented Jan 11, 2024

lemire commented Jan 11, 2024

camel-cdr commented Jan 11, 2024 • edited

davidlt commented Jan 27, 2024

camel-cdr commented Jan 29, 2024

camel-cdr commented Feb 18, 2024

lemire commented Feb 18, 2024

camel-cdr commented Jan 10, 2024 •

edited

lemire commented Jan 10, 2024 •

edited

camel-cdr commented Jan 11, 2024 •

edited