Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RVV port for Base64 procedures #380

Open
WojciechMula opened this issue Mar 23, 2024 · 4 comments
Open

RVV port for Base64 procedures #380

WojciechMula opened this issue Mar 23, 2024 · 4 comments

Comments

@WojciechMula
Copy link
Collaborator

I got inspired by @camel-cdr :) and started to work on base64 procedures. The undergoing work is done in my repo https://github.com/WojciechMula/base64simd.

@camel-cdr
Copy link
Contributor

camel-cdr commented Mar 23, 2024

Ah, you beat me to it :-) I was planning to start on that as well.

For encode I was planning to do something similar to the haswell implementation, that's basically what you did.
I think it could be improved by using a higher LMUL and unrolling the vrgather to multiple LMUL=1 vrgathers.
That wouldn't be trivial for the expanding vrgather, but could be possible with slightly overlapping LMUL=1 loads, or slides. (e.g. load1: ptr, load2: ptr + 1*vlmax/4*3, load3: ptr + 2*vlmax/4*3, ...)
Your current code still assumes VLEN=128 right? (e.g. input += 3*4;)

For decode, I'd try to use vcompress instead of vrgather as the last step in pack, it's the less complex operation.

@WojciechMula
Copy link
Collaborator Author

For decode, I'd try to use vcompress instead of vrgather as the last step in pack, it's the less complex operation.

You have to take into account that we need to store data in the big-endian order, this is why I used gather. Alternative is to swap 4-byte inputs when Zvbb extension is available.

@camel-cdr
Copy link
Contributor

You have to take into account that we need to store data in the big-endian order

I thought you could do the endianess swap with the initial vrgather, but now that I think about it a bit more that doesn't line up the bits properly.

@camel-cdr
Copy link
Contributor

@WojciechMula I tried rearranging the shifts to create the big endian result, that could be compressed:

// in32:   [00dddDDD|00cccCCC|00bbbBBB|00aaaAAA]
// d:      [00dddDDD|00000000|00000000|00000000] i&
// c1:     [0000dddD|DD00cccC|000000bb|bBBB00aa] i>>2,4 (i as 2x16)
// c:      [00000000|0000cccC|000000bb|00000000] c1&
// ca:     [CC000000|00000000|aaaAAA00|00000000] i<<14,10 (i as 2x16)
// b1:     [cCCC00bb|bBBB00aa|aAAA0000|00000000] i<<12
// b:      [00000000|bBBB0000|00000000|00000000] b1&
// dcba:   [CCdddDDD|bBBBcccC|aaaAAAbb|00000000] d|c|ca|b

This even requires one fewer shift, but reinterprets the input as 16 bit elements for two shifts.
Here is a untested mock implementation, I think you need to lower LMUL to 4 to avoid spills:

// outside of looop:
const size_t vl16m4 = __riscv_vsetvlmax_e16m4();
const vuint16m4_t v16_2_4 = __riscv_vreinterpret_v_u32m4_u16m4(__riscv_vmv_v_x_u32m4(0x000400002, vl16m4));
const vuint16m4_t v16_14_10 = __riscv_vreinterpret_v_u32m4_u16m4(__riscv_vmv_v_x_u32m4(0x000a0000e, vl16m4));
const size_t vl8m4 = __riscv_vsetvlmax_e8m4();
vbool2_t mcompress = __riscv_vmsne_vx_u8m4_b2(__riscv_vand_vx_u8m4(__riscv_vid_v_u8m4(vl8m2), 3, vl8m4), 3, vl8m4);
...

const vuint8m4_t in8 = ...;
// in32:   [00dddDDD|00cccCCC|00bbbBBB|00aaaAAA]
const vuint32m4_t in32 = __riscv_vreinterpret_v_u8m4_u32m4(in8);
const vuint16m4_t in16 = __riscv_vreinterpret_v_u8m4_u16m4(in8);

const size_t vl = __riscv_vsetvlmax_e32m4();

// d:      [00dddDDD|00000000|00000000|00000000]
const vuint32m4_t d = __riscv_vand_vx_u32m4(in32, 0x3f000000, vl);

// c1:     [0000dddD|DD00cccC|000000bb|bBBB00aa]
const vuint32m4_t c1 = __riscv_vreinterpret_v_u16m4_u32m4(__riscv_vsrl_vx_u16m4(in16, v16_2_4, vl));
// c:      [00000000|0000cccC|000000bb|00000000]
const vuint32m4_t c = __riscv_vand_vx_u32m4(c, 0x000f0300, vl);

// ca:     [CC000000|00000000|aaaAAA00|00000000]
const vuint32m4_t ca = __riscv_vreinterpret_v_u16m4_u32m4(__riscv_vsll_vx_u16m4(in16, v16_14_10, vl));

// b1:     [cCCC00bb|bBBB00aa|aAAA0000|00000000]
const vuint32m4_t b1 = __riscv_vsll_vx_u32m4(b1, 12, vl);
// b:      [00000000|bBBB0000|00000000|00000000]
const vuint32m4_t b = __riscv_vand_vx_u32m4(b1, 0x00f00000, vl);

// dcba:   [CCdddDDD|bBBBcccC|aaaAAAbb|00000000]
const vuint32m4_t dcba = __riscv_vor_vv_u32m4(__riscv_vor_vv_u32m4(d, c, vl); __riscv_vor_vv_u32m4(ca, b, vl), vl);

// pack 3 byte-groups into continous array
return __riscv_vcompress_vm_u8m4(__riscv_vreinterpret_v_u32m4_u8m4(abcd), mcompress, vl);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants