RVV port for Base64 procedures #380

WojciechMula · 2024-03-23T21:17:16Z

I got inspired by @camel-cdr :) and started to work on base64 procedures. The undergoing work is done in my repo https://github.com/WojciechMula/base64simd.

camel-cdr · 2024-03-23T21:59:35Z

Ah, you beat me to it :-) I was planning to start on that as well.

For encode I was planning to do something similar to the haswell implementation, that's basically what you did.
I think it could be improved by using a higher LMUL and unrolling the vrgather to multiple LMUL=1 vrgathers.
That wouldn't be trivial for the expanding vrgather, but could be possible with slightly overlapping LMUL=1 loads, or slides. (e.g. load1: ptr, load2: ptr + 1*vlmax/4*3, load3: ptr + 2*vlmax/4*3, ...)
Your current code still assumes VLEN=128 right? (e.g. input += 3*4;)

For decode, I'd try to use vcompress instead of vrgather as the last step in pack, it's the less complex operation.

WojciechMula · 2024-03-24T14:18:15Z

For decode, I'd try to use vcompress instead of vrgather as the last step in pack, it's the less complex operation.

You have to take into account that we need to store data in the big-endian order, this is why I used gather. Alternative is to swap 4-byte inputs when Zvbb extension is available.

camel-cdr · 2024-03-24T14:20:41Z

You have to take into account that we need to store data in the big-endian order

I thought you could do the endianess swap with the initial vrgather, but now that I think about it a bit more that doesn't line up the bits properly.

camel-cdr · 2024-03-24T17:53:04Z

@WojciechMula I tried rearranging the shifts to create the big endian result, that could be compressed:

// in32:   [00dddDDD|00cccCCC|00bbbBBB|00aaaAAA]
// d:      [00dddDDD|00000000|00000000|00000000] i&
// c1:     [0000dddD|DD00cccC|000000bb|bBBB00aa] i>>2,4 (i as 2x16)
// c:      [00000000|0000cccC|000000bb|00000000] c1&
// ca:     [CC000000|00000000|aaaAAA00|00000000] i<<14,10 (i as 2x16)
// b1:     [cCCC00bb|bBBB00aa|aAAA0000|00000000] i<<12
// b:      [00000000|bBBB0000|00000000|00000000] b1&
// dcba:   [CCdddDDD|bBBBcccC|aaaAAAbb|00000000] d|c|ca|b

This even requires one fewer shift, but reinterprets the input as 16 bit elements for two shifts.
Here is a untested mock implementation, I think you need to lower LMUL to 4 to avoid spills:

// outside of looop:
const size_t vl16m4 = __riscv_vsetvlmax_e16m4();
const vuint16m4_t v16_2_4 = __riscv_vreinterpret_v_u32m4_u16m4(__riscv_vmv_v_x_u32m4(0x000400002, vl16m4));
const vuint16m4_t v16_14_10 = __riscv_vreinterpret_v_u32m4_u16m4(__riscv_vmv_v_x_u32m4(0x000a0000e, vl16m4));
const size_t vl8m4 = __riscv_vsetvlmax_e8m4();
vbool2_t mcompress = __riscv_vmsne_vx_u8m4_b2(__riscv_vand_vx_u8m4(__riscv_vid_v_u8m4(vl8m2), 3, vl8m4), 3, vl8m4);
...

const vuint8m4_t in8 = ...;
// in32:   [00dddDDD|00cccCCC|00bbbBBB|00aaaAAA]
const vuint32m4_t in32 = __riscv_vreinterpret_v_u8m4_u32m4(in8);
const vuint16m4_t in16 = __riscv_vreinterpret_v_u8m4_u16m4(in8);

const size_t vl = __riscv_vsetvlmax_e32m4();

// d:      [00dddDDD|00000000|00000000|00000000]
const vuint32m4_t d = __riscv_vand_vx_u32m4(in32, 0x3f000000, vl);

// c1:     [0000dddD|DD00cccC|000000bb|bBBB00aa]
const vuint32m4_t c1 = __riscv_vreinterpret_v_u16m4_u32m4(__riscv_vsrl_vx_u16m4(in16, v16_2_4, vl));
// c:      [00000000|0000cccC|000000bb|00000000]
const vuint32m4_t c = __riscv_vand_vx_u32m4(c, 0x000f0300, vl);

// ca:     [CC000000|00000000|aaaAAA00|00000000]
const vuint32m4_t ca = __riscv_vreinterpret_v_u16m4_u32m4(__riscv_vsll_vx_u16m4(in16, v16_14_10, vl));

// b1:     [cCCC00bb|bBBB00aa|aAAA0000|00000000]
const vuint32m4_t b1 = __riscv_vsll_vx_u32m4(b1, 12, vl);
// b:      [00000000|bBBB0000|00000000|00000000]
const vuint32m4_t b = __riscv_vand_vx_u32m4(b1, 0x00f00000, vl);

// dcba:   [CCdddDDD|bBBBcccC|aaaAAAbb|00000000]
const vuint32m4_t dcba = __riscv_vor_vv_u32m4(__riscv_vor_vv_u32m4(d, c, vl); __riscv_vor_vv_u32m4(ca, b, vl), vl);

// pack 3 byte-groups into continous array
return __riscv_vcompress_vm_u8m4(__riscv_vreinterpret_v_u32m4_u8m4(abcd), mcompress, vl);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RVV port for Base64 procedures #380

RVV port for Base64 procedures #380

WojciechMula commented Mar 23, 2024

camel-cdr commented Mar 23, 2024 •

edited

WojciechMula commented Mar 24, 2024

camel-cdr commented Mar 24, 2024

camel-cdr commented Mar 24, 2024

RVV port for Base64 procedures #380

RVV port for Base64 procedures #380

Comments

WojciechMula commented Mar 23, 2024

camel-cdr commented Mar 23, 2024 • edited

WojciechMula commented Mar 24, 2024

camel-cdr commented Mar 24, 2024

camel-cdr commented Mar 24, 2024

camel-cdr commented Mar 23, 2024 •

edited