Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8: AVX2 implementation of Valid #58

Merged
merged 67 commits into from Jan 11, 2022
Merged

utf8: AVX2 implementation of Valid #58

merged 67 commits into from Jan 11, 2022

Conversation

pelletier
Copy link
Contributor

@pelletier pelletier commented Oct 14, 2021

This branch is a Go implementation of the Keiser-Lemire "Validating UTF-8 In Less Than One Instruction Per
Byte" paper
. For inputs under 32 bytes or on machines without AVX2 support, a re-implementation of the stdlib algorithm is used.

For incomplete blocks of 32 bytes, this version still uses the vector registers.

This code exposes two functions Valid([]byte) bool and Validate([]byte) (bool, bool). Valid is a drop-in replacement for the standard library's unicode.Valid. Validate is a more precise function that also returns whether the input was valid ASCII. For small strings, ascii.Valid is used as a first pass, then stdlib's utf8.Valid is used. This is possibly responsible for the overhead we are seeing for inputs < 32 bytes.

Current results:

goos: darwin
goarch: amd64
pkg: github.com/segmentio/asm/utf8
cpu: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz

name                    time/op
Valid/1kValid/AVX-8       80.0ns ± 2%
Valid/1kValid/Stdlib-8     733ns ± 2%
Valid/1MValid/AVX-8       76.8µs ± 2%
Valid/1MValid/Stdlib-8     751µs ± 1%
Valid/10ASCII/Stdlib-8    4.07ns ± 0%
Valid/10ASCII/AVX-8       7.70ns ± 2%
Valid/10Japan/AVX-8       28.6ns ± 1%
Valid/10Japan/Stdlib-8    27.0ns ± 1%

name                    speed
Valid/1kValid/AVX-8     12.8GB/s ± 2%
Valid/1kValid/Stdlib-8  1.40GB/s ± 2%
Valid/1MValid/AVX-8     13.7GB/s ± 2%
Valid/1MValid/Stdlib-8  1.40GB/s ± 1%
Valid/10ASCII/Stdlib-8  2.46GB/s ± 0%
Valid/10ASCII/AVX-8     1.30GB/s ± 2%
Valid/10Japan/AVX-8     1.05GB/s ± 1%
Valid/10Japan/Stdlib-8  1.11GB/s ± 1%

This is my first time writing Go assembly, so I'd appreciate any kind of feedback!


ns/op, for arrays up to 400 bytes (lower is better):

image

ns/op, for arrays up to 64MiB (lower is better):

data-large

Machine: specs
Code used to generate graphs: plot.py


Todo

  • Generate code with AVO.
  • Check AVX2 support.
  • Use lower overhead algorithm for < 32B.
  • Understand why the low overhead algorithm is slower than stdlib. Not understood, but after iterating on the code, the low overhead algorithm is as fast as the standard library one of an Intel CPU (not AMD, somehow).
  • Make the test suite faster.
  • Also returns whether the input was ascii only.
  • Fix table generation (see 3716cfd)
  • Reuse stdlib's utf8.first and acceptRanges tables.
  • Cover profile for generated asm code. I don't think that's possible.

Further work

Copy link
Contributor

@achille-roussel achille-roussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks really clean, nice work so far 🙌

build/utf8/valid_asm.go Outdated Show resolved Hide resolved
build/utf8/valid_asm.go Outdated Show resolved Hide resolved
build/utf8/valid_asm.go Outdated Show resolved Hide resolved
build/utf8/valid_asm.go Outdated Show resolved Hide resolved
@pelletier
Copy link
Contributor Author

First bug found by the Go1.18 fuzzing system:

[tpelletier@thinkpad utf8]$ gotip test -run _ -fuzz ./
warning: starting with empty corpus
fuzz: elapsed: 0s, execs: 0 (0/sec), new interesting: 0 (total: 0)
fuzz: minimizing 1863-byte failing input file
fuzz: elapsed: 0s, minimizing
--- FAIL: FuzzValid (0.41s)
    --- FAIL: FuzzValid (0.00s)
        valid_fuzz_test.go:16: Valid("0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\xc60") = true; want false
    
    Failing input written to testdata/fuzz/FuzzValid/10d8eaee7858193ed8118cacee74232872e061aa7a7a768ba0792bf7bbb22b72
    To re-run:
    go test -run=FuzzValid/10d8eaee7858193ed8118cacee74232872e061aa7a7a768ba0792bf7bbb22b72
FAIL
exit status 1
FAIL	github.com/segmentio/asm/utf8	0.416s

@achille-roussel
Copy link
Contributor

I don't feel bad if we don't reuse the stdlib symbols, taking dependencies on unexpired APIs always has a hire maintenance cost.

@pelletier
Copy link
Contributor Author

pelletier commented Jan 4, 2022

As an experiment, commit 4a7bb03 shows what it would look like to call the stdlib directly as opposed to re-implementing it. It's slightly slower on the current benchmarks, but the easier maintenance is probably worth it.

image

utf8/valid_support_amd64.go Outdated Show resolved Hide resolved
utf8/valid_go_test.go Outdated Show resolved Hide resolved
utf8/valid_go_test.go Outdated Show resolved Hide resolved
utf8/valid_fuzz_test.go Outdated Show resolved Hide resolved
utf8/valid_default.go Outdated Show resolved Hide resolved
utf8/utf8.go Outdated Show resolved Hide resolved
build/utf8/valid_asm.go Outdated Show resolved Hide resolved
utf8/valid_amd64.s Outdated Show resolved Hide resolved
utf8/valid_amd64.s Outdated Show resolved Hide resolved
"github.com/segmentio/asm/ascii"
)

func FuzzValid(f *testing.F) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Comment on lines +148 to +194
// Prepare intermediate vector for push operations
VPERM2I128 $0x03, Y8, Y11, Y8

// Check errors on the high nibble of the previous byte
VPALIGNR $0x0f, Y8, Y11, Y10
VPSRLW $0x04, Y10, Y12
VPAND Y12, Y6, Y12
VPSHUFB Y12, Y3, Y12

// Check errors on the low nibble of the previous byte
VPAND Y10, Y6, Y10
VPSHUFB Y10, Y4, Y10
VPAND Y10, Y12, Y12

// Check errors on the high nibble on the current byte
VPSRLW $0x04, Y11, Y10
VPAND Y10, Y6, Y10
VPSHUFB Y10, Y5, Y10
VPAND Y10, Y12, Y12

// Find 3 bytes continuations
VPALIGNR $0x0e, Y8, Y11, Y10
VPSUBUSB Y2, Y10, Y10

// Find 4 bytes continuations
VPALIGNR $0x0d, Y8, Y11, Y8
VPSUBUSB Y1, Y8, Y8

// Combine them to have all continuations
VPOR Y10, Y8, Y8

// Perform a byte-sized signed comparison with zero to turn any non-zero bytes into 0xFF.
VXORPS Y10, Y10, Y10
VPCMPGTB Y10, Y8, Y8

// Find bytes that are continuations by looking at their most significant bit.
VPAND Y7, Y8, Y8

// Find mismatches between expected and actual continuation bytes
VPXOR Y8, Y12, Y8

// Store result in sticky error
VPOR Y9, Y8, Y9

// Prepare for next iteration
VPSUBUSB Y0, Y11, Y10
VMOVDQU Y11, Y8
Copy link
Contributor

@chriso chriso Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be able to improve performance here by allocating registers yourself. It looks like avo's register allocator has introduced false data dependencies, and allocating registers yourself (and using more of them) might let you eliminate the dependencies.

@pelletier
Copy link
Contributor Author

Difference validating inputs using AVX with leftover bytes, between the memory scratch and fully in vector registers:

benchstat out-old.txt out-new.txt
name                 old time/op    new time/op     delta
Valid/tail300/AVX-8    32.4ns ± 2%     28.0ns ± 2%  -13.74%  (p=0.008 n=5+5)
Valid/tail316/AVX-8    32.6ns ± 0%     28.1ns ± 0%  -13.74%  (p=0.008 n=5+5)

name                 old speed      new speed       delta
Valid/tail300/AVX-8  9.26GB/s ± 2%  10.73GB/s ± 2%  +15.93%  (p=0.008 n=5+5)
Valid/tail316/AVX-8  9.69GB/s ± 0%  11.24GB/s ± 0%  +15.93%  (p=0.008 n=5+5)

@pelletier
Copy link
Contributor Author

pelletier commented Jan 6, 2022

Nice to see the duration variations between multiples of 32 being dampened:

data

Copy link
Contributor

@achille-roussel achille-roussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

cmd/valid/README.md Outdated Show resolved Hide resolved
@pelletier pelletier changed the title utf8: AVX2 implementation of valid utf8: AVX2 implementation of Valid Jan 9, 2022
@pelletier pelletier merged commit 0ec6ead into main Jan 11, 2022
@pelletier pelletier deleted the pelletier/utf8-valid branch January 11, 2022 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants