Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ord implementation and signed zero canonicalization #144

Merged
merged 4 commits into from Oct 10, 2023

Conversation

orlp
Copy link
Contributor

@orlp orlp commented Oct 10, 2023

These micro-optimizations significantly reduce the number of instructions comparisons take, and often makes them branchless as well. Similarly we use a trick to canonicalize signed zero to positive zero in a single instruction without branches for faster hashing.

@orlp
Copy link
Contributor Author

orlp commented Oct 10, 2023

For example, a <= b went from this:

example::old_leq:
        vucomiss        xmm1, xmm0
        jae     .LBB0_1
        mov     al, 1
        vucomiss        xmm0, xmm1
        jae     .LBB0_5
        mov     al, -1
        vucomiss        xmm0, xmm0
        jp      .LBB0_4
.LBB0_5:
        inc     al
        cmp     al, 2
        setb    al
        ret
.LBB0_1:
        xor     eax, eax
        vucomiss        xmm0, xmm1
        sbb     eax, eax
        inc     al
        cmp     al, 2
        setb    al
        ret
.LBB0_4:
        vucomiss        xmm1, xmm1
        setnp   al
        inc     al
        cmp     al, 2
        setb    al
        ret

to this:

example::new_leq:
        vcmpleps        xmm0, xmm0, xmm1
        vxorps  xmm2, xmm2, xmm2
        vcmpunordps     xmm1, xmm1, xmm2
        vorps   xmm0, xmm1, xmm0
        vmovd   eax, xmm0
        and     al, 1
        ret

Copy link
Collaborator

@mbrubeck mbrubeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@mbrubeck mbrubeck merged commit 4e29b08 into reem:master Oct 10, 2023
2 checks passed
@orlp
Copy link
Contributor Author

orlp commented Oct 10, 2023

@mbrubeck To also give some concrete numbers, on my Apple M1 machine sorting a shuffled Vec of 1 million OrderedFloat<f64>s went from 110ms to 84ms, an 1.3x speedup. I'd expect the difference on x86-64 to be even greater.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants