Clang and GCC add Smis showdown

Assumed: knowledge of x86 assembly.

In a previous post, I gave the gcc output for this code:

uint32_t add_smi_smi(uint32_t x, uint32_t y,
                     uint32_t (*bailout)()) {
    uint32_t s;
    if (__builtin_add_overflow(x, y, &s) || (s & 3)) {
        return bailout();
    }
    return s;
}

Here is gcc’s output again:

add_smi_smi(unsigned int, unsigned int, unsigned int (*)()):
        xor     ecx, ecx
        add     edi, esi
        mov     eax, edi
        setc    cl
        and     eax, 3
        or      eax, ecx
        jne     .L6
        mov     eax, edi
        ret
.L6:
        jmp     rdx

Here is clang’s output:

add_smi_smi(unsigned int, unsigned int, unsigned int (*)()):
        lea     eax, [rdi + rsi]
        add     edi, esi
        jc      .LBB0_3
        and     eax, 3
        jne     .LBB0_3
        mov     eax, edi
        ret
.LBB0_3:
        jmp     rdx

I was really surprised by clang’s output here! lea eax, [rdi + rsi] and add edi, esi both compute the sum! So clang decides to the two numbers twice, rather than simply adding and then copying the result. I found an answer on StackOverflow explaining why:

One significant difference between LEA and ADD on x86 CPUs is the execution unit which actually performs the instruction. Modern x86 CPUs are superscalar and have multiple execution units that operate in parallel, with the pipeline feeding them somewhat like round-robin (bar stalls). Thing is, LEA is processed by (one of) the unit(s) dealing with addressing (which happens at an early stage in the pipeline), while ADD goes to the ALU(s) (arithmetic / logical unit), and late in the pipeline. That means a superscalar x86 CPU can concurrently execute a LEA and an arithmetic/logical instruction.

Cool stuff. In my silly microbenchmark they were the same speed, and gcc’s machine code was shorter, so I showed that. But I’m not sure which would perform better in the context of surrounding code.

Here’s a Godbolt link for this code.