Clang and GCC add Smis showdown
Tags: low-level
Assumed: knowledge of x86 assembly.
In a previous post, I gave the gcc output for this code:
uint32_t add_smi_smi(uint32_t x, uint32_t y,
uint32_t (*bailout)()) {
uint32_t s;
if (__builtin_add_overflow(x, y, &s) || (s & 3)) {
return bailout();
}return s;
}
Here is gcc’s output again:
int, unsigned int, unsigned int (*)()):
add_smi_smi(unsigned xor ecx, ecx
add edi, esi
mov eax, edi
setc cl
and eax, 3
or eax, ecx
jne .L6
mov eax, edi
ret
.L6:
jmp rdx
Here is clang’s output:
int, unsigned int, unsigned int (*)()):
add_smi_smi(unsigned lea eax, [rdi + rsi]
add edi, esi
jc .LBB0_3
and eax, 3
jne .LBB0_3
mov eax, edi
ret
.LBB0_3:
jmp rdx
I was really surprised by clang’s output here! lea eax, [rdi + rsi]
and add edi, esi
both compute the sum! So clang decides to the two numbers twice, rather than simply adding and then copying the result. I found an answer on StackOverflow explaining why:
One significant difference between
LEA
andADD
on x86 CPUs is the execution unit which actually performs the instruction. Modern x86 CPUs are superscalar and have multiple execution units that operate in parallel, with the pipeline feeding them somewhat like round-robin (bar stalls). Thing is,LEA
is processed by (one of) the unit(s) dealing with addressing (which happens at an early stage in the pipeline), whileADD
goes to the ALU(s) (arithmetic / logical unit), and late in the pipeline. That means a superscalar x86 CPU can concurrently execute aLEA
and an arithmetic/logical instruction.
Cool stuff. In my silly microbenchmark they were the same speed, and gcc’s machine code was shorter, so I showed that. But I’m not sure which would perform better in the context of surrounding code.
Here’s a Godbolt link for this code.