Speeding up Javascript addition

Tags: low-level

Assumed: knowledge of JIT compilers, knowledge of x86 assembly.

All numbers in Javascript are doubles. But NodeJS’s Javascript engine, V8, doesn’t represent them that way.

V8 has an optimization for SMall Integers (aka Smis). An Smi is an integer between  − 231 and 231 − 1. Smis are stored in the high 32 bits of 64 bit words. V8 uses the last bit for “tagging” whether the object is a pointer or an Smi.

            |---high 32 bits ---|--- low 32 bits ---|
Pointer:    |________________address_____________001|
Smi:        |____int32_value____|0000000000000000000|
                                                   ^
                                                   |
                                   this is a tag bit

We will call this representation the tagged representation. Note that in order to add two Smis, we need to untag them first. We can do this by right shifting them both by 32 bits, to get the actual 32 bit value stored in the upper half.

Note that the result of adding two Smis may not be representable as an Smi. For example, if x = 231 − 1 and y = 1, then x and y are both representable as Smis. However x + y = 231, which is outside the range of an Smi.

Let’s consider the code NodeJS generates for adding two Smis. We’ll call the add function many times with Smi arguments, to induce V8’s optimizer into specializing it for the case of adding two Smis. (I’ve done some cleaning to make the assembly look nicer.)

function add(x, y) { return x + y; }
for (let i = 0; i < 1e7; i++) add(i, 1);

which outputs:

$ node --print-opt-code t.js
...
movq rdx,[rbp+0x18]   ; load argument x
testb rdx,0x1         ; look at last bit of x
jnz <+0x97>           ; bailout if x is not an Smi
movq rcx,[rbp+0x10]   ; load argument y
testb rcx,0x1         ; look at last bit of y
jnz <+0xa3>           ; bailout if y is not an Smi
movq rdi,rcx          ; untag Smi x
shrq rdi, 32          ;   by shifting away the 0s
movq r8,rdx           ; untag Smi y
shrq r8, 32           ;   by shifting away the 0s
addl rdi,r8           ; add x and y
jo <+0xaf>            ; bailout if overflow
shlq rdi, 32          ; retag `rdi`
...

V8 needs to insert bailouts, in case we later call add with something which is not an Smi, like add('1','2'). The bailouts generated are straightforward: check if x is an Smi otherwise bailout, check if y is an Smi otherwise bailout.

But can we do better? One idea is to optimistically add the two Smis! There is some magic here: if we just add the two 64-bit tagged values, that’s the same thing as shifting down and then adding. That is:

x + y == ((x >> 32) + (y >> 32)) << 32
(when the low 32 bits of x and y are all 0s)

Note that this only works when the low 32 bits of x and y are all 0s – i.e., we need to be sure that x and y are Smis when we add them.

Consider the last three bits of the Pointer & Smi diagram above. If we add two pointers, the last three bits will always be 010. If we add a pointer and an Smi, the last three bits will always be 001. If we add two Smis, the last three bits will be 000. In other words, we actually added two Smis if the last two bits are both 0.

One lazy way to do this is to ask a C compiler what it would write. Here’s gcc’s output:

uint32_t add_smi_smi(uint32_t x, uint32_t y,
                     uint32_t (*bailout)()) {
    uint32_t s;
    if (__builtin_add_overflow(x, y, &s) || (s & 3)) {
        return bailout();
    }
    return s;
}

If we compile this with gcc -O3, and then translate it to use the same stack loads that the V8 version does, we get something like this. (I changed the assembly syntax to match V8’s.)

xorl rbx,rbx        ; set rbx to zero (for later)
movq rdx,[rbp+0x18] ; load x
movq rcx,[rbp+0x10] ; load y
addq rdx,rcx        ; add x + y
seto bl             ; set bl if addition overflowed
movq rax,rdx        ; copy rdx to rax
andl eax,0x3        ; compute eax = eax & 3
orq  rax,rbx        ; compute rax = rax | rbx
jne  <bailout>      ; if rax != 0, bailout

I benchmarked the “fast path” (i.e., no bailouts) using nanoBench. This ends up being around 25% faster on my server, and produces machine code which is about half the size! But it also has a disadvantage: we need to do some extra work in the bailout to check which condition failed (in particular, we don’t know if x was not an Smi, y was not an Smi, or if they were both Smis but an overflowed occurred).

Overall, I’m not sure if implementing this would be a net win, but I suspect it would be. If I have time, I’ll implement this (potentially a hacky incorrect version) in V8 and see how much better the computation heavy benchmarks get.

Posted on 2022-02-01