Speeding up Javascript addition
Tags: low-level
Assumed: knowledge of JIT compilers, knowledge of x86 assembly.
All numbers in Javascript are doubles. But NodeJS’s Javascript engine, V8, doesn’t represent them that way.
V8 has an optimization for SMall Integers (aka Smis). An Smi is an integer between − 231 and 231 − 1. Smis are stored in the high 32 bits of 64 bit words. V8 uses the last bit for “tagging” whether the object is a pointer or an Smi.
|---high 32 bits ---|--- low 32 bits ---|
Pointer: |________________address_____________001|
Smi: |____int32_value____|0000000000000000000|
^
|
this is a tag bit
We will call this representation the tagged representation. Note that in order to add two Smis, we need to untag them first. We can do this by right shifting them both by 32 bits, to get the actual 32 bit value stored in the upper half.
Note that the result of adding two Smis may not be representable as an Smi. For example, if x = 231 − 1 and y = 1, then x and y are both representable as Smis. However x + y = 231, which is outside the range of an Smi.
Let’s consider the code NodeJS generates for adding two Smis. We’ll call the add
function many times with Smi arguments, to induce V8’s optimizer into specializing it for the case of adding two Smis. (I’ve done some cleaning to make the assembly look nicer.)
function add(x, y) { return x + y; }
for (let i = 0; i < 1e7; i++) add(i, 1);
which outputs:
$ node --print-opt-code t.js
...movq rdx,[rbp+0x18] ; load argument x
rdx,0x1 ; look at last bit of x
testb jnz <+0x97> ; bailout if x is not an Smi
movq rcx,[rbp+0x10] ; load argument y
rcx,0x1 ; look at last bit of y
testb jnz <+0xa3> ; bailout if y is not an Smi
movq rdi,rcx ; untag Smi x
rdi, 32 ; by shifting away the 0s
shrq movq r8,rdx ; untag Smi y
r8, 32 ; by shifting away the 0s
shrq rdi,r8 ; add x and y
addl jo <+0xaf> ; bailout if overflow
rdi, 32 ; retag `rdi`
shlq ...
V8 needs to insert bailouts, in case we later call add
with something which is not an Smi, like add('1','2')
. The bailouts generated are straightforward: check if x
is an Smi otherwise bailout, check if y
is an Smi otherwise bailout.
But can we do better? One idea is to optimistically add the two Smis! There is some magic here: if we just add the two 64-bit tagged values, that’s the same thing as shifting down and then adding. That is:
x + y == ((x >> 32) + (y >> 32)) << 32
(when the low 32 bits of x and y are all 0s)
Note that this only works when the low 32 bits of x
and y
are all 0s – i.e., we need to be sure that x
and y
are Smis when we add them.
Consider the last three bits of the Pointer & Smi diagram above. If we add two pointers, the last three bits will always be 010
. If we add a pointer and an Smi, the last three bits will always be 001
. If we add two Smis, the last three bits will be 000
. In other words, we actually added two Smis if the last two bits are both 0.
One lazy way to do this is to ask a C compiler what it would write. Here’s gcc
’s output:
uint32_t add_smi_smi(uint32_t x, uint32_t y,
uint32_t (*bailout)()) {
uint32_t s;
if (__builtin_add_overflow(x, y, &s) || (s & 3)) {
return bailout();
}return s;
}
If we compile this with gcc -O3
, and then translate it to use the same stack loads that the V8 version does, we get something like this. (I changed the assembly syntax to match V8’s.)
rbx,rbx ; set rbx to zero (for later)
xorl movq rdx,[rbp+0x18] ; load x
movq rcx,[rbp+0x10] ; load y
rdx,rcx ; add x + y
addq seto bl ; set bl if addition overflowed
movq rax,rdx ; copy rdx to rax
eax,0x3 ; compute eax = eax & 3
andl rax,rbx ; compute rax = rax | rbx
orq jne <bailout> ; if rax != 0, bailout
I benchmarked the “fast path” (i.e., no bailouts) using nanoBench. This ends up being around 25% faster on my server, and produces machine code which is about half the size! But it also has a disadvantage: we need to do some extra work in the bailout to check which condition failed (in particular, we don’t know if x
was not an Smi, y
was not an Smi, or if they were both Smis but an overflowed occurred).
Overall, I’m not sure if implementing this would be a net win, but I suspect it would be. If I have time, I’ll implement this (potentially a hacky incorrect version) in V8 and see how much better the computation heavy benchmarks get.