Remote builds with AWS Spot Instances
Tags: selfhosted
Occasionally I contribute to open-source projects. Some of these projects are really big and compiling them takes a long time. This is usually not a problem in industry: either you’ll get an expensive desktop machine or remote build access to a fleet of expensive “build bot” machines.
Unfortunately I don’t have either of those – my laptop is nine years old and was nowhere near state-of-the-art when I bought it. Compiling a large open-source project like Node.js from scratch takes hours.
Anyway I used to get around this by having a beefy 32-core Amazon EC2 instance in the cloud. I would turn it on only when I was using it, and there is a crontab configured to automatically shut-off the instance every night in case I forget. There is a monthly charge for the disk storage, but it’s comparatively cheap. But even with careful stopping and starting of the instance, it ends up being more expensive than I would like. I spend a lot of time reading and writing code, and not compiling it. But regardless of whether the CPUs are completely idle or spinning at 100%, I’m paying Amazon multiple dollars per hour.
In addition to their on-demand instances, AWS provides “spot instances”, which can be up to 90% cheaper. In exchange, you have to deal with “interruptions”, where AWS can decide to terminate your instance in order to get extra capacity.
I decided to create a “remote runner” tool. The idea is that I would run
remote_run make test
, and it would:
- Start up a spot instance.
- Send it the
make test
command and show me the results locally. - Have the spot machine linger around for a while (in case I have a quick follow-up command) before shutting down.
Boring nitty gritty details
This ended up being much trickier than I imagined, mostly because I had a simplistic model of how AWS worked. Here’s what I ended up building in practice:

First, there are various build dependencies which need to be on the root disk in order to actually do the build. We could install them whenever we run a command but that would add significantly to the start time. Instead we build an “AMI”, which is Amazon speak for a virtual machine image. I used Hashicorp’s Packer to do this. This is a “one-time” setup, although you’ll have to do it “one-more-time” whenever you add new build dependencies.
Next, we need to store the build artifacts and repositories somewhere. This was easy for on-demand instances, because there was a single instance and it could just have an EBS volume attached. For spot instances this ended up being tricky: you need to manually attach the EBS volume and mount it on the machine when it starts.
At this point I was able to start actually running build commands. I
added a git push
over SSH to sync up my local code to the remote,
and got my first build passing :)
The runtime for doing a small command like remote_run echo 1
was
unacceptably slow even when the spot instance was already warm: 2-3
seconds. One of those seconds was because the spot instance IP
required AWS API calls which were not fast. I built a “local cache”
which stored the last known IP. Establishing new SSH connections for
each command took an additional second, I used OpenSSH’s
multiplexing feature which brought it down to ~60ms.
Then there were a bunch of annoying edge cases, like handling running a command when a spot instance was shutting down or starting up, or trying to make the remote runner robust even when I pressed Ctrl-C halfway through the pipeline.
Annoyingly, while testing my code I started getting random errors. AWS informed me that they suspected that my account had been compromised. I suspect the average cryptomining workload looks pretty similar to this, although theirs is perhaps more economically productive.
Conclusion
The end result of all this toil:
- I can do builds really fast! My most complicated projects can do a full compile in ~minutes.
- “Cold” commands take around 20-30 seconds for the spot EC2 instance to become ready, and then degraded performance while the EBS volume warms up.
- “Warm” commands take an additional ~100ms which is very reasonable.
Against my better judgement, here is the full code. It’s a series of unprincipled hacks (and using StrictHostKeyChecking=no is dangerous), but it works well for my workflow.
Python Code
import boto3
import subprocess
import sys
import time
import os
import base64
import json
from botocore.exceptions import ClientError
from shlex import quote
from functools import wraps
from time import time, sleep
= None
base_ssh_command = boto3.client("ec2", region_name="us-west-2")
ec2 = "/b/v8/v8"
fp = "vol-xxxxxxxxxxxxxxxxx"
volume_id = False
verbose
class Color:
= "\033[1;32;48m"
GREEN = "\033[1;31;48m"
RED = "\033[1;34;48m"
BLUE = "\033[1;37;0m"
END
def verbose_timing(f):
@wraps(f)
def wrap(*args, **kw):
= time()
ts print(f"{Color.BLUE}running {f.__name__}{Color.END}")
= None
error try:
= f(*args, **kw)
result except Exception as e:
= e
error = time()
te = te - ts
diff if diff > 2:
= f"{diff:2.4f} sec"
diff_nice else:
= f"{int(diff*1000)} ms"
diff_nice if error:
print(f"{Color.RED}... failed in {diff_nice}{Color.END}")
raise error
else:
print(f"{Color.GREEN}... done in {diff_nice}{Color.END}")
return result
return wrap
= verbose_timing if verbose else lambda f: f
timing
@timing
def get_instance_information(instance_name):
= ec2.describe_instances(
instances =[
Filters"Name": "instance-state-name", "Values": ["pending", "running"]},
{"Name": "tag:Name", "Values": [instance_name]},
{
]
)for reservation in instances["Reservations"]:
for instance in reservation["Instances"]:
return {
"public_ip": instance["NetworkInterfaces"][0]["Association"][
"PublicIp"
],"instance_id": instance["InstanceId"],
"has_volume": any(
("Ebs"]["VolumeId"] == volume_id
bdm[for bdm in instance["BlockDeviceMappings"]
)
),"state": instance["State"]["Name"],
}
@timing
def request_spot_instance(instance_name, instance_type):
= {
request "MaxCount": 1,
"MinCount": 1,
"ImageId": "ami-xxxxxxxxxxxxxxxxx",
"InstanceType": instance_type,
"InstanceInitiatedShutdownBehavior": "terminate",
"KeyName": instance_name,
"EbsOptimized": True,
"UserData": base64.b64encode(
b'#!/bin/bash\necho "shutdown -h now" | at now + 5 min\n'
"utf8"),
).decode("NetworkInterfaces": [
{"SubnetId": "subnet-xxxxxxxxxxxxxxxxx",
"AssociatePublicIpAddress": True,
"DeviceIndex": 0,
"Groups": ["sg-xxxxxxxxxxxxxxxxx"],
}
],"TagSpecifications": [
{"ResourceType": "instance",
"Tags": [{"Key": "Name", "Value": "build-bot"}],
}
],"InstanceMarketOptions": {"MarketType": "spot"},
"MetadataOptions": {
"HttpTokens": "required",
"HttpEndpoint": "enabled",
"HttpPutResponseHopLimit": 2,
},"PrivateDnsNameOptions": {
"HostnameType": "ip-name",
"EnableResourceNameDnsARecord": False,
"EnableResourceNameDnsAAAARecord": False,
},
}= ec2.run_instances(**request)
response return response["Instances"][0]["SpotInstanceRequestId"]
@timing
def wait_for_spot_instance_request(spot_request_id):
while True:
= ec2.describe_spot_instance_requests(
response =[spot_request_id]
SpotInstanceRequestIds
)= response["SpotInstanceRequests"][0]["State"]
state if state == "active":
= response["SpotInstanceRequests"][0]["InstanceId"]
instance_id return instance_id
elif state == "failed":
raise Exception("Spot instance request failed.")
1)
sleep(
@timing
def attach_volume(instance_id):
= None
last_error for retries in range(60):
try:
= ec2.attach_volume(
response ="/dev/sdf", InstanceId=instance_id, VolumeId=volume_id
Device
)break
except ClientError as e:
= str(e)
str_e if "(VolumeInUse)" in str_e or "is not 'running'." in str_e:
= e
last_error else:
raise e
1)
sleep(else:
raise last_error
return response
@timing
def mount_when_ready(public_ip, key_file_path):
= f"{base_ssh_command} ubuntu@{public_ip} 'mountpoint /b >/dev/null || {{ sudo mount /dev/nvme1n1 /b || sudo mount /dev/sdf /b || sudo mount /dev/xvdf /b; }}'"
ssh_command while subprocess.call(ssh_command, shell=True) != 0:
0.1)
sleep(
@timing
def sync_changes_to_git():
subprocess.check_call('git add -u && git commit --allow-empty -m"for build"', shell=True
)return subprocess.check_output(["git", "rev-parse", "HEAD"]).strip().decode("utf-8")
@timing
def push_git_sha(public_ip, key_file_path, git_sha):
= (
ssh_command f"git push -f ssh://ubuntu@{public_ip}{fp} HEAD:refs/heads/for-spot-build"
)=True)
subprocess.check_call(ssh_command, shell
@timing
def setup_build(public_ip, key_file_path, git_sha):
= (
ssh_command f"{base_ssh_command} ubuntu@{public_ip} 'cd {fp} && git checkout -q {git_sha}'"
)=True)
subprocess.check_call(ssh_command, shell
def run_command(public_ip, key_file_path, user_command):
= f"{base_ssh_command} ubuntu@{public_ip} {quote(user_command)}"
ssh_command try:
=True)
subprocess.call(ssh_command, shellexcept KeyboardInterrupt:
return
@timing
def rsync_from_remote(public_ip, key_file_path, remote_path, local_path):
= f"rsync --info=progress2 -a -r -e '{base_ssh_command}' ubuntu@{public_ip}:{remote_path} {local_path}"
rsync_command =True)
subprocess.check_call(rsync_command, shell
def load_build_bot_from_cache(cache_directory):
return json.load(open(cache_directory + "/build_bot.json", "r"))
def store_build_bot_cache(cache_directory, build_bot):
try:
os.makedirs(cache_directory)except OSError as e:
from errno import EEXIST
if e.errno != EEXIST:
raise
return json.dump(build_bot, open(cache_directory + "/build_bot.json", "w"))
def is_build_bot_alive(public_ip):
= reschedule_shutdown_command(10)
shutdown_command = f"{base_ssh_command} -o ConnectTimeout=3 ubuntu@{public_ip} {quote(shutdown_command)} >/dev/null 2>/dev/null"
ssh_command try:
=True)
subprocess.check_call(ssh_command, shellreturn True
except subprocess.CalledProcessError:
return False
@timing
def invalidate_build_bot_cache(cache_directory):
from shutil import rmtree
if "/build-bot/" not in cache_directory:
raise Exception("failing because cache directory looks wrong")
rmtree(cache_directory)
def reschedule_shutdown_command(minutes):
# Only cancel existing shutdown jobs after we've scheduled a new
# shutdown, so that there's always a shutdown scheduled even in the
# face of incoming interrupts.
return f'sudo bash -c "to_cancel=$(atq | cut -f 1); echo shutdown -h | at now + {minutes} min; echo "$to_cancel" | xargs atrm; exit 0" >/dev/null 2>/dev/null'
if __name__ == "__main__":
= None
build_bot = "~/.ssh/build-bot.pem"
key_file_path = f"ssh -i {key_file_path} -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/%C -o ControlPersist=10m"
base_ssh_command = "build-bot"
instance_name = "c7a.8xlarge"
instance_type
= os.path.expanduser("~/.cache/build-bot/")
cache_directory if os.path.isdir(cache_directory):
= load_build_bot_from_cache(cache_directory)
build_bot for retries in range(60):
if is_build_bot_alive(build_bot["public_ip"]):
break
1)
sleep(if get_instance_information(instance_name) is None:
# The build bot went away (as expected).
invalidate_build_bot_cache(cache_directory)= None
build_bot break
else:
raise Exception("build bot is present but not responsive")
= " ".join(sys.argv[1:])
user_command = "~/v8/v8/out/x64.release"
local_out_directory = [f"cd {fp}", "export PATH=/b/depot_tools:$PATH"]
commands_to_run "GIT_SSH_COMMAND"] = base_ssh_command
os.environ[
if build_bot is None:
= get_instance_information(instance_name)
build_bot
if build_bot is None:
= request_spot_instance(instance_name, instance_type)
spot_request_id = wait_for_spot_instance_request(spot_request_id)
instance_id for retries in range(60):
try:
1)
sleep(= get_instance_information(instance_name)
build_bot if build_bot is not None:
break
except KeyError: # network not ready yet...
pass
else:
raise Exception("spot instance request fulfilled but no bot")
= build_bot["public_ip"]
public_ip if not build_bot["has_volume"]:
= build_bot["instance_id"]
instance_id
attach_volume(instance_id)"has_volume"] = True
build_bot[
if not build_bot.get("has_mount"):
mount_when_ready(public_ip, key_file_path)"has_mount"] = True
build_bot[
if "gm.py" in user_command:
= sync_changes_to_git()
git_sha
push_git_sha(public_ip, key_file_path, git_sha)f"git checkout {git_sha} >/dev/null")
commands_to_run.append(
commands_to_run.append(user_command)3))
commands_to_run.append(reschedule_shutdown_command(";".join(commands_to_run))
run_command(public_ip, key_file_path,
store_build_bot_cache(cache_directory, build_bot)
# add as necessary
# rsync_from_remote(public_ip, key_file_path, f"{fp}/out/x64.release/", local_out_directory)
Packer Code
packer {
required_plugins {
amazon = {
version = ">= 0.0.1"
source = "github.com/hashicorp/amazon"
}
}
}
source "amazon-ebs" "ubuntu" {
ami_name = "build-bot-ami"
instance_type = "c7a.xlarge"
region = "us-west-2"
source_ami_filter {
filters = {
name = "ubuntu/images/*ubuntu-focal-20.04-amd64-server-*"
root-device-type = "ebs"
virtualization-type = "hvm"
}
most_recent = true
owners = ["099720109477"]
}
ssh_username = "ubuntu"
}
build {
name = "build-bot"
sources = [
"source.amazon-ebs.ubuntu"
]
provisioner "file" {
source = "../build-bot.pub"
destination = "/tmp/build-bot.pub"
}
provisioner "shell" {
script = "setup.sh"
}
}
setup.sh
Shell Code (used by Packer)
#!/bin/bash
set -Eeuox pipefail
# Update & upgrade all packages.
export DEBIAN_FRONTEND=noninteractive
sudo apt-get update -y
sudo apt-get -y -o Dpkg::Options::="--force-confdef" \
-o Dpkg::Options::="--force-confold" dist-upgrade
sudo apt-get update
# Install the SSH key.
mkdir -p ~/.ssh
chmod 700 ~/.ssh
cp /tmp/build-bot.pub ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
# Some dependencies I use.
sudo apt-get -y -qq install ninja-build ccache
# Dependencies for V8/Chromium.
curl https://chromium.googlesource.com/chromium/src/+/main/build/install-build-deps.py?format=TEXT \
| base64 -d > /tmp/install_build_deps.py
sudo python3 /tmp/install_build_deps.py
# By default you can't log in while the machine is booting or
# shutting down. Comment that out - lets us SSH in faster.
sudo sed -i '/account[[:space:]]*required[[:space:]]*pam_nologin.so/s/^/#/' /etc/pam.d/sshd
# Add the mountpoint.
sudo mkdir -p /b
sudo chown ubuntu:ubuntu /b
# Add some configuration files outside of the root mount.
mkdir -p ~/.ccache
cat > ~/.ccache/ccache.conf <<EOF
max_size = 32.0G
cache_dir = /b/ccache
EOF
echo . /b/.bashrc >> ~/.bashrc
echo source /b/.vimrc >> ~/.vimrc
P.S. After doing all this, I’m thinking I’ll just buy a new computer.