Issue 165 – September 23, 2023

Spot checkpoint

Hey there 👋

It’s Robin from CFD Engine & I’ve done it again.

Last week I said I’d write fewer AWS emails & this week I’m writing an AWS email 🤦‍♂️

Bear with me though, you might like how this one works, even if you never touch AWS.

I’m sharing an easy way to save your simulation data before your spot instance gets taken away.

I’m using two standard function objects & the command-line tool, curl – no fancy scripting or AWS shenanigans involved, at all.

Still with me? Let’s go…

Spot recap

Spot instances are awesome (who doesn’t like a 70% discount on their compute costs). But…having your simulation cancelled before it finishes is not awesome & that’s the gamble with spot instances.

If someone wants to pay full price for your discounted machine, then AWS will kick you off & give it to them.

There’s some nuance to how it works, but that’s the general gist.

We can reduce the amount of data we lose when this happens by increasing our save frequency. But on bigger cases, those extra saves can take a while.

We can probably do a little better…

What’s the idea?

When AWS wants your cheap machine back, they’ll give it (not you) 2mins notice before it gets shutdown.

If we can capture that notification, we can use those 2mins to interrupt our simulation & write the data before we get biffed.

2mins isn’t long enough to manually intervene, so we’ll have to automate it.

Here’s the plan, we’ll use the systemCall function object to check if our machine has received a shutdown notice.

If it has, we’ll interrupt the simulation using the abort function object & write out our data.

If it hasn’t, we’ll just keep on iterating (& checking) 💪

spotCheck function

Here’s how we can build that spotCheck function…

First up is a checker function that uses systemCall to run 2 command-line tools during our simulation.

Using the systemCall function object to check for a spot interrupt notification during our simulation using curl

It’s only running on the master process & it runs every 10s (you can tweak it to suit your case, or just run it every iteration, up to you).

The commands we’re running are in the executeCalls section.

Every time this function runs it uses echo to print a message & then it uses curl to check for a termination notice.

This next bit, is the AWS-specific bit.

Our 2min warning lives in our instance metadata at a specific URL (see below). If curl finds something at that URL it will save it to our current working directory as a file called OF.interrupt.

If no warning has been issued, the URL doesn’t exist, curl will fail & the OF.interrupt file won’t be written.

Note: if you want to use this on AWS then the URL_TO_CHECK is http://169.254.169.254/latest/meta-data/spot/instance-action - it’s not machine specific, but it’s too long to fit in my screenshot 😊

On to the next bit, our interrupt function…

Using the abort function object to stop our simulation early

This uses an abort function object to look for our OF.interrupt file.

If it finds it, it’ll write the data & end the simulation.

If not, it’ll carry on as normal.

And that’s it. Combine the two functions into a single file, save it in your system directory, reference it in the functions section of your controlDict & give it a try.

Gotchas

Check you’ve got curl installed, or this won’t work 🤦‍♂️ it’s on most Linux systems & you can get it via your package manager.

You need to be saving your data to persistent storage, otherwise when your machine goes down, your data will go with it.

Things get a bit more complicated if we’re running on clusters of spot instances, but maybe we can look at that another day?

Spot, Check ✅

That’s it, two function objects plus curl & we have a way to grab our data before AWS reclaims our spot instance.

If you’re not on AWS you could still use systemCall to do other tasks. If you can script it, then it can run it during your simulation. Check out the docs & let your imagination run.

Drop me a note if you have your own uses for systemCall or if you think you might give this a go.

Until next week, stay safe,

Signed Robin K