Hey there 👋
It’s Robin from CFD Engine & I’ve done it again.
Last week I said I’d write fewer AWS emails & this week I’m writing an AWS email 🤦♂️
Bear with me though, you might like how this one works, even if you never touch AWS.
I’m sharing an easy way to save your simulation data before your spot instance gets taken away.
I’m using two standard function objects & the command-line tool, curl
– no fancy scripting or AWS shenanigans involved, at all.
Still with me? Let’s go…
Spot recap
Spot instances are awesome (who doesn’t like a 70% discount on their compute costs). But…having your simulation cancelled before it finishes is not awesome & that’s the gamble with spot instances.
If someone wants to pay full price for your discounted machine, then AWS will kick you off & give it to them.
There’s some nuance to how it works, but that’s the general gist.
We can reduce the amount of data we lose when this happens by increasing our save frequency. But on bigger cases, those extra saves can take a while.
We can probably do a little better…
What’s the idea?
When AWS wants your cheap machine back, they’ll give it (not you) 2mins notice before it gets shutdown.
If we can capture that notification, we can use those 2mins to interrupt our simulation & write the data before we get biffed.
2mins isn’t long enough to manually intervene, so we’ll have to automate it.
Here’s the plan, we’ll use the systemCall
function object to check if our machine has received a shutdown notice.
If it has, we’ll interrupt the simulation using the abort
function object & write out our data.
If it hasn’t, we’ll just keep on iterating (& checking) 💪
spotCheck function
Here’s how we can build that spotCheck
function…
First up is a checker
function that uses systemCall
to run 2 command-line tools during our simulation.
It’s only running on the master
process & it runs every 10s (you can tweak it to suit your case, or just run it every iteration, up to you).
The commands we’re running are in the executeCalls
section.
Every time this function runs it uses echo
to print a message & then it uses curl
to check for a termination notice.
This next bit, is the AWS-specific bit.
Our 2min warning lives in our instance metadata at a specific URL (see below). If curl
finds something at that URL it will save it to our current working directory as a file called OF.interrupt
.
If no warning has been issued, the URL doesn’t exist, curl
will fail & the OF.interrupt
file won’t be written.
Note: if you want to use this on AWS then the
URL_TO_CHECK
ishttp://169.254.169.254/latest/meta-data/spot/instance-action
- it’s not machine specific, but it’s too long to fit in my screenshot 😊
On to the next bit, our interrupt
function…
This uses an abort
function object to look for our OF.interrupt
file.
If it finds it, it’ll write the data & end the simulation.
If not, it’ll carry on as normal.
And that’s it. Combine the two functions into a single file, save it in your system
directory, reference it in the functions
section of your controlDict
& give it a try.
Gotchas
Check you’ve got curl
installed, or this won’t work 🤦♂️ it’s on most Linux systems & you can get it via your package manager.
You need to be saving your data to persistent storage, otherwise when your machine goes down, your data will go with it.
Things get a bit more complicated if we’re running on clusters of spot instances, but maybe we can look at that another day?
Spot, Check ✅
That’s it, two function objects plus curl
& we have a way to grab our data before AWS reclaims our spot instance.
If you’re not on AWS you could still use systemCall
to do other tasks. If you can script it, then it can run it during your simulation. Check out the docs & let your imagination run.
Drop me a note if you have your own uses for systemCall
or if you think you might give this a go.
Until next week, stay safe,