Ready, Steady, Stop

Hey there,

It’s Robin from CFD Engine and I’m excited about this one. You know I like AWS & you know I like spot instances, well this week I made them even better.

Not really, but I finally got to check out hibernation on AWS (I’m only three years late) and I reckon it could be a big one for CFD on AWS.

Let’s start with the golden rule of EC2 – the one that you (& your chequebook) ignore at your peril:

The Golden Rule of EC2: “Turn. It. Off.”

You’ve basically got three options…

Stopping your instance

Like shutting down your workstation. When you start it up again, your data will be there waiting for you 🤞 You don’t pay for compute while it’s stopped, but you do need to pay AWS to hang onto your data.

Terminating your instance

As drastic as it sounds – terminated machines are gone, forever. And so is your data, unless you ticked the right boxes or moved it somewhere safe. There’s nothing left to incur charges, it’s all gone.

Hibernating your instance

A bit like closing the lid on your laptop. Everything is frozen, persisted to disk and the instance is shutdown. You’re billed for the storage, but there’s no charge for the compute while it’s asleep. When it’s time to wake up, the process is reversed & everything should pick up where it left off when you “closed the lid”.

So, if you’re going home for the night & won’t be using it, stop your instance.

If you’ve broken it & it doesn’t work properly anymore, terminate it & start a new one.

I’m not really sure where hibernation fits day-to-day. But, combine it with spot instances, and we might just be onto something.

Spot (again 🥱)

Indulge me a minute while I harp on about spot instances again, you might like this one.

Spot instances are great. I hope I’ve established that in previous emails. What’s not great, is when there aren’t any (it can happen) and when you get kicked off.

If the price/demand goes up enough, you get kicked, and your instance is terminated. Gone. Forever. Bummer.

Hopefully you’re built for this & you didn’t lose the data from your run 🤞 But now it’s up to you to get it started again on a different instance.

But, what if instead of AWS terminating your instance, when spot capacity dries up, it hibernated it instead? Then, when there’s spare compute, it’ll wake it up and carry on like nothing happened.

It sounds like cheap compute without much downside.

Wait a minute

Call me cynical, but this sounds too good to be true. What are the chances of hibernating a running CFD simulation and then it carrying on where it left off without skipping a beat?

Maybe I’m battle-scarred, but once a simulation is running, I like it to be left in peace until it’s finished. I’m not even that keen on restarts. But this is different, so maybe it could work?

I need to be convinced.

So, I took it for a spin & I think I am convinced.

Wait another minute

We can’t just hibernate any old instance (unfortunately). There are a number of hoops we need to jump through with hibernation. Here are the main prerequisites in 5 bullet points…

Not all instances support hibernation. We can use most C3, C4 and C5 instances, but not the metal variants, or the ones with more than 150GB RAM.
Not all OS (operating systems) support hibernation. Recent Ubuntu release are good to go, plus you can grab the OpenFOAM binaries for them too.
You have to use an encrypted EBS volume as your root storage (for the OS etc) – no problems here.
The aforementioned EBS volume has to be big enough to hold the contents of the RAM, whilst asleep. You’ll probably need to make it pretty big ≥ OS + RAM + OpenFOAM install + your case data.
There are also a couple of (easy) config tweaks needed before it will fall asleep on demand (unlike me, who just needs 5 laps of a Grand Prix & I’m out).

One more thing

Hibernation must be enabled when you start your instance – you can’t enable it on a running instance.

Hoops jumped through – let’s give it a go.

Testing, Testing

My highly-unscientific testing involved starting an instance that fitted the above rules, installing OpenFOAM v2006 & then running the motorbike tutorial, uninterrupted, as a baseline.

Then I ran it again – hibernating it a couple of times during the simulation.

It went to sleep pretty quickly & took about 5mins to resume on waking.

Each time it resumed with no issues.

Once complete, I diff-d the logs & results & they were identical (except for the expected – file paths, PIDs and timestamps etc).

When I combed through the run log, I could see that the interrupted iteration took 17mins, while the others took around a second each. Nice.

Colour me impressed.

Possible gotchas

This test used an on-demand instance. It’s not easy to test spot-instance hibernation as the market isn’t transparent enough.

But I’ve no reason to think that it wouldn’t work on spot. The mechanics are the same, suspending-to-disk, it’s just that the trigger is spot capacity, rather than a request from me.

Also, if there’s no spot capacity then your jobs could be hibernating for a while. Spot isn’t really for priority jobs. If you need something quick smart then it’s probably worth paying retail.

Our secret

So, now I’m conflicted. If you’re on AWS, I think you should try this out. However, I don’t want all my spot instances to be hoovered up, so let’s keep this just between us 🤫 OK?

Check out the AWS Hibernation Docs for more info.

If you don’t use AWS then I thank you for getting this far down an “AWS” email 🙏

Let me know if you think you might take this for a spin. Or drop me a note if you’re already hibernating your spot instances. I’m especially keen to know if there be dragons that I haven’t spotted.

Until next week, sim-you-later,