Issue 061 – August 28, 2021

AWS storage for CFD: a mini-guide

Hey there,

It’s Robin from CFD Engine & I’ve been neglecting the AWS issues recently, so this week I want to have a go at untangling AWS storage options for CFD.

Don’t skip this if you don’t use AWS for CFD though – they have a couple of great options for archiving your stuff & it makes a great off-site backup solution.

Stick with me & I’ll give you the rundown…

The Big Five

AWS (currently) has 5 main storage options plus another umpteen solutions if you include databases, container storage & code repositories. Storage, storage everywhere, but what’s the difference?

Here’s a summary of the big 5, each in roughly 100 words, including a top tip – let’s go…

S3 (Simple Storage Service)

Amazon S3 is an essentially-bottomless object store that looks a bit like a file system, but isn’t. Think of it like an FTP site – you can upload & download files, but you don’t save to it directly from your applications.

S3 is your gateway to AWS storage – you can transfer your files here without having to provision anything or leave any machines running.

Great for medium-to-long-term archiving of CFD datasets, sharing files with clients & off-site backup of your best meshes.  

Top Tip: Make use of S3’s built-in feature to automatically move your infrequently accessed files to cheaper storage tiers – it can really help to keep your bills in check.

S3 Glacier

Transferring data to S3 Glacier is like archiving stuff to tape ( 🤔 – ask your parents). Your data is no longer immediately accessible & if you want it back you’ll have to wait (& pay). It’s really cheap storage though, if you don’t touch it.

It’s exclusively for long-term storage – for those files that you’ll only need when disaster (or an audit team) strikes. If you need access to something (even just occasionally) leave it in S3 – don’t transfer it to Glacier.

Ideal for that project (or that client) that isn’t coming back anytime soon.

Top Tip: Glacier gets expensive when used with lots of small files (like an OpenFOAM case) so it’s often cheaper to store cases as an archive (tarred &/or zipped into a single file) rather than as naked directories.

EBS

Think of Amazon EBS (Elastic Block Store) as the hard-drive in your workstation or in a cluster node. They’re generally only accessible by their host machine, but they’re pretty darn fast.

This is where your simulation data will live while you’re working on it – dictionaries, solution data, post-processing etc – if a machine needs fast access to a file then stick it on its EBS drive.

Top Tip: You pay for your EBS volumes whether they’re attached to a machine or not, so keep an eye out for those unattached volumes that are doing nothing, but still inflating your bill.

EFS

Amazon Elastic File System is network file storage in the cloud – the equivalent of your shared drives at work. Lots of machines can connect & use the storage at any time, although the performance will drop-off when all your cluster nodes hit it at the same time.

I don’t like EFS. I like the idea of a central file system to share data between machines, but man-alive is it expensive. Coupled with the fact that it’s essentially bottomless (unlike my wallet) it can be an costly place to store CFD-sized data on all but the shortest of time scales.

If you do use it, be ruthless about what’s in there, how long it stays there & whether it could live elsewhere.

Top Tip: Consider whether you could replace a shared EFS drive with a combination of S3 & EBS to provide essentially the same thing, but at a fraction of the cost.

FSx for Lustre

FSx for Lustre is the high-performance cousin of EFS. It provides lightning-fast access to storage for clusters of machines & it stays fast even when they all attack it at once.

If you need me to tell you about Lustre then you probably don’t need it (yet). As you scale up your AWS clusters, then you might start to benefit from what it offers.

That said, for many of us, read-write time is minimal compared to solve time, so it may not yield that much of a performance benefit.

It’s not too expensive though & it’s probably worth a look if you write A LOT or run big clusters.

Over to you

Hopefully this speedrun through the main AWS storage options was useful? But I’m keen to know, how do you do storage for CFD?

Do you have much CFD data on AWS? Where does it live? Am I missing something?

How about local storage? Are quotas still a thing, along with the Friday evening disc space juggle or is storage too cheap to worry about these days?

Please drop me a note & share your experiences or ask me a question. I’m always keen to hear how you do your CFD & in this case how you handle your data storage.

Until next week, stay safe,

Signed Robin K