OpenFOAM Tips from AWS

Hey there,

It’s Robin from CFD Engine & this email isn’t just for AWS users, this is for all OpenFOAMers.

A couple of weeks ago Neil Ashton & his colleagues at AWS published a blog post with a handful of recommendations for getting the best bang for your buck when running OpenFOAM on AWS.

The focus was mainly on cost-efficiency but as time is money on AWS, there are some speedup tips too.

Some of it is AWS-specific, but most of it isn’t, so I thought I’d share their takeaways with you, plus some commentary & see what you reckon.

Their benchmark

Their tests were based on the large motorbike benchmark from the OpenFOAM HPC technical committee, which they ran in v2012.

One oddity of this benchmark is that it’s made by meshing a single motorbike, then mirroring & merging it twice, to end up with a case that contains 4 motorbikes.

You mesh 8M cells in snappyHexMesh but solve 32M cells 🤔 I think it would’ve been nice to mesh the big model in snappyHexMesh (at least once) to see if that had any impact on the recommendations, but we can always try that for ourselves.

I’m going to skip the AWS-specific stuff (for now), but they ran this case across a lot of different compute options to come up with the following…

Their recommendations

Use fewer cores for meshing than solving

I know snappyHexMesh doesn’t really scale, but I was surprised to see that it didn’t scale past 2 or 3 nodes. I’m not sure how practical it would be to mesh on just a couple of nodes on local hardware though.

AWS instances typically have more memory than you can efficiently use. If you don’t have similarly “fat nodes” then you’ll need to use extra cores just to gather enough RAM to mesh your model. After all, it’s not uncommon with SHM to need more memory to mesh a case, than to solve it.

Similarly, on AWS you can use a different instance type to mesh, without affecting your solve resource. I’m sure your queue would figure it out, but meshing & solving on different core counts feels like it could be a “pinch point.”

It’s an interesting recommendation & well worth testing, but this might not fit your hardware.

Meshing more than 100K cells per core is fine

As previously noted, if you’ve got enough RAM, then running snappyHexMesh on a single node (36-96 cores in these tests) was the cheapest option. Running on two (or three) nodes was a little faster, but not much.

It would be nice to try this with different sized meshes, what’s the upper limit?

Thinking aloud: snappyHexMesh spends a lot of time re-balancing the decomposition as the mesh grows. I wonder if there’s much to be gained by increasing the maxLoadUnbalance in snappyHexMeshDict to reduce the number of times it has to re-balance 🤔

Use `hierarchical` decomposition for meshing

This one was a bit of a surprise, or rather the difference between hierarchical & scotch was a surprise. Meshing with scotch decomposition took approximately twice as long as using hierarchical 😬 I’m going to have to take this one for a spin.

Use `scotch` decomposition for solving

There was very little solve-time difference between hierarchical & scotch decomposition until they got beyond 600 cores (& even then the difference was relatively small) – but then again, any speedup is a good speedup, especially if it’s free.

Aim for 50K–100K cells per core for solving

One of the nice things about cloud computing is that it puts the cost of a simulation front & centre. So, whilst you can go faster with more & more cores (up to a point), you can easily identify the “sweet spot.”

On AWS that appears to be somewhere between 50K to 100K cells per core when solving.

I think this has been a rule of thumb in CFD-land for a while, but it’s one that I routinely ignore 🤫

Don’t use `reconstructParMesh`

I’ve mentioned this one before but it bears repeating…

If you’re meshing & solving on different numbers of cores please don’t reconstruct your mesh & then re-decompose it – it takes ages.

Instead, edit your decomposeParDict (change the numberOfSubdomains etc) & run redistributePar (in parallel) to make the change.

Much quicker than running reconstructParMesh then decomposePar – it can still take a while on bigger meshes though.

Any thoughts?

I recommend adding the original AWS blog post to your reading list, if you’ve not done it already.

Do their recommendations fit with your experience? Are you going to give any new ones a try?

Personally, I’m interested to see how the “mesh-on-less-cores & redistribute” approach fairs on a bigger mesh – I’ll be giving that a go.

Also, how do you decide how many cores to use? Do you usually mesh & run on different counts? Or do you use “all of the cores, always” ? What’s the sweet spot for your hardware, models & workflow?

Drop me a note – I’m always keen to hear how you do yours.

Until next week, stay safe