ParMETIS and HUGE meshes

hi everyone,

short question:

Does anyone around here have experience with using ParMETIS on massively parallel machines and really large unstructured meshes (and their dual graphs)?? I am talking about finite element meshes with up to 1 billion elements, and running ParMETIS on more than 4000 MPI ranks. Can I expect ParMETIS to work correctly in such an environment, or are their known issues?

Anyone with such experience, please let me know. Thanks!

Andreas

RE: Hello, As far as I know, all

Hello,

As far as I know, all those machines have a shared filesystem, the Crays using a LUSTRE filesystem, while the Blue Gene/Q's have a GPFS file filesystem. I am quite sure the Blue Gene Q's don't have local scratch, and doubt the Crays do.

In any case, when running his tests, Charles burned through his time allocations more due to IO than to computing time (as he only ran a few time steps), so I'm not sure adding I/O is the best option.
In most tests, he was using quite a few more cells per rank as we usually do, and was quite near the memory limit (and if ParMetis didn't run for 8K MPI tasks, he had no reason to try it on 32K, as he was testing the scalability of the CFD code, not testing ParMetis itself), so chances are pretty good running the case on a few more tasks would work.

I'll contact him to check (and am visting his lab in about 10 days from now).
I don't have access to quite as big machines (the best I can get is a 4-rack Blue Gene/Q, wich is would be enough for the same tests, but getting more than rack could lead to a long wait in queue...

Note also that for very large cases with our code (which is downloadable at http://code-saturne.org, for those who would want to check the ParMetis calls in src/mesh/cs_partition.c), we have observed scalability of I/O and preprocessing operations using MPI_Alltoallv start leading to decreasing performance of those stages. This was expected, and is not an issue at 4K tasks, and even at 32K, the decreased performance is not much of an issue compared to time which would be required for numerics, but I would expect libraries such as ParMETIS or PT-SCOTCH (wich both contain MPI_Alltoallv, I presume for similar matching algorithms we use internally) to exhibit similar behavior, making tests at that scale a bit more costly than if "linear" performance were expected. On our side, we have 2 mitigation strategies: hybrid MPI/OpenMP, to reduce communicator size, but which has its own issues, and moving to binomial or Crysalrouter algorithms for those collectives (Argonne's Nek5000 seeming to report very good scalability with those).

In any case, if you are interested, our code reports more detailed performance tuning of different stages, so if the logs are still available, we might also have access to the timings of ParMetis itself if you are interested.

Regards,

Yvan Fournier

RE: Re: ParMETIS and HUGE meshes

Hello,

Some people I work with have recently run a finite volume code run using several partitioners, including ParMetis, up to 800 million hexahedral cells and up to 16000 cores on several PRACE machines. The mesh is a hexahedral mesh, and the connectivity craph is face-based, so for about 800 million graph nodes (mesh cells), it has a little less than 4.8 billion graph edges (faces), as boundary faces are removed from the graph. The graph is symmetric.

If I am not mistaken, at 3.2 billion cells, neither ParMETIS not Pt-SCOTCH could run on this case, due to memory constraints (the code they are linked with already requires a fair amount of memory, so they might have been able to run on more cores).

Of course, compiling with 64-bit ids is a must.

You may find more details in the following paper :

http://www.prace-project.eu/IMG/pdf/Optimisation_of_Code_Saturne_for_Petascale_Simulations.pdf

Regards,

Yvan Fournier

RE: Yvan, Do these machines have

Yvan,

Do these machines have access to local disk /scratch space? There is a way to reduce the memory of ParMetis by writing a lot of the data that it does not need to disk.

george