ParMetis Parallel Scaling Issues

Some setup for my question: I am working on a project that reads in an unstructured mesh and uses ParMetis to partition the mesh cells over the calling processors. We currently treat each mesh cell as a vertex in a graph and then run ParMetis PartKway on the "graph" with 16 splitting processors. The code has a Python layer that controls the execution of the simulation. Then the lower level is Fortran for all of the heavy lifting. So ParMetis is being called by the Fortran layer through the wrapper function. At compile time, all the lower level code is compiled into shared object (.so) libraries for the Python to link against.

We have some issues with memory when trying to scale above 512 processors.

For instance on a machine with these characteristics: 16 cores per compute node, ~28GB usable memory per node, CC-NUMA style architecture
And a mesh with these characteristics: ~4.5 Million Cells, ~10.6 Million Faces, ~1.8 Million Nodes

This behavior is observed:

Total Procs: 16. 188.79MB Used after call.
ParMetis PartKWay Call START: Memory Free: 26583.53125 MB
ParMetis PartKWay Call END: Memory Free: 26394.73828 MB

Total Procs: 32. 74.22MB Used after call.
ParMetis PartKWay Call START: Memory Free: 27702.12500 MB
ParMetis PartKWay Call END: Memory Free: 27627.90625 MB

Total Procs: 64. 700.33 Used after call.
ParMetis PartKWay Call START: Memory Free: 28543.28516 MB
ParMetis PartKWay Call END: Memory Free: 27842.94922 MB

Total Procs: 128. 1592.43MB Used after call.
ParMetis PartKWay Call START: Memory Free: 28500.00000 MB
ParMetis PartKWay Call END: Memory Free: 26907.56641 MB

Total Procs: 256. 3406.09MB Used after call.
ParMetis PartKWay Call START: Memory Free: 28394.66406 MB
ParMetis PartKWay Call END: Memory Free: 24988.57812 MB

Total Procs: 512. 6953.80MB Used after call.
ParMetis PartKWay Call START: Memory Free: 28106.56641 MB
ParMetis PartKWay Call END: Memory Free: 21152.76953 MB

Total Procs: 1024. 13048.52MB Used after call.
ParMetis PartKWay Call START: Memory Free: 27587.71094 MB
ParMetis PartKWay Call END: Memory Free: 14539.19531 MB

Total Procs: 2048.
Execution fails, not enough memory on the node to execute.

So the amount of memory consumed during the ParMetis call is doubled with every doubling of processor count.

A couple of caveats:
1.) We know we should move to the PartMeshKWay routine and a project is currently in the works to do that. However, I doubt that alone will drop the amount of memory we are currently seeing used.
2.) We could do a ParMetis split once across the compute nodes (splitting procs = number of executing compute nodes) and then subsequently split within each compute node to minimize memory usage.

Given those above caveats and ignoring them for now, is there any other reason why we are seeing the memory balloon as much as it is? From what I can tell of ParMetis, it does a great job cleaning up all of the memory it uses during the partitioning process, so I would think that it should be using memory at a linearly-proportional rate to the number of calling procs. But I could be wrong about this.

I guess I am just looking for ideas of why this behavior is seen and if there are any other fixes to the problem besides the two caveats above.

Thanks!