Segfault using Parmetis
Hello,
I am using parmetris for partitioning tetrahedral grids via:
ParMETIS V3 PartKway (idxtype *vtxdist, idxtype *xadj, idxtype *adjncy, idxtype *vwgt, idxtype *adjwgt,
int *wgt?ag, int *num?ag, int *ncon, int *nparts, ?oat *tpwgts, ?oat *ubvec,
int *options, int *edgecut, idxtype *part, MPI Comm *comm);
Grids with up to 13000 elements (tetraeders) are decomposed the way it should be, larger grids however result in a segfault:
[~-devel:20359] *** Process received signal ***
[~-devel:20359] Signal: Segmentation fault (11)
[~-devel:20359] Signal code: Address not mapped (1)
[~-devel:20359] Failing at address: 0x48ee
[~-devel:20359] [ 0] [0xffffe600]
[~-devel:20359] [ 1] /home/ckonrad/lib/libparmetis.so.3.1(ParMETIS_V3_PartMeshKway+0x189) [0xf7fb6789]
[~-devel:20359] [ 2] parpart(main+0x6fd) [0x804e3cd]
[~-devel:20359] [ 3] /lib/libc.so.6(__libc_start_main+0xe0) [0xa55f70]
[~-devel:20359] [ 4] parpart [0x804dbc1]
[~-devel:20359] *** End of error message ***
I have really no idea how this comes. For this reason I will show how I call this function:
ParMETIS_V3_PartMeshKway( elmdist, eptr, eind, NULL,
&wgtflag, &numflag, &ncon, &ncommonnodes, &nparts,
tpwgts, &ubvec, options,
&edgecut, part, & comworld );
I suppose elmdist, eptr, eind to be correct. Maybe I am doing a fault with the weights. I want to have equally sized partitions. It seems that the documentation of this function is copy-and-pasted from the corresponding Graph partitioning routine, because for example the description of the wgtflag references parameter names that do not appear in the definition of the PartMeshKway.
I set: wgtflag = 0, numflag = 0; ncon = 0; ncommonnodes = 2; nparts = (between 2 and 50); tpwgt = [1/nparts ... 1/nparts] (nparts elements)
ubvec = 1.05 (1 element), options[0] = 0; options[1], options[2] are abitrary
Thank you for any hints,
Christian
RE: Can you open an issue on the
Can you open an issue on the flyspray issue tracking system.
thanks.
RE: even MashToDual segfaults
Hello,
I used now MeshToDual to convert my mesh to a graph as a first step. The mesh-partitioning function converts the mesh in a first step to a graph and for the fact that I get also a seg_fault when I call the conversion function manually there has to be sth wrong with the parameters I use for calling the mesh-partitioning routine.
ParMETIS_V3_Mesh2Dual(elmdist, eptr, eind, &numflag,
& ncommonnodes, &r1, &r2, & comworld);
The example that crashes has 19589 tetraedra. Hence elmdist computes to: [0 4897 9794 14691 19589] which seems okay.
eptr = [0 4 8 12 ...4*4897] for the first processor (tetraeders have 4 nodes). The eind array is of dimension 4*4897 for the first processor and stores onwardly the 4 coordinates of the first 4897 tetrahedra.
Again numflag = 0, ncommonnodes = 2, r1 and r2 are data arrays for the result and comworld the communicator.
Do you have any ideas what I am doing wrong?
thank you, chris
RE: Can you try the command-line
Can you try the command-line version metis's mesh2dual on the mesh to make sure that there are no "bugs" with the mesh itself.
RE: proceeding
Hello,
I tried mesh2dual and in fact I got a Segfault. This is due to the fact that my meshes contain nodes numbers starting from 0 and not from 1. I modified the meshes, that is I added one to each node and mesh2dual is able to work on all my mesh files now.
I did the same with the input for parmetis. However this didn't help. I have the same behavior as before.
Does a successful mesh2dual call tell me that my file is okay or is it possible that it might produce some kind of graph that also contains errors?
I now tried ParMETIS_V3_Mesh2Dual to my modified meshes with the same result that I get in the partitioning a Segfault...
Is there some tool that checks the mesh file? If I apply the process:
* renumbering from [0 to N] to [1 to N+1]
* mesh2dual
* graphchk
then I get that the resulting graph file is correct. I am puzzled.
chris
RE: partly
Hello,
I executed the code for the convertion to a graph like that:
ParMETIS_V3_Mesh2Dual(...);
cout << world_id << ": done" << endl;
There is only one processor that crashes, all the others work fine and display the done message. Maybe this is a hint that it has to do with the writing to result values?
chris
RE: Source code and example files
Hello,
I generated a zip file containing these files:
* par.cpp (my source code)
* Makefile
* 3000.ele
* 5000.ele
You can reproduce my problem with that. Just 'make' and call:
mpirun -c 4 ./par_part 3000.ele (this will work)
mpirun -c 4 ./par_part 5000.ele (this wont)
The zip file is available at: http://home.in.tum.de/~konrad/par.zip
Thanks, chris