Bug Reports Log Out | Topics | Search
Moderators | Edit Profile

Discussion about George's Research » ParMETIS - Parallel Graph Partitioning » Bug Reports « Previous Next »

  Thread Last Poster Posts Pages Last Post
ParMetis crashingNiranjan Deo11-03-04  04:42 pm
  Start New Thread        

Author Message
Top of pagePrevious messageNext messageBottom of page Link to this message

Matthias
Posted on Tuesday, February 03, 2004 - 06:04 am:   

Hi
My parallel program crashes with a segmentation violation
when I call ParMETIS_V3_AdaptiveRepart with 20 processes
on an IBM Regatta with ParMetis-3.0.
It crashes somewhere in PQueueUpdate (in Metis).

Because the Regatta is a Supercomputer where a
20-processor job only runs over night, I tried to
reconstruct this crash on one of our own IBM RS6000.
It also crashes with the same input graph.
As I switched on debugging in metis.h (#define DEBUG 1), I
recognized that some assertions fail:
-----
***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut
***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut
***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut
***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut
***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut
***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut
***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut
***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut
***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut
***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut
***ASSERTION failed on line 72 of file debug.c: nbnd == graph->nbnd
16 17
***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut
***ASSERTION failed on line 72 of file debug.c: nbnd == graph->nbnd
12 13
-----
The assertions are in the function FM_2WayEdgeRefine.
In this function PQueueUpdate is called.
I also tried ParMetis-3.1, but the assertions also failed.

I would be glad for any help

Matthias
Top of pagePrevious messageNext messageBottom of page Link to this message

Matthias
Posted on Tuesday, February 03, 2004 - 10:36 am:   

Hi
Some additions to my Problem. I found out that in SplitGraphPart
the new lgraph has an invalid value in adjncy.
(the graph is not consistent) I think it comes
from a wrong value of graph->bndptr. The vertex is handled as
interior vertex, but it isn't interior.
I'll see what's going on there tomorrow.

And the assertions... is it "normal" that they fail?


Top of pagePrevious messageNext messageBottom of page Link to this message

george
Posted on Tuesday, February 03, 2004 - 03:18 pm:   

Are you sure that the graph that you are giving to ParMetis is actually undirected with no self-loops?
Top of pagePrevious messageNext messageBottom of page Link to this message

Matthias
Posted on Wednesday, February 04, 2004 - 11:02 am:   

Yes, I think so. I wrote a short program that checks the
distributed graph for:
- edges starting and ending at the same vertex
- edges ending at undefined vertices
- edges that are not bidirectional

Is that enough checking?
Does ParMETIS provide such a checking function?

Maybe it is interesting that this error does
not occur on an Intel-machine.
The result seems to be all right.

If you wish, I can send you a short test program
that calls ParMETIS_V3_AdaptiveRepart and the graph
data.

Thanks for help!
Matthias
Top of pagePrevious messageNext messageBottom of page Link to this message

george
Posted on Wednesday, February 04, 2004 - 03:31 pm:   

Matthias,

How large is the graph? ParMetis seems to have problems with small graphs :-)

You can email me the test program and sample graph.
Top of pagePrevious messageNext messageBottom of page Link to this message

Lisandro Dalcin
Posted on Tuesday, August 31, 2004 - 11:26 am:   

George, I am developping a module to access ParMETIS 3.1
functionalities in Python. In the process of wrapping, I have found a
inconsistency in the manner ParMETIS treat 'tpwgts' and 'ubvec'
arguments.

In PartMeshKway(), if 'tpwgts' or 'ubvec' are NULL, the routine aborts
(you have explicitly coded the functions to do that, as I saw in
source 'mmetis.c'). In other functions, like PartKway(), a default
value is used (via a call to CheckInputs(..)).

Is there any reason for this behavior? Why not to use a default value
for partition weights in PartMeshKway and issue a warning like other
functions?

Other problem in PartMeshKway occurs if the user pass a 'wgtflag'
whith value '3':

the test in line 43 will pass:

43: if (((*wgtflag)&2) && elmwgt == NULL) {

but the call to PartKway() in line 79 will fail:

79: ParMETIS_V3_PartKway(elmdist, xadj, adjncy, elmwgt, NULL, wgtflag, ...
as 'wgtflag' is '3' but you passed NULL for 'adjwgt'.


Do you think it can be dangerous to do something like

wgtflag = elmwgt?2:0;

before calling PartKway() and issue a warning? Anyway, first bit of
wgtflag does not make any sense in PartMeshKway and can be discarded (Is this correct?)...


Thanks in advance.

Lisandro Dalcin
Top of pagePrevious messageNext messageBottom of page Link to this message

Lisandro Dalcin
Posted on Tuesday, August 31, 2004 - 05:17 pm:   

George, sorry about this... I didn't know my email address was going to be published!! you know... spam is boring and this page is public... Could you erase the link to my email address in the previous post? and clean this post, please? sorry again...
Top of pagePrevious messageNext messageBottom of page Link to this message

Alex
Posted on Friday, September 30, 2005 - 10:50 pm:   

Hi,
I just downloaded ParMETIS as a required component of another code. However, when runing the standard tests in the Graphs/ subdir, I've found that mtest crashes....The stack trace is given below.
Machine data: SGI Altix 350 (4xItanium2)
compilers: Intel cc (icc), ver. 9
loader: also ver. 9
I think the MPI comes from the SGI ProPack 3

Any help on this?
THANKS!
-alex
-----------------------
stack trace-----------

$mpirun -np 4 ./mtest bricks.hex3d 2
Nelements: 117649, Nnodes: 125000, EType: 3
MGCNUM: 2
Completed Dual Graph -- Nvtxs: 117649, Nedges: 2046222
[117649 2046222 29412 29413] [100] [ 0.000] [ 0.000]
[ 60230 1195682 15032 15075] [100] [ 0.000] [ 0.000]
[ 30905 630126 7712 7738] [100] [ 0.000] [ 0.000]
[ 15878 308518 3958 3976] [100] [ 0.000] [ 0.000]
[ 8188 150574 2039 2055] [100] [ 0.000] [ 0.000]
[ 4240 74750 1057 1063] [100] [ 0.000] [ 0.000]
[ 2208 37340 547 555] [100] [ 0.000] [ 0.001]
[ 1155 18804 285 291] [100] [ 0.000] [ 0.001]
[ 606 9272 149 154] [100] [ 0.000] [ 0.002]
[ 321 4636 79 82] [100] [ 0.000] [ 0.004]
[ 175 2312 43 46] [100] [ 0.000] [ 0.008]
[ 96 1166 23 25] [100] [ 0.000] [ 0.016]
nvtxs: 96, balance: 1.029
MPI: On host [blank], Program /s1/ParMetis/ParMetis-3.1/Graphs/mtest, Rank 3, Process 7296 received signal
SIGSEGV(11)


MPI: --------stack traceback-------
Source file not found or not readable, tried...
./../sysdeps/unix/sysv/linux/waitpid.c
/s1/ParMetis/ParMetis-3.1/Graphs/../sysdeps/unix/sysv/linux/waitpid.c
./waitpid.c
/s1/ParMetis/ParMetis-3.1/Graphs/waitpid.c
(Cannot find source file ../sysdeps/unix/sysv/linux/waitpid.c)
MPI: Linux Application Debugger for Itanium(R)-based applications, Version 9.0-12, Build 20050729
MPI: Reading symbolic information from /s1/ParMetis/ParMetis-3.1/Graphs/mtest...No debugging symbols found
MPI: Attached to process id 7296 ....
MPI: stopped at [__pid_t __libc_waitpid(__pid_t, int*, int):32 0x20000000006c5481]
MPI: >0 0x20000000006c5481 in __libc_waitpid(pid=7297, stat_loc=0x60000ffffff77e10, options=0) "../sysdeps/unix/sysv/linux/w
aitpid.c":32
MPI: #1 0x20000000000da440 in mpi_sgi_system(...) in /usr/lib/libmpi.so
MPI: #2 0x20000000000da930 in first_arriver_handler(...) in /usr/lib/libmpi.so
MPI: #3 0x20000000000da280 in slave_sig_handler(...) in /usr/lib/libmpi.so
MPI: #4 0xa0000000000040c0
MPI: #5 0x400000000007f340 in iidxsort__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #6 0x40000000000380a0 in Moc_KWayFM__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #7 0x40000000000258b0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #8 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #9 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #10 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #11 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #12 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #13 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #14 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #15 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #16 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #17 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #18 0x4000000000024cc0 in ParMETIS_V3_PartKway(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #19 0x4000000000016fa0 in ParMETIS_V3_PartMeshKway(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #20 0x4000000000002b40 in main(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest
MPI: #21 0x20000000005a6990 in __libc_start_main(main=0x4000000000141510, argc=3, ubp_av=0x60000fffffffae98, init=0x200000000
077c200, fini=0x200000000077c200, rtld_fini=0x20000000000da440, stack_end=0xc000000000000288) "../sysdeps/generic/libc-start.
c":205
MPI: #22 0x4000000000001d80 in _start(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest

MPI: -----stack traceback ends-----
MPI: On host [blank], Program /s1/ParMetis/ParMetis-3.1/Graphs/mtest, Rank 3, Process 7296: Dumping core on
signal SIGSEGV(11) into directory /s1/ParMetis/ParMetis-3.1/Graphs
MPI: MPI_COMM_WORLD rank 3 has terminated without calling MPI_Finalize()
MPI: aborting job
MPI: Received signal 11
Top of pagePrevious messageNext messageBottom of page Link to this message

Alex
Posted on Tuesday, October 04, 2005 - 03:02 am:   

So, I recompiled with the latest SGI SCSL and MPT libraries, and included the -debug option. Seems that kmetis.c in ParMETISLIB is the culprit. A quick google search on the problem turns up vague suggestions that the order of calling MPI routines in the kmetis.c Moc_Global_Partition__(...) is to blame.....Help, please????

mpirun -np 4 ./mtest bricks.hex3d 2 >logb
MPI: On host *****.com, Program /s1/ParMetis/ParMetis-3.1/Graphs/mtest, Rank 3, Process 17994 received signal SIGSEGV(11)


MPI: --------stack traceback-------
Source file not found or not readable, tried...
./../sysdeps/unix/sysv/linux/waitpid.c
/s1/ParMetis/ParMetis-3.1/Graphs/../sysdeps/unix/sysv/linux/waitpid.c
./waitpid.c
/s1/ParMetis/ParMetis-3.1/Graphs/waitpid.c
(Cannot find source file ../sysdeps/unix/sysv/linux/waitpid.c)
MPI: Linux Application Debugger for Itanium(R)-based applications, Version 9.0-12, Build 20050729
MPI: Reading symbolic information from /s1/ParMetis/ParMetis-3.1/Graphs/mtest...done
MPI: Attached to process id 17994 ....
MPI: stopped at [__pid_t __libc_waitpid(__pid_t, int*, int):32 0x2000000003f95481]
MPI: >0 0x2000000003f95481 in __libc_waitpid(pid=17995, stat_loc=0x60000ffffff782a0, options=0) "../sysdeps/unix/sysv/linux/waitpid.c":32
MPI: #1 0x20000000000fa700 in MPI_SGI_stacktraceback(...) in /usr/lib/libmpi.so
MPI: #2 0x20000000000fb3e0 in slave_sig_handler(...) in /usr/lib/libmpi.so
MPI: #3 0xa0000000000040c0
MPI: #4 0x400000000007f780 in iidxsort__(total_elems=9, pbase=0x30013) "iidxsort.c":136
MPI: #5 0x40000000000380e0 in Moc_KWayFM__(ctrl=0x97d956d952972800, graph=0x30013, wspace=0x1, npasses=65598) "kwayfm.c":489
MPI: #6 0x4000000000025930 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":257
MPI: #7 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228
MPI: #8 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228
MPI: #9 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228
MPI: #10 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228
MPI: #11 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228
MPI: #12 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228
MPI: #13 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228
MPI: #14 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228
MPI: #15 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228
MPI: #16 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x97d956d952c77800, graph=0x30013, wspace=0x0) "kmetis.c":228
MPI: #17 0x4000000000024d40 in ParMETIS_V3_PartKway(vtxdist=0x60000ffffffba5e0, xadj=0x400000000013e170, adjncy=0x1cb91, vwgt=0x1f390e, adjwgt=0x6000000000008908, wgtflag=0x0, numflag=0x60000ffffffba550, ncon=0x1, nparts=0x60000fffffffa948, tpwgts=0x60000000001541d0, ubvec=0x60000fffffffa8c0, options=0x60000fffffffa840, edgecut=0x60000fffffffa8b0, part=0x6000000000e01ff0, comm=0x60000ffffffba694) "kmetis.c":137
MPI: #18 0x4000000000017020 in ParMETIS_V3_PartMeshKway(elmdist=0xffffffffffffffff, eptr=0x1003e, eind=0x200000000484d690, elmwgt=0x1003e, wgtflag=0x400000001, numflag=0x600000000001097e, ncon=0x0, ncommonnodes=0x0, nparts=0x60000fffffffa948, tpwgts=0x60000000001541d0, ubvec=0x60000fffffffa8c0, options=0x60000fffffffa840, edgecut=0x60000fffffffa8b0, part=0x6000000000e01ff0, comm=0x60000fffffffa93c) "mmetis.c":79
MPI: #19 0x4000000000002b40 in main(argc=3, argv=0x60000fffffffae18) "mtest.c":72
MPI: #20 0x2000000003e76990 in __libc_start_main(main=0x4000000000141950, argc=3, ubp_av=0x60000fffffffae18, init=0x200000000404c200, fini=0x200000000404c200, rtld_fini=0x20000000000fa700, stack_end=0xc00000000000040e) "../sysdeps/generic/libc-start.c":205
MPI: #21 0x4000000000001d80 in _start(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest

MPI: -----stack traceback ends-----
MPI: On host ******.com, Program /s1/ParMetis/ParMetis-3.1/Graphs/mtest, Rank 3, Process 17994: Dumping core on signal SIGSEGV(11) into directory /s1/ParMetis/ParMetis-3.1/Graphs
MPI: MPI_COMM_WORLD rank 3 has terminated without calling MPI_Finalize()
MPI: aborting job
MPI: Received signal 11
Top of pagePrevious messageNext messageBottom of page Link to this message

Alex
Posted on Tuesday, October 04, 2005 - 07:50 pm:   

a bit of an update here....It seems that the
SGI MPT 1.12 and icc v9 have some conflicts when
the compiler tries to optimize the code. Hence, the ParMETIS tests run when optimizations are disabled (-O0 setting), but duly crash with the aforementioned messages with any other level of optimization (-O1,2,3). Weird.
Top of pagePrevious messageNext messageBottom of page Link to this message

Alex
Posted on Thursday, October 06, 2005 - 12:14 am:   

a bit more of an update...entitled "cursed C pointer arithmetic"....:-) :-)

it turns out that the icc compiler optimization routines do not like what looks like a pretty much standard routine: iidxsort (implements a mix of quicksort + insert sort algos.)
If I compile *JUST* that routine with optimizations disabled(-O0), (whereas all others are with optimization -O2), everything works fine.
I don't know if this is just a memory coincidence on my machine, but.....

"how bizzare, how bizzare..."
Top of pagePrevious messageNext messageBottom of page Link to this message

Alex
Posted on Thursday, October 06, 2005 - 09:54 pm:   

hmmm..........it seems that it is more likely an Intel bug in the optimization code :-) :-) :-)
("cursed Intel!!!! ")

anyhow, the following simple program causes a segmentation fault when compiled with any of the optimization flags of icc (ie, -O1,2,3)....
However, when compiled with -O0, it runs just fine!!!!!!!!!!!

Now, its been about 13 years since I hacked away at pointer arithmetic, but it seems to me that iidxsort is perfectly fine. Any comments out there?

Thanks!
-a
---------------------------------
$ icc -v
Version 9.0
$ icc -O0 idxt.c iidxsort.c -o idxt
$./idxt
>>>>
0 1 1 1 2 3 3 3 3 6 6 8 8 9 9
<<<<
$ icc -O1 idxt.c iidxsort.c -o idxt
$./idxt
Segmentation fault
$
------------------------------------------------
#include <stdio.h>
//#include "parmetis.h"

typedef int idxtype;

void iidxsort(int total_elems, idxtype *pbase);

int main(void)
{
int i;
idxtype nelem = 15;
idxtype niz[] = {3,6,2,3,0,6,8,9,1,3,8,9,3,1,1};


iidxsort(nelem, niz);

printf(">>>>\n");
for (i=0; i<nelem;i++)
printf(" %d",niz[i]);

printf("\n<<<<\n");

return(1);
}

Top of pagePrevious messageNext messageBottom of page Link to this message

Alex
Posted on Friday, October 07, 2005 - 03:39 pm:   

To wrap this up, Intel confirms this is a bug with their compiler:
Intel(R) C Itanium(R) Compiler for Itanium(R)-based applications] Version 9.0 Build 20050912 Package ID: l_cc_c_9.0.026

:-). :-).

Top of pagePrevious messageNext messageBottom of page Link to this message

george
Posted on Friday, October 07, 2005 - 04:01 pm:   

Alex,

I'm glad that you resolved this problem :-)

george
Top of pagePrevious messageNext messageBottom of page Link to this message

lijian
Posted on Friday, January 27, 2006 - 04:50 pm:   

ParMETIS_V3_AdaptiveRepart crash.

When I use ParMETIS_V3_AdaptiveRepart to partition a very small graph (only 6 vertexes) using more than 3 processes, it crashes.
(a very simple initial partiaion is given)

When the number of processes <=3, it works.
When I use ParMETIS_V3_PartKway for the same small graph, it works.

I would like to make my code work for various scale of problems from hundreds to billions.
Can you give me any suggestion?

The following is the input data
useing 6 processes:
vtxdist =[0 1 2 3 4 5 6]
rank 0: xadj =[0 5 ]
rank 0: adjncy =[2 3 4 5 1 ]
rank 1: xadj =[0 1 ]
rank 1: adjncy =[0 ]
rank 2: xadj =[0 2 ]
rank 2: adjncy =[0 4 ]
rank 3: xadj =[0 2 ]
rank 3: adjncy =[0 4 ]
rank 4: xadj =[0 4 ]
rank 4: adjncy =[0 3 5 2 ]
rank 5: xadj =[0 2 ]
rank 5: adjncy =[0 4 ]

Add Your Message Here
Posting is currently disabled in this topic. Contact your discussion moderator for more information.

Topics | Last Day | Last Week | Tree View | Search | Help/Instructions | Program Credits Administration

The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.