Author |
Message |
Matthias
| Posted on Tuesday, February 03, 2004 - 06:04 am: | |
Hi My parallel program crashes with a segmentation violation when I call ParMETIS_V3_AdaptiveRepart with 20 processes on an IBM Regatta with ParMetis-3.0. It crashes somewhere in PQueueUpdate (in Metis). Because the Regatta is a Supercomputer where a 20-processor job only runs over night, I tried to reconstruct this crash on one of our own IBM RS6000. It also crashes with the same input graph. As I switched on debugging in metis.h (#define DEBUG 1), I recognized that some assertions fail: ----- ***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut ***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut ***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut ***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut ***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut ***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut ***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut ***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut ***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut ***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut ***ASSERTION failed on line 72 of file debug.c: nbnd == graph->nbnd 16 17 ***ASSERTION failed on line 65 of file fm.c: ComputeCut(graph, where) == graph->mincut ***ASSERTION failed on line 72 of file debug.c: nbnd == graph->nbnd 12 13 ----- The assertions are in the function FM_2WayEdgeRefine. In this function PQueueUpdate is called. I also tried ParMetis-3.1, but the assertions also failed. I would be glad for any help Matthias
|
Matthias
| Posted on Tuesday, February 03, 2004 - 10:36 am: | |
Hi Some additions to my Problem. I found out that in SplitGraphPart the new lgraph has an invalid value in adjncy. (the graph is not consistent) I think it comes from a wrong value of graph->bndptr. The vertex is handled as interior vertex, but it isn't interior. I'll see what's going on there tomorrow. And the assertions... is it "normal" that they fail?
|
george
| Posted on Tuesday, February 03, 2004 - 03:18 pm: | |
Are you sure that the graph that you are giving to ParMetis is actually undirected with no self-loops? |
Matthias
| Posted on Wednesday, February 04, 2004 - 11:02 am: | |
Yes, I think so. I wrote a short program that checks the distributed graph for: - edges starting and ending at the same vertex - edges ending at undefined vertices - edges that are not bidirectional Is that enough checking? Does ParMETIS provide such a checking function? Maybe it is interesting that this error does not occur on an Intel-machine. The result seems to be all right. If you wish, I can send you a short test program that calls ParMETIS_V3_AdaptiveRepart and the graph data. Thanks for help! Matthias
|
george
| Posted on Wednesday, February 04, 2004 - 03:31 pm: | |
Matthias, How large is the graph? ParMetis seems to have problems with small graphs You can email me the test program and sample graph. |
Lisandro Dalcin
| Posted on Tuesday, August 31, 2004 - 11:26 am: | |
George, I am developping a module to access ParMETIS 3.1 functionalities in Python. In the process of wrapping, I have found a inconsistency in the manner ParMETIS treat 'tpwgts' and 'ubvec' arguments. In PartMeshKway(), if 'tpwgts' or 'ubvec' are NULL, the routine aborts (you have explicitly coded the functions to do that, as I saw in source 'mmetis.c'). In other functions, like PartKway(), a default value is used (via a call to CheckInputs(..)). Is there any reason for this behavior? Why not to use a default value for partition weights in PartMeshKway and issue a warning like other functions? Other problem in PartMeshKway occurs if the user pass a 'wgtflag' whith value '3': the test in line 43 will pass: 43: if (((*wgtflag)&2) && elmwgt == NULL) { but the call to PartKway() in line 79 will fail: 79: ParMETIS_V3_PartKway(elmdist, xadj, adjncy, elmwgt, NULL, wgtflag, ... as 'wgtflag' is '3' but you passed NULL for 'adjwgt'. Do you think it can be dangerous to do something like wgtflag = elmwgt?2:0; before calling PartKway() and issue a warning? Anyway, first bit of wgtflag does not make any sense in PartMeshKway and can be discarded (Is this correct?)... Thanks in advance. Lisandro Dalcin
|
Lisandro Dalcin
| Posted on Tuesday, August 31, 2004 - 05:17 pm: | |
George, sorry about this... I didn't know my email address was going to be published!! you know... spam is boring and this page is public... Could you erase the link to my email address in the previous post? and clean this post, please? sorry again... |
Alex
| Posted on Friday, September 30, 2005 - 10:50 pm: | |
Hi, I just downloaded ParMETIS as a required component of another code. However, when runing the standard tests in the Graphs/ subdir, I've found that mtest crashes....The stack trace is given below. Machine data: SGI Altix 350 (4xItanium2) compilers: Intel cc (icc), ver. 9 loader: also ver. 9 I think the MPI comes from the SGI ProPack 3 Any help on this? THANKS! -alex ----------------------- stack trace----------- $mpirun -np 4 ./mtest bricks.hex3d 2 Nelements: 117649, Nnodes: 125000, EType: 3 MGCNUM: 2 Completed Dual Graph -- Nvtxs: 117649, Nedges: 2046222 [117649 2046222 29412 29413] [100] [ 0.000] [ 0.000] [ 60230 1195682 15032 15075] [100] [ 0.000] [ 0.000] [ 30905 630126 7712 7738] [100] [ 0.000] [ 0.000] [ 15878 308518 3958 3976] [100] [ 0.000] [ 0.000] [ 8188 150574 2039 2055] [100] [ 0.000] [ 0.000] [ 4240 74750 1057 1063] [100] [ 0.000] [ 0.000] [ 2208 37340 547 555] [100] [ 0.000] [ 0.001] [ 1155 18804 285 291] [100] [ 0.000] [ 0.001] [ 606 9272 149 154] [100] [ 0.000] [ 0.002] [ 321 4636 79 82] [100] [ 0.000] [ 0.004] [ 175 2312 43 46] [100] [ 0.000] [ 0.008] [ 96 1166 23 25] [100] [ 0.000] [ 0.016] nvtxs: 96, balance: 1.029 MPI: On host [blank], Program /s1/ParMetis/ParMetis-3.1/Graphs/mtest, Rank 3, Process 7296 received signal SIGSEGV(11) MPI: --------stack traceback------- Source file not found or not readable, tried... ./../sysdeps/unix/sysv/linux/waitpid.c /s1/ParMetis/ParMetis-3.1/Graphs/../sysdeps/unix/sysv/linux/waitpid.c ./waitpid.c /s1/ParMetis/ParMetis-3.1/Graphs/waitpid.c (Cannot find source file ../sysdeps/unix/sysv/linux/waitpid.c) MPI: Linux Application Debugger for Itanium(R)-based applications, Version 9.0-12, Build 20050729 MPI: Reading symbolic information from /s1/ParMetis/ParMetis-3.1/Graphs/mtest...No debugging symbols found MPI: Attached to process id 7296 .... MPI: stopped at [__pid_t __libc_waitpid(__pid_t, int*, int):32 0x20000000006c5481] MPI: >0 0x20000000006c5481 in __libc_waitpid(pid=7297, stat_loc=0x60000ffffff77e10, options=0) "../sysdeps/unix/sysv/linux/w aitpid.c":32 MPI: #1 0x20000000000da440 in mpi_sgi_system(...) in /usr/lib/libmpi.so MPI: #2 0x20000000000da930 in first_arriver_handler(...) in /usr/lib/libmpi.so MPI: #3 0x20000000000da280 in slave_sig_handler(...) in /usr/lib/libmpi.so MPI: #4 0xa0000000000040c0 MPI: #5 0x400000000007f340 in iidxsort__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #6 0x40000000000380a0 in Moc_KWayFM__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #7 0x40000000000258b0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #8 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #9 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #10 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #11 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #12 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #13 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #14 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #15 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #16 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #17 0x40000000000255c0 in Moc_Global_Partition__(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #18 0x4000000000024cc0 in ParMETIS_V3_PartKway(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #19 0x4000000000016fa0 in ParMETIS_V3_PartMeshKway(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #20 0x4000000000002b40 in main(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: #21 0x20000000005a6990 in __libc_start_main(main=0x4000000000141510, argc=3, ubp_av=0x60000fffffffae98, init=0x200000000 077c200, fini=0x200000000077c200, rtld_fini=0x20000000000da440, stack_end=0xc000000000000288) "../sysdeps/generic/libc-start. c":205 MPI: #22 0x4000000000001d80 in _start(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: -----stack traceback ends----- MPI: On host [blank], Program /s1/ParMetis/ParMetis-3.1/Graphs/mtest, Rank 3, Process 7296: Dumping core on signal SIGSEGV(11) into directory /s1/ParMetis/ParMetis-3.1/Graphs MPI: MPI_COMM_WORLD rank 3 has terminated without calling MPI_Finalize() MPI: aborting job MPI: Received signal 11 |
Alex
| Posted on Tuesday, October 04, 2005 - 03:02 am: | |
So, I recompiled with the latest SGI SCSL and MPT libraries, and included the -debug option. Seems that kmetis.c in ParMETISLIB is the culprit. A quick google search on the problem turns up vague suggestions that the order of calling MPI routines in the kmetis.c Moc_Global_Partition__(...) is to blame.....Help, please???? mpirun -np 4 ./mtest bricks.hex3d 2 >logb MPI: On host *****.com, Program /s1/ParMetis/ParMetis-3.1/Graphs/mtest, Rank 3, Process 17994 received signal SIGSEGV(11) MPI: --------stack traceback------- Source file not found or not readable, tried... ./../sysdeps/unix/sysv/linux/waitpid.c /s1/ParMetis/ParMetis-3.1/Graphs/../sysdeps/unix/sysv/linux/waitpid.c ./waitpid.c /s1/ParMetis/ParMetis-3.1/Graphs/waitpid.c (Cannot find source file ../sysdeps/unix/sysv/linux/waitpid.c) MPI: Linux Application Debugger for Itanium(R)-based applications, Version 9.0-12, Build 20050729 MPI: Reading symbolic information from /s1/ParMetis/ParMetis-3.1/Graphs/mtest...done MPI: Attached to process id 17994 .... MPI: stopped at [__pid_t __libc_waitpid(__pid_t, int*, int):32 0x2000000003f95481] MPI: >0 0x2000000003f95481 in __libc_waitpid(pid=17995, stat_loc=0x60000ffffff782a0, options=0) "../sysdeps/unix/sysv/linux/waitpid.c":32 MPI: #1 0x20000000000fa700 in MPI_SGI_stacktraceback(...) in /usr/lib/libmpi.so MPI: #2 0x20000000000fb3e0 in slave_sig_handler(...) in /usr/lib/libmpi.so MPI: #3 0xa0000000000040c0 MPI: #4 0x400000000007f780 in iidxsort__(total_elems=9, pbase=0x30013) "iidxsort.c":136 MPI: #5 0x40000000000380e0 in Moc_KWayFM__(ctrl=0x97d956d952972800, graph=0x30013, wspace=0x1, npasses=65598) "kwayfm.c":489 MPI: #6 0x4000000000025930 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":257 MPI: #7 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228 MPI: #8 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228 MPI: #9 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228 MPI: #10 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228 MPI: #11 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228 MPI: #12 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228 MPI: #13 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228 MPI: #14 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228 MPI: #15 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x1, graph=0x1003e, wspace=0x0) "kmetis.c":228 MPI: #16 0x4000000000025640 in Moc_Global_Partition__(ctrl=0x97d956d952c77800, graph=0x30013, wspace=0x0) "kmetis.c":228 MPI: #17 0x4000000000024d40 in ParMETIS_V3_PartKway(vtxdist=0x60000ffffffba5e0, xadj=0x400000000013e170, adjncy=0x1cb91, vwgt=0x1f390e, adjwgt=0x6000000000008908, wgtflag=0x0, numflag=0x60000ffffffba550, ncon=0x1, nparts=0x60000fffffffa948, tpwgts=0x60000000001541d0, ubvec=0x60000fffffffa8c0, options=0x60000fffffffa840, edgecut=0x60000fffffffa8b0, part=0x6000000000e01ff0, comm=0x60000ffffffba694) "kmetis.c":137 MPI: #18 0x4000000000017020 in ParMETIS_V3_PartMeshKway(elmdist=0xffffffffffffffff, eptr=0x1003e, eind=0x200000000484d690, elmwgt=0x1003e, wgtflag=0x400000001, numflag=0x600000000001097e, ncon=0x0, ncommonnodes=0x0, nparts=0x60000fffffffa948, tpwgts=0x60000000001541d0, ubvec=0x60000fffffffa8c0, options=0x60000fffffffa840, edgecut=0x60000fffffffa8b0, part=0x6000000000e01ff0, comm=0x60000fffffffa93c) "mmetis.c":79 MPI: #19 0x4000000000002b40 in main(argc=3, argv=0x60000fffffffae18) "mtest.c":72 MPI: #20 0x2000000003e76990 in __libc_start_main(main=0x4000000000141950, argc=3, ubp_av=0x60000fffffffae18, init=0x200000000404c200, fini=0x200000000404c200, rtld_fini=0x20000000000fa700, stack_end=0xc00000000000040e) "../sysdeps/generic/libc-start.c":205 MPI: #21 0x4000000000001d80 in _start(...) in /s1/ParMetis/ParMetis-3.1/Graphs/mtest MPI: -----stack traceback ends----- MPI: On host ******.com, Program /s1/ParMetis/ParMetis-3.1/Graphs/mtest, Rank 3, Process 17994: Dumping core on signal SIGSEGV(11) into directory /s1/ParMetis/ParMetis-3.1/Graphs MPI: MPI_COMM_WORLD rank 3 has terminated without calling MPI_Finalize() MPI: aborting job MPI: Received signal 11 |
Alex
| Posted on Tuesday, October 04, 2005 - 07:50 pm: | |
a bit of an update here....It seems that the SGI MPT 1.12 and icc v9 have some conflicts when the compiler tries to optimize the code. Hence, the ParMETIS tests run when optimizations are disabled (-O0 setting), but duly crash with the aforementioned messages with any other level of optimization (-O1,2,3). Weird. |
Alex
| Posted on Thursday, October 06, 2005 - 12:14 am: | |
a bit more of an update...entitled "cursed C pointer arithmetic".... it turns out that the icc compiler optimization routines do not like what looks like a pretty much standard routine: iidxsort (implements a mix of quicksort + insert sort algos.) If I compile *JUST* that routine with optimizations disabled(-O0), (whereas all others are with optimization -O2), everything works fine. I don't know if this is just a memory coincidence on my machine, but..... "how bizzare, how bizzare..." |
Alex
| Posted on Thursday, October 06, 2005 - 09:54 pm: | |
hmmm..........it seems that it is more likely an Intel bug in the optimization code ("cursed Intel!!!! ") anyhow, the following simple program causes a segmentation fault when compiled with any of the optimization flags of icc (ie, -O1,2,3).... However, when compiled with -O0, it runs just fine!!!!!!!!!!! Now, its been about 13 years since I hacked away at pointer arithmetic, but it seems to me that iidxsort is perfectly fine. Any comments out there? Thanks! -a --------------------------------- $ icc -v Version 9.0 $ icc -O0 idxt.c iidxsort.c -o idxt $./idxt >>>> 0 1 1 1 2 3 3 3 3 6 6 8 8 9 9 <<<< $ icc -O1 idxt.c iidxsort.c -o idxt $./idxt Segmentation fault $ ------------------------------------------------ #include <stdio.h> //#include "parmetis.h" typedef int idxtype; void iidxsort(int total_elems, idxtype *pbase); int main(void) { int i; idxtype nelem = 15; idxtype niz[] = {3,6,2,3,0,6,8,9,1,3,8,9,3,1,1}; iidxsort(nelem, niz); printf(">>>>\n"); for (i=0; i<nelem;i++) printf(" %d",niz[i]); printf("\n<<<<\n"); return(1); }
|
Alex
| Posted on Friday, October 07, 2005 - 03:39 pm: | |
To wrap this up, Intel confirms this is a bug with their compiler: Intel(R) C Itanium(R) Compiler for Itanium(R)-based applications] Version 9.0 Build 20050912 Package ID: l_cc_c_9.0.026 . .
|
george
| Posted on Friday, October 07, 2005 - 04:01 pm: | |
Alex, I'm glad that you resolved this problem george |
lijian
| Posted on Friday, January 27, 2006 - 04:50 pm: | |
ParMETIS_V3_AdaptiveRepart crash. When I use ParMETIS_V3_AdaptiveRepart to partition a very small graph (only 6 vertexes) using more than 3 processes, it crashes. (a very simple initial partiaion is given) When the number of processes <=3, it works. When I use ParMETIS_V3_PartKway for the same small graph, it works. I would like to make my code work for various scale of problems from hundreds to billions. Can you give me any suggestion? The following is the input data useing 6 processes: vtxdist =[0 1 2 3 4 5 6] rank 0: xadj =[0 5 ] rank 0: adjncy =[2 3 4 5 1 ] rank 1: xadj =[0 1 ] rank 1: adjncy =[0 ] rank 2: xadj =[0 2 ] rank 2: adjncy =[0 4 ] rank 3: xadj =[0 2 ] rank 3: adjncy =[0 4 ] rank 4: xadj =[0 4 ] rank 4: adjncy =[0 3 5 2 ] rank 5: xadj =[0 2 ] rank 5: adjncy =[0 4 ]
|
|