Parallel Multilevel Graph Partitioning *

George Karypis and Vipin Kumar
University of Minnesota, Department of Computer Science, Minneapolis, MN 55455

Abstract

In this paper we present a parallel formulation of a graph partitioning and sparse matrix ordering algorithm that is based on a multilevel algorithm we developed recently. Our parallel algorithm achieves a speedup of up to 56 on a 128-processor Cray T3D for moderate size problems, further reducing its already moderate serial run time. Graphs with over 200,000 vertices can be partitioned in 128 parts, on a 128-processor Cray T3D in less than 3 seconds. This is at least an order of magnitude better than any previously reported run times on 128-processors for obtaining an 128-partition. This also makes it possible to use our parallel graph partitioning algorithm to partition meshes dynamically in adaptive computations. Furthermore, the quality of the produced partitions and orderings are comparable to those produced by the serial multilevel algorithm that has been shown to substantially outperform both spectral partitioning and multiple minimum degree.

1 Introduction

Graph partitioning is an important problem that has extensive applications in many areas, including scientific computing, VLSI design, task scheduling, geographical information systems, and operations research. The problem is to partition the vertices of a graph in $p$ roughly equal parts, such that the number of edges connecting vertices in different parts is minimized. The efficient implementation of many parallel algorithms usually requires the solution to a graph partitioning problem, where vertices represent computational tasks, and edges represent data exchanges. A $p$-way partition of the computation graph can be used to assign tasks to $p$ processors. Because the partition assigns equal number of computational tasks to each processor the work is balanced among $p$ processors, and because it minimizes the edge-cut, the communication overhead is also minimized. For example, the solution of a sparse system of linear equations $Ax = b$ via iterative methods on a parallel computer gives rise to a graph partitioning problem. A key step in each iteration of these methods is the multiplication of a sparse matrix and a (dense) vector. Partitioning the graph that corresponds to matrix $A$, is used to significantly reduce the amount of communication [18]. If parallel direct methods are used to solve a sparse system of equations, then a graph partitioning algorithm can be used to compute a fill reducing ordering that lead to high degree of concurrency in the factorization phase [18, 6].

The graph partitioning problem is NP-complete. However, many algorithms have been developed that find a reasonably good partition. Spectral partitioning methods [21, 11] provide good quality graph partitions, but have very high computational complexity. Geometric partition methods [9, 20] are quite fast but they often provide worse partitions than those of more expensive methods such as spectral. Furthermore, geometric methods are applicable only if coordinate information for the graph is available. Recently, a number of researches have investigated a class of algorithms that are based on multilevel graph partitioning that have moderate computational complexity [4, 11, 14, 15, 13]. Some of these multilevel schemes [4, 11, 14, 15, 13] provide excellent (even better than spectral) graph partitions. Even though these multilevel algorithms are quite fast compared with spectral methods, performing a multilevel partitioning in parallel is desirable for many reasons including adaptive grid computations, computing fill reducing orderings for parallel direct factorizations, and taking advantage the aggregate amount of memory available on parallel computers.

Significant amount of work has been done in developing parallel algorithms for partitioning unstructured graphs and for producing fill reducing orderings for sparse matrices [2, 5, 8, 7, 12]. Only moderately good speedups have been obtained for parallel formulation of graph partitioning algorithms that use geometric methods [9, 5] despite the fact that geometric partitioning algorithms are inherently easier to parallelize. All parallel formulations presented so far for spectral partitioning have reported fairly small speedups [2, 1, 12] unless the graph has been distributed to the processors so that certain degree of data locality is achieved [1].

In this paper we present a parallel formulation of a graph partitioning and sparse matrix ordering algorithm that is based on a multilevel algorithm we developed recently [14]. A key feature of our parallel formulation (that distinguishes it from other proposed parallel formulations of multilevel algorithms [2, 1, 22]) is that it partitions the vertices of the graph into $\sqrt{p}$ parts while distributing the overall adjacency matrix of the graph among all $p$ processors. As shown in [16], this mapping is usually much better than one-dimensional distribution, when no partitioning information about the graph is known. Our parallel algorithm achieves a speedup of up to 56 on 128 processors for moderate size problems, further reducing the already moderate serial run time of multilevel schemes. Furthermore, the quality of the produced partitions and orderings are comparable to those

---

*This work was supported by NSF CCR-9423082 and by the Army Research Office contract DAAH04-95-1-0538, and by Army High Performance Computing Research Center under the auspices of the Department of the Army, Army Research Laboratory cooperative agreement number DAAH04-95-2-0003/contract number DAAH04-95-C-0008, the content of which does not necessarily reflect the position or the policy of the government, and no official endorsement should be inferred. Access to computing facilities was provided by Minnesota Supercomputer Institute, Cray Research Inc., and by the Pittsburgh Supercomputing Center. Related papers are available via WWW at URL: http://www.cs.umn.edu/users/kumar/papers.html

1063-7133/96 $5.00 © 1996 IEEE
Proceedings of IPPS '96
produced by the serial multilevel algorithm that has been shown to outperform both spectral partitioning and multiple minimum degree [14]. The parallel formulation in this paper is described in the context of the serial multilevel graph partitioning algorithm presented in [14]. However, nearly all of the discussion in this paper is applicable to other multilevel graph partitioning algorithms [4, 11, 15].

2 Multilevel Graph Partitioning

The $p$-way graph partitioning problem is defined as follows: Given a graph $G = (V, E)$ with $|V| = n$, partition $V$ into $p$ subsets $V_1, V_2, \ldots, V_p$ such that $V_i \cap V_j = \emptyset$ for $i \neq j$, $|V_i| = n/p$, and $\bigcup_i V_i = V$, and the number of edges of $E$ whose incident vertices belong to different subsets is minimized. A $p$-way partition of $V$ is commonly represented by a partition vector $P$ of length $n$, such that for every vertex $v \in V$, $P(v)$ is an integer between 1 and $p$, indicating the partition at which vertex $v$ belongs. Given a partition $P$, the number of edges whose incident vertices belong to different subsets is called the edge-cut of the partition.

The $p$-way partition problem is most frequently solved by recursive bisection. That is, we first obtain a 2-way partition of $V$, and then we further subdivide each part using 2-way partitions. After $\log p$ phases, graph $G$ is partitioned into $p$ parts. Thus, the problem of performing a $p$-way partition is reduced to that of performing a sequence of 2-way partitions or bisections. Even though this scheme does not necessarily lead to optimal partition [15], it is used extensively due to its simplicity [6].

The basic idea behind the multilevel graph bisection algorithm is very simple. The graph $G$ is first coarsened down to a few hundred vertices, a bisection of this much smaller graph is computed, and then this partition is projected back towards the original graph (finer graph), by periodically refining the partition. Since the finer graph has more degrees of freedom, such refinements usually decrease the edge-cut. This process is graphically illustrated in Figure 1. The reader should refer to [14] for further details.

3 Parallel Multilevel Graph Partitioning Algorithm

There are two types of parallelism that can be exploited in the $p$-way graph partitioning algorithm based on the multilevel bisection algorithms. The first type of parallelism is due to the recursive nature of the algorithm. Initially a single processor finds a bisection of the original graph. Then, two processors find bisections of the two subgraphs just created and so on. However, this scheme by itself can use only up to $\log p$ processors, and reduces the overall run time of the algorithm only by a factor of $O(\log p)$. We will refer to this type of parallelism as the parallelism associated with the recursive step.

The second type of parallelism that can be exploited is during the bisection step. In this case, instead of performing the bisection of the graph on a single processor, we perform it in parallel. We will refer to this type of parallelism as the parallelism associated with the bisection step. By parallelizing the divide step, the speedup obtained by the parallel graph partitioning algorithm is not bounded by $O(\log p)$, and can be significantly higher than that.

The parallel graph partitioning algorithm we describe in this section exploits both of these types of parallelism. Initially all the processors cooperate to bisect the original graph $G$, into $G_0$ and $G_1$. Then, half of the processors bisect $G_0$, while the other half of the processors bisect $G_1$. This step creates four subgraphs $G_{00}, G_{01}, G_{10},$ and $G_{11}$. Then each quarter of the processors bisect one of these subgraphs and so on. After $\log p$ steps, the graph $G$ has been partitioned into $p$ parts.

In the next three sections we describe how we have parallelized the three phases of the multilevel bisection algorithm.

3.1 Coarsening Phase

During the coarsening phase, a sequence of coarser graphs is constructed. A coarser graph $G_{i+1} = (V_{i+1}, E_{i+1})$ is constructed from the finer graph $G_i = (V_i, E_i)$ by finding a maximal matching $M_i$ and contracting the vertices and edges of $G_i$ to form $G_{i+1}$. This is the most time consuming phase of the three phases; hence, it needs be parallelized effectively. Furthermore, the amount of communication required during the contraction of $G_i$ to form $G_{i+1}$ depends on how the matching is computed.

On a serial computer, computing a maximal matching can be done very efficiently using randomized algorithms. However, computing a maximal matching in parallel, and particularly on a distributed memory parallel computer, is hard. A direct parallelization of the serial randomized algorithms or algorithms based on depth first graph traversals require significant amount of communication.
tion overhead can be reduced if the graph is initially partitioned among processors in such a way that the number of edges going across processor boundaries is small. But this requires solving the $p$-way graph partitioning problem, that we are trying to solve in the first place.

Another way of computing a maximal matching is to divide the $n$ vertices among $p$ processors and then compute matchings between the vertices locally assigned within each processor. The advantages of this approach is that no communication is required to compute the matching, and since each pair of vertices that gets matched belongs to the same processor, no communication is required to move adjacency lists between processors. However, this approach causes problems because each processor has very few nodes to match from. Also, even though there is no need to exchange the adjacency lists among processors, each processor needs to know matching information about all the vertices that its local vertices are connected to in order to properly form the contracted graph. As a result significant amount of communication is required. In fact this computation is very similar in nature to the multiplication of a randomly sparse matrix (corresponding to the graph) with a vector (corresponding to the matching vector).

In our parallel coarsening algorithm, we retain the advantages of the previous scheme, but minimize its drawbacks by computing the matchings between groups of $n/p$ vertices. This increases the size of the computed matchings, and also, as discussed in [16], the communication overhead for constructing the coarse graph is decreased. Specifically, our parallel coarsening algorithm treats the $p$ processors as a two-dimensional array of $\sqrt{p} \times \sqrt{p}$ processors (where $p = 2^l$). The vertices of the graph $G_0 = (V_0, E_0)$ are distributed among this processor grid using a cyclic mapping [18]. The vertices $V_0$ are partitioned into $\sqrt{p}$ subsets $V_0^0, V_0^1, \ldots, V_0^{\sqrt{p}-1}$. Processor $P_{i,j}$ stores the edges of $E_0$ between the subsets of vertices $V_0^i$ and $V_0^j$. Having distributed the data in this fashion, the algorithm then proceeds to find a matching. This matching is computed by the processors along the diagonal of the processor-grid. In particular, each processor $P_{i,j}$ finds a heavy-edge matching $M_0$ using the set of edges it stores locally. The union of these $\sqrt{p}$ matchings is taken as the overall matching $M_0$. Since the vertices are split into $\sqrt{p}$ parts, this scheme finds larger matchings than the one that partitions vertices into $p$ parts.

The coarsening algorithm continues until the number of vertices between successive coarser graphs does not substantially decrease. Assume that this happens after $k$ coarsening levels. At this point, graph $G_k = (V_k, E_k)$ is folded into the lower quadrant of the processor subgrid. The coarsening algorithm then continues by creating coarser graphs. Since the subgraph of the diagonal processors of this smaller processor grid contains more vertices and edges, larger matchings can be found and thus the size of the graph is reduced further. This process of coarsening followed by folding continues until the entire coarse graph has been folded down to a single processor, at which point the sequential coarsening algorithm is employed to coarsen the graph.

Since, between successive coarsening levels, the size of the graph decreases, the coarsening scheme just described utilizes more processors during the coarsening levels in which the graphs are large and fewer processors for the smaller graphs. As our analysis in [16] shows, decreasing the size of the processor grid does not affect the overall performance of the algorithm as long as the graph size shrinks by a certain factor between successive graph foldings.

### 3.2 Initial Partitioning Phase

At the end of the coarsening phase, the coarsest graph resides on a single processor. We use the Greedy Graph Growing (GGGP) algorithm described in [14] to partition the coarsest graph. We perform a small number of GGGP runs starting from different random vertices and the one with the smaller edge-cut is selected as the partition. Instead of having a single processor perform these different runs, the coarsest graph can be replicated to all (or a subset of) processors, and each of these processors can perform its own GGGP partition. We did not implement it, since the run time of the initial partition phase is only a very small fraction of the run time of the overall algorithm.

### 3.3 Uncoarsening Phase

During the uncoarsening phase, the partition of the coarsest graph $G_m$ is projected back to the original graph by going through the intermediate graphs $G_{m-1}, G_{m-2}, \ldots, G_1$. After each step of projection, the resulting partition is further refined by using vertex swap heuristics (based on Kernighan-Lin [17]) that decrease the edge-cut [14].

For refining the coarser graphs that reside on a single processor, we use the boundary Kernighan-Lin refinement algorithm (BKLR) described in [14]. However, the BKLR algorithm is sequential in nature and it cannot be used in its current form to efficiently refine a partition when the graph is distributed among a grid of processors [8]. In this case we use a different algorithm that tries to approximate the BKLR algorithm but is more amenable to parallel computations. The key idea behind our parallel refinement algorithm is to select a group of vertices to swap from one part to the other instead of selecting a single vertex. Refinement schemes that use similar ideas are described in [5]. However, our algorithm differs in two important ways from the other schemes: (i) it uses a different method for selecting vertices; (ii) it uses a two-dimensional partition to minimize communication.

The parallel refinement algorithm consists of a number of phases. During each phase, at each diagonal processor a group of vertices is selected from one of the two parts and is moved to the other part. The group of vertices selected by each diagonal processor corresponds to the vertices that lead to a decrease in the edge-cut. This process continues by alternating the part from where vertices are moved, until no further improvement in the overall edge-cut can be made, or a maximum number of iterations has been reached. In our experiments, the maximum number of iterations was set to six. Balance between partitions is maintained by (a) starting the sequence of vertex swaps from the heavier part of the partition, and (b) by employing an explicit balancing iteration at the end of each refinement phase if there is more
than 2% load imbalance between the parts of the partition.

Our parallel variation of the Kernighan-Lin refinement algorithm has a number of interesting properties that positively affect its performance and its ability to refine the partition. First, the task of selecting the group of vertices to be moved from one part to the other is distributed among the diagonal processors instead of being done serially. Secondly, the task of updating the internal and external degrees of the affected vertices is distributed among all the p processors. Furthermore, by restricting the moves in each phase to be unidirectional (i.e., they go only from one partition to other) instead of being bidirectional (i.e., allow both types of moves in each phase), we can guarantee that each vertex in the group of vertices being moved reduces the edge-cut.

In the serial implementation of BLKR, it is possible to make vertex moves that initially lead to worse partition, but eventually (when more vertices are moved) better partition is obtained. Thus, the serial implementation has the ability to climb out of local minima. However, the parallel refinement algorithm lacks this capability, as it never moves vertices if they increase the edge-cut. Also, the parallel refinement algorithm, is not as precise as the serial algorithm as it swaps groups of vertices rather than one vertex at a time. However, our experimental results show that it produces results that are not much worse than those obtained by the serial algorithm. The reason is that the graph coarsening process provides enough local view and the refinement phase only needs to provide minor local improvements.

4 Experimental Results

We evaluated the performance of the parallel multilevel graph partitioning and sparse matrix ordering algorithm on a wide range of matrices arising in finite element applications. The characteristics of these matrices are described in Table 1.

<table>
<thead>
<tr>
<th>Matrix Name</th>
<th>No. of Vertices</th>
<th>No. of Edges</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>4ELT</td>
<td>15606</td>
<td>45878</td>
<td>2D Finite element mesh</td>
</tr>
<tr>
<td>BCSSTK31</td>
<td>35588</td>
<td>572914</td>
<td>3D Stiffness matrix</td>
</tr>
<tr>
<td>BCSSTK51</td>
<td>44609</td>
<td>983066</td>
<td>3D Stiffness matrix</td>
</tr>
<tr>
<td>BRACK2</td>
<td>62561</td>
<td>365559</td>
<td>3D Finite element mesh</td>
</tr>
<tr>
<td>CAN</td>
<td>54195</td>
<td>1900797</td>
<td>3D Stiffness matrix</td>
</tr>
<tr>
<td>COPPER2</td>
<td>55476</td>
<td>122238</td>
<td>3D Finite element mesh</td>
</tr>
<tr>
<td>CYLINDER95</td>
<td>45594</td>
<td>1196266</td>
<td>3D Stiffness matrix</td>
</tr>
<tr>
<td>ROTOR</td>
<td>99517</td>
<td>662931</td>
<td>3D Finite element mesh</td>
</tr>
<tr>
<td>SHELL23</td>
<td>181200</td>
<td>2315965</td>
<td>3D Stiffness matrix</td>
</tr>
<tr>
<td>WAVE</td>
<td>156317</td>
<td>1059331</td>
<td>3D Finite element mesh</td>
</tr>
</tbody>
</table>

Table 1: Various matrices used in evaluating the multilevel graph partitioning and sparse matrix ordering algorithm.

We implemented our parallel multilevel algorithm on a 128-processor Cray T3D parallel computer. Each processor on the T3D is a 150MHz DEC Alpha chip. The processors are interconnected via a three-dimensional torus network that has a peak unidirectional bandwidth of 150Bytes per second, and a small latency. We used SHMEM message passing library for communication. In our experimental setup, we obtained a peak bandwidth of 90MB/Bytes and an effective startup time of 4 microseconds.

Since, each processor on the T3D has only 64MB/Bytes of memory, some of the larger matrices could not be partitioned on a single processor. For this reason, we compare the parallel run time on the T3D with the run time of the serial multilevel algorithm running on a SGI Challenge with 1.2GBytes of memory and 150MHz MIPS R4400. Even though the R4400 has a peak integer performance that is 10% lower than the Alpha, due to the significantly higher amount of secondary cache available on the SGI machine (1 Mbyte on SGI versus 0 Mbytes on T3D processors), the code running on a single processor T3D is about 15% slower than that running on the SGI. The computed speedups in the rest of this section are scaled to take this into account. All times reported are in seconds. Since our multilevel algorithm uses randomization in the coarsening step, we performed all experiments with a fixed seed.

4.1 Graph Partitioning

The performance of the parallel multilevel algorithm for the matrices in Table 1 is shown in Table 2 for a p-way partition on p processors, where p is 16, 32, 64, and 128. The performance of the serial multilevel algorithm for the same set of matrices running on an SGI is shown in Table 3. For both the parallel and the serial multilevel algorithm, the edge-cut and the run time are shown in the corresponding tables. In the rest of this section we will first compare the quality of the partitions produced by the parallel multilevel algorithm, and then the speedup obtained by the parallel algorithm.

Figure 2 shows the size of the edge-cut of the parallel multilevel algorithm compared to the serial multilevel algorithm. Any bars above the baseline indicate that the parallel algorithm produces partitions with higher edge-cut than the serial algorithm. From this graph we can see that for most matrices, the edge-cut of the parallel algorithm is worse than that of the serial algorithm. This is due to the fact that the coarsening and refinement performed by the parallel algorithm are less powerful. But in most cases, the difference in edge-cut is quite small. For nine out of the ten matrices, the edge-cut of the parallel algorithm is within 10% of that of the serial algorithm. Furthermore, the difference in quality decreases as the number of partitions increases. The only exception is 4ELT, for which the edge-cut of the parallel 16-way partition is about 27% worse than the serial one. However, even for this problem, when larger partitions are considered, the relative difference in the edge-cut decreases; and for the of 128-way partition, parallel multilevel does slightly better than the serial multilevel.

Figure 3 shows the size of the edge-cut of the parallel algorithm compared to the Multilevel Spectral Bisection algorithm (MSB) [3]. The MSB algorithm is a widely used algorithm that has been found to generate high quality partitions with small edge-cuts. We used the Chaco [10] graph partitioning package to produce the MSB partitions. As before, any bars above the baseline indicate that the parallel algorithm generates partitions with higher edge-cuts. From this figure we see that the quality of the parallel algorithm

---

1The speedup is computed as 1.15 * T_{SGI}/T_{T3D}, where T_{SGI} and T_{T3D} are the run times on SGI and T3D, respectively.
Table 2: The performance of the parallel multilevel graph partitioning algorithm. For each matrix, the performance is shown for 16, 32, 64, and 128 processors. \( T_p \) is the parallel run time for a \( p \)-way partition on \( p \) processors, \( E_{C_p} \) is the edge-cut of the \( p \)-way partition, and \( S \) is the speedup over the serial multilevel algorithm.

<table>
<thead>
<tr>
<th>Matrix</th>
<th>( p = 16 )</th>
<th>( p = 32 )</th>
<th>( p = 64 )</th>
<th>( p = 128 )</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>( T_p )</td>
<td>( E_{C_p} )</td>
<td>( T_p )</td>
<td>( E_{C_p} )</td>
</tr>
<tr>
<td>4GLT</td>
<td>0.48</td>
<td>1443</td>
<td>6.0</td>
<td>0.48</td>
</tr>
<tr>
<td>BCSTRK51</td>
<td>1.84</td>
<td>22754</td>
<td>17.0</td>
<td>0.32</td>
</tr>
<tr>
<td>BCSTRK52</td>
<td>1.69</td>
<td>43987</td>
<td>12.9</td>
<td>1.35</td>
</tr>
<tr>
<td>BRACK2</td>
<td>1.14</td>
<td>14987</td>
<td>8.6</td>
<td>1.83</td>
</tr>
<tr>
<td>CANT</td>
<td>3.30</td>
<td>159565</td>
<td>13.8</td>
<td>2.29</td>
</tr>
<tr>
<td>COP2R2</td>
<td>2.05</td>
<td>22109</td>
<td>7.4</td>
<td>1.78</td>
</tr>
<tr>
<td>CYLINDERST93</td>
<td>2.35</td>
<td>131514</td>
<td>14.3</td>
<td>1.71</td>
</tr>
<tr>
<td>ROTOR</td>
<td>3.16</td>
<td>26532</td>
<td>11.0</td>
<td>2.39</td>
</tr>
<tr>
<td>SHELL93</td>
<td>5.80</td>
<td>54765</td>
<td>13.9</td>
<td>4.40</td>
</tr>
<tr>
<td>WAVE</td>
<td>5.10</td>
<td>57543</td>
<td>10.3</td>
<td>4.70</td>
</tr>
</tbody>
</table>

Table 3: The performance of the serial multilevel graph partitioning algorithm on an SGI for 16-, 32-, 64-, and 128-way partition. \( T_p \) is the run time for a \( p \)-way partition, and \( E_{C_p} \) is the edge-cut of the \( p \)-way partition.

<table>
<thead>
<tr>
<th>Matrix</th>
<th>( T_p )</th>
<th>( E_{C_p} )</th>
<th>( T_p )</th>
<th>( E_{C_p} )</th>
<th>( T_p )</th>
<th>( E_{C_p} )</th>
<th>( T_p )</th>
<th>( E_{C_p} )</th>
</tr>
</thead>
<tbody>
<tr>
<td>4GLT</td>
<td>2.49</td>
<td>1141</td>
<td>2.91</td>
<td>1836</td>
<td>3.55</td>
<td>2965</td>
<td>4.62</td>
<td>4600</td>
</tr>
<tr>
<td>BCSTRK51</td>
<td>1.96</td>
<td>2583</td>
<td>15.04</td>
<td>42305</td>
<td>17.82</td>
<td>65529</td>
<td>21.40</td>
<td>97819</td>
</tr>
<tr>
<td>BCSTRK52</td>
<td>16.22</td>
<td>43740</td>
<td>22.31</td>
<td>70654</td>
<td>25.92</td>
<td>106440</td>
<td>30.29</td>
<td>152081</td>
</tr>
<tr>
<td>BRACK2</td>
<td>16.02</td>
<td>14679</td>
<td>19.48</td>
<td>21665</td>
<td>22.78</td>
<td>29983</td>
<td>25.72</td>
<td>42625</td>
</tr>
<tr>
<td>CANT</td>
<td>37.32</td>
<td>199955</td>
<td>47.22</td>
<td>71986</td>
<td>56.53</td>
<td>442398</td>
<td>61.50</td>
<td>574553</td>
</tr>
<tr>
<td>COP2R2</td>
<td>15.32</td>
<td>21922</td>
<td>20.17</td>
<td>31364</td>
<td>19.50</td>
<td>43721</td>
<td>25.50</td>
<td>58809</td>
</tr>
<tr>
<td>CYLINDERST93</td>
<td>29.21</td>
<td>126532</td>
<td>36.48</td>
<td>195532</td>
<td>45.68</td>
<td>289629</td>
<td>51.39</td>
<td>416190</td>
</tr>
<tr>
<td>ROTOR</td>
<td>30.71</td>
<td>24718</td>
<td>50.09</td>
<td>35100</td>
<td>41.93</td>
<td>533258</td>
<td>48.13</td>
<td>75110</td>
</tr>
<tr>
<td>SHELL93</td>
<td>60.97</td>
<td>51687</td>
<td>68.23</td>
<td>81384</td>
<td>90.65</td>
<td>124826</td>
<td>115.86</td>
<td>185323</td>
</tr>
<tr>
<td>WAVE</td>
<td>45.75</td>
<td>51300</td>
<td>54.37</td>
<td>71339</td>
<td>64.44</td>
<td>97978</td>
<td>71.98</td>
<td>129785</td>
</tr>
</tbody>
</table>

Figure 2: Quality (size of the edge-cut) of our parallel multilevel algorithm relative to the serial multilevel algorithm.

is almost never worse than that of the MSB algorithm. For eight out of the ten matrices, the parallel algorithm generated partitions with fewer edge-cuts, up to 50% better in some cases. On the other hand, for the matrices that the parallel algorithm performed worse, it is only by a small factor (less than 6%). This figure (along with Figure 2) also indicates that our serial multilevel algorithm outperforms the MSB algorithm. An extensive comparison between our serial multilevel algorithm and MSB, can be found in [14].

Tables 2 and 3 also show the run time of the parallel algorithm and the serial algorithm, respectively. A number of conclusions can be drawn from these results. First, as \( p \) increases, the time required for the \( p \)-way partition on \( p \)-processors decreases. Depending on the size and characteristics of the matrix this decrease is quite substantial. The decrease in the parallel run time is not linear to the increase in \( p \) but somewhat smaller for the following reasons: (a) As \( p \) increases, the time required to perform the \( p \)-way partition also increases; (there are more partitions to perform). (b) The parallel multilevel algorithm incurs communication and idling overhead that limits the asymptotic speedup to \( O(\sqrt{p}) \) unless a good partition of the graph is available even before the partitioning process starts [16].

4.2 Sparse Matrix Ordering

We used the parallel multilevel graph partitioning algorithm to find a fill reducing ordering via nested dissection. The performance of the parallel multilevel nested dissection algorithm (MLND) for various matrices is shown in Table 4. For each matrix, the table shows the parallel run time and the number of nonzeros in the Cholesky factor \( L \) of the resulting matrix for 16, 32, and 64 processors. On \( p \) processors, the ordering is computed by using nested dissection for the first \( \log p \) levels, and then multiple minimum degree [19] (MMD) is used to order the submatrices stored locally on each processor.

Figure 4 shows the relative quality of both serial and par-
Table 4: The performance of the parallel MLND algorithm on 16, 32, and 64 processors for computing a fill reducing ordering of a sparse matrix. $T_p$ is the run time in seconds and $|L|$ is the number of nonzeros in the Cholesky factor of the matrix.

parallel MLND versus the MMD algorithm. These graphs were obtained by dividing the number of operations required to factor the matrix using MLND by that required by MMD. Any bars above the baseline indicate that the MLND algorithm requires more operations than the MMD algorithm. From this graph, we see that in most cases, the serial MLND algorithm produces orderings that require fewer operations than MMD. The only exception is BCSSTK32, for which the serial MLND requires twice as many operations.

Comparing the parallel MLND algorithm against the serial MLND, we see that the orderings produced by the parallel algorithm require more operations (see Figure 4). However, as seen in Figure 4, the overall quality of the parallel MLND algorithm is usually within 20% of the serial MLND algorithm. The only exception in Figure 4 is SHELL93. Also, the relative quality changes slightly as the number of processors used to find the ordering increases.

![Figure 4](image)

Figure 4: Quality of our parallel MLND algorithm relative to the multiple minimum degree algorithm and the serial MLND algorithm.

Comparing the run time of the parallel MLND algorithm (Table 4) with that of the parallel multilevel partitioning algorithm (Table 2) we see that the time required by ordering is somewhat higher than the corresponding partitioning time. This is due to the extra time taken by the approximate minimum cover algorithm and the MMD algorithm used during ordering. But the relative speedup between 16 and 64 processors for both cases are quite similar.

References


