Figure 5 shows what happens to scalability when the total problem size
is fixed and the number of processors is varied using PETSc on IA32. Ideal
scaling would look like
. Total problem size is fixed at 8
million unknowns and
ranges from 1 to 256. Again, BiCGStab preconditioned
with Block Jacobi is the best choice.
Figure 6 shows a cross-platform comparison between the IA32 cluster
and the Origin2000 with fixed total problem size scaling. The much smaller
theoretical peak flop rate of the Origin2000 is most evident in the single
processor case and is less pronounced as
increases.
Figure 7 is the non-symmetric version of figure 3. Since this is a non-symmetric problem, the best non-multigrid method is now BiCGStab preconditioned with block Jacobi. Again it is evident that the multigrid preconditioned method scales far better than the others because of the superior algorithmic scaling of multigrid.
In Figure 8 the Linux cluster shows better single-processor performance compared with SGI Origin2000 (about 15 to 20% faster). The Linux cluster scales very well as the number of processors increases, while the SGI Origin2000 does not scale as well. Comparing the two numerical techniques, GMRES with multi-grid preconditioner show better single-processor performance. However, multigrid scales better. Therefore as number of processors increases, multigrid becomes more effective.
Table 1 demonstrates the scalability of different components of the solver using information from PETSc's built-in instrumentation. The tests were run on a non-symmetric problem using BiCGStab preconditioned with block Jacobi on IA32. Using the log files generated by PETSc, we have tabulated the percent of time spent in subset of the function calls made during a linear solve. PETSc inserts a barrier call in the dot-product function when profiling is turned on. The VecDotBarrier times reflect the synchronization delay in the dot-products. The MatMult scales well with the total solve time, but the VecDotBarrier does not. This indicates that as the number of processors increases, the VecDotBarrier is increasingly becoming the bottleneck for good performance and scaling. On the other hand, the MatSolve portion of the solve generally decreases as the number of processors grows due to the almost embarassingly parallel nature of the solves on the individual blocks of the block Jacobi preconditioner.
|
|
|
|
|
|
|
|
|