David Biermann

April 26, 2000

Independent Study with Dr. Board

Graduation with Distinction Paper

 

Experiments in Parallel Computing

 

Matrix Multiplication

            Matrix multiplication is a simple and common algorithm used in everyday calculation and scientific computing. The purpose of this experiment was to attempt to speed up matrix multiplication by employing a network of closely networked computers working in parallel to complete a single matrix multiplication. Having done this, the matrix multiplication program was used as a tool for evaluating the performance and efficiency of this kind of operation.

            The matrix multiplication algorithm was implemented using two different libraries for parallel computing. The first is PVM, or Parallel Virtual Machine and the second is MPI, or Message Passing Interface. Both libraries allow the user to write parallel programs using an explicit message passing model. This means that the user can easily spawn independent processes on different processors but is responsible for explicitly controlling all communication, in the form of sending and receiving data, between these processes. The performance differences between these parallel programming packages will be discussed in detail later, but the data in this section will be based on code written in MPI.

            One of most important performance issues to consider is how the algorithm performs as the number of processors used is increased. The biggest bottleneck in parallel computing on a network of workstations is the delay of sending messages between processes. In addition to network latency, the act of packing a message and sending it has significant computing overhead associated with it. For this reason, small problems do not scale well to parallel environments. By the same token, large problems can be significantly sped up by parallel implementations, but there is limit to how many processors can be effectively used. As a large problem is broken up into smaller and smaller problems, communication delays begin to become a significant portion of the total execution time. At some point, it becomes ineffective to add any more processors to the job. In order to explore this problem, this matrix multiplication algorithm was run on a Beowulf cluster, consisting of 450 MHz dual Pentium machines with 512 Mbs of RAM and connected via 100 Mb/s fast ethernet. On this cluster, the matrix multiplication algorithm was able to effectively use around ten processors on matrices of size 800x800 or more. For matrices as small as 400x400, only about six processors can be used effectively.  

            As mentioned above, each Beowulf machine has two processors per box. For this reason, it is interesting to consider the performance difference between using processors on different machines versus on the same machine. For example, if we were to run a parallel program on four processors, would there be a difference in performance if those four processors were distributed across four different Beowulf machines, or only two. Using multiple processors in the same box decreases the network traffic, but means that the two processor will have to share available system resources. The question really becomes, can running two independent processes on the same box make effective use of both processors. Tests shows  

 

Strassen’s Algorithm

            A normal, naïve implementation of matrix multiplication has an asymptotic running time of O(n3). There are several matrix multiplication algorithms that improve slightly on this. One of the most famous is Strassen’s Algorithm. Strassen’s algorithm takes advantage of redundant repeated multiplications in a matrix multiplication to reduce the asymptotic running time to O( n^(log27) ). Unfortunately, it does this at the cost of significant O(n2) work. For this reason, Strassen’s algorithm is only effective on very large Problem sizes. For this experiment, I implemented sequential and parallel versions of Strassen’s Algorithm to see if it would be effective in speeding up matrix multiplication on large problems. The results were somewhat disappointing. The sequential version of the program was able to realize small performance rewards for large problem sizes. Unfortunately, for the problem sizes that this program was able to undertake, parallelizing the algorithm simply did not yield any performance gains. In any case, the performance gains do not seem to offset the extreme increase in complexity of implementing the algorithm.

 

PVM vs. MPI

            The purpose of implementing algorithms in parallel is to increase performance. Therefore, it is reasonable to study which tools will give the user the advantage of the most speed, all other factors being equal. For this reason, I implemented essentially identical versions of matrix multiplication in both PVM and MPI parallel programming packages in order to determine their relative performance. PVM is an older system and is traditionally believed to have lower performance. MPI is relatively new, and is supposed to have more efficient message passing, and less overhead associated with communication and therefore superior overall performance. However, the release of the latest version of PVM may have reversed this trend. My tests show that PVM actually has a small performance advantage this application. However, as the number of processors is increased, the performance advantage diminished, indicating that MPI may in fact have superior message passing.