David Biermann
April 26, 2000
Independent Study with Dr. Board
Graduation with Distinction Paper
Matrix
multiplication is a simple and common algorithm used in everyday calculation and
scientific computing. The purpose of this experiment was to attempt to speed up
matrix multiplication by employing a network of closely networked computers
working in parallel to complete a single matrix multiplication. Having done
this, the matrix multiplication program was used as a tool for evaluating the
performance and efficiency of this kind of operation.
The matrix multiplication algorithm was implemented using two different
libraries for parallel computing. The first is PVM, or Parallel Virtual Machine
and the second is MPI, or Message Passing Interface. Both libraries allow the
user to write parallel programs using an explicit message passing model. This
means that the user can easily spawn independent processes on different
processors but is responsible for explicitly controlling all communication, in
the form of sending and receiving data, between these processes. The performance
differences between these parallel programming packages will be discussed in
detail later, but the data in this section will be based on code written in MPI.
One of most important performance issues to consider is how the algorithm
performs as the number of processors used is increased. The biggest bottleneck
in parallel computing on a network of workstations is the delay of sending
messages between processes. In addition to network latency, the act of packing a
message and sending it has significant computing overhead associated with it.
For this reason, small problems do not scale well to parallel environments. By
the same token, large problems can be significantly sped up by parallel
implementations, but there is limit to how many processors can be effectively
used. As a large problem is broken up into smaller and smaller problems,
communication delays begin to become a significant portion of the total
execution time. At some point, it becomes ineffective to add any more processors
to the job. In order to explore this problem, this matrix multiplication
algorithm was run on a Beowulf cluster, consisting of 450 MHz dual Pentium
machines with 512 Mbs of RAM and connected via 100 Mb/s fast ethernet. On this
cluster, the matrix multiplication algorithm was able to effectively use around
ten processors on matrices of size 800x800 or more. For matrices as small as
400x400, only about six processors can be used effectively.
As mentioned above, each Beowulf machine has two processors per box. For
this reason, it is interesting to consider the performance difference between
using processors on different machines versus on the same machine. For example,
if we were to run a parallel program on four processors, would there be a
difference in performance if those four processors were distributed across four
different Beowulf machines, or only two. Using multiple processors in the same
box decreases the network traffic, but means that the two processor will have to
share available system resources. The question really becomes, can running two
independent processes on the same box make effective use of both processors.
Tests shows
A
normal, naïve implementation of matrix multiplication has an asymptotic running
time of O(n3). There are several matrix multiplication algorithms
that improve slightly on this. One of the most famous is Strassen’s Algorithm.
Strassen’s algorithm takes advantage of redundant repeated multiplications in a
matrix multiplication to reduce the asymptotic running time to O(
n^(log27) ). Unfortunately, it does this at the cost of significant
O(n2) work. For this reason, Strassen’s algorithm is only effective
on very large Problem sizes. For this experiment, I implemented sequential and
parallel versions of Strassen’s Algorithm to see if it would be effective in
speeding up matrix multiplication on large problems. The results were somewhat
disappointing. The sequential version of the program was able to realize small
performance rewards for large problem sizes. Unfortunately, for the problem
sizes that this program was able to undertake, parallelizing the algorithm
simply did not yield any performance gains. In any case, the performance gains
do not seem to offset the extreme increase in complexity of implementing the
algorithm.
The
purpose of implementing algorithms in parallel is to increase performance.
Therefore, it is reasonable to study which tools will give the user the
advantage of the most speed, all other factors being equal. For this reason, I
implemented essentially identical versions of matrix multiplication in both PVM
and MPI parallel programming packages in order to determine their relative
performance. PVM is an older system and is traditionally believed to have lower
performance. MPI is relatively new, and is supposed to have more efficient
message passing, and less overhead associated with communication and therefore
superior overall performance. However, the release of the latest version of PVM
may have reversed this trend. My tests show that PVM actually has a small
performance advantage this application. However, as the number of processors is
increased, the performance advantage diminished, indicating that MPI may in fact
have superior message passing.