350 lines
14 KiB
Text
350 lines
14 KiB
Text
|
Date: Oct 19, 1994
|
||
|
|
||
|
This is the directory for the second release of the Stanford Parallel
|
||
|
Applications for Shared-Memory (SPLASH-2) programs. For further
|
||
|
information contact splash@mojave.stanford.edu.
|
||
|
|
||
|
PLEASE NOTE: Due to our limited resources, we will be unable to spend
|
||
|
much time answering questions about the applications.
|
||
|
|
||
|
splash.tar contains the tared version of all the files. Grabbing this
|
||
|
file will get you everything you need. We also keep the files
|
||
|
individually untared for partial retrieval. The splash.tar file is not
|
||
|
compressed, but the large files in it are. We attempted to compress the
|
||
|
splash.tar file to reduce the file size further, but this resulted in
|
||
|
a negative compression ratio.
|
||
|
|
||
|
|
||
|
DIFFERENCES BETWEEN SPLASH AND SPLASH-2:
|
||
|
----------------------------------------
|
||
|
|
||
|
The SPLASH-2 suite contains two types of codes: full applications and
|
||
|
kernels. Each of the codes utilizes the Argonne National Laboratories
|
||
|
(ANL) parmacs macros for parallel constructs. Unlike the codes in the
|
||
|
original SPLASH release, each of the codes assumes the use of a
|
||
|
"lightweight threads" model (which we hereafter refer to as the "threads"
|
||
|
model) in which child processes share the same virtual address space as
|
||
|
their parent process. In order for the codes to function correctly,
|
||
|
the CREATE macro should call the proper Unix system routine (e.g. "sproc"
|
||
|
in the Silicon Graphics IRIX operating system) instead of the "fork"
|
||
|
routine that was used for SPLASH. The difference is that processes
|
||
|
created with the Unix fork command receive their own private copies of
|
||
|
all global variables. In the threads model, child processes share the
|
||
|
same virtual address space, and hence all global data. Some of the
|
||
|
codes function correctly when the Unix "fork" command is used for child
|
||
|
process creation as well. Comments in the code header denote those
|
||
|
applications which function correctly with "fork."
|
||
|
|
||
|
|
||
|
MACROS:
|
||
|
-------
|
||
|
|
||
|
Macros for the previous release of the SPLASH application suite can be
|
||
|
obtained via anonymous ftp to www-flash.stanford.edu. The macros are
|
||
|
contained in the pub/old_splash/splash/macros subdirectory. HOWEVER,
|
||
|
THE MACRO FILES MUST BE MODIFIED IN ORDER TO BE USED WITH SPLASH-2 CODES.
|
||
|
The CREATE macros must be changed so that they call the proper process
|
||
|
creation routine (See DIFFERENCES section above) instead of "fork."
|
||
|
|
||
|
In this macros subdirectory, macros and sample makefiles are provided
|
||
|
for three machines:
|
||
|
|
||
|
Encore Multimax (CMU Mach 2.5: C and Fortran)
|
||
|
SGI 4D/240 (IRIX System V Release 3.3: C only)
|
||
|
Alliant FX/8 (Alliant Rev. 5.0: C and Fortran)
|
||
|
|
||
|
These macros work for us with the above operating systems. Unfortunately,
|
||
|
our limited resources prevent us from supporting them in any way or
|
||
|
even fielding questions about them. If they don't work for you, please
|
||
|
contact Argonne National Labs for a version that will. An e-mail address
|
||
|
to try might be monitor-users-request@mcs.anl.gov. An excerpt from
|
||
|
a message, received from Argonne, concerning obtaining the macros follows:
|
||
|
|
||
|
"The parmacs package is in the public domain. Approximately 15 people at
|
||
|
Argonne (or associated with Argonne or students) have worked on the
|
||
|
parmacs package at one time or another. The parmacs package is
|
||
|
implemented via macros using the M4 macropreprocessor (standard on most
|
||
|
Unix systems). Current distribution of the software is somewhat ad hoc.
|
||
|
Most C versions can be obtained from netlib (send electronic mail to
|
||
|
netlib@ornl.gov with the message send index from parmacs). Fortran
|
||
|
versions have been emailed directly or sent on tape. The primary
|
||
|
documentation for the parmacs package is the book ``Portable Programs for
|
||
|
Parallel Processors'' by Lusk, et al, Holt, Rinehart, and Winston 1987."
|
||
|
|
||
|
The makefiles provided in the individual program directories specify
|
||
|
a null macro set that will turn the parallel programs into sequential
|
||
|
ones. Note that we do not have a null macro set for FORTRAN.
|
||
|
|
||
|
|
||
|
CODE ENHANCEMENTS:
|
||
|
------------------
|
||
|
|
||
|
All of the codes are designed for shared address space multiprocessors
|
||
|
with physically distributed main memory. For these types of machines,
|
||
|
process migration and poor data distribution can decrease performance
|
||
|
to suboptimal levels. In the applications, comments indicating potential
|
||
|
enhancements can be found which will improve performance. Each potential
|
||
|
enhancement is denoted by a comment beginning with "POSSIBLE ENHANCEMENT".
|
||
|
The potential enhancements which we identify are:
|
||
|
|
||
|
(1) Data Distribution
|
||
|
|
||
|
Comments are placed in the code indicating where directives should
|
||
|
be placed so that data can be migrated to the local memories of
|
||
|
nodes, thus allowing for remote communication to be minimized.
|
||
|
|
||
|
(2) Process-to-Processor Assignment
|
||
|
|
||
|
Comments are placed in the code indicating where directives should
|
||
|
be placed so that processes can be "pinned" to processors,
|
||
|
preventing them from migrating from processor to processor.
|
||
|
|
||
|
In addition, to facilitate simulation studies, we note points in the
|
||
|
codes where statistics gathering routines should be turned on so that
|
||
|
cold-start and initialization effects can be avoided.
|
||
|
|
||
|
As previously mentioned, processes are assumed to be created through calls
|
||
|
to a "threads" model creation routine. One important side effect is that
|
||
|
this model causes all global variables to be shared (whereas the fork model
|
||
|
causes all processes to get their own private copy of global variables).
|
||
|
In order to mimic the behavior of global variables in the fork model, many
|
||
|
of the applications provide arrays of structures that can be accessed by
|
||
|
process ID, such as:
|
||
|
|
||
|
struct per_process_info {
|
||
|
char pad[PAD_LENGTH];
|
||
|
unsigned start_time;
|
||
|
unsigned end_time;
|
||
|
char pad[PAD_LENGTH];
|
||
|
} PPI[MAX_PROCS];
|
||
|
|
||
|
In these structures, padding is inserted to ensure that the structure
|
||
|
information associated with each process can be placed on a different
|
||
|
page of memory, and can thus be explicitly migrated to that processor's
|
||
|
local memory system. We follow this strategy for certain variables since
|
||
|
these data really belong to a process and should be allocated in its local
|
||
|
memory. A programming model that had the ability to declare global private
|
||
|
data would have automatically ensured that these data were private, and
|
||
|
that false sharing did not occur across different structures in the
|
||
|
array. However, since the threads model does not provide this capability,
|
||
|
it is provided by explicitly introducing arrays of structures with padding.
|
||
|
The padding constants used in the programs (PAD_LENGTH in this example)
|
||
|
can easily be changed to suit the particular characteristics of a given
|
||
|
system. The actual data that is manipulated by individual applications
|
||
|
(e.g. grid points, particle data, etc) is not padded, however.
|
||
|
|
||
|
Finally, for some applications we provide less-optimized versions of the
|
||
|
codes. The less-optimized versions utilize data structures that lead to
|
||
|
simpler implementations, but which do not allow for optimal data
|
||
|
distribution (and can thus generate false-sharing).
|
||
|
|
||
|
|
||
|
REPORT:
|
||
|
-------
|
||
|
|
||
|
A report will be put together shortly describing the structure, function,
|
||
|
and performance characteristics of each application. The report will be
|
||
|
similar to the original SPLASH report (see the original report for the
|
||
|
issues discussed). The report will provide quantitative data (for two
|
||
|
different cache line size) for characteristics such as working set size
|
||
|
and miss rates (local versus remote, etc.). In addition, the report
|
||
|
will discuss cache behavior and synchronization behavior of the
|
||
|
applications as well. In the mean time, each application directory has
|
||
|
a README file that describes how to run each application. In addition,
|
||
|
most applications have comments in their headers describing how to run
|
||
|
each application.
|
||
|
|
||
|
|
||
|
README FILES:
|
||
|
-------------
|
||
|
|
||
|
Each application has an associated README file. It is VERY important to
|
||
|
read these files carefully, as they discuss the important parameters to
|
||
|
supply for each application, as well as other issues involved in running
|
||
|
the programs. In each README file, we discuss the impact of explicitly
|
||
|
distributing data on the Stanford DASH Multiprocessor. Unless otherwise
|
||
|
specified, we assume that the default data distribution mechanism is
|
||
|
through round-robin page allocation.
|
||
|
|
||
|
|
||
|
PROBLEM SIZES:
|
||
|
--------------
|
||
|
|
||
|
For each application, the README file describes a recommended problem
|
||
|
size that is a reasonable base problem size that both can be simulated
|
||
|
and is not too small for reality on a machine with up to 64 processors.
|
||
|
For the purposes of studying algorithm performance, the parameters
|
||
|
associated with each application can be varied. However, for the
|
||
|
purposes of comparing machine architectures, the README files describe
|
||
|
which parameters can be varied, and which should remain constant (or at
|
||
|
their default values) for comparability. If the specific "base"
|
||
|
parameters that are specified are not used, then results which are
|
||
|
reported should explicitly state which parameters were changed, what
|
||
|
their new values are, and address why they were changed.
|
||
|
|
||
|
|
||
|
CORE PROGRAMS:
|
||
|
--------------
|
||
|
|
||
|
Since the number of programs has increased over SPLASH, and since not
|
||
|
everyone may be able to use all the programs in a given study, we
|
||
|
identify some of the programs as "core" programs that should be used
|
||
|
in most studies for comparability. In the currently available set, these
|
||
|
core programs include:
|
||
|
|
||
|
(1) Ocean Simulation
|
||
|
(2) Hierarchical Radiosity
|
||
|
(3) Water Simulation with Spatial data structure
|
||
|
(4) Barnes-Hut
|
||
|
(5) FFT
|
||
|
(6) Blocked Sparse Cholesky Factorization
|
||
|
(7) Radix Sort
|
||
|
|
||
|
The less optimized versions of the programs, when provided, should be
|
||
|
used only in addition to these.
|
||
|
|
||
|
|
||
|
MAILING LIST:
|
||
|
-------------
|
||
|
|
||
|
Please send a note to splash@mojave.stanford.edu if you have copied over
|
||
|
the programs, so that we can put you on a mailing list for update reports.
|
||
|
|
||
|
|
||
|
AUTHORSHIP:
|
||
|
-----------
|
||
|
|
||
|
The applications provided in the SPLASH-2 suite were developed by a number
|
||
|
of people. The report lists authors primarily responsible for the
|
||
|
development of each application code. The codes were made ready for
|
||
|
distribution and the README files were prepared by Steven Cameron Woo and
|
||
|
Jaswinder Pal Singh.
|
||
|
|
||
|
|
||
|
CODE CHANGES:
|
||
|
-------------
|
||
|
|
||
|
If modifications are made to the codes which improve their performance,
|
||
|
we would like to hear about them. Please send email to
|
||
|
splash@mojave.stanford.edu detailing the changes.
|
||
|
|
||
|
|
||
|
UPDATE REPORTS:
|
||
|
---------------
|
||
|
|
||
|
Watch this file for information regarding changes to codes and additions
|
||
|
to the application suite.
|
||
|
|
||
|
|
||
|
CHANGES:
|
||
|
-------
|
||
|
|
||
|
10-21-94: Ocean code, contiguous partitions, line 247 of slave1.C changed
|
||
|
from
|
||
|
|
||
|
t2a[0][0] = hh3*t2a[0][0]+hh1*psi[procid][1][0][0];
|
||
|
|
||
|
to
|
||
|
|
||
|
t2a[0][0] = hh3*t2a[0][0]+hh1*t2c[0][0];
|
||
|
|
||
|
This change does not affect correctness; it is an optimization
|
||
|
that was performed elsewhere in the code but overlooked here.
|
||
|
|
||
|
11-01-94: Barnes, file code_io.C, line 55 changed from
|
||
|
|
||
|
in_real(instr, tnow);
|
||
|
|
||
|
to
|
||
|
|
||
|
in_real(instr, &tnow);
|
||
|
|
||
|
11-01-94: Raytrace, file main.C, lines 216-223 changed from
|
||
|
|
||
|
if ((pid == 0) || (dostats))
|
||
|
CLOCK(end);
|
||
|
|
||
|
gm->partime[0] = (end - begin) & 0x7FFFFFFF;
|
||
|
if (pid == 0) gm->par_start_time = begin;
|
||
|
|
||
|
/* printf("Process %ld elapsed time %lu.\n", pid, lapsed); */
|
||
|
|
||
|
}
|
||
|
|
||
|
to
|
||
|
|
||
|
if ((pid == 0) || (dostats)) {
|
||
|
CLOCK(end);
|
||
|
gm->partime[pid] = (end - begin) & 0x7FFFFFFF;
|
||
|
if (pid == 0) gm->par_start_time = begin;
|
||
|
}
|
||
|
|
||
|
11-13-94: Raytrace, file memory.C
|
||
|
|
||
|
The use of the word MAIN_INITENV in a comment in memory.c causes
|
||
|
m4 to expand this macro, and some implementations may get confused
|
||
|
and generate the wrong C code.
|
||
|
|
||
|
11-13-94: Radiosity, file rad_main.C
|
||
|
|
||
|
rad_main.C uses the macro CREATE_LITE. All three instances of
|
||
|
CREATE_LITE should be changed to CREATE.
|
||
|
|
||
|
11-13-94: Water-spatial and Water-nsquared, file makefile
|
||
|
|
||
|
makefiles were changed so that the compilation phases included the
|
||
|
CFLAGS options instead of the CCOPTS options, which did not exist.
|
||
|
|
||
|
11-17-94: FMM, file particle.C
|
||
|
|
||
|
Comment regarding data distribution of particle_array data
|
||
|
structure is incorrect. Round-robin allocation should be used.
|
||
|
|
||
|
11-18-94: OCEAN, contiguous partitions, files main.C and linkup.C
|
||
|
|
||
|
Eliminated a problem which caused non-doubleword aligned
|
||
|
accesses to doublewords for the uniprocessor case.
|
||
|
|
||
|
main.C: Added lines 467-471:
|
||
|
|
||
|
if (nprocs%2 == 1) { /* To make sure that the actual data
|
||
|
starts double word aligned, add an extra
|
||
|
pointer */
|
||
|
d_size += sizeof(double ***);
|
||
|
}
|
||
|
|
||
|
Added same lines in file linkup.C at line numbers 100 and 159.
|
||
|
|
||
|
07-30-95: RADIX has been changed. A tree-structured parallel prefix
|
||
|
computation is now used instead of a linear one.
|
||
|
|
||
|
LU had been modified. A comment describing how to distribute
|
||
|
data (one of the POSSIBLE ENHANCEMENTS) was incorrect for the
|
||
|
contiguous_blocks version of LU. Also, a modification was made
|
||
|
that reduces false sharing at line 206 of lu.C:
|
||
|
|
||
|
last_malloc[i] = (double *) (((unsigned) last_malloc[i]) + PAGE_SIZE -
|
||
|
((unsigned) last_malloc[i]) % PAGE_SIZE);
|
||
|
|
||
|
A subdirectory shmem_files was added under the codes directory.
|
||
|
This directory contains a file that can be compiled on SGI machines
|
||
|
which replaces the libsgi.a file distributed in the original SPLASH
|
||
|
release.
|
||
|
|
||
|
09-26-95: Fixed a bug in LU. Line 201 was changed from
|
||
|
|
||
|
last_malloc[i] = (double *) G_MALLOC(proc_bytes[i])
|
||
|
|
||
|
to
|
||
|
|
||
|
last_malloc[i] = (double *) G_MALLOC(proc_bytes[i] + PAGE_SIZE)
|
||
|
|
||
|
Fixed similar bugs in WATER-NSQUARED and WATER-SPATIAL. Both
|
||
|
codes needed a barrier added into the mdmain.C files. In both
|
||
|
codes, the line
|
||
|
|
||
|
BARRIER(gl->start, NumProcs);
|
||
|
|
||
|
was added. In WATER-NSQUARED, it was added in mdmain.C at line
|
||
|
84. In WATER-SPATIAL, it was added in mdmain.C at line 107.
|