gem5/splash2
Sanchayan Maity 2fcc51c2c1 Commit splash2 benchmark
While at it also add the libpthread static library amd m5op_x86
for matrix multiplication test code as well.

Note that the splash2 benchmark code does not comply with gem5
coding guidelines. Academic guys never seem to follow 80 columns
and no whitespace guideline :(.
2017-04-26 20:50:15 +05:30
..
codes Commit splash2 benchmark 2017-04-26 20:50:15 +05:30
README.SPLASH2 Commit splash2 benchmark 2017-04-26 20:50:15 +05:30
SPLASH2.POSTING Commit splash2 benchmark 2017-04-26 20:50:15 +05:30
splash2_isca95.ps.Z Commit splash2 benchmark 2017-04-26 20:50:15 +05:30

Date:  Oct 19, 1994

This is the directory for the second release of the Stanford Parallel 
Applications for Shared-Memory (SPLASH-2) programs. For further 
information contact splash@mojave.stanford.edu.

PLEASE NOTE:  Due to our limited resources, we will be unable to spend
much time answering questions about the applications.

splash.tar contains the tared version of all the files.  Grabbing this
file will get you everything you need.  We also keep the files 
individually untared for partial retrieval.  The splash.tar file is not
compressed, but the large files in it are.  We attempted to compress the
splash.tar file to reduce the file size further, but this resulted in 
a negative compression ratio.


DIFFERENCES BETWEEN SPLASH AND SPLASH-2:
----------------------------------------

The SPLASH-2 suite contains two types of codes: full applications and 
kernels.  Each of the codes utilizes the Argonne National Laboratories 
(ANL) parmacs macros for parallel constructs.  Unlike the codes in the 
original SPLASH release, each of the codes assumes the use of a 
"lightweight threads" model (which we hereafter refer to as the "threads" 
model) in which child processes share the same virtual address space as 
their parent process.  In order for the codes to function correctly, 
the CREATE macro should call the proper Unix system routine (e.g. "sproc" 
in the Silicon Graphics IRIX operating system) instead of the "fork" 
routine that was used for SPLASH.  The difference is that processes 
created with the Unix fork command receive their own private copies of 
all global variables.  In the threads model, child processes share the 
same virtual address space, and hence all global data.  Some of the 
codes function correctly when the Unix "fork" command is used for child 
process creation as well.  Comments in the code header denote those 
applications which function correctly with "fork."


MACROS:
-------

Macros for the previous release of the SPLASH application suite can be 
obtained via anonymous ftp to www-flash.stanford.edu.  The macros are 
contained in the pub/old_splash/splash/macros subdirectory.  HOWEVER, 
THE MACRO FILES MUST BE MODIFIED IN ORDER TO BE USED WITH SPLASH-2 CODES.
The CREATE macros must be changed so that they call the proper process
creation routine (See DIFFERENCES section above) instead of "fork."

In this macros subdirectory, macros and sample makefiles are provided
for three machines:

Encore Multimax (CMU Mach 2.5: C and Fortran)
SGI 4D/240      (IRIX System V Release 3.3: C only)
Alliant FX/8    (Alliant Rev. 5.0: C and Fortran)

These macros work for us with the above operating systems.  Unfortunately,
our limited resources prevent us from supporting them in any way or
even fielding questions about them.   If they don't work for you, please
contact Argonne National Labs for a version that will.  An e-mail address
to try might be monitor-users-request@mcs.anl.gov. An excerpt from
a message, received from Argonne, concerning obtaining the macros follows:

"The parmacs package is in the public domain.  Approximately 15 people at
Argonne (or associated with Argonne or students) have worked on the 
parmacs package at one time or another.  The parmacs package is 
implemented via macros using the M4 macropreprocessor (standard on most 
Unix systems).  Current distribution of the software is somewhat ad hoc.  
Most C versions can be obtained from netlib (send electronic mail to 
netlib@ornl.gov with the message send index from parmacs).  Fortran 
versions have been emailed directly or sent on tape.  The primary 
documentation for the parmacs package is the book ``Portable Programs for 
Parallel Processors'' by Lusk, et al, Holt, Rinehart, and Winston 1987."

The makefiles provided in the individual program directories specify
a null macro set that will turn the parallel programs into sequential
ones.  Note that we do not have a null macro set for FORTRAN.


CODE ENHANCEMENTS:
------------------

All of the codes are designed for shared address space multiprocessors
with physically distributed main memory.  For these types of machines, 
process migration and poor data distribution can decrease performance 
to suboptimal levels.  In the applications, comments indicating potential 
enhancements can be found which will improve performance.  Each potential 
enhancement is denoted by a comment beginning with "POSSIBLE ENHANCEMENT".
The potential enhancements which we identify are:

  (1) Data Distribution
  
      Comments are placed in the code indicating where directives should 
      be placed so that data can be migrated to the local memories of 
      nodes, thus allowing for remote communication to be minimized. 

  (2) Process-to-Processor Assignment

      Comments are placed in the code indicating where directives should 
      be placed so that processes can be "pinned" to processors, 
      preventing them from migrating from processor to processor.

In addition, to facilitate simulation studies, we note points in the 
codes where statistics gathering routines should be turned on so that 
cold-start and initialization effects can be avoided.

As previously mentioned, processes are assumed to be created through calls
to a "threads" model creation routine.  One important side effect is that 
this model causes all global variables to be shared (whereas the fork model 
causes all processes to get their own private copy of global variables).  
In order to mimic the behavior of global variables in the fork model, many 
of the applications provide arrays of structures that can be accessed by 
process ID, such as:

     struct per_process_info {
       char pad[PAD_LENGTH];
       unsigned start_time;
       unsigned end_time;
       char pad[PAD_LENGTH];
     } PPI[MAX_PROCS];

In these structures, padding is inserted to ensure that the structure 
information associated with each process can be placed on a different 
page of memory, and can thus be explicitly migrated to that processor's
local memory system.  We follow this strategy for certain variables since 
these data really belong to a process and should be allocated in its local
memory.  A programming model that had the ability to declare global private
data would have automatically ensured that these data were private, and 
that false sharing did not occur across different structures in the 
array.  However, since the threads model does not provide this capability,
it is provided by explicitly introducing arrays of structures with padding.
The padding constants used in the programs (PAD_LENGTH in this example)
can easily be changed to suit the particular characteristics of a given
system.  The actual data that is manipulated by individual applications
(e.g. grid points, particle data, etc) is not padded, however.

Finally, for some applications we provide less-optimized versions of the 
codes.  The less-optimized versions utilize data structures that lead to 
simpler implementations, but which do not allow for optimal data 
distribution (and can thus generate false-sharing).


REPORT:
-------

A report will be put together shortly describing the structure, function,
and performance characteristics of each application.  The report will be
similar to the original SPLASH report (see the original report for the 
issues discussed).  The report will provide quantitative data (for two
different cache line size) for characteristics such as working set size
and miss rates (local versus remote, etc.).  In addition, the report
will discuss cache behavior and synchronization behavior of the 
applications as well.  In the mean time, each application directory has 
a README file that describes how to run each application.  In addition, 
most applications have comments in their headers describing how to run 
each application.  


README FILES:
-------------

Each application has an associated README file.  It is VERY important to
read these files carefully, as they discuss the important parameters to 
supply for each application, as well as other issues involved in running 
the programs.  In each README file, we discuss the impact of explicitly
distributing data on the Stanford DASH Multiprocessor.  Unless otherwise
specified, we assume that the default data distribution mechanism is 
through round-robin page allocation.


PROBLEM SIZES:
--------------

For each application, the README file describes a recommended problem
size that is a reasonable base problem size that both can be simulated 
and is not too small for reality on a machine with up to 64 processors.
For the purposes of studying algorithm performance, the parameters 
associated with each application can be varied.  However, for the 
purposes of comparing machine architectures, the README files describe 
which parameters can be varied, and which should remain constant (or at 
their default values) for comparability.  If the specific "base" 
parameters that are specified are not used, then results which are 
reported should explicitly state which parameters were changed, what 
their new values are, and address why they were changed.


CORE PROGRAMS:
--------------

Since the number of programs has increased over SPLASH, and since not
everyone may be able to use all the programs in a given study, we
identify some of the programs as "core" programs that should be used
in most studies for comparability.  In the currently available set, these
core programs include:

(1) Ocean Simulation
(2) Hierarchical Radiosity
(3) Water Simulation with Spatial data structure
(4) Barnes-Hut
(5) FFT
(6) Blocked Sparse Cholesky Factorization
(7) Radix Sort

The less optimized versions of the programs, when provided, should be
used only in addition to these.


MAILING LIST:
-------------

Please send a note to splash@mojave.stanford.edu if you have copied over 
the programs, so that we can put you on a mailing list for update reports.


AUTHORSHIP:
-----------

The applications provided in the SPLASH-2 suite were developed by a number
of people.  The report lists authors primarily responsible for the 
development of each application code.  The codes were made ready for 
distribution and the README files were prepared by Steven Cameron Woo and 
Jaswinder Pal Singh.


CODE CHANGES:
-------------

If modifications are made to the codes which improve their performance, 
we would like to hear about them.  Please send email to 
splash@mojave.stanford.edu detailing the changes.


UPDATE REPORTS:
---------------

Watch this file for information regarding changes to codes and additions
to the application suite.


CHANGES:
-------

10-21-94: Ocean code, contiguous partitions, line 247 of slave1.C changed 
          from

          t2a[0][0] = hh3*t2a[0][0]+hh1*psi[procid][1][0][0];
        
          to

          t2a[0][0] = hh3*t2a[0][0]+hh1*t2c[0][0];

          This change does not affect correctness; it is an optimization 
          that was performed elsewhere in the code but overlooked here.

11-01-94: Barnes, file code_io.C, line 55 changed from
          
          in_real(instr, tnow);

          to 
   
          in_real(instr, &tnow);

11-01-94: Raytrace, file main.C, lines 216-223 changed from

          if ((pid == 0) || (dostats))
          CLOCK(end);

          gm->partime[0] = (end - begin) & 0x7FFFFFFF;
          if (pid == 0) gm->par_start_time = begin;

/*      printf("Process %ld elapsed time %lu.\n", pid, lapsed); */

          }

          to

          if ((pid == 0) || (dostats)) {
            CLOCK(end);
            gm->partime[pid] = (end - begin) & 0x7FFFFFFF;
            if (pid == 0) gm->par_start_time = begin;
          }

11-13-94: Raytrace, file memory.C
 
          The use of the word MAIN_INITENV in a comment in memory.c causes 
          m4 to expand this macro, and some implementations may get confused 
          and generate the wrong C code.

11-13-94: Radiosity, file rad_main.C

          rad_main.C uses the macro CREATE_LITE.  All three instances of 
          CREATE_LITE should be changed to CREATE.

11-13-94: Water-spatial and Water-nsquared, file makefile

          makefiles were changed so that the compilation phases included the 
          CFLAGS options instead of the CCOPTS options, which did not exist.

11-17-94: FMM, file particle.C

          Comment regarding data distribution of particle_array data
          structure is incorrect.  Round-robin allocation should be used.

11-18-94: OCEAN, contiguous partitions, files main.C and linkup.C

          Eliminated a problem which caused non-doubleword aligned 
          accesses to doublewords for the uniprocessor case.

          main.C: Added lines 467-471:

          if (nprocs%2 == 1) {         /* To make sure that the actual data
                                          starts double word aligned, add an extra
                                          pointer */
            d_size += sizeof(double ***);
          }

          Added same lines in file linkup.C at line numbers 100 and 159.

07-30-95: RADIX has been changed.  A tree-structured parallel prefix 
          computation is now used instead of a linear one.

          LU had been modified.  A comment describing how to distribute
          data (one of the POSSIBLE ENHANCEMENTS) was incorrect for the 
          contiguous_blocks version of LU.  Also, a modification was made
          that reduces false sharing at line 206 of lu.C:

          last_malloc[i] = (double *) (((unsigned) last_malloc[i]) + PAGE_SIZE -
                           ((unsigned) last_malloc[i]) % PAGE_SIZE);

          A subdirectory shmem_files was added under the codes directory.
          This directory contains a file that can be compiled on SGI machines
          which replaces the libsgi.a file distributed in the original SPLASH
          release.

09-26-95: Fixed a bug in LU. Line 201 was changed from

            last_malloc[i] = (double *) G_MALLOC(proc_bytes[i])

          to

            last_malloc[i] = (double *) G_MALLOC(proc_bytes[i] + PAGE_SIZE)

          Fixed similar bugs in WATER-NSQUARED and WATER-SPATIAL.  Both
          codes needed a barrier added into the mdmain.C files. In both
          codes, the line 
            
            BARRIER(gl->start, NumProcs);
          
          was added.  In WATER-NSQUARED, it was added in mdmain.C at line
          84.  In WATER-SPATIAL, it was added in mdmain.C at line 107.