Sanchayan Maity
0f4b39775c
During the last commit of splash2 benchmark it seems before committing when we ran "make clean", it effectively undid what the patch at below link did http://www.capsl.udel.edu/splash/Download.html Fix this since without this it is not possible to build the arcane splash2 benchmark. |
||
---|---|---|
.. | ||
fft.C | ||
fft.c | ||
libpthread.a | ||
m5op_x86.o | ||
Makefile | ||
README.fft |
GENERAL INFORMATION: The FFT program is a complex, one-dimensional version of the "Six-Step" FFT described in Bailey, D. H. FFTs in External or Hierarchical Memory. Journal of Supercomputing, 4(1):23-35, March 1990. Specific optimizations in this implementation include: (1) Performing staggered, blocked transposes that exploit cache-line reuse; (2) The roots of unity data structure is arranged and distributed for only local accesses during application of the roots of unity step; (3) A small set of roots of unity elements are replicated locally at each processor for computation of the 1D FFTs; and (4) The matrix data structures are padded to reduce cache mapping conflicts. The algorithm used in this implementation is described in: Woo, S. C., Singh, J. P., and Hennessy, J. L. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors. Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), October 1994. This program works under both the Unix FORK and SPROC models. RUNNING THE PROGRAM: To see how to run the program, please see the comment at the top of the file fft.C, or run the application with the "-h" command line option. Four command-line parameters MUST be specified: The number of points to transform, the number of processors, the log base 2 of the cache line size, and the number of cache lines. The number of complex data points to be transformed and the number of processors being used can be varied according to the following rules. The number of data points must be an even power of 2 (2**16, 2**18, etc). The command-line option "-mM" specifies the even power of two (M), so for 65,536 data points, the command line option is "-m16." The number of processors must be a power of 2. The main data structures are padded to reduce cache mapping conflicts. The constant PAGE_SIZE should be changed to reflect the page size (in bytes) of the target system. In addition, the program uses a blocking technique to exploit cache line reuse. Both the log base 2 of the cache line size and the number of cache lines in the cache should be specified at the command line in order to allow the blocking algorithm to work effectively. BASE PROBLEM SIZE: The base problem size for an upto-64 processor machine is 65,536 complex data points (M=16). The PAGE_SIZE constant should be changed to reflect the page size (in bytes) of the target system. In addition, the log base 2 of the cache line size, the number of cache lines, and the number of processors must be specified. DATA DISTRIBUTION: Our "POSSIBLE ENHANCEMENT" comments in the source code tell where one might want to distribute data and how. Data distribution has an impact on performance on the Stanford DASH multiprocessor.