Last Revision: Fri Aug 18 11:24:54 EDT 1995
Author's Email Address: jics@cs.utk.edu
Copyright (C) October 1991, Computational Science Education Project
Retrieve a PostScript file with encapsulated color figures using
ghostview
Acknowledgments
Thanks to Maui High Performance Computing Center and Cornell High Performance Center for their on-line information and permission to use the documents and labs to promote education on the IBM SP2.
Thanks to Christian Halloy and Linda Mellott for critiquing and editing drafts of this guide.
Any mention of specific products, trademarks, or brand names
within this document is for identification purposes only.
All trademarks belong to their respective owners.
The purpose of this document is to give an overview of the IBM SP2 for beginning users. Since your access to an IBM SP2 might be on one of several IBM SP2 machines, this document shall be rather general, and not discuss machine dependent features.
Some of the important features and capabilities are presented including concise information on hardware, software tools, and compilers which is essential for beginners who want to program in parallel.
The IBM SP2 is Scalable POWERpallel2 system. POWER means Performance Optimized With Enhanced RISC. And, RISC is a Reduced Instruction Set Computer.
The SP2 consists of up to several hundred nodes which are all RISC 6000 superscalar processors. (Superscalar means that more than one instruction per clock cycle can be performed.) The SP2 can be considered a distributed memory machine because each node has its own local memory.
When programing in parallel on the SP2, you can run programs in SPMD (single program, multiple data) mode or MIMD (multiple instruction, multiple data) mode.
First of all, you will want to get an account on an SP2 machine. Once you have that account, you can access the machine by typing telnet SP2machine_name.
If you are working on an X-Windows terminal and wish to create new windows or open X-Window based tools on your workstation, type xhost + in a window on your workstation. Then, once you are logged in to the SP2 machine, you will want to set your environment DISPLAY variable by typing
setenv DISPLAY workstation_name:0.0
(There are also other important environment variables which may
need to be set. These will be discussed later in this document.)
So, how can an SP be configured? There can be 4 to 16 processors per frame (or cabinet) and 1 to 16 frames per system. Frames can consist of BOTH thin and wide nodes. The main difference between thin and wide nodes is that generally wide nodes have the capacity for larger memory and a larger memory cache. Standard SP2 systems could contain up to 256 nodes. Up to 512 nodes can be ordered from IBM by special request. (ORNL has a 16 node SP2, Maui has a 400 node SP2, and Cornell has a 512 node machine.) An optional high performance switch (HPS) for inter-processor communication is usually included on most systems. Such a switch can interconnect up to 16 nodes with up to 16 other nodes. This high performance switch has a 40 MB/sec peak bi-directional bandwidth. More than one HPS may be included when the number of nodes in the SP2 machine is greater than 32.
The IBM SP2 is fairly robust in that it has concurrent maintenance, which simply means each processor node can be removed, repaired, and rebooted without interrupting operations on other nodes.
Table 1: Processor Nodes - Wide vs. Thin
Each node of the SP2 runs the AIX Operating System, IBM's version of UNIX. The SP2 has a Parallel Operating Environment, or POE, which is used for handling parallel tasks running on multiple nodes. (POE will be discussed further in the subsequent section.) The POE also provides a Message Passing Library (MPL), as well as some important tools for debugging, profiling, and system monitoring. These shall also be discussed later. There is software available for system monitoring, and performance profiling and debugging such as the Visualization Tool(VT), prof and gprof profilers, pdbx and xpdbx parallel debuggers. Often, PVM and/or PVMe are/is available on the IBM SP2. Some systems may even have Message Passing Interface(MPI) and/or High Performance FORTRAN(HPF) available. Important math libraries may be available. And, there may be software available to assist you in parallelizing codes such as FORGE90. For more information about what is available on the SP2 you are using can usually be found on the WWW pages for the machine you are using.
The POE consists of parallel compiler scripts, POE environment variables, parallel debugger(s) and profiler(s), MPL, and parallel visualization tools. These tools allow one to develop, execute, profile, debug, and fine-tune parallel code.
A few important terms you may wish to know are Partition Manager, Resource Manager, and Processor Pools. The Partition Manager controls your partition, or group of nodes on which you wish to run your program. The Partition Manager requests the nodes for your parallel job, acquires the nodes necessary for that job (if the Resource Manager is not used), copies the executables from the initiating node to each node in the partition, loads executables on every node in the partition, and sets up standard I/O.
The Resource Manager keeps track of the nodes currently processing a parallel task, and, when nodes are requested from the Partion Manager, it allocates nodes for use. The Resource Manager attempts to enforce a ``one parallel task per node" rule.
The Processor Pools are sets of nodes dedicated to a particular type of process (such as interactive, batch, I/O intensive) which have been grouped together by the system administrator(s).
For more information about the processor or node, pools available on your system, simply type jm_status -P when on the SP2 machine.
There are many environment variables and command line flags that can be set to influence the operation of POE tools and the execution of parallel programs. A complete list of the POE environment variables can be found in the IBM ``AIX Parallel Environment Operation and Use" manual. Some environment variables which critically affect program execution are listed below.
Information about available pools may be obtained by typing the command jm_status -P at the Unix prompt.
MP_HOSTFILE does not need to be set if the host list file is the default, host.list.
A host list file must be present if either:
1) specific node allocation is required,
2) non-specific node allocation from a number of system pools is requested, or
3) a host list file named something other than the default
host.list will be used.
You may wish to use a shell script to set the appropriate environment variables or you could set important environment variables in your .login file.
Some IBM SP2 systems will already have a .rhosts file set up for you when you receive your account. However, on other systems, you may have to set up your own .rhosts file. This file should include the names of all the nodes and switches you may ever want to use.
Make sure that the node you are logged onto is in your .rhosts file. If it is not, you will want to add that node into the .rhosts file to avoid problems.
A sample .rhosts file might look similar to this (but contain more lines):
sp2101.ccs.ornl.gov hps101.ccs.ornl.gov sp2102.ccs.ornl.gov hps102.ccs.ornl.gov
Some systems require you to place your user_name on each line after the node or switch name. You will not need the dot extensions if you wish to access nodes of only that machine. Ask your system help about the naming of the nodes on the system you wish to use.
The host file should contain a list of all the nodes (or pools of nodes, but not both) you wish to run your code on. The first task will be run on the first node or pool listed, the second task will run on the second node or pool listed, etc. If you are using pools in the host file and do not have enough pools listed for all the tasks, the last tasks will use additional nodes within the last pool listed. However, if you are listing nodes, you must have at least as many nodes listed in the host file as you wish to run. You are allowed to repeat a node name within a host file. Doing so will cause your program to run multiple tasks on one node.
The default host file is host.list, but you can change the MP_HOSTFILE environment variable to be some other file name. If you have decided to run your code on ONE pool and have set MP_HOSTFILE to Null and RM_POOL to the appropriate pool number, you do not need to have a host file.
A sample host file using nodes:
!This is a comment line !Use an exclamation at the beginning of any comment r25n09.tc.cornell.edu shared multiple r25n10.tc.cornell.edu shared multiple r25n11.tc.cornell.edu shared multiple r25n12.tc.cornell.edu shared multiple ! !Host nodes are named r25n09, r25n10, r25n11, and r25n12 !When using MPL, shared means you share with others. !multiple means you allowing multiple MPL tasks on one node. ! !dedicated in place of shared or multiple means you do not want !to share with other people's MPL tasks, or you do not want !to allow multiple MPL tasks of your own on one node.
A sample host file using pools:
!This line is a comment @0 shared multiple @1 shared multiple @3 shared multiple @0 shared multiple !0, 1, and 3 are the pool numbers. !Again, shared means you share with others. !multiple means you allow multiple tasks on one node.In this example, one node is chosen from pool 0 by the Resource daemon for the first task, one node from pool 1 is chosen for the next task, one node from pool 3 is chosen for the following task, and the nodes for any remaining task(s) are chosen from pool 0.
The following compiler flags are available for both Fortran ( xlf, mpxlf) or C ( cc, mpcc) compilers in addition to the usual flags available for these compilers.
(If neither -ip nor -us is used at compile time, a CSS library will be dynamically linked with the executable at run time. This library is determined by the MP_EUILIB environment variable.)
There are two types of communication methods available on the IBM SP2. These are the User Space protocol(us) and the Internet protocol(ip). The User Space protocol is much quicker, but does not allow one to share the communicating nodes with other Message Passing Library (MPL) processes. This User Space protocol always uses the high performance switch. The Internet Protocol (ip) is slower, but allows communicating nodes to be shared by other MPL processes. One can use the Internet Protocol over the ethernet or the high performance switch (which you may have guessed, is quicker than using the ethernet).
MPL is the Message Passing Library designed by IBM for message passing over the tightly coupled, distributed memory system of the SP2. The programming model is MIMD (multiple instruction multiple data). Inter-task communication occurs through message passing. The number of tasks dedicated to run a parallel program in MPL is fixed at load time. Unlike PVM, new tasks cannot be ``spawned" during runtime.
Note: The programmer is explicitly responsible for identifying and implementing parallelism within each program in MPL.
MPL allows both process-to-process communication and collective (or global) communication. Process-to process communication is used when two processes communicate with one another using sends and receives. Collective communication is used when one process communicates with several other processes at one time using broadcast, scatter, gather, or reduce operations.
The basic blocking send and receive operations are:
If you are using FORTRAN:
mp_bsend(msg_buffer, msglength, destination, tag) mp_brecv(msg_buffer, max_length, source, tag, rec_length)OR, if you are using C:
mpc_bsend(&msg_buffer, msglength, destination, tag) mpc_brecv(&msg_buffer, max_length, source, tag, rec_length)where
Sends and receives can be blocking or non-blocking. Those that are blocking wait until the send or receive has finished before continuing with other instructions. The non-blocking ones continue with other instructions even when the send or receive has not yet finished. (Blocking operations may also be refered to as synchronous. Non-blocking operations are also called asynchronous.)
The non-blocking send and receives in FORTRAN are mp_send and mp_recv.
NOTE: All commands for C have a c between the mp and _ of the MPL commands for FORTRAN. Other MPL commands (in FORTRAN) which you may find useful are:
For example, in Fortran, use
CALL mp_environ(totaltasks, task_id)
In C, use
returncharacter = mpc_environ(&totaltasks, &task_id)
This is an X-Windows analysis tool, which allows a quick survey of the
utilization of processor nodes. You can get into the System Status Array
by typing
poestat&
Within this System Status Array, each node is represented
by a colored square:
Remember, if you want the status of all active MPL jobs running on the nodes, the command is jm_status -j. This could be lengthy if many jobs are running on the SP2 at the time.
The Visualization Tool(VT) is a graphical user interface which enables you to perform both trace visualization and (real time) performance monitoring within IBM's Parallel Environment. Note that this tool is only useful in the monitoring and visualization of MPL jobs.
You can get into VT by typing
vt &
OR by typing vt tracefile &
where tracefile is the name of a previously
created tracefile.
Many ``views" are also available for looking into Computation and Communication, and System, Network, and Disk utilizations under trace visualization. All of these ``views" except Communication are available under performance monitoring. These ``views" can be in the form of pie charts, bar charts, and grids. There is one 3-D representation of processor CPU utilization.
More information about VT can be found using the InfoExplorer. Also, the Maui High Performance Computing Center has some very detailed information available on the World Wide Web. Their URL is listed in the ``More Information" section of this document.
The LoadLeveler is a batch scheduling system available through IBM for the SP2, which provides the facility for building, submitting and processing serial or parallel (PVM and MPL) batch jobs within a network of machines. LoadLeveler scheduling matches user-defined job requirements with the best available machine resources. A user is allowed to load level his or her own jobs only.
The entire collection of machines available for LoadLeveler scheduling is called a ``pool". Note that this is NOT the same as the processor, or node pools discussed earlier. The LoadLeveler ``pool" is the group of nodes which the LoadLeveler manages. On Cornell's SP2 these nodes are the nodes in the subpools called ``batch". (A subpool is a smaller division of a larger node pool.) On some smaller machines, such as the 16 node SP2 at ORNL, the LoadLeveler pool includes every node on the machine.
Every machine in the pool has one or more LoadLeveler daemons running on it.
The LoadLeveler pool has one Central Manager (CM) machine, whose principal function is to coordinate LoadLeveler related activities on all machines in the pool. This CM maintains status information on all machines and jobs, making decisions about where jobs should run. If the Central Manager machine goes down, job information is not lost. Jobs executing on other machines will continue to run, while jobs waiting to run will start when CM is again restarted, and other jobs may continue to be submitted from other machines. (Such jobs will be dispatched when the Central Manager is restarted.) Normally, users do not even need to know about the Central Manager.
Other machines in the pool may be used to submit jobs, execute jobs, and/or schedule submitted jobs (in cooperation with the Central Manager).
Every LoadLeveler job must be defined in a job command file whose filename is followed by .cmd. Only after defining a job command file, may a user submit the job for scheduling and execution.
A job command file, such as Sample1.cmd in the following examples, can be submitted to the LoadLeveler by typing
llsubmit Sample1.cmd
Lines in a .cmd file that begin with a # not followed by
a @
are considered
comment lines, which the LoadLeveler ignores. Lines that begin with
a # followed by a @
(even if these two symbols are separated
by several spaces) are considered to be command lines for the LoadLeveler.
Listed below are three sample .cmd files.
Sample1.cmd is a simple job command file which submits the serial
job pi
in the ~jsmith/labs/poe/C
subdirectory once.
Sample2.cmd submits that
same serial job (in the same directory) four different times,
most likely on four diffferent SP2 nodes.
Sample3.cmd is a script
.cmd file which submits a parallel job.
Sample1.cmd: #The executable is ~/labs/poe/C/pi in user jsmith's home directory # #The serial job is submitted just one time # # @ executable = /user/user14/jsmith/labs/poe/C/pi # @ input = /dev/null # @ output = sample1.out # @ error = sample1.err # @ notification = complete # @ checkpoint = no # @ restart = no # @ requirements = (Arch == "R6000") && (OpSys == "AIX32") # @ queue Sample2.cmd: #The executable is ~/labs/poe/C/pi in user jsmith's home directory # #This submits the serial pi job four times by listing "queue" four times. # Starting on August 18, 1995 at 4:35 PM # #@ executable = /user/user14/jsmith/labs/poe/C/pi #@ input = /dev/null #@ output = sample2.out #@ error = sample2.err #@ startdate = 16:35 08/18/95 #@queue #@queue #@queue #@queue Sample3.cmd: #!/bin/csh #The executable is ~/labs/poe/C/pi_reduce in jsmith's home directory # #This time, a script command file is used to submit a parallel job # #@ job_name = pi_reduce #@ output = sample3.out #@ error = sample3.err #@ job_type = parallel #@ requirements = (Adapter == "hps_user") #@ min_processors = 4 #@ max_processors = 4 #@ environment = MP_INFOLEVEL=1;MP_LABELIO=yes #@ notification = complete #@ notify_user = jsmith@cs.utk.edu # This sends e-mail to jsmith@cs.utk.edu once the job has been submitted #@ queue echo $LOADL_PROCESSOR_LIST >! sample3.hosts /usr/lpp/poe/bin/poe /user/user14/jsmith/labs/poe/C/pi_reduce
Script .cmd files similar to the Sample3.cmd file shown here are necessary for parallel jobs. After the LoadLeveler processes its command lines, the rest of the script file is run as the executable.
For more information on LoadLeveler and XLoadLeveler commands and .cmd file command lines, try browsing through the information available on the InfoExplorer. (The InfoExplorer is discussed in the Additional Information section of this SP2 guide.)
One can get into the XLoadLeveler, the X-Windows version of
the LoadLeveler, while on the SP2 by typing
xloadl&
A large window sectioned into three parts will appear on the screen. The three sub-windows are the Jobs window, the Machines window, and the Messages window.
The Jobs Window will list the current jobs that have been submitted to LoadLeveler, whether they are running or not. This window also allows one to build jobs, prioritize them, cancel or hold them, and to display each job along with its status. A newly built job can be saved into a .cmd file for further use, such as job submission.
The Machines Window lists the machines, or nodes, available to the LoadLeveler CM. From this window, you can also create jobs, prioritize them, stop, or hold them. Jobs can be displayed by the machines they run on.
The Messages Window gives information on LoadLeveler activities. (Each activity is time-stamped.)
The best way to learn more about this X-Windows tool is to actually get into the XLoadLeveler and try it. The more you practice using a tool such as the XLoadLeveler, the better you will understand what the tool is capable of doing.
The InfoExplorer is a powerful Window-based information tool which allows the user to access some useful reference manuals for the SP2.
To get information specific to the parallel environment,
type
info -l pe &
The -l option allows one to look into a particular library available on the InfoExplorer. Other libraries may also be interesting to view. Two of these are the mpxlf and mpcc compiler libraries.
To get general information, type info &.
Again, the best way to learn this, or any other, X-based tool is to actually get into the tool and use it.
Here are some URL's that may also prove useful:
* http://www.mhpcc.edu/training/workshop/html/workshop.html
http://www.tc.cornell.edu/Edu/Talks/Education.html
http://www.tc.cornell.edu/Edu/Education.and.Training.Materials/sp.html
http://ibm.tc.cornell.edu/ibm/pps/otherwww.html
http://www.mcs.anl.gov/Projects/sp/index.html
http://wwwsp2.cern.ch/Welcome.html
http://www.qpsf.edu.au/sites/sites.html
http://www.ornl.gov/olc/SP2/SP2.html
http://math.nist.gov/acmd/Staff/KRemington/Primer/tutorial.html
http://ibm.tc.cornell.edu/ibm/pps/doc/primer/
http://lscftp.kgn.ibm.com/pps/aboutsp2/sp2sys.html
** http://lscftp.kgn.ibm.com/pps/aboutsp2/sp2new.html
* Document on which this is mainly based.
** Where I learned some useful background information.
``IBM AIX Parallel Environment -
Parallel Programming Subroutine Reference"
``IBM AIX Parallel Environment - Operations and Use"
``Message Passing Libraries on the SP1/SP2." Presentation materials
from the Cornell Theory Center, Ithaca, New York.
Man pages for the SP1/SP2 on the machine. (Use the C language name for the routine when looking for the FORTRAN version.)
A Beginner's Guide to the IBM SP2
This document was generated using the LaTeX2HTML translator Version 95.1 (Fri Jan 20 1995) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -no_reuse -split 0 SP2_guide.tex.
The translation was initiated by Joint Institute for CS on Fri Aug 18 11:24:39 EDT 1995