Stephanie Wolf
Joint Institute of Computational Science
University of Tennessee
Knoxville, TN 37996--1301, USA

Last Revision: Fri Aug 18 11:24:54 EDT 1995
Author's Email Address: jics@cs.utk.edu

A Beginner's Guide to the IBM SP2

Copyright (C) October 1991, Computational Science Education Project

Retrieve a PostScript file with encapsulated color figures using ghostview

Acknowledgments

Thanks to Maui High Performance Computing Center and Cornell High Performance Center for their on-line information and permission to use the documents and labs to promote education on the IBM SP2.

Thanks to Christian Halloy and Linda Mellott for critiquing and editing drafts of this guide.


Any mention of specific products, trademarks, or brand names within this document is for identification purposes only. All trademarks belong to their respective owners.

Contents

Foreward

The purpose of this document is to give an overview of the IBM SP2 for beginning users. Since your access to an IBM SP2 might be on one of several IBM SP2 machines, this document shall be rather general, and not discuss machine dependent features.

Some of the important features and capabilities are presented including concise information on hardware, software tools, and compilers which is essential for beginners who want to program in parallel.

Introduction

The IBM SP2 is Scalable POWERpallel2 system. POWER means Performance Optimized With Enhanced RISC. And, RISC is a Reduced Instruction Set Computer.

The SP2 consists of up to several hundred nodes which are all RISC 6000 superscalar processors. (Superscalar means that more than one instruction per clock cycle can be performed.) The SP2 can be considered a distributed memory machine because each node has its own local memory.

When programing in parallel on the SP2, you can run programs in SPMD (single program, multiple data) mode or MIMD (multiple instruction, multiple data) mode.

Accessing the IBM SP2

First of all, you will want to get an account on an SP2 machine. Once you have that account, you can access the machine by typing telnet SP2machine_name.

If you are working on an X-Windows terminal and wish to create new windows or open X-Window based tools on your workstation, type xhost + in a window on your workstation. Then, once you are logged in to the SP2 machine, you will want to set your environment DISPLAY variable by typing

setenv DISPLAY workstation_name:0.0
(There are also other important environment variables which may need to be set. These will be discussed later in this document.)

IBM SP2 Hardware

So, how can an SP be configured? There can be 4 to 16 processors per frame (or cabinet) and 1 to 16 frames per system. Frames can consist of BOTH thin and wide nodes. The main difference between thin and wide nodes is that generally wide nodes have the capacity for larger memory and a larger memory cache. Standard SP2 systems could contain up to 256 nodes. Up to 512 nodes can be ordered from IBM by special request. (ORNL has a 16 node SP2, Maui has a 400 node SP2, and Cornell has a 512 node machine.) An optional high performance switch (HPS) for inter-processor communication is usually included on most systems. Such a switch can interconnect up to 16 nodes with up to 16 other nodes. This high performance switch has a 40 MB/sec peak bi-directional bandwidth. More than one HPS may be included when the number of nodes in the SP2 machine is greater than 32.

The IBM SP2 is fairly robust in that it has concurrent maintenance, which simply means each processor node can be removed, repaired, and rebooted without interrupting operations on other nodes.

  
Table 1: Processor Nodes - Wide vs. Thin

IBM SP2 Software

Each node of the SP2 runs the AIX Operating System, IBM's version of UNIX. The SP2 has a Parallel Operating Environment, or POE, which is used for handling parallel tasks running on multiple nodes. (POE will be discussed further in the subsequent section.) The POE also provides a Message Passing Library (MPL), as well as some important tools for debugging, profiling, and system monitoring. These shall also be discussed later. There is software available for system monitoring, and performance profiling and debugging such as the Visualization Tool(VT), prof and gprof profilers, pdbx and xpdbx parallel debuggers. Often, PVM and/or PVMe are/is available on the IBM SP2. Some systems may even have Message Passing Interface(MPI) and/or High Performance FORTRAN(HPF) available. Important math libraries may be available. And, there may be software available to assist you in parallelizing codes such as FORGE90. For more information about what is available on the SP2 you are using can usually be found on the WWW pages for the machine you are using.

Parallel Operating Environment (POE)

The POE consists of parallel compiler scripts, POE environment variables, parallel debugger(s) and profiler(s), MPL, and parallel visualization tools. These tools allow one to develop, execute, profile, debug, and fine-tune parallel code.

A few important terms you may wish to know are Partition Manager, Resource Manager, and Processor Pools. The Partition Manager controls your partition, or group of nodes on which you wish to run your program. The Partition Manager requests the nodes for your parallel job, acquires the nodes necessary for that job (if the Resource Manager is not used), copies the executables from the initiating node to each node in the partition, loads executables on every node in the partition, and sets up standard I/O.

The Resource Manager keeps track of the nodes currently processing a parallel task, and, when nodes are requested from the Partion Manager, it allocates nodes for use. The Resource Manager attempts to enforce a ``one parallel task per node" rule.

The Processor Pools are sets of nodes dedicated to a particular type of process (such as interactive, batch, I/O intensive) which have been grouped together by the system administrator(s).

For more information about the processor or node, pools available on your system, simply type jm_status -P when on the SP2 machine.

Important Environment Variables

There are many environment variables and command line flags that can be set to influence the operation of POE tools and the execution of parallel programs. A complete list of the POE environment variables can be found in the IBM ``AIX Parallel Environment Operation and Use" manual. Some environment variables which critically affect program execution are listed below.

MP_PROCS - sets number of processes, or tasks, to allocate for your partition. Often one process, or task, is assigned to each node, but you can also run multiple tasks on one node.

MP_RESD - specifies whether or not the Partition Manager should connect to the POWERparallel system Resource Manager to allocate nodes. If the value is set to ``no,'' the Partition Manager will allocate nodes without going through the Resource Manager.

MP_RMPOOL - sets the number of the POWERparallel system pool that should be used by the Resource Manager for non-specific node allocation. This is only valid if you are using the POWERparallel Resource Manager for non-specific node allocation (from a single pool) without a host list file.

Information about available pools may be obtained by typing the command jm_status -P at the Unix prompt.

MP_HOSTFILE - specifies the name of a host file for node allocation. This can be any file name, NULL or `` ". The default host list file is host.list in the current directory.

MP_HOSTFILE does not need to be set if the host list file is the default, host.list.

A host list file must be present if either:

1) specific node allocation is required,
2) non-specific node allocation from a number of system pools is requested, or
3) a host list file named something other than the default host.list will be used.

MP_SAVEHOSTFILE - names the output host file to be generated by the partition manager. This designated file will contain the names of the nodes on which your parallel program actually ran.

MP_RETRY - specifies the period of time (in seconds) between allocation retries if not enough processors are available.

MP_RETRYCOUNT - specifies the number of times the partition manager should attempt to allocate processor nodes before returning without having run your program.

MP_EUILIB - specifies which Communication SubSystem (CSS) library implementation to use for communication. Set to ip (for Internet Protocol CSS) or us (for User Space CSS, which allows one to drive the high-performance switch directly from parallel tasks without going through the kernel or operating system).

MP_EUIDEVICE - sets the adaptor used for message passing: en0 (Ethernet), fi0 (FDDI), tr0 (token-ring), or css0 (the high-performance switch adaptor). [Note: This variable is ignored if the US CSS library is used.]

MP_EUIDEVELOP - whether MPL should do more detailed checking during program execution. (This takes a value of yes or no).

MP_TRACEFILE - name of the trace file to be created during the execution of a program where tracing commands appear within the code.

MP_TRACELEVEL - level of VT tracing (0 = NONE, 9 = ALL trace records are on, 1, 2, & 3 are each a different combination of some trace records.

MP_INFOLEVEL - amount of diagnostic information your program displays as it runs (0 - 5 with 0 being the least amount of information given.)

You may wish to use a shell script to set the appropriate environment variables or you could set important environment variables in your .login file.

.rhosts file

Some IBM SP2 systems will already have a .rhosts file set up for you when you receive your account. However, on other systems, you may have to set up your own .rhosts file. This file should include the names of all the nodes and switches you may ever want to use.

Make sure that the node you are logged onto is in your .rhosts file. If it is not, you will want to add that node into the .rhosts file to avoid problems.

A sample .rhosts file might look similar to this (but contain more lines):

sp2101.ccs.ornl.gov 
hps101.ccs.ornl.gov 
sp2102.ccs.ornl.gov 
hps102.ccs.ornl.gov

Some systems require you to place your user_name on each line after the node or switch name. You will not need the dot extensions if you wish to access nodes of only that machine. Ask your system help about the naming of the nodes on the system you wish to use.

host.list file

The host file should contain a list of all the nodes (or pools of nodes, but not both) you wish to run your code on. The first task will be run on the first node or pool listed, the second task will run on the second node or pool listed, etc. If you are using pools in the host file and do not have enough pools listed for all the tasks, the last tasks will use additional nodes within the last pool listed. However, if you are listing nodes, you must have at least as many nodes listed in the host file as you wish to run. You are allowed to repeat a node name within a host file. Doing so will cause your program to run multiple tasks on one node.

The default host file is host.list, but you can change the MP_HOSTFILE environment variable to be some other file name. If you have decided to run your code on ONE pool and have set MP_HOSTFILE to Null and RM_POOL to the appropriate pool number, you do not need to have a host file.

A sample host file using nodes:

!This is a comment line 
!Use an exclamation at the beginning of any comment
r25n09.tc.cornell.edu shared multiple
r25n10.tc.cornell.edu shared multiple
r25n11.tc.cornell.edu shared multiple
r25n12.tc.cornell.edu shared multiple
!
!Host nodes are named r25n09, r25n10, r25n11, and r25n12
!When using MPL, shared means you share with others.
!multiple means you allowing multiple MPL tasks on one node. 
!
!dedicated in place of shared or multiple means you do not want 
!to share with other people's MPL tasks, or you do not want 
!to allow multiple MPL tasks of your own on one node.

A sample host file using pools:

!This line is a comment
@0 shared multiple
@1 shared multiple
@3 shared multiple
@0 shared multiple
!0, 1, and 3 are the pool numbers.  
!Again, shared means you share with others.
!multiple means you allow multiple tasks on one node.
In this example, one node is chosen from pool 0 by the Resource daemon for the first task, one node from pool 1 is chosen for the next task, one node from pool 3 is chosen for the following task, and the nodes for any remaining task(s) are chosen from pool 0.

Compiler Options

The following compiler flags are available for both Fortran ( xlf, mpxlf) or C ( cc, mpcc) compilers in addition to the usual flags available for these compilers.

-p or -pg provides information necessary for the use of the profilers prof or gprof, respectively.

-g makes the compiled program suitable for debugging with pdbx or xpdbx. This option is also necessary to use the Source Code view in the Visualization Tool, vt.

-O optimize the output code. (The -O can be followed by an optimization level: -O2 , for example.) Remember, optimize only if NOT debugging. (One cannot use -O with -p, -pg, or -g).

-ip causes the IP CSS library to be statically bound with the executable. Communication during execution will use the Internet Protocol.

-us causes the US CSS library to be statically bound with the executable. Communication will occur over the high-performance switch.

(If neither -ip nor -us is used at compile time, a CSS library will be dynamically linked with the executable at run time. This library is determined by the MP_EUILIB environment variable.)

Communication Methods

There are two types of communication methods available on the IBM SP2. These are the User Space protocol(us) and the Internet protocol(ip). The User Space protocol is much quicker, but does not allow one to share the communicating nodes with other Message Passing Library (MPL) processes. This User Space protocol always uses the high performance switch. The Internet Protocol (ip) is slower, but allows communicating nodes to be shared by other MPL processes. One can use the Internet Protocol over the ethernet or the high performance switch (which you may have guessed, is quicker than using the ethernet).

Message Passing Library (MPL)

MPL is the Message Passing Library designed by IBM for message passing over the tightly coupled, distributed memory system of the SP2. The programming model is MIMD (multiple instruction multiple data). Inter-task communication occurs through message passing. The number of tasks dedicated to run a parallel program in MPL is fixed at load time. Unlike PVM, new tasks cannot be ``spawned" during runtime.

Note: The programmer is explicitly responsible for identifying and implementing parallelism within each program in MPL.

MPL allows both process-to-process communication and collective (or global) communication. Process-to process communication is used when two processes communicate with one another using sends and receives. Collective communication is used when one process communicates with several other processes at one time using broadcast, scatter, gather, or reduce operations.

The basic blocking send and receive operations are:
If you are using FORTRAN:

     mp_bsend(msg_buffer, msglength, destination, tag)
     mp_brecv(msg_buffer, max_length, source, tag, rec_length)
OR, if you are using C:
     mpc_bsend(&msg_buffer, msglength, destination, tag)
     mpc_brecv(&msg_buffer, max_length, source, tag, rec_length)
where
msg_buffer - memory buffer containing the data to be sent or received. (in FORTRAN, this is simply the name of the buffer; in C, it is the buffer's corresponding address.)

msglength - message length (in bytes)

max_length - expected length of receive buffer

rec_length - number of bytes actually received

destination - process id number of the process, or task, to which the message is being sent

source - process, or task, id number of the source from which you want to receive a message

tag - user-defined non-negative integer used to identify the messages transfered (perhaps you want one process to receive only messages tagged with 12 at some point in the program execution).

Sends and receives can be blocking or non-blocking. Those that are blocking wait until the send or receive has finished before continuing with other instructions. The non-blocking ones continue with other instructions even when the send or receive has not yet finished. (Blocking operations may also be refered to as synchronous. Non-blocking operations are also called asynchronous.)

The non-blocking send and receives in FORTRAN are mp_send and mp_recv.

NOTE: All commands for C have a c between the mp and _ of the MPL commands for FORTRAN. Other MPL commands (in FORTRAN) which you may find useful are:

mp_environment returns the total number of tasks and the task's own id number.

For example, in Fortran, use

CALL mp_environ(totaltasks, task_id)

In C, use

returncharacter = mpc_environ(&totaltasks, &task_id)

mp_probe checks whether a message has arrived yet.
mp_status provides a non-blocking check of the status of a specified non-blocking send or receive, and
mp_wait which blocks to ``wait" for a non-blocking send or receive to finish.

Many other MPL operations are available. You may find more information on those, as well as the appropriate syntax for the above-listed commmands in IBM's ``AIX Parallel Environment - Parallel Programming Subroutine Reference", or try the InfoExplorer (discussed in the Additional Information section of this document).

System Status Array

This is an X-Windows analysis tool, which allows a quick survey of the utilization of processor nodes. You can get into the System Status Array by typing
poestat& Within this System Status Array, each node is represented by a colored square:

Pink squares
- nodes with low or no utilization.
Yellow squares
- nodes with higher utilization.
Green frames within the squares
- nodes with running parallel MPL processes.
Grey squares
- nodes not available for monitoring.

To the right appears a list of node names; nodes are listed in order from left to right and from top to bottom.

Remember, if you want the status of all active MPL jobs running on the nodes, the command is jm_status -j. This could be lengthy if many jobs are running on the SP2 at the time.

Visualization Tool (VT)

The Visualization Tool(VT) is a graphical user interface which enables you to perform both trace visualization and (real time) performance monitoring within IBM's Parallel Environment. Note that this tool is only useful in the monitoring and visualization of MPL jobs.

You can get into VT by typing
vt &

OR by typing vt tracefile &
where tracefile is the name of a previously created tracefile.

Many ``views" are also available for looking into Computation and Communication, and System, Network, and Disk utilizations under trace visualization. All of these ``views" except Communication are available under performance monitoring. These ``views" can be in the form of pie charts, bar charts, and grids. There is one 3-D representation of processor CPU utilization.

More information about VT can be found using the InfoExplorer. Also, the Maui High Performance Computing Center has some very detailed information available on the World Wide Web. Their URL is listed in the ``More Information" section of this document.

Loadlever

The LoadLeveler is a batch scheduling system available through IBM for the SP2, which provides the facility for building, submitting and processing serial or parallel (PVM and MPL) batch jobs within a network of machines. LoadLeveler scheduling matches user-defined job requirements with the best available machine resources. A user is allowed to load level his or her own jobs only.

LoadLeveler Overview

The entire collection of machines available for LoadLeveler scheduling is called a ``pool". Note that this is NOT the same as the processor, or node pools discussed earlier. The LoadLeveler ``pool" is the group of nodes which the LoadLeveler manages. On Cornell's SP2 these nodes are the nodes in the subpools called ``batch". (A subpool is a smaller division of a larger node pool.) On some smaller machines, such as the 16 node SP2 at ORNL, the LoadLeveler pool includes every node on the machine.

Every machine in the pool has one or more LoadLeveler daemons running on it.

The LoadLeveler pool has one Central Manager (CM) machine, whose principal function is to coordinate LoadLeveler related activities on all machines in the pool. This CM maintains status information on all machines and jobs, making decisions about where jobs should run. If the Central Manager machine goes down, job information is not lost. Jobs executing on other machines will continue to run, while jobs waiting to run will start when CM is again restarted, and other jobs may continue to be submitted from other machines. (Such jobs will be dispatched when the Central Manager is restarted.) Normally, users do not even need to know about the Central Manager.

Other machines in the pool may be used to submit jobs, execute jobs, and/or schedule submitted jobs (in cooperation with the Central Manager).

Every LoadLeveler job must be defined in a job command file whose filename is followed by .cmd. Only after defining a job command file, may a user submit the job for scheduling and execution.

A job command file, such as Sample1.cmd in the following examples, can be submitted to the LoadLeveler by typing

llsubmit Sample1.cmd

Lines in a .cmd file that begin with a # not followed by a @ are considered comment lines, which the LoadLeveler ignores. Lines that begin with a # followed by a @ (even if these two symbols are separated by several spaces) are considered to be command lines for the LoadLeveler.

Listed below are three sample .cmd files.

Sample1.cmd is a simple job command file which submits the serial job pi in the ~jsmith/labs/poe/C subdirectory once. Sample2.cmd submits that same serial job (in the same directory) four different times, most likely on four diffferent SP2 nodes. Sample3.cmd is a script .cmd file which submits a parallel job.

Sample1.cmd:

#The executable is ~/labs/poe/C/pi in user jsmith's home directory
#
#The serial job is submitted just one time
#
# @ executable = /user/user14/jsmith/labs/poe/C/pi
# @ input = /dev/null
# @ output = sample1.out
# @ error = sample1.err
# @ notification = complete
# @ checkpoint = no
# @ restart = no
# @ requirements = (Arch == "R6000") && (OpSys == "AIX32")
# @ queue

Sample2.cmd:

#The executable is ~/labs/poe/C/pi in user jsmith's home directory
#
#This submits the serial pi job four times by listing "queue" four times.
#  Starting on August 18, 1995 at 4:35 PM  
#
#@ executable = /user/user14/jsmith/labs/poe/C/pi
#@ input = /dev/null
#@ output = sample2.out
#@ error = sample2.err
#@ startdate = 16:35 08/18/95 
#@queue
#@queue
#@queue
#@queue

Sample3.cmd:

#!/bin/csh
#The executable is ~/labs/poe/C/pi_reduce in jsmith's home directory
#
#This time, a script command file is used to submit a parallel job
#
#@ job_name        = pi_reduce
#@ output          = sample3.out
#@ error           = sample3.err
#@ job_type        = parallel
#@ requirements    = (Adapter == "hps_user")
#@ min_processors  = 4
#@ max_processors  = 4
#@ environment     = MP_INFOLEVEL=1;MP_LABELIO=yes
#@ notification    = complete
#@ notify_user     = jsmith@cs.utk.edu
# This sends e-mail to jsmith@cs.utk.edu once the job has been submitted
#@ queue
echo $LOADL_PROCESSOR_LIST >! sample3.hosts
/usr/lpp/poe/bin/poe /user/user14/jsmith/labs/poe/C/pi_reduce

Script .cmd files similar to the Sample3.cmd file shown here are necessary for parallel jobs. After the LoadLeveler processes its command lines, the rest of the script file is run as the executable.

For more information on LoadLeveler and XLoadLeveler commands and .cmd file command lines, try browsing through the information available on the InfoExplorer. (The InfoExplorer is discussed in the Additional Information section of this SP2 guide.)

XLoadLeveler Overview

One can get into the XLoadLeveler, the X-Windows version of the LoadLeveler, while on the SP2 by typing
xloadl&

A large window sectioned into three parts will appear on the screen. The three sub-windows are the Jobs window, the Machines window, and the Messages window.

The Jobs Window will list the current jobs that have been submitted to LoadLeveler, whether they are running or not. This window also allows one to build jobs, prioritize them, cancel or hold them, and to display each job along with its status. A newly built job can be saved into a .cmd file for further use, such as job submission.

The Machines Window lists the machines, or nodes, available to the LoadLeveler CM. From this window, you can also create jobs, prioritize them, stop, or hold them. Jobs can be displayed by the machines they run on.

The Messages Window gives information on LoadLeveler activities. (Each activity is time-stamped.)

The best way to learn more about this X-Windows tool is to actually get into the XLoadLeveler and try it. The more you practice using a tool such as the XLoadLeveler, the better you will understand what the tool is capable of doing.

Additional Information on the IBM SP2

InfoExplorer

The InfoExplorer is a powerful Window-based information tool which allows the user to access some useful reference manuals for the SP2.

To get information specific to the parallel environment, type
info -l pe &

The -l option allows one to look into a particular library available on the InfoExplorer. Other libraries may also be interesting to view. Two of these are the mpxlf and mpcc compiler libraries.

To get general information, type info &.

Again, the best way to learn this, or any other, X-based tool is to actually get into the tool and use it.

Information Available on the World Wide Web

Here are some URL's that may also prove useful:

* http://www.mhpcc.edu/training/workshop/html/workshop.html
http://www.tc.cornell.edu/Edu/Talks/Education.html
http://www.tc.cornell.edu/Edu/Education.and.Training.Materials/sp.html
http://ibm.tc.cornell.edu/ibm/pps/otherwww.html
http://www.mcs.anl.gov/Projects/sp/index.html
http://wwwsp2.cern.ch/Welcome.html
http://www.qpsf.edu.au/sites/sites.html
http://www.ornl.gov/olc/SP2/SP2.html
http://math.nist.gov/acmd/Staff/KRemington/Primer/tutorial.html
http://ibm.tc.cornell.edu/ibm/pps/doc/primer/
http://lscftp.kgn.ibm.com/pps/aboutsp2/sp2sys.html
** http://lscftp.kgn.ibm.com/pps/aboutsp2/sp2new.html

* Document on which this is mainly based.
** Where I learned some useful background information.

Typewritten Manuals and man Pages

``IBM AIX Parallel Environment - Parallel Programming Subroutine Reference"

``IBM AIX Parallel Environment - Operations and Use"

``Message Passing Libraries on the SP1/SP2." Presentation materials from the Cornell Theory Center, Ithaca, New York.

Man pages for the SP1/SP2 on the machine. (Use the C language name for the routine when looking for the FORTRAN version.)

About this document ...

A Beginner's Guide to the IBM SP2

This document was generated using the LaTeX2HTML translator Version 95.1 (Fri Jan 20 1995) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -no_reuse -split 0 SP2_guide.tex.

The translation was initiated by Joint Institute for CS on Fri Aug 18 11:24:39 EDT 1995


Joint Institute for Computational Science
Fri Aug 18 11:24:39 EDT 1995