IBM LoadLeveler for AIX 5L: Using and Administering
LoadLeveler overview
Overview summary
What is LoadLeveler?
LoadLeveler basics
How LoadLeveler works
Network job management and job scheduling systems
How LoadLeveler schedules jobs
LoadLeveler daemons
The LoadLeveler job cycle
Consumable resources
LoadLeveler interfaces
Interface summary
LoadLeveler command line interface
Summary of LoadLeveler commands
Using the Graphical User Interface
Starting the Graphical User Interface
Specifying options
The LoadLeveler main window
Getting help using the Graphical User Interface
Differences between LoadLeveler's Graphical User Interface and other Graphical User Interfaces
Graphical User Interface typographic conventions
Customizing the Graphical User Interface
Syntax of an Xloadl file
Modifying windows and buttons
Creating your own pull-down menus
Customizing fields on the Jobs window and the Machines window
Modifying help panels
Administrative uses for the Graphical User Interface
Job related administrative actions
Machine related administrative actions
LoadLeveler API interface
Summary of LoadLeveler APIs
User tasks
User task summary
Submitting and managing jobs
Building a job command file
Job command file syntax
Submitting a job command file
Managing jobs
Editing job command files
Querying the status of a job
Placing and releasing a hold on a job
Cancelling a job
Checkpointing a job
Setting and changing the priority of a job
Working with machines
Run-time environment variables
Managing jobs that consume resources
Specifying the consumption of resources by a job step
Displaying currently available resources
Special considerations for parallel jobs
Supported parallel environments
Keyword considerations for parallel jobs
Scheduler considerations
Task assignment considerations
Submitting jobs that use striping
Understanding striping
Using striping
Running interactive POE jobs
Job command file examples
Obtaining status of parallel jobs
Obtaining allocated host names
Administrator tasks
Administrator task summary
Administering and configuring LoadLeveler
Overview
Planning considerations
Where to begin?
Quick set up
Administering LoadLeveler
Administration file structure and syntax
Configuring LoadLeveler
The configuration files
Configuration file structure and syntax
Considerations for integrating LoadLeveler with AIX Workload Manager
Keyword summary
Administration tasks for parallel jobs
Scheduling considerations for parallel jobs
Allowing users to submit interactive POE jobs
Allowing users to submit PVM jobs
Restrictions and limitations for PVM jobs
Setting up a class for parallel jobs
Setting up a parallel master node
Gathering job accounting data
Collecting job resource data on serial and parallel jobs
Collecting job resource data based on machines
Collecting job resource data based on events
Collecting job resource information based on user accounts
Collecting the accounting information and storing it into files
Accounting reports
Job accounting setup procedure
Routing jobs to NQS machines
Setting up the NQS environment
Designating machines to which jobs will be routed
NQS scripts
NQS machine job routing procedure
Detailed descriptions
Descriptions summary
Job command file keywords
account_no
arguments
blocking
checkpoint
ckpt_dir
ckpt_file
ckpt_time_limit
class
comment
core_limit
cpu_limit
data_limit
dependency
environment
error
executable
file_limit
group
hold
image_size
initialdir
input
job_cpu_limit
job_name
job_type
max_processors
min_processors
network
node
node_usage
notification
notify_user
output
parallel_path
preferences
queue
requirements
resources
restart
restart_from_ckpt
restart_on_same_nodes
rss_limit
shell
stack_limit
startdate
step_name
task_geometry
tasks_per_node
total_tasks
user_priority
wall_clock_limit
Job command file variables
Example 1
Example 2
Administration and Configuration file keywords
Administration file keywords
Configuration file keywords and LoadLeveler variables
Keywords
User-defined keywords
LoadLeveler variables
LoadLeveler daemons and job states
Daemons
The master daemon
The schedd daemon
The startd daemon
The negotiator daemon
The kbdd daemon
The gsmonitor daemon
Job states
Commands
llacctmrg - Collect machine history files
llcancel - Cancel a submitted job
llckpt - Checkpoint a running job step
llclass - Query class information
llctl - Control LoadLeveler daemons
lldcegrpmaint - LoadLeveler DCE group maintenance utility
llextSDR - Extract adapter information from the SDR
llfavorjob - Reorder system queue by job
llfavoruser - Reorder system queue by user
llhold - Hold or release a submitted job
llinit - Initialize machines in the LoadLeveler cluster
llmatrix - Query Gang matrix
llmodify - Change attributes of a submitted job step
llpreempt - Preempt a submitted job step
llprio - Change the user priority of submitted job steps
llq - Query job status
llstatus - Query machine status
llsubmit - Submit a job
llsummary - Return job resource information for accounting
Application Programming Interfaces (APIs)
Accounting API
Account validation user exit
Report generation subroutine
Checkpointing API
ckpt subroutine
ll_init_ckpt
ll_ckpt
ll_set_ckpt_callbacks
ll_unset_ckpt_callbacks
Data Access API
Using the data access API
ll_query subroutine
ll_set_request subroutine
ll_reset_request subroutine
ll_get_objs subroutine
Understanding the LoadLeveler job object model
ll_get_data subroutine
ll_next_obj subroutine
ll_free_objs subroutine
ll_deallocate subroutine
Examples of using the Data Access API
Error Handling API
ll_error subroutine
Parallel Job API
Interaction between LoadLeveler and the parallel API
ll_get_hostlist subroutine
ll_start_host subroutine
Examples
Query API
ll_get_jobs subroutine
ll_free_jobs subroutine
ll_get_nodes subroutine
ll_free_nodes subroutine
Submit API
llsubmit subroutine
llfree_job_info subroutine
Monitoring programs
Workload Management API
ll_control subroutine
ll_modify subroutine
ll_preempt subroutine
ll_start_job subroutine
ll_terminate_job subroutine
Usage notes
User exits
Handling DCE security credentials
Handling an AFS token
Filtering a job script
Using your own mail program
Writing prolog and epilog programs
Procedures
Using the Graphical User Interface
Step 1: Building jobs
Step 2: Edit the job command file
Step 3: Submit a job command file
Step 4: Display, refresh, and obtain job status
Step 5: Sort the Jobs window
Step 6: Change priorities of jobs in a queue
Step 7: Hold a job
Step 8: Release a hold on a job
Step 9: Cancel a job
Step 10: Modify consumable CPUs and consumable memory
Step 11: Take checkpoint
Step 12: Display and refresh machine status
Step 13: Sort the Machines window
Step 14: Find the location of the central manager
Step 15: Find the location of the public scheduling machines
Step 16: Find the type of scheduler in use
Step 17: Specify which jobs appear in the Jobs window
Step 18: Specify which machines appear in Machines window
Step 19: Save LoadLeveler messages in a file
Customizing the administration file
Step 1: Specify machine stanzas
Step 2: Specify user stanzas
Step 3: Specify class stanzas
Step 4: Specify group stanzas
Step 5: Specify adapter stanzas
Customizing the global and local configuration file
Step 1: Define LoadLeveler administrators
Step 2: Define LoadLeveler cluster characteristics
Step 3: Define LoadLeveler machine characteristics
Step 4: Define consumable resources
Step 5: Specify how many jobs a machine can run
Step 6: Prioritize the queue maintained by the negotiator
Step 7: Prioritize the order of executing machines maintained by the negotiator
Step 8: Manage a job's status using control expressions
Step 9: Define job accounting
Step 10: Specify alternate central managers
Step 11: Specify where files and directories are located
Step 12: Record and control log files
Step 13: Define network characteristics
Step 14: Enable checkpointing
Planning considerations for checkpointing jobs
How to checkpoint a job
Remove old checkpoint files
Step 15: Specify process tracking
Step 16: Configuring LoadLeveler to use DCE security services
Step 17: Specify additional configuration file keywords
Setting up job accounting files
Task 1: Update the configuration file
Task 2: Merge multiple files collected from each machine into one file
Task 3: Report job information on all the jobs in the history file
Task 4: Using account numbers and setting up account validation
Task 5: Specifying machines and their weights
Routing jobs to NQS machines
Task 1: Modify the administration file
Task 2: Modify the configuration file
Task 3: Submit the jobs
Task 4: Obtain status of NQS jobs
Task 5: Cancel NQS jobs
Using Gang scheduling
Overview
Gang scheduling concepts
Hierarchical communication
Task switching
Supported hardware
Application support
Preemption
Keywords specific to Gang scheduling
Configuration file keywords for Gang scheduling
Sample configuration file
Administration file keywords for Gang
Sample administration file
Gang scheduling interactions and restrictions
Network Time Protocol (NTP)
Consumable resource enforcement
Reconfiguration
Circular preemption
Restrictions for Gang scheduling and preemption
Implied START_CLASS values
Last one wins rule
Job command file and Gang scheduling
LoadLeveler commands for Gang
APIs used with Gang scheduling
Support for 64-bit applications
64-bit support for Job Command, Configuration, and Administration keywords
64-bit support for Job Command file keywords
64-bit support for Administration keywords
64-bit support for Configuration keywords and expressions
64-bit support for Command line interfaces and the GUI
64-bit support for Command line interfaces
64-bit support for the GUI
64-bit support for the LoadLeveler APIs
64-bit support for Accounting functions
Appendix contents
Appendixes
Appendix A. Examples
User tasks: building job command files
Using commands
Additional examples of building job command files
User tasks: building parallel job command files
POE
PVM 3.3 (non-SP)
PVM 3.3.11+ (SP2MPI architecture)
Appendix B. Customer case studies
Customer 1: technical computing at the Cornell Theory Center
System configuration
LoadLeveler configuration
Customer 2: circuit simulation
System configuration
LoadLeveler configuration
Customer 3: high-energy physics
System configuration
LoadLeveler batch configuration
LoadLeveler interactive configuration
Processor configuration
Customer 4: computer chip design
System configuration
Interactive configuration
Batch configuration
Configuration for a machine that schedules (but doesn't run) jobs
Appendix C. Troubleshooting
Troubleshooting LoadLeveler
Frequently Asked Questions
Helpful hints
Getting help from IBM
Appendix D. Bibliography
Information formats
Finding documentation on the World Wide Web
Accessing PSSP documentation online
Manual pages for public code
RS/6000 SP planning publications
RS/6000 SP hardware publications
RS/6000 SP Switch Router publications
Related hardware publications
RS/6000 SP software publications
AIX publications
DCE publications
Redbooks
Non-IBM publications
Appendix E. Notices
Trademarks and service marks
Appendix F. Glossary
Appendix . Index
[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]