Here are a collection of scripts I've written to try and politely keep users
in-line on our ROCKs linux cluster.
They come with no warranty and are poorly written in Python, so be forewarned.
Furthermore, they may be customized to the particularities of our cluster.
Please also see:
All scripts are copyright ©2007 by David
Black-Schaffer, but permission is given to use and/or modify them.
Head Node
- Checking for old user processes on the head node - check_for_old_processes.py:
- Description:
This script will find all processes that were started between minAgeDays
and maxAgeDays for each user. It will then put the PID and name of these
processes in a ~/.processes_to_be_killed and send an email to the user
telling him or her that those processes will be killed the next time the
script is run. On subsequent runs it will issue a kill -9 as root to whatever
processes are in that file and re-generate the file based on any other
old processes. If the user wants to keep the process around he or she
can simply remove it from the ~/.processes_to_be_killed file.
- How I run it (cron on the head node):
# Kill old processes
on the head node and generate warnings about new ones every week.
# Note that we only deal with processes between 8 and 9 weeks old.
00 00 */7 * * check_for_old_processes.py 56 63
- Checking for updated packages for the head node - check_for_package_updates.py:
- Description:
This script checks if yum has new packages and sends an email to the cluster
administrator if newer packages are found.
- How I run it (cron on the head node):
# Check if there is
a new version of OpenSSH and samba available.
# These are the two most critical security packages.
00 00 * * * check_for_package_updates.py openssh
00 02 * * * check_for_package_updates.py samba
- Warning users about running jobs on the head node - watchHeadNodeUsers.py:
- Description:
This script watches for excessive CPU or memory usage by users on the
head node and sends the user an email once every few hours if they exceed
the limits. It functions by checking for which users are using how much
CPU usage and storing this in a state file. The next time it runs if a
user is over either limit it increases the user's temperature until they
exceed a given level, at which point an email is sent. This allows users
to briefly use lots of CPU or memory, but makes sure they don't in the
long run.
- How I run it (cron on the head node):
# Monitor head node
usage every 2 minutes
*/2 * * * * /home/shared/system/bin/utilities/watchHeadNodeUsers.py
Resource Management
- Warning users about their disk quotas - login_disk_space_check.py:
- Description:
This script runs every time a user opens a new connection to the head
node and tells them what their current disk quota usage is. It does so
by connecting to the storage server and retrieving their disk quota. Then
depending on how much of their disk quota they are using it displays a
more or less friendly message.
- How I run it (/etc/bashrc on the head node):
Add the following to /etc/bashrc after the "#Turn on checkwinsize":
# Check for disk quota
over-runs
login_disk_space_check.py
- Cleaning up temporary files on all the nodes - cleanup_temp_files.py
and run_cleanup_temp_files.sh:
- Description:
Goes through and deletes files older than a certain amount in certain
places. There are other scripts to do this, but this one works nicely
for me because I can tell it to ignore files owned by logged in users.
- How I run it (cron on the head node):
# Clean up temporary
files across the cluster at 2am every day.
00 02 * * * run_cleanup_temp_files.sh > /dev/null
- Killing non-queued user processes - kill_non_queued_processes.py:
- Description:
This script will look at all the running user processes and either generate
warning emails if it is called with "warning" or kill the processes
if the user does not have jobs registered on the node. This is particularly
useful for users who use "qlogin" to get an interactive session
and then leave jobs hanging around. Once their qlogin session ends any
processes they left running will be killed by this script.
- How I run it (cron on the head node)":
# Send out warning emails
about unreserved node usage at 10.50pm right before killing them
50 22 * * * /opt/rocks/bin/cluster-fork kill_non_queued_processes.py warning
> /dev/null
# Kill those unreserved processes at 11pm.
0 23 * * * /opt/rocks/bin/cluster-fork kill_non_queued_processes.py
- Suspending low-priority SGE Jobs - suspend_low_priority_jobs.py:
- Description:
This script defines multiple "priorityQueues" and then finds
nodes with more jobs than CPUs and suspends enough of the non-priority
jobs on those nodes to insure that the jobs in the priority queues run
on either own CPU. If it finds a node that has free CPUs and suspended
jobs it then un-suspends the jobs. The effect is to preempt jobs in non-priority
queues with jobs from the priority queues. This script assumes that the
priority queues and the non-priority queues overlap in resources.
- How I run it (cron on the head node):
# Suspend low-priority
jobs on the cluster every 2 minutes
*/2 * * * * /home/shared/system/bin/utilities/suspend_low_priority_jobs.py
- Preventing multiple qlogin reservations - single_qlogin.py:
- Description:
Our users use "qlogin" to reserve interactive CPUs on our cluster,
but most users log in multiple times and run qlogin over and over again
for each terminal session. This results in them reserving multiple CPUs,
when what they mostly want is simply more terminals open at once. To avoid
this I have the users run single_qlogin.py instead of qlogin (by putting
it first on their path). This script checks if the user already has a
qlogin session in SGE and if so it opens an ssh connection to that node
rather than reserving a new qlogin session. If the users wants to force
a new session he or she can run it with "-new". This means that
a user who ssh's in 5 times and runs qlogin 5 times reserves only one
CPU and gets 5 connections to that CPU.
Unfortunately this doesn't do everything we want, so I had to modify the
rocks-qlogin.sh scrip to call qlogin_logout_check.py
when it closes a qlogin session. This script is run when a users exits
their qlogin session and warns the user if they have abandoned their SGE
qlogin reservation but left other processes running on that node. (Since
each qlogin does not equal a reserved SGE qlogin they can close the reservation
while they still have other terminals open to that node.)
- How I run it:
I made a link to it called "qlogin" and made sure that that
link is before the original qlogin on the users's paths.
- Removing abandoned qlogin sessions - qlogin_cleanup.py:
- Description:
This script checks for users who are not logged into the head node and
who have qlogin sessions reserved and kills those sessions. Often users
close their terminal sessions without exiting their qlogin sessions and
SGE does not clean up from this nicely. This script takes care of that.
- How I run it (cron on the head node):
# Remove orphaned qlogin
sessions every day at 5.30am.
30 05 * * * /home/shared/system/bin/utilities/qlogin_cleanup.py