David Black-Schaffer's ROCKs Linux Cluster User Scripts

Here are a collection of scripts I've written to try and politely keep users in-line on our ROCKs linux cluster. They come with no warranty and are poorly written in Python, so be forewarned. Furthermore, they may be customized to the particularities of our cluster.

Please also see:

Head Node

Checking for old user processes on the head node - check_for_old_processes.py:
- Description:
  This script will find all processes that were started between minAgeDays and maxAgeDays for each user. It will then put the PID and name of these processes in a ~/.processes_to_be_killed and send an email to the user telling him or her that those processes will be killed the next time the script is run. On subsequent runs it will issue a kill -9 as root to whatever processes are in that file and re-generate the file based on any other old processes. If the user wants to keep the process around he or she can simply remove it from the ~/.processes_to_be_killed file.
- How I run it (cron on the head node):
  # Kill old processes on the head node and generate warnings about new ones every week.
  # Note that we only deal with processes between 8 and 9 weeks old.
  00 00 */7 * * check_for_old_processes.py 56 63
Checking for updated packages for the head node - check_for_package_updates.py:
- Description:
  This script checks if yum has new packages and sends an email to the cluster administrator if newer packages are found.
- How I run it (cron on the head node):
  # Check if there is a new version of OpenSSH and samba available.
  # These are the two most critical security packages.
  00 00 * * * check_for_package_updates.py openssh
  00 02 * * * check_for_package_updates.py samba
Warning users about running jobs on the head node - watchHeadNodeUsers.py:
- Description:
  This script watches for excessive CPU or memory usage by users on the head node and sends the user an email once every few hours if they exceed the limits. It functions by checking for which users are using how much CPU usage and storing this in a state file. The next time it runs if a user is over either limit it increases the user's temperature until they exceed a given level, at which point an email is sent. This allows users to briefly use lots of CPU or memory, but makes sure they don't in the long run.
- How I run it (cron on the head node):
  # Monitor head node usage every 2 minutes
  */2 * * * * /home/shared/system/bin/utilities/watchHeadNodeUsers.py

Resource Management

Warning users about their disk quotas - login_disk_space_check.py:
- Description:
  This script runs every time a user opens a new connection to the head node and tells them what their current disk quota usage is. It does so by connecting to the storage server and retrieving their disk quota. Then depending on how much of their disk quota they are using it displays a more or less friendly message.
- How I run it (/etc/bashrc on the head node):
  Add the following to /etc/bashrc after the "#Turn on checkwinsize":
  # Check for disk quota over-runs
  login_disk_space_check.py
Cleaning up temporary files on all the nodes - cleanup_temp_files.py and run_cleanup_temp_files.sh:
- Description:
  Goes through and deletes files older than a certain amount in certain places. There are other scripts to do this, but this one works nicely for me because I can tell it to ignore files owned by logged in users.
- How I run it (cron on the head node):
  # Clean up temporary files across the cluster at 2am every day.
  00 02 * * * run_cleanup_temp_files.sh > /dev/null
Killing non-queued user processes - kill_non_queued_processes.py:
- Description:
  This script will look at all the running user processes and either generate warning emails if it is called with "warning" or kill the processes if the user does not have jobs registered on the node. This is particularly useful for users who use "qlogin" to get an interactive session and then leave jobs hanging around. Once their qlogin session ends any processes they left running will be killed by this script.
- How I run it (cron on the head node)":
  # Send out warning emails about unreserved node usage at 10.50pm right before killing them
  50 22 * * * /opt/rocks/bin/cluster-fork kill_non_queued_processes.py warning > /dev/null
  # Kill those unreserved processes at 11pm.
  0 23 * * * /opt/rocks/bin/cluster-fork kill_non_queued_processes.py
Suspending low-priority SGE Jobs - suspend_low_priority_jobs.py:
- Description:
  This script defines multiple "priorityQueues" and then finds nodes with more jobs than CPUs and suspends enough of the non-priority jobs on those nodes to insure that the jobs in the priority queues run on either own CPU. If it finds a node that has free CPUs and suspended jobs it then un-suspends the jobs. The effect is to preempt jobs in non-priority queues with jobs from the priority queues. This script assumes that the priority queues and the non-priority queues overlap in resources.
- How I run it (cron on the head node):
  # Suspend low-priority jobs on the cluster every 2 minutes
  */2 * * * * /home/shared/system/bin/utilities/suspend_low_priority_jobs.py
Preventing multiple qlogin reservations - single_qlogin.py:
- Description:
  Our users use "qlogin" to reserve interactive CPUs on our cluster, but most users log in multiple times and run qlogin over and over again for each terminal session. This results in them reserving multiple CPUs, when what they mostly want is simply more terminals open at once. To avoid this I have the users run single_qlogin.py instead of qlogin (by putting it first on their path). This script checks if the user already has a qlogin session in SGE and if so it opens an ssh connection to that node rather than reserving a new qlogin session. If the users wants to force a new session he or she can run it with "-new". This means that a user who ssh's in 5 times and runs qlogin 5 times reserves only one CPU and gets 5 connections to that CPU.
  Unfortunately this doesn't do everything we want, so I had to modify the rocks-qlogin.sh scrip to call qlogin_logout_check.py when it closes a qlogin session. This script is run when a users exits their qlogin session and warns the user if they have abandoned their SGE qlogin reservation but left other processes running on that node. (Since each qlogin does not equal a reserved SGE qlogin they can close the reservation while they still have other terminals open to that node.)
- How I run it:
  I made a link to it called "qlogin" and made sure that that link is before the original qlogin on the users's paths.
Removing abandoned qlogin sessions - qlogin_cleanup.py:
- Description:
  This script checks for users who are not logged into the head node and who have qlogin sessions reserved and kills those sessions. Often users close their terminal sessions without exiting their qlogin sessions and SGE does not clean up from this nicely. This script takes care of that.
- How I run it (cron on the head node):
  # Remove orphaned qlogin sessions every day at 5.30am.
  30 05 * * * /home/shared/system/bin/utilities/qlogin_cleanup.py