TODO:
	- (1-dec-05) get sge to correctly distribute jobs and suspend jobs (tickets)
	- (1-dec-05) add qmon to user docs
	- (1-dec-05) add cluster-fork to user docs
	- (1-dec-05) enable automatic updates via yum and sync this with tripwire
	Make the addUsers script check for invalid user names
	archive all of the dryer stuff? mount dryer? change the UIDs to match? Use dryer as backup?

	
CHANGES:
	- (31-jan-06) updated rsnapshot config to have all the directories on the head node
	- (1-Feb-06) removed /share and /sys from the head node backup since they are not regular directories
	- (2-Feb-06) changed /var/www to be owned by the webadmin, but /var/www/wordpress needs to be owned by apache
	- (2-Feb-06) changed the https://localhost/admin/phpMyAdmin/ to support modifying any mySQL database on the machine
	- (4-Feb-06) imposed hard resource limits for users on the head node (30 processes max)
	- (7-Feb-06) changed the rsnapshot crontab to be separated by 1 hour and fixed the weekly to go on the 28th.
	- (8-Feb-06) added details on adding a regression user
	- (11-Feb-06) added description of NFS tuning parameters
	- (13-Feb-06) added description of script for monitoring servers and sending email (check_server_statys.py)
	- (21-Feb-06) added explicit limits for root to /etc/security/limits.conf on the head node to allow root processes more memory
	- (23-Feb-06) updated check_server_status.py script to user the "serverCheckUser" to check server status and added those users to cva and bagels.
	- (23-Feb-06) added cleanup_temp_files.py script to cleanup the temporary locations on the cluster.
	- (24-Feb-06) modified the rsnapshot setup on the file server to do emergency snapshots when the check_server_status script finds the backup host down.
	- (25-Feb-06) added disk_space_report.py to the head node to send out disk space usage reports.
	- (10-Mar-06) added description of queue changes
	- (31-Mar-06) added description of new node changes
	- (06-Apr-06) added changes for node final packages and disk quotas
	- (17-Apr-06) added changes to smartd.conf and adding it to 411
	- (26-Apr-06) added changes to watchHeadNodeUsers.py
	- (8-June-06) modified scheduler configuration to discuss script to suspend low-priority jobs
	- (6-June-06) added description of check_for_old_processes
	- (9-Aug-06) added details on how to re-install a stuck/crashed node
	- (12-Dec-06) added details on setting up a second ssh server for backups	
	- (30-Apr-06) added new firewall documentation, cleaned up other sections
	- (14-Aug-07) added logrotate compression and process accounting

##############################
Thoughts for future installs:
##############################
It would be a good idea to have the /home directory organized as:
/home/dally
/home/aiken
/home/mendel
etc. so people not in our group can be handled easily.

*** The headnode should have its disks RAID 1 mirrored. ***

The file system should be stored on a RAID 10 or a ZFS RAIDZ array.
The backup server should be running ZFS for snapshots.

We should use an RSA electronic key style one-time password authentication to prevent account compromises.


##############################
Cluster Install Hints:
##############################

Basic information
bagels.stanford.edu/172.24.72.184
gateway 172.24.72.1
netmask 255.255.255.0
DNS 171.64.7.55, 171.64.7.77
N37.25 W122.10

Partition the head node as:
/ 30GB
swap 1GB
/state/partition1 remainder

Partition the storage node as:
/ 6GB
swap 1GB
/state/partition1 remainder

11/1/05
ROCKS 4.1 install:

After installing the head node we need to patch the DVD to get the storage node to install:
# cd /home/install/contrib/4.1/*/RPMS
# wget http://www.rocksclusters.org/ftp-site/pub/rocks/fixes/4.1/noarch/roll-hpc-kickstart-4.1-1.noarch.rpm

# cd /home/install
# rocks-dist dist

now try to install your nas. it will force you into manual
partitioning. to interact with the install, on the frontend execute:
# ssh nas-0-0 -p 2200

In the future the head node should be built with a RAID1 mirror for the main disks. Having a single IDE drive as a point of failure is foolish.


##############################
Upgrading: 
##############################
See the instructions on the ROCKs website. Basically if you need to re-install the head node you want to save the node database, re-install the head node, re-import the node database and then tell all the nodes to rebuild themselves.


##############################
Updating the head node:
##############################
We can use 'up2date' (or I guess 'yum') to update the install on the head node. We need to do this for security reasons, however we need to be careful. We only want to update openssh, samba, and maybe openssl, since ROCKs depends on various other things and doing a blanket update will kill various things such as adding users. We really don't care about the individual cluster nodes being up-to-date for security reasons becaues they are on the private network. However, it is probably a good idea to update them from time to time, but I'm not sure of the best way to do that.

Generate an up2date key:
	bagels% rpm --import /usr/share/rhn/RPM-GPG-KEY
	
Run up2date graphically and update ssh and openssl. (These are the only security things we really care about -- we'll install Samba later.) Under Retrieval/Installation enable "Enable RPM rollbacks". 

To keep software up-to-date there is a script check_for_package_updates.py that uses yum to see if the currently installed package is up to date. It is run nightly to check samba and openssh. If they are not up to date an email alert will be sent to the security account.

For this script to work with the firewall blocking outgoing data you need to update the /etc/yum.repo.d/CentOS-base.repo file to point to http://mirror.stanford.edu/yum/pub/ instead of http://mirror.centos.org/. You'll haev to change all 14 instances of this. I created a separate file "CentOS-Base-Stanford.repo" for this purpose and renamed the old one "CentOS-Base.disabled".


##############################
Custom configuration:
##############################
We want to change the node configuration to do the following:
	- change the partitioning to get 15GB for the / partition
	- install all the standard linux tools
	- install WINE and any other custom RPMS
	- auto mount the /dryer directory and link /cad to /dryer/cad
	- add the /home/shared/system/bin path to the path
	- display a welcome message on login to the head node
	
	
To make these modifications we copy /home/install/site-profiles/4.1/nodes/skeleton.xml to extend-auto-partition.xml and extend-compute.xml.
	
Change the partitioning (http://www.rocksclusters.org/rocks-documentation/4.1/customization-partitioning.html):
	Configure the node setup to do the new partitioning. Above the "main" secion in /hom/install/site-profiles/4.1/nodes/extend-auto-partition.xml add:
		<var name="Kickstart_PartsizeRoot" val="15000"/>
		<var name="Kickstart_PartsizeSwap" val="2000"/>
	We also need to remove the current partitioning information from the database:
		bagels% rocks-partition --list --delete --nodename compute-0-0
	To do this on all the nodes run the /home/shared/system/bin/utilities/remove_node_partition_info.py script.
		
	We then need to make sure it will re-install by removing the .rocks-release file from each partition:
		bagels% ssh compute-0-0 'sh /rm -f /.rocks-release'
	
Adding all standard packages: (You can get the list of packages from home/install/rocks-dist/lan/x86_64/RedHat/base/comps.xml)
	1) We would like to add the following, but these are all categories, not groups, so it doesn't work.
	<package type="meta">Desktops</package>
	<package type="meta">Applications</package>
	<package type="meta">Development</package>
	<package type="meta">System</package>
	
	Instead add:	
	<package type="meta">base-x</package>
	<package type="meta">xfce-desktop</package>
	<package type="meta">gnome-desktop</package>
	<package type="meta">kde-desktop</package>
	
	<package type="meta">editors</package>
	<package type="meta">engineering-and-scientific</package>
	<package type="meta">graphical-internet</package>
	<package type="meta">text-internet</package>
	<package type="meta">office</package>
	<package type="meta">sound-and-video</package>
	<package type="meta">authoring-and-publishing</package>
	<package type="meta">graphics</package>
	<package type="meta">games</package>
	
	<package type="meta">development-tools</package>
	<package type="meta">kernel-development</package>
	<package type="meta">x-software-development</package>
	<package type="meta">gnome-software-development</package>
	<package type="meta">kde-software-development</package>
	<package type="meta">xfce-software-development</package>
	<package type="meta">compat-arch-development</package>
	<package type="meta">legacy-software-development</package>
	<package type="meta">x86-compta-arch-development</package>
	
	<package type="meta">admin-tools</package>
	<package type="meta">system-tools</package>
	<package type="meta">printing</package>
	<package type="meta">compat-arch-support</package>
	<package type="meta">x86-compat-libs</package>
	
	
	There are other packages we want to add. We can figure out what the difference between the head node and a compute node is by doing a diff --side-by-side on the /root/install.log on the head node and the compute node. Let's add these packages as well:

	<package>gcc4</package>
	<package>gcc4-c++</package>
	<package>gcc4-gfortran</package>
	
	<package>graphviz</package>
	<package>graphviz-devel</package>
	<package>graphviz-doc</package>
	<package>graphviz-graphs</package>
	<package>graphviz-tcl</package>
	
	<package>aspell</package>
	
	<package type="meta">ruby</package>
	
	<package>subversion</package>
	<package>subversion-devel</package>
	<package>subversion-perl</package>
	
	<package>thunderbird</package>
	<package type="meta">xemacs</package>
	
	<package>tcl-devel</package>
	<package>pump-devel</package>
	<package>elfutils-devel</package>
	<package>syslinux</package>
	<package>tclx</package>
	<package>tclx-devel</package>
	<package>tclx-doc</package>
	
	
	<package>vnc</package>
	<package>xfig</package>
	<package>xpdf</package>
	
	
	<package>tetex</package>
	<package>tetex-latex</package>
	<package>libtermcap-devel</package>
	<package>dialog</package>
	<package>expat-devel</package>
	<package>vim-common</package>
	<package>vim-enhanced</package>
	
	
	<package>ElectricFence</package>
	
	
	This will install:
		Desktops
			base-x
			xfce-desktop
			gnome-desktop
			kde-desktop
			
		Applications
			editors
			engineering-and-scientific
			graphical-internet
			text-internet
			office
			sound-and-video
			authoring-and-publishing
			graphics
			games
			
		Development
			development-tools
			kernel-development
			x-software-development
			gnome-software-development
			kde-software-development
			xfce-software-development
			compat-arch0development
			legacy-software-development
			x86-compta-arch-development
			
		System
			admin-tools
			system-tools
				samba-client
			printing
			compat-arch-support
			x86-compat-libs
			
	The total install is now about 7GB. More space is needed for the actual downloading of the files for the install.
	
Adding WINE:
	1) download the latest CentOS 4 WINE RPM from winehq.org into /home/install/contrib/4.1/x86_64/RPMS/
	2) Add <package>wine</package> to /home/install/site-profiles/4.1/nodes/extend-compute.xml

Automounts:
	1) Change the /etc/auto.master on bagels to have the following for /home:
		/home  /etc/auto.home   -fstype=nfs,rsize=32768,wsize=32768,soft     --ghost --timeout=1200
	2) Add a line to automount dryer:
		/dryer  /etc/auto.dryer -fstype=nfs,rw,exec,soft,rsize=32768,wsize=32768        --ghost --timeout=60
	3) Create the /etc/auto.dryer file:
		cad     dryer.stanford.edu:/vol/vol1/cad
	4) We do not need to add auto.dryer to 411 as it automatically adds all auto.* files when we re-build the file list.
	5) Edit the extend-compute.xml file to link /cad to /dryer/cad:
		<post>
			/bin/ln -s /dryer/cad /cad
		</post>
	6) Add an automount for the backup server by editing /etc/auto.home to include:
		backup -fstype=nfs,tcp,ro,soft 172.24.94.72:/snapshots

Re-build the cluster configuration/installation:
	bagels% cd /home/install
	bagels% rocks-dist dist
		
Add the sudoers file to 411:
	1) Edit /var/411/Files.mk and add:
		FILES += /etc/sudoers
		
Add the login cluster-specific shell initialization scripts:
	1) put scripts in /etc/profile.d/zz-bagels.csh and zz-bagels.sh that add the /home/shared/system/bin and /home/shared/system/bin/utilities to the path. 
	2) Add them to 411 by editing /var/411/Files.mk and add:
		FILES += /etc/profile.d/zz-bagels.csh
		FILES += /etc/profile.d/zz-bagels.sh
		
Add the cluster-login message to all interactive logins to the head node:
	1) Modify /etc/bashrc
	At the end of the "are we an interactive shell" add:
		# Output the cluster welcome message
		echo -e "`cat /home/shared/system/bin/utilities/cluster-welcome.txt`"
		
Add the login disk quota check right after the login-message to /etc/bashrc:
	# Check for disk quota over-runs
        /home/shared/system/bin/utilities/login_disk_space_check.py


Re-build the 411 database by executing:
	bagels% cd /var/411
	bagels% make clean
Re-distribute the changed files:
	bagels% make -C /var/411
	
Now re-install the nodes. Test this first on one node:
	bagels% ssh compute-0-0 '/boot/kickstart/cluster-kickstart'
	
	You can watch the install by doing:
	bagels% ssh -p 2200 compute-0-0
	Or by using "shoot-node compute-0-0" instead of the cluster-kickstart.
	
	If it works then do the whole cluster: 
	bagles% cluster-fork '/boot/kickstart/cluster-kickstart'	
	
	
##############################
Setup Bagels to allow key-based ssh logins just for cream-cheese
##############################
We need to run a second sshd program on the cluster to allow the backup server to connect via key-based ssh. To set this up see the cream-cheese setup.txt document.


##############################
Configuring SMART disk monitoring:
##############################
Modify the /etc/smartd.conf file on bagels to:
	# Setup the internal drive to send daily emails if something goes wrong.
	# Run a short self-test every Sunday at 2am. (-s S/../../7/02)
	# Report changes to the health status (-H)
	
	/dev/hda -H -M daily -m root -s S/../../7/02

Add "-M test" and then execute /sbin/service smartd restart to make sure the email gets through. Then rmove the "-M test".

Now add /etc/smartd.conf to 411 so it is updated to all the nodes. To do this edit /var/411/Files.mk and add:
	FILES += /etc/smartd.conf
	
Re-build the 411 database by executing:
	bagels% cd /var/411
	bagels% make clean
Re-distribute the changed files:
	bagels% make -C /var/411
	
Now restart smartd on all the cluster nodes:
	bagels% cluster-fork "/sbin/service smartd restart"
	
	
##############################
Configuring the NAS for user storage:
##############################
The default formatting of our NAS gave us one large partition on our NAS (which is what we want).

We want to have the following structure:
/state/partition1/
	archive/
	emergency_snapshots/
	home/
		shared/
			system/
				bin/
					/utilities
				cad/
				cva/
					WWW/
					EMAIL/
				local_documentation/
			projects/
			user/
				tmp_nobackup/
	
We will put the users home directories under home/ and the archived material will go under archive/. We will then export home/ with read/write permissions and  and archive/ with read-only permissions. This means that to modify the archive/ directory you have to log into the storage node.

Create the directories if they aren't already on the NAS.
	nas-0-0% mkdir /state/partition1/emergency_snapshots
	nas-0-0% mkdir /state/partition1/archive
	nas-0-0% mkdir /state/partition1/home

Set up the archive directory so users can only read it:
	nas-0-0% chmod 0755 /state/partition1/archive

Configure the home directory to be readable but not editable by users
	nas-0-0% chmod 0755 /state/partition1/home
	
Now we need to export the home and archive directories. Edit nas-0-0:/etc/exports:

Add to export the home directory:
	/state/partition1/home 10.0.0.0/255.0.0.0(rw,no_root_squash,sync)

Add to export the backup directory:
	/state/partition1/archive 10.0.0.0/255.0.0.0(ro,no_root_squash,sync)
	
restart nfs:
	nas-0-0% exportfs -r
	nas-0-0% exportfs -a

verify it with:
	nas-0-0% exportfs

you should see the two exports listed:
	/state/partition1/archive
			10.0.0.0/255.0.0.0
	/state/partition1/home
			10.0.0.0/255.0.0.0
	/export/data1   10.0.0.0/255.0.0.0

Now when we create users we want to point them to that user directory.


NFS Performance tuning:

On the file server:

Let's try increasing the number of nfsd instances to 256 from 8. Create the file /etc/sysconfig/nfs on the file server and put in:

	RPCNFSDCOUNT="256"

Modify the /etc/exports file to export everything as "async" not "sync" and export home with "wdelay" to try and aggregate writes.

Tune the receive memory network parmeters to 256k

/sbin/sysctl -w net.core.rmem_default=262144
/sbin/sysctl -w net.core.rmem_max=262144

We can add these to /etc/sysctl.conf if it makes a big difference.


On the head node:

Modify the /auto.master on the head node to include "-fstype=nfs,rsize=32768,wsize=32768" after "/etc/auto.home" and then reload autofs with:
	service autofs reload
	
Propogate the changes to the rest of the cluster with
	make -C /var/411

Reload the auto.master for the rest of the cluster
	unsetenv DISPLAY
	cluster-fork "/sbin/service autofs reload"
	
You may have to do a "service autofs condrestart" to get some of them to reload, and if users are logged int then the changes won't take effect.


##############################
Setting up user disk quotas:
##############################
First create the quota files for the partition:
	nas-0-0% touch /state/partition1/home/aquota.user
	nas-0-0% chmod 600 /state/partition1/home/aquota.user
	nas-0-0% touch /state/partition1/home/aquota.group
	nas-0-0% chmod 600 /state/partition1/home/aquota.group

On the file server we need to change /etc/fstab to mount the /home directory with disk quotas enabled. Change nas-0-0%/etc/fstab to:
	LABEL=/state/partition  /state/partition1       ext3    defaults,usrquota,grpquota        1 2
	
Then restart the file server (we need to unmount the partition adn this is the easiest way to do it):
	nas-0-0% /sbin/shutdown -r now
	
Once the server comes build the quotas file. This may take quite a while. During this time users should not be using the system.
	nas-0-0% quotacheck -vguma
	
Then enable quotas:
	nas-0-0% quotaon -av
	
You can then check user quotas by running:
	nas-0-0% repquota -avs
	
Now we need to establish disk quotas for all users. To do this we will set the disk quotas for one user and then propogate them to the rest.
What we need to do is use "edquota -p" to propogate one user's quota to others. We then use "awk" to get all usernames where the user ID (field 3) is between 499 and 1024.
	nas-0-0% edquota -p davidbbs `awk -F: '(($3 > 499) && ($3 < 1024)) {print $1}' /etc/passwd`
	
Set the default grace periods to 50 days. This way users will get at least one warning from the montly disk reports.
	nas-0-0% edquota -t
	

Check to make sure that the quotas are correct:
	nas-0-0% /usr/sbin/repquota -as

Then check who is over-quota:
	nas-0-0% quota -qs
	
The disk_quota_warning.py script should be run by cron on the head node every week to send out warning emails about being over-quota.
	bagels% crontab -e
		# Generate over-quota warnings every week
		00 00 */7 * * /home/shared/system/bin/utilities/disk_quota_warning.py


##############################
Setting user limits on the head node:
##############################
We need to limit the resources users can use on the head node to prevent silly mistakes (such as recursively launching 21,000 processes or using up 8GB of memory) from taking down the whole cluster. To do this we will use pam_limits, which reads a config file from /etc/security/limits.conf.

You can get info on pam_limits from: /usr/share/doc/pam-0.77/

Add the following to /etc/security/limits.conf:

*	-	core	0		# Disallow core dumps
*	-	data	102400		# Max data size of 100MB per login
*	-	memlock	102400		# Max memory locked of 100MB per login
*	-	rss	204800		# Max resident memory set size of 200MB per login
*	-	stack	102400		# Max stack size of 100MB per login
*	-	nproc	100		# Maximum of 100 processes TOTAL
*	-	maxlogins	50	# Maximum of 50 logins

Make sure that root can still do a lot. (This is important for things like rsync.)
root    -       data    2097152 # 2GB
root    -       memlock 524288  # 512MB
root    -       rss     2097152 # 2GB
root    -       stack   2097152 # 2GB
root    -       nproc   5000
root    -       maxlogins       1000


##############################
Creating a universal tmp directory:
##############################
We want to create a universal tmp directory on the fileserver so pepople can put files in there without having them snapshotted. This is kind of worthless since the files still count against their disk quota.

Create the directory:
	bagels% mkdir /home/shared/user/tmp_nobackup
	bagels% chmod o+rwx /home/shared/user/tmp_nobackup


##############################
Creating a universal bin directory:
##############################
We want to create a directory for binaries that can be installed for general use.

	bagels% mkdir /home/shared/system/bin
	bagels% chgrp admins /home/shared/system/bin
	bagels% chmod g+w /home/shared/system/bin
	bagels% chmod o+r /home/shared/system/bin


##############################
Setting up tripwire:
##############################
Tripwire is a checksumming program that will checksum critical system files and make a cryptographically secure file containing those checksums. This means that if you setup and update tripwire, it can detect changes to any of the files it monitors. However, if you modify any of those files it will also see the changes unless you explicitly update the tripwire database after those changes. (See below.)

We need to adjust the defaults because the reports we get otherwise have so many false positives that they are worthless. After we have done that we will re-build the policy file.

Go to /opt/tripwire/etc/twpol-parts/base and edit config to include:
	/etc/samba/smb.conf	-> $(SEC_CONFIG) ;
	/etc/samba/smbpasswd	-> $(SEC_CONFIG) ;
	/etc/samba/smbusers	-> $(SEC_CONFIG) ;

Edit boot-volatile to change /var/log to: (this causes it to check for files existing but not their contents)
	/var/log/	-> $(IgnoreAll)
	!/var/log/sa	;
	
	Uncomment:
	/var/lock/subsys/named            -> $(SEC_CONFIG) ; #Uncomment when this file exists
	/var/lock/subsys/nfs              -> $(SEC_CONFIG) ; #Uncomment when this file exists
	/var/lock/subsys/smb              -> $(SEC_CONFIG) ; #Uncomment when this file exists
	
	
Edit devices to comment out /proc/kcore.

Edit root-home to not check various files that will always change when we do somethign as root:
	!/root/.emacs.d/auto-save-list	;
	!/root/.ICEauthority                ;
	!/root/.fonts.cache*                ;
	
Add to osbin two lines to prevent checking the backup sanity check script files:
   !/bin/.backup_sanity_check   ;
   !/lib/.backup_sanity_check   ;

Do the same for userbin:
   !/sbin/.backup_sanity_check   ;

Rebuild the policy:
	bagels% /opt/tripwire/etc/ make policy
	
Note: after changing the policy you will always need to make updatedb. (See below.)

To setup tripwire for the first time we need to generate site and machine keys with known passwords so we can update the database. Subsequent uses only need to update the database not regenerate the keys.

in bagles%/opt/tripwire/etc

Edit bagles%/opt/tripwire/etc/config.tw to set the TWEDITOR to "/usr/bin/emacs"

Remove the batch keys bagles%/opt/tripwire/etc/bagles....key and site.key.

Create a new key set with a known passowrd:
	bagles%/opt/tripwire/etc/ make initialize-interactive
	
Tripwire will then ask you to set the password for the site and local databases and then generate them (which will require entering those passwords a few times.)

Set tripwire to send mail to security@cva.stanford.edu:
	bagels% /opt/tripwire/etc/tw-email-to-set security@cva.stanford.edu

Now update the tripwire database following the instructions below for "using tripwire."


##############################
Using tripwire:
##############################
Whenever changes are made to the system (users added, programs updated, etc.) the tripwire database should be updated so it knows that the system has been legitimately changed.

To do this run:
	bagles%/opt/tripwire/etc/ make update

Tripwire will then open a text editor with a report listing all the things that have changed. Scroll down to the "Object Summary" part and it will list all objects that have changed.

By default a "[x]" will be next to all of them. For each object that has changed and you recognize as being a legitimate change leave the "x" there. For others remove the "x" so it does not update the database.

Save the file and exit the editor. You will then be prompted for the tripwire password.

Once you've done that we want to make sure everything is okay. Check it by running:
	bagles%/opt/tripwire/etc/ make check
	
You should see no files marked as modified. (You might see the emacs save list from the file you just modified but that can safely be ignored.) You may also get some "file system errors" but I think those can be ignored.

You can get the most recent tripwire report by executing:
	bagles%/opt/tripwire/etc/ make print-report

By default tripwire is run daily and reports are mailed to root. You can forward this mail to another account by adding a .forward address to the root account.


##############################
Installing pam_abl to block repeated login attempts:
##############################
http://sourceforge.net/projects/pam-abl/
pam_abl is a PAM module which will automatically (and temporarily) blacklist any accounts or host that execute multiple repeated failed login attempts. We want to install this on the head node.

We need to install pam_abl and then configure it.
Try: "yum install pam_abl" but if that doesn't work download the package from sourceforge, unzip and untar it, and make it. Then "make install".
Copy conf/pam_abl.conf to /etc/security.

The default /etc/security/pam_abl.conf is configured as follows:
	host_db=/var/lib/abl/hosts.db	-- host database
	host_purge=2d			-- purge hosts from the database after 2 days
	host_rule=*:10/1h,30/1d		-- rule: any host 10 attempts in 1 hour or 30 in 1 day
	user_db=/var/lib/abl/users.db	-- user database
	user_purge=2d			-- purge users after 2 days
	user_rule=!root:10/1h,30/1d	-- any user except root for 10 attempts in 1 hour or 30 in 1 day


In /etc/pam.d/system-auth we need to add:
	auth	required	/lib/security/pam_abl.so config=/etc/security/pam_abl.conf
after:
	auth	required	/lib/security/$ISA/pam_env.so
	
To make this work with one-time-passwords we should move this to /etc/pam.d/otp-auth right before the pam_sotp.so line.

To test it try logging in with a bad password. To view the status execute:
	/usr/bin/pam_abl -v /etc/security/pam_abl.conf
	
Let's also add a weekly blacklist report by adding a cron job:
	crontab -e
	00 00 */7 * *  /usr/bin/pam_abl -v /etc/security/pam_abl.conf
	
Also add a crontask to purge the list every two days according to our rules above:
	crontab -e
	00 00 */2 * * /usr/bin/pam_abl -p /etc/security/pam_abl.conf
	
	
##############################
Configuring one time passwords with pam_sotp-0.3.3
##############################
Download pam_sotp-0.3.3 and make the package. To build the package you need to have the pam libraries installed for development, so you may need to build it on a compute node and not the head node.

To install it you then need to install the pam module (pam_sotp.so) in /lib64/security on the head node and set it to be owned by root and have 755 permissions.
	bagels% cp pam_sotp-0.3.3/src/pam/pam_sotp.so /lib64/security/
	bagels% chown root.root /lib64/security/pam_sotp.so
	bagels% chmod 755 /lib64/security/pam_sotp.so

Next you need to copy the otppasswd executable to the /bin directory on the head node and set it to be owned by root and have 2755 permissions.
	bagels% cp pam_sotp-0.3.3/src/utils/otppasswd /bin/
	bagels% chown root.root /bin/otppasswd
	bagels% chmod 2755 /bin/otppasswd

When the otppasswd executable runs it uses PAM to check /etc/pam.d/otppaswd file for authentication. We can configure this file to require that the user enter his or her non-one-time password to generate more passwords or set it up so they just run the command to generate more. For the latter case we need to create  the file /etc/pam.d/otppasswd and put

auth	sufficient	pam_permit.so
auth	required	pam_unix.so

then change the persmissions so only root can write it, but others can read it:
	bagels% chmod 644 /etc/pam.d/otppasswd

We aren't going to actually allow users to use the otppasswd program to generate their own passwords since this will spit them out to the terminal and they will just store them on their local computer in some file named passwords.txt which defeats the whole point. So instead users will run the /home/shared/system/bin/utilities/passwd_generate.py script to generate new passwords. This script will print out two copies of the passwords to a printer and send the user mail telling them that more passwords were generated.
	
Now create the directory for pam_sotp to store its databases.
	bagels% mkdir /etc/sotp
	bagels% chown root.root /etc/sotp
	bagels% chmod 770 /etc/sotp
	
We are going to setup one time passwords to only be required for ssh/sftp access and sudo access on the head node. To do this we need to modify the PAM setup for sshd and sudo, and change the sshd_config to disable RSA key-based authentication.

Setup sshd to to show the one-time password number and to not accept any form of authentication other than passwords.
Edit the /etc/ssh/sshd_conf file:
	RSAAuthenication no
	PubkeyAuthentication no
	HostbasedAuthentication no
	PasswordAuthentication yes
	KerbrosAuthentication no
	GSSAPIAuthentication no
	UsePAM yes

Once you have changed the sshd_config you need to re-load it by executing kill -HUP <sshd pid> where you can get the PID for ssh from "ps -aux | grep sshd".
	
Create the otp-auth pam file in /etc/pam.d/ on the head node:

	auth	required	/lib/security/$ISA/pam_env.so
	auth	sufficient	/lib/security/$ISA/pam_sotp.so prompt_number=yes
	auth	required	/lib/security/$ISA/pam_deny.so
	
Modify the /etc/pam.d/sshd and /etc/pam.d/sudo files such that the 
	auth	required	pam_stack.so service=system-auth
is changed to:
	auth	required	pam_stack.so service=otp-auth


##############################
Adjusting logrotate on the head node:
##############################
Change logrotate to keep 8 weeks (2 months) worth of logs. Logrotate runs automatically to move the logs in /var/log and delete old ones.
We want to make sure we have enough to not miss any when the weekly snapshots are thrown out in favor of the monthly ones.

Edit:
	bagels% /etc/logrotate.conf
	
And change the rotate to 52.

Change /var/log/wtmp {} entry to have a rotate 12 and rotate monthly.

Add a few extra files to logrotate:
# system-specific logs may be also be configured here.
/var/log/watchHeadNodeUsers.log {}
/var/log/qlogin_cleanup {}
/var/log/check_server_status {}
/var/log/user {}

Enable compression, add:
compress
Enable mailing of expired (compressed) log entries:
mail cluster-admin@cva.stanford.edu

##############################
Enabling user process accounting:
##############################
Turn on process accounting on the head node so we'll have a record of what was run when and by whom. To do this execute:
bagels% touch /var/log/pacct
bagels% chmod 0644 /var/log/pacct
bagels% /sbin/accton /var/log/pacct

Add the log to logrotate:
bagels% /etc/logrotate.conf
add:
# Pacct needs to be restarted when we change the log
# Before we rotate turn it off, then turn it on.
# We'll lose process accounting for abit, though...
/var/log/pacct {
prerotate
        /sbin/accton
endscript
postrotate
        touch /var/log/pacct
        chmod 0600 /var/log/pacct
        /sbin/accton /var/log/pacct
endscript
}

You can then use lastcomm to analyze the results of this log file. Note that this may get large as it will record all processes, so it is important to have its logs being compressed by logrotate.


##############################
Configuring sudo:
##############################
Generally we don't want people becoming root to do things. Instead we want to have them use the 'sudo' command.

We should make a group of admins who are allowed to execute commands. To do this create an admin group:
	bagels% /usr/sbin/groupadd admins
	
Put it into samba as well so they can access directories over the network
	bagels% net groupmap add ntgroup="admins" unixgroup="admins" 
	
Put appropriate users in the group:
	bagels% /usr/sbin/usermod -a -G admins davidbbs

Update across the cluster:
	bagels% make -C /var/411

To activate sudo this we need to enable it by using /usr/sbin/visudo to edit the config file.

First export EDITOR="emacs"

Then run visudo and add the following to the defaults to insure that people don't accidentally do multiple root commands.
	Defaults:ALL timestamp_timeout=0
	
We want to allow the group, so add:
	%admins	ALL=(ALL) ALL
	
Allow root full access
	root ALL=(ALL) ALL
	
And we want users to be able to update the web site
	ALL ALL=/home/shared/system/cva/update.sh

Make sudo log all actions by adding:
	Defaults logfile=/var/log/sudo.log
	
Then you can add users to the list by adding them as:
	user-name        ALL=(ALL) ALL
	
This specifies that the given user can do anything. You can also be more restrictive (see the man page).


We need to configure the cluster so that changes to sudo get propagated to all the nodes. To do this edit /var/411/Files.mk to add
	FILES += /etc/sudoers
	
to the end. Then update across the cluster:
	bagels% make -C /var/411
	
Now all changes to sudoers on the head node will be propagated to all nodes when 411 is updated.


##############################
samba:
##############################
We want to enable samba but limit it to the part of the network with the cva machines and the Gates VPN.
Since we are paranoid we are going to limit it both within samba and using the firewall.

Check if samba is installed:
	bagels% which smbd
	
Use yum to install samba:
	bagels% yum install samba
	
Use yum to install swat:
	bagels% yum install swat

This will put the "swat" configuration file in /etc/xinet.d, but it will have disable = yes, which will prevent it from running.
	bagels% chkconfig swat on

Restart xinetd
	bagels% service xinetd restart
	
Now run a web browser on the head node and go to:
	http://localhost:901

Log in as root.
	Go to globals and choose "Advanced"
	Allow only our hosts:
	hosts allow: 171.64.72., 172.24.72., 171.67.73., 172.24.76
	Change the workgroup to CVA
	workgroup: CVA
	Run only on ethernet 1
	interfaces: eth1
	Click "Commit Changes" to save the changes.
	Make it browseable
	Click "Commit Changes"
	
Note that since the directories are auto mounted this will result in a free disk size of 0 being displayed if the users try to log in and view the root /home/ directory. If they want to we can add another share:
	"other-users"
		path = /home/
		browseable = Yes
		
Commit the changes.

Turn SWAT off:
	bagels% chkconfig swat off
			
			
Set samba to log accesses by user:
	In /etc/samba/smb.conf change it to have: "log file = /var/log/samba/%U.log"
	
Start samba:
	bagels% service smb start
	
Configure samba to come up at startup in levels 4 and 5.
	chkconfig --levels 345 smb on
	
Check that it's configured correctly:
	chkconfig --list smb
	
	
To add additional shares to Samba we can manually edit the configuration file. 
Add a share for the projects/ and shared/ directories. Add these manually to the bagels%/etc/samba/smb.conf file as:

[projects]
        comment = CVA Projects
        path = /home/shared/projects/
        valid users = @students, @faculty
        read only = No
        inherit permissions = Yes

[shared]
        comment = CVA Shared Directory
        path = /home/shared/
        valid users = @students, @faculty
        read only = No
        inherit permissions = Yes

	
Then restart smbd:

	bagels% service smb reload	

	
Apparently we should open ports 137, 138, 139 and 445 for both udp and tcp traffic. See the firewall configuration section for how this is setup. Basically we create a Gates chain which has all the IPs we want to allow and then we use that chain whenever ports 137, 138, 139, of 445 are accessed. We could do this manually as follows, but it's a pain:

bagels% /sbin/iptables --insert INPUT --source 171.64.72.0/255.255.255.0 --in-interface eth1 --proto tcp --dport 137 --jump ACCEPT
	bagels% /sbin/iptables --insert INPUT --source 171.64.72.0/255.255.255.0 --in-interface eth1 --proto tcp --dport 138 --jump ACCEPT
	bagels% /sbin/iptables --insert INPUT --source 171.64.72.0/255.255.255.0 --in-interface eth1 --proto tcp --dport 139 --jump ACCEPT
	bagels% /sbin/iptables --insert INPUT --source 171.64.72.0/255.255.255.0 --in-interface eth1 --proto tcp --dport 445 --jump ACCEPT
	
	bagels% /sbin/iptables --insert INPUT --source 171.64.72.0/255.255.255.0 --in-interface eth1 --proto udp --dport 137 --jump ACCEPT
	bagels% /sbin/iptables --insert INPUT --source 171.64.72.0/255.255.255.0 --in-interface eth1 --proto udp --dport 138 --jump ACCEPT
	bagels% /sbin/iptables --insert INPUT --source 171.64.72.0/255.255.255.0 --in-interface eth1 --proto udp --dport 139 --jump ACCEPT
	bagels% /sbin/iptables --insert INPUT --source 171.64.72.0/255.255.255.0 --in-interface eth1 --proto udp --dport 445 --jump ACCEPT

	bagels% iptables --insert INPUT --source 172.24.72.0/255.255.255.0 --in-interface eth1 --proto tcp --dport 139 --jump ACCEPT
	bagels% iptables --insert INPUT --source 171.64.72.0/255.255.255.0 --in-interface eth1 --proto tcp --dport 139 --jump ACCEPT
	bagels% iptables --insert INPUT --source 171.67.73.0/255.255.255.0 --in-interface eth1 --proto tcp --dport 139 --jump ACCEPT
	
	And allow access for Bill's office:
	bagels% iptables --insert INPUT --source 172.24.76.0/255.255.255.0 --in-interface eth1 --proto tcp --dport 139 --jump ACCEPT
	Verify that this added two rulse for netbios:
	bagels% iptables --list
	Save it.
	bagles% service iptables save


Now we need to configure PAM to update the samba passwords when users change their linux ones.
To do this we edit bagels:/etc/pam.d/system-auth

We need to insert a line for smb into the password stack. (http://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/pam.html)

Insert the following line in the password section before any "sufficient" directives (they will override it) and before the pam_deny directive.
	password    required      /lib64/security/pam_smbpass.so nullok use_authtok try_first_pass

Now when users type passwd it should change both their SMB and their unix password.

When creating users we need to add them to the samba password database with:
	bagels% smbpasswd -na username
Remeber that when deleting users we need to remove them from the samba database with 
	bagels% smbasswd -x usernamd
	
The addBagelsUser.py script does this.

To allow anyone to mount samba shares on the head node we can modify the sudoers file using /usr/sbin/visudo to include:
	# Allow any user to mount smb file systems
	ALL ALL=/usr/bin/smbmount
	ALL ALL=/usr/bin/smbumount

This will allow all users on the head node to mount samba file systems.


Samba may report the wrong disk space and give the warning: WARNING: dfree is broken on this system when connecting 
This can be fixed by setting dfree to a script like: df $1 | tail -1 | awk '{print $1" "$3}'
That scrip twill try to get the disk space free on the requested share. This will work for users (since their share exists on the file system) but the other-users share will not work because it is an auto.mounted directory (/home). To fix this we just replace the $1 in the script with with /home/backup. This will accurately report the space free on the server. Put this in a script and place it in /home/shared/system/bin/utilities/cluster_dfree.sh and then configure samba to use that script for the dfree command. Add the following line to the smb.conf file under [global]:
        dfree command = /home/shared/system/bin/utilities/cluster_dfree.sh


Does having this in the NT-CVA domain mean that the CVA PDC is used to try and access it so it will fail? Should we put it in another domain? Set it as a domain master?
Do we need a guest account to browse it?


##############################
firewall:
##############################
We want the following:
	Allow samba from our specific Gates subnets.
	Allow ssh on port 22.
	Allow ssh on port 26 from cream-cheese.
	Allow any out-going connection to stanford networks.
	Disallow any and all other incoming and outgoing traffic.
	
To do this we use the following /etc/sysconfig/iptables script. Copy it into /etc/sysconfig/iptables then restart iptables to load it.
	bagels% /sbin/service iptables restart
	
		
# Generated by David Black-Schaffer on 27-April-2007
*filter
:INPUT ACCEPT [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [16072211570:11381251689854]

# Create our two chains for Gates and Stanford
-N Gates
-N Stanford

# ###
# INPUT chain
# ###
# Accept continuing traffic
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT 
# Allow all internal cluster traffic
-A INPUT -i eth0 -j ACCEPT 
-A INPUT -i lo -j ACCEPT 
# Allow ssh
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT 
#-A INPUT -i eth1 -p tcp -m tcp --dport 22 -j ACCEPT 
# Cream-cheese backup ssh on port 26
-A INPUT -s 172.24.94.72 -i eth1 -p tcp -m tcp --dport 26 -j ACCEPT 
# Samba from Gates
-A INPUT -p tcp --dport 137 -j Gates
-A INPUT -p tcp --dport 138 -j Gates
-A INPUT -p tcp --dport 139 -j Gates
-A INPUT -p tcp --dport 445 -j Gates
-A INPUT -p udp --dport 137 -j Gates
-A INPUT -p udp --dport 138 -j Gates
-A INPUT -p udp --dport 139 -j Gates
-A INPUT -p udp --dport 445 -j Gates
# Other stuff
-A INPUT -p icmp -m icmp --icmp-type any -j ACCEPT 
-A INPUT -p udp -m udp --dport 0:1024 -j REJECT --reject-with icmp-port-unreachable 
-A INPUT -p tcp -m tcp --dport 0:1024 -j REJECT --reject-with icmp-port-unreachable 
-A INPUT -p udp -m udp --dport 8649 -j REJECT --reject-with icmp-port-unreachable 
# Else reject it
-A INPUT -i eth1 -j REJECT --reject-with icmp-port-unreachable 

# ###
# FORWARD Chain
# ###
# Should we allow new forwarded inputs?
-A FORWARD -i eth1 -o eth0 -m state --state NEW,RELATED,ESTABLISHED -j ACCEPT
#-A FORWARD -i eth1 -o eth0 -m state --state NEW,RELATED,ESTABLISHED -j ACCEPT 
#-A FORWARD -i eth0 -j ACCEPT 
# Process all forwarded traffic to make sure it's going on-campus.
-A FORWARD -i eth0 -j Stanford
-A FORWARD -i eth0 -j REJECT --reject-with icmp-port-unreachable

# ###
# OUTPUT Chain
# ###
-A OUTPUT -o eth1 -j Stanford
-A OUTPUT -o eth1 -j REJECT --reject-with icmp-host-prohibited

# ###
# Gates IP check. This is for our machines and Bill's office
# ###
-A Gates -s 172.24.76.0/255.255.255.0 -i eth1 -j ACCEPT
-A Gates -s 171.67.73.0/255.255.255.0 -i eth1 -j ACCEPT
-A Gates -s 171.64.72.0/255.255.255.0 -i eth1 -j ACCEPT
-A Gates -s 172.24.72.0/255.255.255.0 -i eth1 -j ACCEPT

# ###
# Stanford Main Campus IP check
# ###
# Main campus
-A Stanford -d 171.64.0.0/255.255.0.0 -j ACCEPT
-A Stanford -d 171.67.0.0/255.255.0.0 -j ACCEPT
# Private VLANs
-A Stanford -d 172.0.0.0/255.0.0.0 -j ACCEPT
# Main campus DSL
-A Stanford -d 171.66.0.0/255.255.0.0 -j ACCEPT
# Student residences
-A Stanford -d 128.12.0.0/255.255.0.0 -j ACCEPT
# Medical center
#-A Stanford -d 171.65.0.0/255.255.0.0 -j ACCEPT
# SLAC
#-A Stanford -d 134.79.0.0/255.255.0.0 -j ACCEPT

COMMIT
*nat
:PREROUTING ACCEPT [33699306:5199993235]
:POSTROUTING ACCEPT [2600364:205594290]
:OUTPUT ACCEPT [2696100:215351183]
-A POSTROUTING -o eth1 -j MASQUERADE 
COMMIT

		
##############################
.stanford.edu domain:
##############################
Edit /etc/sysconfig/network-scripts/ifcfg-eth1 to include
	DOMAINNAME=stanford.edu


##############################
Creating groups:
##############################
Create a student group:
	bagels% groupadd students
	bagels% groupadd alumni
	bagels% groupadd faculty
	bagels% groupadd staff
	bagels% groupadd regression
	
Remember when creating groups to add appropriate mappings to samba so file sharing works. To do this use the "net" command:
	bagels% net groupmap add ntgroup="students" unixgroup="students"
	

##############################
Creating users:
##############################
Users should be added to the appropriate groups. In general this will be the "students" group, but it may also include project groups. However, students from other groups should be added a group for their professor. I.e., if Ron Fedkiw has a student using the cluster there should be a prof-fedwick group and that student should be added to it. Groups should be used both for access control and account management.

Users should be added with their numeric UID equal to that of their Stanford SUID. To find out this number we need to use an LDAP querry. To make that work we need to install the openldap client on the head node:
	bagels% yum install openldap-clients
	
Once that is installed we just use the script in /home/shared/system/bin/utilities/addBagelsUser.py. To run the script execute:
	bagels% sudo /home/shared/system/bin/utilities/addBagelsUser.py
	
This script does the following:
	Create the user account. This puts the home directory on the NAS.
		bagels% /usr/sbin/useradd --netserver nas-0-0.local -d /state/partition1/home/test-user -c "Test User" test-user
		
	Add the user to the appropriate group
		bagels% /usr/sbin/usermod -a -G students test-user
	
	Create a samba user entry for the user
		bagels% smbpasswd -an test-user
	
	Set the password
		bagels% passwd test-user
		
	Note that the directory may not show up in the home directory on the head node until autofs restarts. You can still cd into it, though. The group change may not propogate and may need to be manually propogated by issuing a 411 make:
		bagels% make -C /var/411/
	
	Update the tripwire database:
		bagles%/opt/tripwire/etc/make update


##############################
Creating a cadmgr user:
##############################
Create a user with UID 413 as the "cadmgr" user. This user will own all the cad tools.
	/usr/sbin/useradd -c "Cad Manager" -u 413 -M -n cadmgr

Now we want to set the  home directory to the cad directory and the group to cad and lock the account:
	/usr/sbin/usermod -d "/home/shared/system/cad" -L -g "cad" cadmgr

Make sure you change the owner of the cad directory tools to cadmgr.
	chmod -R cadmgr /home/shared/system/cad
	
Now make sure all of the members of the cad group can sudo su cadmgr. Run /usr/bin/visudo to edit the sudoers file.
Add:
	# Allow all cad users to become cadmgr
	%cad            ALL= (root) /bin/su cadmgr

We also want to allow people in the cad group to change directories to be owned by the cadmgr user. To do this I created a script /home/shared/system/bin/utilities/set_cadmgr_user.sh which simply calls:
	chown -R cadmgr.cad $1
	
This way we can allow all users in the cad group to change files to be owned by the cadmgr by adding the following to sudo via visudo:
	# Allow all cad users to change the owner to cadmgr
	%cad            ALL= (root) /home/shared/system/bin/utilities/set_cadmgr_user.sh


##############################
Creating a shared directory:
##############################
We want to create a directory that can be shared by a group. To do this we first create a group for it:
	bagels% /usr/sbin/groupadd merrimac
	
Add the users we want to the group:
	bagels% /usr/sbin/usermod -a -G merrimac davidbbs
	
Propogate the changes to the cluster:
	bagels% make -C /var/411/
	
Create the directory:
	bagels% mkdir /home/shared/projects/merrimac
	
Change the privlidges on the shared directory:
	bagels% chmod g+rwx /home/shared/projects/merrimac
	
Change the group on the shared directory:
	bagels% chgrp merrimac /home/shared/projects/merrimac
	
Make the group sticky:
	bagels% chmod g+s /home/shared/projects/merrimac
	
Verify that it's correct:
	bagels% ls -l /home/shared/projects/
	
We should see:
drwxrwsr-x  7 root     merrimac 4.0K Dec 14 00:20 merrimac


##############################
Creating regression users or shared accounts:
##############################
Regression user accounts are used to run cron jobs and allow test to run outside of the individual users' accounts.
They should be normal user accounts except that they should:
	1) be in the regression group
	2) should have a .noaliases file so they don't get added to the cva-users list
	3) should have their .forward to a member of the group
	4) Members of the project group should be able to "sudo su regression-user" to become that user.
	
Create the user account using the standard addBagelsUser script. Do not add them to the students group, but instead add them to the regression group.

Create a .noaliases file in the account's home directory and remove the .aliases file.

Modify the default .forward file to point to a real user who is reponsible for the account.

Allow the members to su to the regression user. Run visudo and add:

%group	ALL = (root) /bin/su regression-user

Where the group is the appopriate group (i.e., "imagine") and the regression-user is the appropriate account (i.e., "imagine-regression"). Now anyone in the group can execute "sudo su regression-user" to become that user.

Disable the password for the regression-user by executing:
	bagels% usermod -L regression-user

After changing the sudo settings propogate it to the cluste via:

	bagels% make -C /var/411
	

##############################
Email forwarding and Web syncing with cva.stanford.edu:
##############################
The cva group WWW and email forwarding services works by having a cron job on bagels run every 6 hours and synchronize the contents of the group web direcotry (/home/shared/system/cva/WWW) and email forwarding files (/home/shared/system/cva/MAIL/) and the user files (~/WWW and ~/.forward).

There are two scripts (/home/shared/system/cva/WWW/WWWupdate.py and /home/shared/system/cva/MAIL/MAILupdate.py) that handle the updates.

The WWWupdate.py script goes through all user directories and if the user has a WWW directory in his or her home directory a soft link is created to the group web directory (/home/shared/system/cva/WWW/people/). Then rsync is used to synchronize /home/shared/system/cva/WWW to /var/www/html/ on the web server. This rsync command follows links, so links should not be used in the website hierarchy. If a user who no longer has an account wants a web directory it can simply be copied into the people/ directory or linked from their archived account.

The MAILupdate.py script builds an email aliases file (/home/shared/system/cva/MAIL/aliases) by merging the contents of the "alumni_mail", "mail_lists", "other_mail", and "system_mail" files and any .aliases files found in the users' directories. The ".aliases" files should be of the form "address: fowardaddress". The script also builds a cva-users list of all the users who have .aliases files, using their first listed alias. This file is then rsync-ed to the web server and the aliases database is updated. The above static files should be used to hold aliases for people who have cva forwarding but do not have cva accounts.

There are two directories on the mail server, /etc/mail_includes/ and /etc/mail_archive/ which are synced to bagels. If you want to use :include: to include a file in an email aliases file you must reference it as :include:/etc/mail_includes/filename.include. Then when the sync happens it will work. Similiarly, if you want to archive a malining list you should pipe it to "/etc/mail_archive/listname.mail" and then it will be synced to bagels every hour.

To add the update job to cron, scheduled every hour modify the crontab on bagels as follows:
	bagels% export EDITOR=emacs (unless you like vi)
	bagels% crontab -e
	
	0 */1 * * * /home/shared/system/cva/update.sh

These scripts can be run by any user is sudo is setup correctly by executing:
	bagels% sudo /home/shared/system/cva/update.sh
	
		
##############################
Web statistics pages with Visitor
##############################
Download and make Visitors from http://www.hping.org/visitors. Put it in /home/shared/system/bin/visitors-web-log-analyzer.

Create the process_logfiles.sh script as:
	#! /bin/sh
	#
	# File to update the webserver stat files based on the logs that are in the snapshots.
	
	LOG_FILES=/home/backup/daily.0/cva/var/log/httpd/access_log*
	DEST=/home/shared/system/cva/WWW/WWW/stats
	
	./visitors $LOG_FILES -A -m 30 --error404 --trails --prefix http://cva.stanford.edu --output-file $DEST/stats.html
	
	./visitors $LOG_FILES --prefix http://cva.stanford.edu --graphviz > $DEST/graph.dot
	dot $DEST/graph.dot -Tpng > $DEST/graph.png

Create the stats directory in WWW on bagels.

Add the process_logfiles.sh script to cron as:

crontab -e
	30 02 * * * /home/shared/system/bin/visitors-web-log-analyzer/process_logfiles.sh > /dev/null
	
We also need to make this directory restricted. On cva create the file /etc/httpd/conf.d/stats.conf:

	# Limit stats access to stanford cs only.
	#
	Alias /stats /var/www/html/stats
	
	<Location /stats>
	    Order deny,allow
	    Deny from all
	    Allow from bagels.stanford.edu
	    Allow from 172.24
	    Allow from 171.64
	    Allow from 171.67
	    Allow from 127.0.0.1
	    Allow from ::1
	</Location>

And restart apache:
	/sbin/service httpd restart

	
##############################
Archiving data from other machines:
##############################
We want to create an archive shared directory and move the contents of another machine to it. However, since this is an archive we don't want people to accidentally write to it. So we will mount it the same way we mounted the backup, e.g., read only.

Create the archive directory on the storage server:
	nas-0-0% mkdir /state/partition1/archive
	
Export it from the storage server. Edit nas-0-0%/etc/exports to include:
	/state/partition1/archive 10.0.0.0/255.0.0.0(ro,no_root_squash,sync)
	
Update the exports:
	nas-0-0% exportfs -a
	
Update the head node to import it under users. Modify bagels%/etc/auto.home to include:
	# Mount the archive directory. Note it is not in the home directory so it can be read only
	archive nas-0-0.local:/state/partition1/archive
	
Propogate the information to the cluster:
	bagels% make -C /var/411
	
Now the archive directory will be available under /home/archive with read-only permissions on all machines except the storage server. This means that to get files into it you should log in as root to the storage server and copy them to the /state/partition1/archive directory by hand.


Moving data over from other machines:

The easiest way to archive data from another machine is to rsync it too bagels and put it in the /home/shared/user/tmp_nobackup/ directory then log into the storage node and move it to the archive. That way you can write it to bagels as root to preserve all the persmissions and time stamps.


Here's an example of how I moved the dim-sum files over:

The easiest way to do it is to use rsync from the storage server to get the files you want. This requires that you be able to log in as root via ssh to the machine you want to archive. To do this you need to modify /etc/sshd.conf on that machine.

Once you've done that run rsync as follows:

	nas-0-0% rsync -rlpt root@source-machine:/path/to/filesystem/ /state/partition1/archive/archivename/
	
The "-rlpt" will do it recursively, keeping links as links (i.e., not following them), keeping permissions, and keeping time stamps. Unless the UIDs and GIDs are the same between the two systems there is no point in using -g and -u to keep those the same.

Note that there will be problems here. Any symbolic links that leave the file system you're archiving (i.e., those that are not relative) will break. There's not much you can do about this.

If you are archiving local data you can just do:

	nas-0-0% rsync -a /state/partition1/user-to-archive/ /state/partition1/archive/users/user-to-archive
	
Which will keep the user and group information. If you want to maintain email aliases or web pages for the user you are archiving you will need to manually copy them to the /home/shared/system/cva/WWW/people and /home/shared/system/cva/MAIL/alumni_mail files.

You may have to be explicit if you are using a broken machine like dim-sum and specify the path to ssh using the --rsh=path-to-ssh.


##############################
Editing the web interface:
##############################
The WordPress login is "admin" and the root password.
Run firefox on the head node and log in.
You can then configure the interface.

We'll add a FAQ section.
Go to the "Write" section and choose "Write Page" and create a new page called "Bagels Info/FAQ".

In that page put a link to http://bagels/local_documentation/

Now we need to tell the web server to look in the right place. To do this we'll put a link in bagels%/var/www/html for local_documentation.

	bagels% ln -s /home/shared/system/local_documentation/ /var/www/html/local_documentation
	
Now test it by putting a test file in the /home/local_documentation directory.

	bagels% touch /home/shared/system/local_documentation/test
	
	
##############################
Modifying mySQL configuration:
##############################
The default install has two mySQL databases: wordpress and cluster.

By default accessing: https://bagels.local/admin/phpMyAdmin and entering the root password will only let you modify the cluster database. (Note that it will allow you to do ANYTHING to that databse, so be careful!) Note: this is "https" not "http".

We will modify this to allow modification to any database on the system. To do so we need to edit the configuration file.

Change: 
	bagels% /var/www/html/admin/phpMyAdmin/rocks.conf
	
Change:
	$cfg['Servers'][1]['user'] = 'apache';
to user 'root'.
Add:
	$cfg['Servers'][1]['auth_type'] = 'http';
And comment out:
	$cfg['Servers'][1]['only_db'] = 'cluster';

Save it and now you can access any of the databases from that URL with the root password. There is also a http://localhost/phpMyAdmin/ installation which runs as the user 'nobody' and should not be able to do anything except view the data. (I hope.)


##############################
3ware utility:
##############################
We should install the GUI config tool for the 3ware card on the storage node.
Download the latest for the 3Ware 9000 series controller from 3ware.com.

Download 3DM2 magement utilty. Unzip it and run the installer.

Accept all the defaults except: set the mail server to bagles.localhost and allow remote administration. Have mail sent to cluster-admin@cva.stnaford.edu. If you make a mistake you can always edit /etc/3dm2/32md.conf.

Then set up the run levels as suggested. First unzip the archive:
	nas-0-0% gunzip 3dm-lnx.tgz
	nas-0-0% tar -xf 3dm-lnx.tar
	nas-0-0% cp -f rc.redhat /etc/rc.d/init.d/3dm2
	nas-0-0% chmod 755 /etc/rc.d/init.d/3dm2
	nas-0-0% chkconfig --add 3dm2
	nas-0-0% chkconfig --level 345 3dm2 on

Start up the server
	nas-0-0% service 3dm2 start
	
Now start firefox on the head node and go to http://nas-0-0:888 and you should get the 3dm2 manager. The admin password is 3ware by default. Set it to the root password.

You should set it to email on info and send messages to root-nas-raid, host cva.stanford.edu, and set the from to be "Bagels 3ware disk array <root@bagels.stanford.edu>" The address is important or stanford won't deliver it.

##############################
install emacs (or any other RPM) by hand on the cluster:
##############################
To install emacs on all the nodes we need to add the RPMs to the distribution and then rebuild the nodes.

To do this copy the emacs RPMs from %bagels:/home/install/rocks-dist/lan/x86_64/RedHat/RPMS/ to bagels%:/home/install/contrib/4.1/x86_64/RPMS/.
	bagels% cp /home/install/rocks-dist/lan/x86_64/RedHat/RPMS/emacs-* /home/install/contrib/4.0.0/x86_64/RPMS/

Then create a configuration file for the packages in bagels%:/home/install/site-profiles/4.0.0/nodes/ if one does not already exist.
	bagels% cd /home/install/site-profiles/4.0.0/nodes/
	bagels% cp skeleton.xml extend-compute.xml
	
Then modify the extend-compute.xml file to include packages by their name:
	<package> emacs </package>
	<package> emacs-common </package>
	<package> emacs-el </package>
	<package> emacs-leim </package>
	
Check really carefully that there aren't any errors here. Forgetting a "/" will cause the node to reboot and not re-install. (You just have to fix it then manually reboot it in that case.)
	
Rebuild the distribution:
	bagels% cd /home/install
	bagels% rocks-dist dist
	
Try installing it on one node first. Log into compute-0-0:
	bagels% ssh compute-0-0 /boot/kickstart/cluster-kickstart
	
The node will rebuild itself. When it's done see if it works. If it does re-install all of them
	bagles% cluster-fork /boot/kickstart/cluster-kickstart
	
	
##############################
Installing non-RPM files:
##############################
To install non-RPM files you should try very hard to put them in the /shared/system/bin/ directory. If you can't do that then you have two options:

1) build an RPM (http://asic-linux.com.mx/~izto/checkinstall/) and install that.
2) use cluster-fork or a post-install script to install it on the nodes.


##############################
User RPM install:
##############################
Create a local db for the RPMs: Apparently you don't need to do this and can just do the install.

	%user@compute-0-0: rpm --initdb --dbpath /home/user/rpminstall/ --root /home/user/rpminstall/
	
Do the install using those paths (and hope it works...)

	%user@compute-0-0: rpm -i --dbpath /home/user/rpminstall/ --root /home/user/rpminstall/ --nodeps package.rpm
	
Note that this DB doesn't have all the actually installed packages so you need to tell it to ignore dependencies.

Apparently there is a bug in RPM with this: https://lists.dulug.duke.edu/pipermail/rpm-devel/2005-January/000233.html


##############################
Configure job submission
##############################
We want to reserve 5 machines for interactive-only logins and use the rest for submit jobs. To do this we need to create a host list for the interactive jobs @intarctive and one for the submit jobs @jobhosts. Then we need to have two queues, where one is interactive only and the other is batch only.

For the cluster queues we need 5 queues:
	1) cva-qlogin.q for interactive sessions from CVA group users. This queue will run on the @interactive machines.
	2) cva-batch.q for batch jobs from CVA group users and over-flow interactive sessions from CVA group members. This queue will run on the @jobhosts machines.
	3) non-cva-qlogin1.q for interactive sessions from non-cva users. This queue will run on the @interactive machines at a lower priority.
	4) non-cva-qlogin2.q for interactive over-flow sessions from non-cva users. This queue will run on the @jobhosts machines at a lower prioity.
	5) non-cva-batch.q for batch jobs from non-CVA users. This queue will run on the @jobhosts machines at a lower priority.

We will create a userset called "cva-users" which will allow access to the cva queues. All users who are not a member of this userset will be denied access to those queues. Additionally, users in the cva-users queues will be denied access to the non-cva queues to prevent overflow jobs from over-subscribing the cluster.

The non-cva queues will be configured to automatically suspend jobs when certain load limits are reached. The goal here is that the non-cva queues should be able to run as much as they want as long as no cva jobs are trying to run on the same node. Since the queues overlap resources they will be scheduled independently of each other so multiple jobs will be scheduled on each node. The non-cva queues will use the load averages and the swap space usage to determine if they should suspend jobs. This will effectively cause non-cva jobs to be suspended as soon as cva jobs start using those nodes.

For the interactive non-cva queues we don't want to suspend jobs immediately if the CPU usage goes up since most interactive sessions will be bursty in their nature and won't use much CPU for much of the time. Additionally, it's really obnoxious to suspend an interactive session since it causes the users's terminal to freeze. 

Note: To see what the various options for suspending jobs are you can go to the Cluster Queues Instances window and click on a machine and click Load. It will then show you all the values for that machine at the current time.

We will use the graphical program qmon to modify the cluster queue. Run it as root:

	%bagels: qmon

1) Create the host lists:
Under "Host Configuration":
	Create a "Host Group" for interactive jobs.
	Call it @interactive.
	Manually add compute-0-0.local to compute-0-5.local.
	
	Create a "Host Group" for jobs. 
	Call it "@jobhosts".
	Manually add all the other computes.


2) Create the userset:
Under "User Configuration":
	Create a userset "cva-users" and add all the users from the group to it. We can not use the unix group "@cva" because the grid engine only looks at the pirmary group for the user, which is always their own private group. The addBagelsUser script will automatically add new users to this userset if they are added to the cva group.

3) Create the queues:
Under "Queue Control" configure the following queues:

cva-qlogin.q:
	Hostlist: @interactive
	sequence number: 0
	slots: 3
	interactive only
	temp directory: /state/partition1
	user access: allow only cva-users
	(Note: this allows 3 interactive sessions per 2-CPU node. Combined with the non-cva-qlogin1.q there can be up to 4 interactive sessions per 2-CPU node. In general interactive sessions will have bursty CPU requirements so this should not be a problem.)
	
	
cva-batch.q:	
	Hostlist: @jobhosts
	sequence number: 1
	slots: 2
	interactive and batch
	user access: allow only cva-users	
	
non-cva-qlogin1.q:
	Hostlist: @interactive
	sequence number: 3
	slots: 1
	nice: 5
	interactive only
	user access: deny only cva-users
	load/suspend threshold: suspend threshold
		swap_used = 1000M
		load_avg = 3.5
		suspend interval: 00:05:00
		jobs suspended per interval: 1
	(We want to be a quite generous with interactive jobs on the interactive machines and only suspend them if they are using a lot of the CPU when others want the CPU.)

	
non-cva-qlogin2.q:
	Hostlist: @jobhosts
	sequence number: 4
	slots: 1
	nice: 5
	interactive only
	user access: deny only cva-users
	load/suspend threshold: suspend threshold
		swap_used = 500M
		load_avg = 2.5
		suspend interval: 00:05:00
		jobs suspended per interval: 1
	(We want to be reasonably generous with interactive jobs on the job hostsand only suspend them if they are using a lot of the CPU when others want the CPU.)
	
non-cva-batch.q:
	Hostlist: @jobhosts
	sequence number: 5
	slots: 2
	nice: 1
	batch only
	user access: deny only cva-users
	load threshold: np_load_av 0.8	// Prevents more jobs from being scheduled here if it is reasonably busy to start with
	These jobs will be suspended by the suspend_low_priority_jobs.py script so we don't set suspend thresholds.
	
4) Configure the scheduler to prioritize queues by sequence number
Under Scheduler Configuration:
	Change it to sort by sequence number so jobs will go to the queues in the order we have specified rather than going directly to the one with the lowest load.
	Change the schedule interval and the reprioritize interval to 15 seconds.

5) Configure the scheduler to allow job submission from all nodes
Under Host Configuration:
	Go to the Submit Host list and add the nodes you want to allow to submit. In this case add all the nodes in the cluster.

6) Install the suspend_low_priority_jobs.py script to run in cron every 2 minutes to suspend low-priority jobs on heavily used nodes.


Note: we may need to add under the Scheduler Configuration the Load Adjustment parameters of 0.5 for np_load_avg. What this does is it immediately increases these when a new job is queued so we don't have to wait for them to go up over time.


The way we want things scheduled is with a share ticket policy. This would give all users the same relative portion of the jobs. To do this we need to edit the policy and either use the functional shares and give each user the same (arbitrary) number of shares or create a share tree with all users in it and give them all the same number of tickets. However, this won't suspend jobs, so long jobs will still clobber other jobs. The problem is that we need to do an entry for every user who submits jobs! There is a way to automatically assign a default number of tickets to each user which we might consider.

To enable qmake (a cluster make command) we need to replace rsh with ssh. To do this run qmon and in the "Cluster Configuration" window for "globl" click Modify and under Advanced add "/usr/bin/ssh" as the "rsh_command."


##############################
qlogin cleanup script
##############################
qlogin sessions seem to not go away on their own unless the user exists cleanly. Since this fills up the queue after a while I wrote a script which looks to see how many qlogin processes on bagels have no known tty and compare this to the number of sessions that user has in the queue. If these are the same for a given compute node then they are all zombied and we can kill them. If there are different numbers it means that at least one of them may still be valid so we can't.

This is in the script /home/shared/system/bin/utilities/qlogin_cleanup.py. 

To install this as a cron job on bagels:
	bagels% crontab -e
	
And add the following line to have it run every day at 6.30 in the morning.

	30 06 * * * /home/shared/system/bin/utilities/qlogin_cleanup.py

Check that it is there with:
	bagels% crontab -l
	
	
##############################
qlogin abuse and the single_qlogin and qlogin_logout_check scripts
##############################
People tend to queue up multiple qlogin sessions to get multiple terminals open. The problem with this is that it fills up lots of slots in the queue. One solution is to increase the interactive nodes to have 4 slots per node, but this is really not what we want. Instead the single_qlogin.py script checks if the user already has a qlogin session and if so just ssh's into that node. If the user does not then this script will run qlogin. Running the script with "-new" will force a new session.

To get this script to run we need to make sure that it is in the path as "qlogin" ahead of the existing script.

To do that add a profile file bagels%/etc/profile.d/zz-bagels.sh and export the paths we want in it:

	export PATH=/home/shared/system/bin:/home/shared/system/bin/utilities:$PATH

Do the same for csh users in the file bagels%/etc/profile.d/zz-bagels.csh:

	setenv PATH "/home/shared/system/bin:/home/shared/system/bin/utilities:${PATH}"

Now add these files to the 411 system so they will be propogated throughout the cluster. In bagels%/var/411/Files.mk add:
	# Add the cluster-specific shell initializations
	FILES += /etc/profile.d/zz-bagels.csh
	FILES += /etc/profile.d/zz-bagels.sh

Then execute:
	make clean
	make

Then create a link to the script as qlogin in the utilities directory:

	bagels% ln -s qlogin single_qlogin.py
	
Now when users log in the /home/shared/system/bin and /home/shared/system/bin/utilities will be on the beginning of their path and they will execute the single_qlogin.py script instead of the default qlogin.

So this will cause any additional runs of the qlogin to just ssh into the node they have reserved. This is fine, but if they exit from that qlogin session they may keep their non-qlogin sessions going and use a node without having it queued. To help this we will modify the command that gets run when qlogin is really executed to check for left over processes on the compute node by running the qlogin_logout_check script.

Edit: bagels%/opt/gridengine/bin/rocks-qlogin.sh

add the following at the end:

	/home/shared/system/bin/utilities/qlogin_logout_check.py $HOST $PORT

	
##############################
kill_non_qlogin  script
##############################
Users may end up with jobs that are not attached to queue reservations. This may be caused by the single_qlogin script generating ssh sessions which are not terminated when the user terminates the qlogin session. 

To resolve this the kill_non_queued_processes.py script will kill any processes owned by users on a node if the user does not have a job scheduled on that node. Users who are not in the cva group will have their processes niced.

However, to make things nice this script can be run with the "warning" option which will send emails to users warning them that their jobs will be killed instead of actually killing them.

To make this work we need to add two cron jobs to the head node, one to run:
	cluster-fork /home/shared/system/bin/utilities/kill_non_queued_processes.py warning
and one to run:
	cluster-fork /home/shared/system/bin/utilities/kill_non_queued_processes.py

Obviously the first one should happend before the second one so the users have time to clean things up. We'll send a warning at 19.00 and kill the process at 23.00

On the head node add:
crontab -e
	0 19 * * * /opt/rocks/bin/cluster-fork /home/shared/system/bin/utilities/kill_non_queued_processes.py warning
	0 23 * * * /opt/rocks/bin/cluster-fork /home/shared/system/bin/utilities/kill_non_queued_processes.py
	
	
##############################
watchHeadNodeUsers  script
##############################
To make sure users don't run jobs on the head node, the script watchHeadNodeUsers.py is run every 2 minutes by cron on the head node. This script checks for users who are using too much CPU or memory and sends them a warning if they keep doing it for more than 4 runs (8 minutes). It then sends them another warning every 2 hours if they continue using too much memory or CPU time.

To run the script add it to cron:
crontab -e
	# Monitor head node usage every 2 minutes
	*/5 * * * * /home/shared/system/bin/utilities/watchHeadNodeUsers.py
	
	
##############################
check_for_old_processes script
##############################
This script checks for user processes running on the head node that are old and kills them. What it does is as follows:
1) Look for a .processes_to_kill file in the user's home directory and kill all processes listed in that file
2) Find any processes that are between minAge days and maxAge days old and write them to a new .processes_to_kill file.
3) Send an email to the user informing him/her that these processes will be killed.

This script is added to cron to run every week and target processes between 8 and 9 weeks old. That is, if a user removes the .processes_to_kill file the processes will then be more than 9 weeks old and will never be targeted again.

On the head node:
crontab -e
	# Kill old processes on the head node and generate warnings about new ones every week.
	# Note that we only deal with processes between 8 and 9 weeks old.
	00 00 */7 * * /home/shared/system/bin/utilities/check_for_old_processes.py 56 63

	
##############################
disk usage report generator
##############################
The disk_space_report.py script generates disk usage reports for both the administrator and the users. This script should be run on bagels once a month by cron:

To install this as a cron job on bagels:
	bagels% crontab -e
	
And add the following line to have it run once a month at 3.30 in the morning.

	30 03 1 * * /bin/nice /home/shared/system/bin/utilities/disk_space_report.py

Check that it is there with:
	bagels% crontab -l


##############################
disk usage tracking
##############################
The disk_quota_delta_report.py script runs every night and updates a CSV file (disk_quota_delta_report.csv) with the disk usage for all users. This file can be graphed in Excel very easily to see disk usage trends.
	
	
##############################
temporary files cleanup script
##############################
Users put temporary files in the local node scratch space (/state/partition1) and in the local temp space (/tmp -- although they shouldn't put anything here). 
Typically they don't clean up after themselves so the cleanup_temp_files.py script will do this. It goes through all the files in the temporary directories and deletes them if they are older than "deleteAgeHours" and the user is not logged in. This script should be run by the run_cleanup_temp_files.sh script which executes this on the head node and all the cluster nodes.

Add this to cron on the head node to check every day.
	bagels% crontab -e
	
Add:
	30 01 * * * /home/shared/system/bin/utilities/run_cleanup_temp_files.sh
	
	
##############################
Snapshot sanity check script
##############################
The backup_sanity_checker.py script writes files to a list of places in the file system with today's date and verifies that these files exist in the same places in yesterday's snapshot with yesterday's date.

If this is not the case an email is sent to the cluster-admin and the backup-admin indicating that the snapshots have failed.

This script should be added to cron to run daily after the daily snapshot:

	# Verify that the snapshots are working. This sends emails if they are not.
	00 05 * * * /home/shared/system/bin/utilities/backup_sanity_checker.py


##############################
Backup and web server status alert messages
##############################
The script in /home/shared/system/bin/utilities/check_server_status.py should be run every day on the head node. This script sends email warnings if either cva.stanford.edu (the web/email server) or cream-cheese.stanford.edu (the backup server) goes down.

This script by default logs into the machines and checks that they are up and the load is low. The script logs in as the "serverCheckUser" so we need to create such users and put root's public rsa key into their authorized_keys file so root can log in as them to execute the command. (Remember this will be a cron job running on the head node so the serverCheckUser needs to have root@bagels as an authorized ssh key for passwordless login.)

The "servers" variable in the script contains a mapping of servers to email addresses and the "alwaysCCAddress" will always be CC'd on warnings. These should be set to appropriately responsible individual email addresses. Remember that if cva is down sending mail to a cva address doesn't really help.

The "emergencyCommand" variable contains an optional command that should be executed when the host is down. Currently this is configured so that if the backup server (cream-cheese) is non-responsive it will execute an "rsnapshot emergency" on the storage node. This will snapshot the current home directories, the head node, and the file server to the storage server. Since this runs once a day at midnight it will keep up to 7 snapshots of this data around.

On both cream-cheese and cva create the user with no password so they can't log in. Make the UID high so it won't bump into anyone.
	% useradd -c "Check Server Status Script User" -u 10000 serverCheckUser
	
Then create a .ssh/authorized_keys file in their directory and add root's public key from bagels. Make sure the authorized_keys file is only rw by the owner and the .ssh directory is write only by the owner.

Check that it works by executing:
	bagels% ssh serverCheckUser@cva

To install it add it to the crontab on bagels
	bagels% export EDITOR=emacs
	bagels% crontab -e
	
	0 0 * * * /home/shared/system/bin/utilities/check_server_status.py


##############################
rsnapshot for emergency local snapshots
##############################
This is disabled as of April 2006. It takes up a lot of space and uses up users's quotas.

The check_server_status.py script will execute "rsnapshot emergency" on the storage server if the backup server goes down. This will be executed once a day at midnight. We need to install and configure rsnapshot to do a local snapshot when this happens. We want to snapshot the head node, the web server, and the user directories.

To do so we need to install rsnapshot. Download the rpm and execute:
	nas-0-0% rpm -ihv rsnapshot
	
Configure it: (Remmber there are TABS between each item)
	# Set the root to our snapshots directory
	snapshot_root	/state/partition1/emergency_snapshots/
	
	# Uncomment the ssh command
	cmd_ssh	/usr/bin/ssh
	
	# 7 snapshots
	interval	emergency	7
	
	# Only generate reports on errors. 
	#This should prevent root from getting email when a file disappears during a snapshot.
	verbose 1
	
	# Exclude anything with "nobackup" anywhere in the path
	exclude	**nobackup**
	# Exclude the cad directories mounted form dryer
	exclude	/cad*
	exclude	/share*
	# Exclude our own backup mount point
	exclude	/home/backup*
	# Exclude the backup directory on cream-cheese
	exclude	/backup*
	exclude	/snapshots*
	# Exclude special directories
	exclude	/dev*
	exclude	/mnt*
	exclude	/proc*
	exclude	/media*
	exclude	/sys*
	exclude	/tmp*
	
	# User home directories
	backup  /state/partition1/home/ nas-0-0/
	# Head node
	backup  root@bagels:/   bagels/
	# Web server
	backup  root@171.64.72.176:/ cva/

Test the configuration:
	nas-0-0% rsnapshot configtest

Create the emergency snapshot directory and make it only root readable/writeable/executeable:
	nas-0-0% mkdir /state/partition1/emergency_snapshots
	nas-0-0% chmod 700 /state/partition1/emergency_snapshots


##############################
Replacing Node Disks
##############################
If one of the hard disks is failing in a node the root user should receive SMARTd notification emails ahead of time. Replacing the disk requires removing the old disk and re-installing the node.

If the disk is unformatted this should just happen automatically when the node is booted up with the new disk.

If the disk has stuff on it you may have problems.
First remove the partition information from the ROCKs database:
	bagels% rocks-partition --list --delete --nodename compute-2-4
	
If the disk was used for another node you should remove the /.rocks-release file on the disk too.

Then re-install the node:
	bagels% ssh compute-2-4 '/boot/kickstart/cluster-kickstart'
	
If that works then you're set.

If it stops giving an error that there are no valid partitions to initialize then you have to manually create a partition. To do this go to the install shell (alt-F3? alt-F2?) and use fdisk to delete any old partitions and create one new one.

The fdisk program is located in /mnt/runtime/usr/sbin/fdisk.

Once you have done that restart and the installation should work as expected.


##############################
Fixing Dead Nodes
##############################
If a node crashes just restart it and it should re-install itself.
To force a node that is crashed to re-install restart it and press F11 or F12 (I don't remember which) while it's booting up. This will prompt you to select the boot disk. Choose the last ethernet card and it will do a network boot and re-install.

If the re-install fails at some point go to the install shell (alt-F3 or alt-F4 -- you should get a # prompt) then go to /mnt/runtime/usr/sbin/ and run fdisk. Run fdisk /dev/sda and then delete all the partitions. Restart the machine and it should install. If it still doesn't then you need to delete the machine's partition map from the rocks database by executing:
	bagels% rocks-partition --list --delete --nodename compute-x-y
And try it again.


##############################
Changing user UIDs
##############################
This is what the change_UID_to_lealand_UID.py script does:

Use ldap to find the right Stanford UID:
/usr/bin/ldapsearch -x -h ldap.stanford.edu -b "cn=accounts,dc=stanford,dc=edu" uid=plegresl

/usr/sbin/usermod -u NEWID username

Change the group ID
/usr/sbin/groupmod -g NEWID username

Then assign the user to that group:
/usr/sbin/usermod -g NEWGID username

Now we just need to change all files in the whole file system with the old UID to the new one and with the old GID to the new GID.
To do this we execute:

find /home/ -gid OLDGID -exec chgrp username "{}" \;
find /home/ -uid OLDUID -exec chown username "{}" \;


##############################
Forcing a 411 Update
##############################
Sometimes 
	  bagels% make -C /var/411
 doesn't get all the nodes to update as it should. You can force them to all update by executing:
	 bagels% cluster-fork 411get --all


##############################
OUT-OF-DATE: Checking password strength:
##############################
30-April-2007: This is no longer needed with one time passwords.
The program "John the Ripper" should be run to check the strength of the passwords for users on the system. This takes a long time to run (several days) and checks passwords against a large multi-lingual dictionary with various replacement patterns. 

A script, /home/shared/system/bin/utilities/john-password-cracker/submit_password_job.sh will submit a password cracking job to the cluster queue. This is added to cron to run once a month:
	crontab -e
	# Try to crack user passwords once a month
	00 01 1 * * /home/shared/system/bin/utilities/john-password-cracker/john/submit_password_job.sh


Install the current version of John the Ripper in /home/shared/system/bin/utilities/. 
Download the large dictionary which includes multiple languages, "all.lst" which can be found at http://www.openwall.com/wordlists/.

To run it you need to first un-shadow the /etc/password file. This creates a sensitive file that should not be left readable by anyone except root. To do this you can use the included unshadow utility:
	% unshadow /etc/passwd /etc/shadow > mypasswd

Make sure you set that file to be root access only:
	% chmod 077 mypassd

Then you run the john program on that file using a small dictionary and word-magling rules. This can take several days.
	% john --wordlist=password.lst --rules mypassd
	
It will process all the valid accounts and try to find passwords that work with those accounts. The results are stored in the john.pot file which can be read by running
	% john --show mypasswd
	
The state of the process is stored every 10 minutes by default in the john.rec file. If the process is stopped it can be restarted where it last checkpointed by running
	% john --restore
	
To see the current status send the process a SIGHUP signal and then use the show command
	% kill -SIGHUP pid-of-process
	
If no passwords are broken that way you can try the larger all.lst wordlist. You can also increase the length that is checked by editing the john.conf file by adding:
	MaxLen = 12 
	
We should have a policy where users have to change their password every 12 months and their accounts are disabled if they do not. We're not concerneed about them re-using a password, but we want to make sure they pick good passwords.

The addBagelsUser.py script automatically sets accounts to expire in 1 year. For other accounts you must run the "chage" command to manually change their expiration dates. 

In /etc/pam.d/system-auth we want to change pam_cracklib.so to enforce the length requirement. This will require that the new password have a minimum length of 10, but if the users use 1 uppercase, 1 digits, and 1 other characters they can have a password as short as 10-3=7 characters long. The difok=0 means that the users can re-use their old password identically. (Remember, the only reason we are expiring passwords is to disable un-used accounts.)

password    requisite     /lib/security/$ISA/pam_cracklib.so retry=3 minlength=10 difok=0 lcredit=0 ucredit=1 dcredit=1 ocredit=1


##############################
OUT OF DATE: rsnapshot for snapshotting locally
##############################
NOTE: This has performance problems. Instead we should put the snapshots on the backup server. This information is no longer used as of 2/24/06. (davidbbs)

Basically we want to install this to backup to /backup/.snapshots/ and then we want to export that as read-only for the users via NFS. Download it into the "bin" user's directory.

download from www.rsnapshot.org and make it OR
	nas-0-0% ./configure --sysconfdir=/etc
	nas-0-0% su
	nas-0-0% make install
	nas-0-0% cp /etc/rsnapshot.conf.default /etc/rsnapshot.conf
downlaod the RPM and install it
	nas-0-0% rpm -i rsnapshot-...

Edit the nas-0-0:/etc/rsnapshot.conf file. (Rember you need TABS between things!) I changed:
	snapshot_root /state/partition1/backup/snapshots/
	
	# define the ssh command for the head-node backup
	# Uncomment this to enable remote ssh backups over rsync.
	cmd_ssh /usr/bin/ssh
	
	# 4 hourly ones
	interval	hourly 	4
	# 3 monthly ones
	interval	monthly	3
	
	logfile /var/log/rsnapshot
	
	# Ignore all files that have "nobackup" somewhere in their name.
	exclude **nobackup**
	
	# backup directly into the backup directory (which is /backup/snapshots/)
	backup  /state/partition1/home/           nas-0-0/
	backup  /etc/           nas-0-0/
	backup  /usr/local/     nas-0-0/
	# backup /etc/ on the head node
	#backup	root@bagels.local:/etc/	bagels/
	#backup	root@baegls.local:/usr/local/	bagels/
	#backup	root@bagels.local:/opt/	bagels/
	#backup	root@bagels.local:/var/	bagels/
	# Backup the headnode bagels.local.
	# Note we can't just do the whole thing since it mounts the
	# shared directory as well and we'd rather not pull that all     
	# over the network back and forth. We also don't want to backup
	# the proc/ filesystem.
	backup  root@bagels.local:/bin/ bagels/
	backup  root@bagels.local:/boot/        bagels/
	backup  root@bagels.local:/etc/ bagels/
	backup  root@bagels.local:/initrd/      bagels/
	backup  root@bagels.local:/lib/ bagels/
	backup  root@bagels.local:/lib64/       bagels/
	backup  root@bagels.local:/misc/        bagels/
	backup  root@bagels.local:/opt/ bagels/
	backup  root@bagels.local:/root/        bagels/
	backup  root@bagels.local:/sbin/        bagels/
	backup  root@bagels.local:/selinux/     bagels/
	backup  root@bagels.local:/srv/ bagels/
	backup  root@bagels.local:/state/       bagels/
	backup  root@bagels.local:/tftpboot/    bagels/
	backup  root@bagels.local:/usr/ bagels/
	# backup the group file server
	backup  root@cva.stanford.edu:/etc/       cva/
	backup	root@cva.stanford.edu:/var/	cva/
	
NOTE: for rsnapshot to backup cva the public key for root@nas-0-0 must be copied to the authorized_keys on cva. 

Check the file with:
	nas-0-0:% rsnapshot configtest

Now we want to configure it so users can't mess with them. To do this we'll follow the instructions to change the permissions so users can't get at /backup and have to get to it through the NFS mount.
	nas-0-0% mkdir /state/partition1/backup/snapshots
	nas-0-0% chmod 0700 /state/partition1/ackup/
	nas-0-0% chmod 0755 /state/partition1/backup/snapshots/

add to nas-0-0:/etc/exports:
	/state/partition1/backup/snapshots/ 10.0.0.0/255.0.0.0(ro,no_root_squash)

restart nfs:
	nas-0-0% exportfs -r
	nas-0-0% exportfs -a

verify it with:
	nas-0-0% exportfs

you should see the export listed:
	/state/partition1/backup/snapshots
			10.0.0.0/255.0.0.0

Before we add the job to cron so it will actually run we need to set up a password-less login to the head node so rsnapshot will run without typing in a password. Do this as root:
	nas-0-0% ssh-keygen -t rsa
	(accept the default location and leave the password blank)

Now we want to add the generated key on the storage server (id_rsa.pub) to the .ssh/authorized_keys in root's home directory on the head node. Log into the head node as root, go into .ssh/authorized_keys and add the contents of the file to the end.

Test that it works by executing:
	nas-0-0% ssh bagles.local
And verify that you don't need to type in your password.

Now test the rsnapshot configuration by running it:
	nas-0-0% rsnapshot hourly
	
You should see the correct hourly.0 directories created in the backup location.

Add the backup jobs to cron:
	nas-0-0% export EDITOR=nano
	nas-0-0% crontab -e
	
	00 00,06,12,18 * * * /bin/nice /usr/bin/rsnapshot hourly
	00 03 * * * /bin/nice /usr/bin/rsnapshot daily
	20 03 7,14,21,28 * * /bin/nice /usr/bin/rsnapshot weekly
	40 03 1 * * /bin/nice /usr/bin/rsnapshot monthly

This will do an hourly snapshot at 00:00, 06:00, 12:00, and 18:00; a daily snapshot at 03:00, a weekly snapshot on the 7th, 14th, 21st, and 28th at 03:20, and a monthly snapshot at 03:40 on the 1st. 

Note that the daily, weekly, and monthly are offset by 1 hour to prevent race conditions.

Check that it's there with:
	nas-0-0% crontab -l

Now we need to have it import that file system export to all the nodes. Adding it to the auto.home makes the most sense since it will be saved across cluster upgrades. It's possible that this should be put in /etc/fstab instead so the directory isn't dynamically created but I'm not sure.
edit bagels.stanford.edu:/etc/auto.home add:
	# Mount the snapshots directory
	backup nas-0-0.local:/state/partition1/backup/snapshots

Restart autofs:
	bagels% service autofs  reload

and check the status:
	bagels% service autofs  status

you should see the mount only if you've gone into the /backup directory.

Now we need to update all the compute nodes with this new information:
	bagles% make -C /var/411
	bagles% service autofs restart

This tells the 411 service to update all teh auto.master and auto.misc files on the whole cluster. However, they may not restart so you can force all the nodes to reload them with
	bagles% cluster-fork service autofs reload
	
The autofs restart may fail. I don't know if this matters.
	
To be nice to people we should put in a link so they don't have to dig through the directory to get to their backup.
	nas-0-0% cd /state/partition1/backup/snapshots/
	mas-0-0% ln -s hourly.0/nas-0-0/state/partition1/home last_home_backup


##############################
OUT OF DATE: install workstation utilities:
##############################
From the file comps.xml in /home/install/rocks-dist/lan/*/RedHat/base/comps.xml you can find a list of all meta packages we can install. We want to include the development and workstation ones.

We need to make a new extend-compute.xml file which will include the "Development" and the "Workstation" meta packages.

To do this copy the skeleton.xml file to extend-compute.xml in /home/install/site-profiles/4.1/nodes.

Edit the file to have the following packages included. 
	<package type="meta">Development Tools</package>
	<package type="meta">Workstation Common</package>

Check really carefully that there aren't any errors here. Forgetting a "/" will cause the node to reboot and not re-install. (You just have to fix it then manually reboot it in that case.)

Now re-build the cluster configuration:
	
Rebuild the distribution:
	bagels% cd /home/install
	bagels% rocks-dist dist
	
Try installing it on one node first. Log into compute-0-0:
	bagels% ssh compute-0-0 /boot/kickstart/cluster-kickstart
	
The node will rebuild itself. When it's done see if it works. If it does re-install all of them
	bagles% cluster-fork '/boot/kickstart/cluster-kickstart'
	
	
##############################
OUT OF DATE: Mounting dryer's cad directory:
##############################
Re-exporting NFS doesn't work, so to mount the dryer cad directory we will modify auto.share so that all the nodes mount it as needed through the NAT in the head node.

Modify bagels:/etc/autofs.share to include:
	# Mount the dryer cad directory. We use the numeric IP address to avoid DNS spoofing attacks.
	cad     -fstype=nfs,ro,exec,nosuid,soft,rsize=8192	 172.24.72.150:/vol/vol1/cad

Propogate the changes:
	bagels% make -C /var/411
	
Now each of the nodes should see:
	/share/cad/ 
With the appropriate files, although it may take a while to mount it as it has to go through the NAT.

Note that this is problematic because that file system is set up with lots of hard links which only work if it is mounted in a particular place. We should mount it at /cad, which involes changing all the fstab fils on all the nodes.

Another moutning approach to get it in /cad is to setup an automount for /cad and manually add each of the sub-directories to it. This will not allow us access to files in /cad, but it will allow access to any sub-directory. To do this we need to modify auto.master on the head node to point /cad to auto.cad and then add an auto.cad file with the appropriate mounts for each tool. After that we need to tell 411 to propogate auto.cad and then do it. (Note that 411 automatically propogates any /etc/auto.* file so we don't need to do anything special to make it propogate.)

On bagels, change /etc/auto.master and add:
	/cad   /etc/auto.cad    --ghost --timeout=1200

In the /etc/auto.cad file add one mount for every directory in the /cad directory. Note that this means that files (such as /cad/SITEID.txt) will not be available in this directory. To get them you will need to go to /share/cad.

# Mount the dryer cad directory. We use the numeric IP address to avoid DNS spoofing attacks.
altera          -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/altera
Archive_Logs    -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/Archive_Logs
automodel       -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/automodel
avant           -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/avant
avanti          -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/avanti
cadence         -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/cadence
cascade         -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/cascade
hyperplot       -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/hyperplot
lager           -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/lager
lisatek         -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/lisatek
local           -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/local
magma           -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/magma
mathworks       -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/mathworks
mentor          -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/mentor
mmi             -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/mmi
nassda          -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/nassda
novas           -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/novas
octtool         -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/octtools
papers          -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/papers
quad            -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/quad
snaketech       -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/snaketech
starsim         -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/starsim
synopsys        -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/synopsys
tensilica       -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/tensilica
tsmc            -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/tsmc
xilinx          -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/xilinx
xpedion         -fstype=nfs,ro,exec,nosuid,soft,rsize=8192              172.24.72.150:/vol/vol1/cad/xpedion

Now reload autofs:
	bagels% /sbin/service autofs reload
	
Make sure the directory works appropriately on bagels.

Then propogate the changes:
	bagels%/var/411 make clean
	bagels% make -C /var/411
	
Now we need to tell all the nodes to reaload their autofs. To do this we will use cluster-fork, but we should first unset the DISPLAY variable so it doesn't try to forward our X session for each node.
	bagels% export DISPLAY=""
	bagels% cluster-fork "/sbin/service autofs reload"
	
Check that it's all set on the compute nodes and then you're done.