Cluster Security Hints for an Academic Environment, or why SSH keys are not safe

ssh keys are only as secure as the machine on which they are stored

How many of your users are using poorly-administered or 3rd party Windows machines? And you let them use ssh keys!?

Preface

Here are some collected notes on my experiences securing my own cluster and watching other people's machines get hacked over the period of a few years in the CVA Group of the Computer System Lab at Stanford University. The cluster in question consists of 39 dual Opteron blades, a head node, a storage server, a backup server, and a web/email server.

Requirements

This cluster was purchased to replace an aging 12-way sun box. The sun machine served as the NFS/NIS server for a bunch of sun machines. The cluster needed to allow users to submit batch jobs and work interactively, as well as integrate group and personal email and web services.

The easiest thing to do (and it was pretty easy) was to install ROCKS which is a fully-functional cluster installation package, including tripwire! Once that was up and running I needed to figure out how to harden the cluster and provide public web/email services.

Logging

Disk space is cheap and logs are invaluable for determining what happened, or, more importantly, if something happened at all. In addition to the basic logging on the system I enable process accounting on the head node to make sure I can see what is being run and by whom when. I also use logwatch to receive emails every day with basic system status information, and I have my own cluster-status script that outputs the disk usage, quota usage, host status, queue status, and user logins and emails me that information every night. I read these two emails and the tripwire email (see below) every morning to get a quick feel for what's going on with the cluster.

It is also important to make sure that your logs are securely backedup, and stored for long enough that you can go back and examine events in the past. We currently set logrotate to rotate logs every week and keep them for a year. As with everything else on our system (see below) the logs are snapshotted and backedup every day. This is somewhat useful because it means that a hacker would have to erase his or her traces from the logs before the next snapshot to avoid detection. (Of course our snapshots are only every 8 hours so this may not be of any real value.)

Hardening the cluster

There are a bunch of simple things that can be easily done to harden the cluster, but it is essential that you understand what they actually buy you. For example, ssh is great, but if you allow users to access it from random machines it becomes only as secure as those random machines.

Put the cluster on a private VLAN

This way the machine is not accessible from off-campus addresses. This has the effect of eliminating attacks from the internet as a whole. Unfortunately Stanford has many thousand poorly managed (or not at all managed) machines on-campus, so this really doesn't solve the problem.

VLANs only reduce your exposure; they are not enough by themselves.

Put up firewalls

This is obvious, but the only packets I allow into my cluster are ssh from within Stanford and smb from within a subset of the subnets in the computer science building. Nothing else is allowed in.

Firewalls do not help with compromised accounts on third-party machines.

Block all off-campus out-going traffic

The first thing most attackers do once they get onto the system is download a rootkit from some webserve. By blocking all out-going traffic to non-Stanford IPs we can at least make this harder. This is mostly security through obscurity, but it will stop script kiddies in their tracks. Note that you have to make sure that the NAT is blocking out-going requests from the internal ethernet interface as well as out-going requests on the head node.

Auto-blacklist failed login attempts

PAM_abl (auto black list) is configured to block access to any account or from any host with more than 10 failed attempts to login in 1 hour, or 30 in one day. This prevents brute-force password guessing and allows me to not worry about weak passwords. (I couldn't care less if someone keeps their password written on a sticky note on their desk as anyone who can break into their office can also just break in and steal the whole cluster...if they have a suitable forklift to carry it.)

Auto-blacklisting login attempts reduces the risk from weak passwords.

Use one-time passwords

SSH is great, but there is one incredibly weak link: passwords. Now I'm not at all concerned about people using easy-to-guess passwords (see above) but what am concerned about are stolen passwords. This happens one of two ways (and I've seen both):

1) users type in their password to some third-party machine which has a keylogger, or

2) their own machine gets a keylogger.

Either way, what then happens (and let me repeat that I've seen this many times) is that the account they first used is compromised and any accounts they've connected to from there are then tried. Note that these machines are on the private VLAN, so merely "hiding" your machine is worthless against these attacks.

Iif the machine has a keylogger, then it is trivial for the hacker to get any private ssh keys on that machine. The same goes for the compromised account, all the ssh keys can be trivially stolen as well. Here's the take-home message: ssh keys are only as secure as the machine on which they are stored. Don't trust the security of ssh keys or passwords unless you control all the machines from which they will be used. Most of my users store ssh keys on Windows laptops. This is the largest threat to my cluster.

There's only one way to deal with this and that's one-time-passwords. I know these are a pain in the ass, and my users complained bitterly until I shoved them down their throats. If you've got lots of money the right thing to do is get RSA key FOBs and use those. If you don't then you can use pam_sotp to protect your cluster. The trick here is that you have to force users to not store their one-time-passwords electronically on their machine. (If you do you might as well use ssh keys since the users will store them in a file named "passwords.txt" which is easy for the person who installed the keylogger to find.) My setup is that when they generate new passwords two copies of the list are printed out on a local printer. The users can easily generate more passwords when they need them, but they do not get the option to store them electronically.

So how does this help? Well, if the users now use a third-party machine, even if it has a keylogger the password is only valid once. If the hacker is really smart and realizes that it's a 5-digit one-time-password, after 10 attempts to hack it the account is blocked. So basically the human factor is gone.

Unfortunately, as Bill Broadley pointed out, this is really just another level of security through obscurity. The hacker's job is a bit more complicated, but no less feasible. (They need to install a trojan ssh client on the users's machine such that when they connect with their one time password the ssh client installs a program in the user's account automatically and opens a back door.) So is this worth it? Well, given the number of successful attacks I've seen via reusable passwords and keyloggers I'd say yes, but if you want a better solution I'm not sure what to say.

One time passwords eliminate the risk of accounts being compromised via keyloggers as long as ssh keys are disabled, but trojaned ssh clients may still get through.

Detecting intrusions

There is no wholly reliable way to tell if a machine has or hasn't been hacked. A good hacker will both replace all the system binaries and clean up his or her tracks so well you won't see anything without a central logging server or an external network traffic manager. (We happen to have both, luckily.) The best you can do is to use tripwire. Tripwire determines cryptographic signatures for various files in your file system and verifies that they have not changed. Of course this only works if someone both 1) updates the database anytime something legitimately changes, and 2) verifies that nothing has been changed regularly. I get an email every day with the results of the last night's tripwire scan and if I see anything odd I know there's a problem. In addition, all our logs are snapshotted to our read-only backup server every 6 hours.

Without tripwire and constant human vigilance you will not detect even a half-competent intrusion.

Providing hardened email/web services

The requirements were that users could maintain a WWW directory in their accounts on the cluster and have that show up on the group web server. Further, the users should be able to configure and manage email aliases and lists from the cluster. Historically the way to do this is just export the file system to the web server. Unfortunately this is not very secure. To completely isolate the two machines I decided to simply have a cron job that rsyncs the appropriate data from the cluster to the web server. This means that if the web server is hacked the only damage that can be done is to that machine. No user data can be destroyed and no other machines can be compromised.

Web and email are necessarily open to the world and as such they should be as thoroughly isolated from the rest of the system as possible.

Providing semi-hardened backups/snapshots

We didn't buy a fancy NAS, so our backup server is just a simple RAID 50 box sitting in another building together with quarterly tap archives. However, we did want to provide snapshots to users of the cluster file system to the users. To make this somewhat secure, the cluster runs a second sshd daemon which only accepts RSA keys and the firewall only allows connections to this port from the backup server's IP address. The backup server then runs rsnapshot (rsync) to do the backups locally, and the results are NFS exported as read-only to the cluster. The backup server allows connections only from the cluster on any port. This means that no one (even root) on the cluster can erase the backups without also having the separate root password to the backup server. Pretty nice setup, even if rsnapshot takes over an hour per snapshot. (Maybe we'll move to ZFS in the future...)

RAID is not a backup. Backups are not archives. Backups should be stored off-site. Archives should be stored elsewhere in a fire-proof safe. Do not ever reuse backup tapes as they will fail much faster.

Conclusion

Well, that's about all I have to say. If you take away anything please remember that ssh keys and passwords are not safe unless you control the machines from which they are used. If your users are using windows laptops or third-party machines then ssh passwords or private keys are a major security hole of which you should be aware. I hope this has been helpful, and remember: I'm not an expert on any of these topics.