The wholesale move to virtualization in the data center often leaves an ugly problem in its wake: virtual machine sprawl. When it costs almost nothing to spin up a server, VMs tend to multiply. The count goes up and so does the burden on power, processing, memory and disk space — not to mention the security liability of an unattended server. These tips can help you identify ghost servers and keep systems running efficiently and safely.
The best way to manage ghost servers, of course, is to avoid creating them in the first place. If employees lack incentives to maintain or shut down servers, they may not do so. Create proper documentation by first identifying an owner for every system in the data center. Then create a chargeback system — even if it’s a zero-sum game — to make server owners actively acknowledge their servers by paying to keep them running. Finally, establish policies for patching, remote auditing and secure configuration to ensure that unattended servers don’t become a security liability.
Use network monitoring to find servers that have fallen under the radar. For physical devices, you can use switch port statistics, but in a virtual world you’ll have to rely on IP-level statistics from devices such as firewalls. If you’re concerned about the load that such logs might put on existing firewalls, drop an older, transparent one into the network with “permit all” rules that log to a dedicated server. One alternative method of capturing traffic is a dedicated NetFlow or IPFIX sensor. However, firewalls are better because they measure connections; NetFlow and IPFIX focus on volume, and thus can be deceptive.
Now start looking for incoming connections to the server. Be sure to drop any traffic from services such as backups, anti-malware and patch updates before generating your “bottom 10 percent” statistics. Outgoing traffic will be a mix of software updates and administrative traffic (such as Active Directory). But a server that nothing talks to is a prime candidate for investigation.
CPU usage can also help to identify these candidates, but it’s often misleading, especially if a server has a handful of Java services that have been CPU-bound for years without being noticed. Don’t laugh; this happens more often than you’d think.
Use other tools to expand the list of candidates. Add in any servers that are known security problems based on your vulnerability analyzer, and check your patch management system to see if it is complaining about systems far out of date. Keep investigating and whittle down your final list, which should be about 10 percent of the servers in your data center. Don’t worry about making a definitive list. This is an exercise you can repeat annually; any servers that slip through the cracks this time will be there next time.
Make sure you have management buy-in. The next step is the most dangerous, so you need to be sure that management has your back. With a list of candidate servers to shut down, make sure everyone knows what you’re doing and why you're doing it: to be more cost-effective, save resources and eliminate security holes.
Finally, start shutting servers down — with plenty of notice, a well-advertised plan and a willingness to reboot anything if someone shouts. This is where things get interesting. Don’t rush this step; it should be done in stages over a period of weeks. Once a server is down, leave it in place for up to six months to see if anyone notices and complains. If not, then finish the job: Back up the data and get rid of obsolete hardware or VM pointers.