Automation’s Impace on Data Center Monitoring Alerts – The Data Center Journal

Posted: February 14, 2017 at 11:17 am

In my last installment, I discussed a few different areas where data center monitoring automation can not only make life in the data center more convenient but also become a force multiplier. I ran out of space, however, before I ran out of ideas (the story of my life). The one thing I didnt cover was the automation you can implement in response to an alert.

As a data center professional, you probably have a solid understanding of monitoring and alerting already, but to truly appreciate how automation can relieve an enormous burden, it may be helpful to review a few examples.

What follows are some clippings from my garden of automationalert responses that have had a huge impact on the environments where they were implemented.

Example 1: Disk Full

Disk-full alerting is a simple concept with a deceptively large number of moving parts. So, I want to break it down into specifics. First, get the alert right. As my fellow SolarWinds Head Geek Thomas LaRock and I discussed in a recent episode of SolarWinds Lab, simplistic disk alerts help nobody. If you have a 2TB disk, alerting when its 90 percent used translates to having204.8GBs of disk space remaining.

A good solution to this problem is to check for both percent used and also remaining space. A better solution is to include logic in the alert that tests for the total space of the drive, so that drives with less than 1TB of space have one set of criteria and drives with greater than 1Tb have another. These tests should all be in the same alert, if possible, because who wants to manage hundreds of alert rules? Nevertheless, you want to ensure you are monitoring disk space in a way that is reasonable for the volumes in question, and only create necessary alerts.

Next, clear unnecessary disk files out of various directories. For the purpose of this article, Ill just say that all systems have a temporary directory and that you can delete all files out of that folder with impunity. The challenge in doing so easily comes down to a problem of impersonation. Many monitoring solutions run on the server as the system account. As a result, performing certain actions requires the script to impersonate a privileged user account. There are a variety of ways to do so, which is why Ill leave the problem here for you to solve in a way that best fits your individual environment.

Once the impersonation issue is resolved, theres another challenge specific to the disk-full alert: knowing that the correct directories for the specific server are being targeted. The best approach is to use a common shared folder that maps to all servers and place a script file there. That script can be set up to first detect the proper directories and then clear them out with all the necessary safeguards and checks in place to avoid accidental damage.

Example 2: Restart an IIS Application Pool

Sadly, restarting application pools is often the easiest and best fix for website-related issues. Im not saying that running appcmd stop... and then appcmd start... from the server command line is a quick kludge that ignores the bigger issues. Im saying that often, resetting the application pool is the fix.

If your web team finds itself in this situation, waking a human being to do the honors is absolutely your most expensive option. But automatically restarting the application pool becomes slightly more challenging because one server could be running multiple websites, which in turn have multiple application pools. Or you could have one big application pool controlling multiple websites. It all depends on how the server and websites were configured and you have no way of knowing.

If your monitoring solution can monitor the application pool, it will provide the name for you. Most mature monitoring solutions do so already. Once you have the name, you can do the following:

Example 3: Restart IIS

Running a close second behind restarting application pools is resetting IIS. Doing so is, of course, the nuclear option of website fixes since you are bouncing all websites and all connections. Even though its drastic, its a necessary step in some cases.

As with restarting application pools, getting a human involved in this incredibly simple action is a waste of everyones time and the companys money. Its far better to automatically restart and then recheck the website a minute or two later. If all is well, the server logs can be investigated in the morning as part of a postmortem. If the website is still down, its time to send in the troops.

You can restart the IIS web server in a number of ways:

Example 4: Restart a Server

If restarting the IIS service is the nuclear option, restarting the entire server is akin to nuclear Armageddon. Yet we all know there are times when restarting the server is the best option, given a certain set of conditions that you can monitor.Assuming your monitoring solution doesn't support a built-in capability for this function, some options include the following:

Example 5: Restart a Service

Occasionally, services stop. They are sometimes even services that you, as a data center professional who needs to monitor your infrastructure, care about, such as SNMP.So, you are cutting dozens of service-down alerts. Have you thought about restarting them? In some cases, a restart doesnt really help much. But in far more situations it does. Computers are funny things. After all, Screws fall out all the time. The world is an imperfect place. (From The Breakfast Club.)

Sometimes, they just need a gentle nudge. If this is the case, you can do the following:

Example 6: Backup a Network-Device Configuration

Everything Ive gone over so far covers direct remediation-type actions. But in some cases, automation can be defensive and informational. Network-device configurations are a good example, in that they dont fix anything, but instead gather additional information to help you fix the issue faster.

Its important to note that between 40 and 80 percent of all corporate-network downtime is the result of unauthorized or uncontrolled changes to network devices. These changes arent always malicious. Often, the change simply went unreviewed by another set of eyes or an otherwise simple error slipped past the team.

So, having the ability to spontaneously pull a device configuration based on an event trigger is super helpful. To do so, you can use the following approach:

There are two general cases when you may want to execute this automatic action. The first is when your monitoring solution receives a config change trap. Although the details of SNMP traps are beyond the scope of this article, you can configure your network devices to send spontaneous alerts on the basis of certain events. One of these events is a configuration change. The second is when the behavior of a device changes drastically, such as when ping success drops below 75 percent or ping latency increases. In either case, often the device is in the process of becoming unavailable. But in some situations, its wobbly, and theres a chance to grab the configuration before it drops completely.

In both of those situations, having the latest configuration provides valuable forensic information that can help troubleshoot the issue. It also gives you a chance to restore the absolutely last-known-good configuration, if necessary. And if it leads you to think, Well, if I have the last known good configuration, why cant I just push that one back? Then you, my friend, have caught the automation bug! Run with it.

Example 7: Reset a User Session

Somewhere in the murky past, the first computer went online and became Node 1 in the vast network we now call the Internet. The next thing that probably happened, mere seconds later, was that the first user forgot to log off their session and left it hanging.

For any system that supports remote connectionswhether its in the form of telnet/ssh, drive mappings or RDP sessionshaving the ability to monitor and manage remote-connection user sessions can make running weekly, if not daily, restarts unnecessary. Or at least much smoother.

For Linux, use the who command to discover current sessions, or with greater granularity by remotely running netstat -tnpa | grep 'ESTABLISHED.*sshd. Once you have the process ID, you can kill it. For Windows, you get the active sessions on a system using the query session command and disconnect the session using the reset session command. Or you can use the PowerShell cmdlet Invoke-RDUserLogoff.

Example 8: Clear DNS Cache

At times, a server and/or application will misbehave because it cant contact an external system. This misbehavior is either because the DNS cache (the list of known systems and their IP addresses) is corrupt, or because the remote system has moved. In either case, a really easy fix is to clear the DNS cache and let the server attempt to contact the system at its new location.

In Windows, use the command ipconfig /flushdns. In Linux, the command varies from one distribution to another, so its possible that sudo /etc/init.d/nscd restart will do the trick, or /etc/init.d/dns-clean, or perhaps another command. Research may be necessary for this one.

Hopefully at least a few of things Ive shared here and in this series on automation as a whole have inspired you to give automation a try in your data center. If so, or if youre already well on your way to automating all the things. Id love to hear about your experiences and perspective in the comments section.

Leading article image courtesy ofLeonardo Rizzi under a Creative Commons license

Leon Adato,SolarWindsHead Geek and long-time IT systems management and monitoring expert, discusses all things data center in this ongoing series.

Automations Impace on Data Center Monitoring Alerts was last modified: February 13th, 2017 by Leon Adato

Read the original here:

Automation's Impace on Data Center Monitoring Alerts - The Data Center Journal

Related Posts