solarwinds  |  thwack
in
Search 44,380 posts contributed by 21,420 members or post a topic.

Root cause analysis, alert suppression, and dependencies.

Among those of us that implement and maintain network management systems the topic of root cause analysis is one of much angst and frustration. Of course we'd all like our monitoring systems to be smart enough to correlate hundreds of seemingly unrelated events into some sort of epiphany of what's "really wrong" but as many of us have found out, that's a pretty lofty goal.

Back when I ran a consulting company I loved root cause analysis/event correlation projects. Who wouldn't? It's a great way to generate tons of revenue and keep your engineers highly engaged without ever really delivering anything useful to the customer. Most times, these projects don't end until the customer either runs out of money or your contact there gets fired for funding a project that never went anywhere. Perfect, right?

I've had the priviledge of working in and visiting some of the largest NOCs in the world, and in almost all of them there sits a system that is supposed to correlate network events into meaningful information about where the problems exist - and then there's another product or even in some cases a home-grown ping tool that actually monitors the network...

Root cause analysis is a good thing. The concept of correlating events to get a better understanding of the big picture is also a good thing. Where people tend to go wrong is that they don't head down this road with clear, achiveable milestones in mind and end up basically driving around forever. Failing to define what is "good enough" is a good way to ensure that you'll never end a project like this.

So, how do you get what you really need in terms of suppressing alerts, defining dependancies, and correlating events without getting lost on the road to the Holy Grail of root cause analysis? Well, next week I'll tell you but right now I'm heading out to Northern Illinois for an early goose hunt with Captain Bob from Migratory Outfitters. If you've got suggestions/comments post them here and I'll include them in the list. Until next week...

Flame on...
Josh
Follow me on Twitter


Posted Oct 21 2008, 11:47 AM by Josh Stephens | Email to a Friend
Share with Others »
Digg | Technorati | reddit

Comments

TGhosh wrote re: Root cause analysis, alert suppression, and dependencies.
on 10-22-2008 10:01 AM

Josh,

This is a great topic to dive into as I am researching the few products in the market place that are really geared towareds Application performance management and event correlation.  The ones that seem to be geared towards what I'm looking for cost upwards of 10x what we paid for NPM 9 and APM2 combined.  I'll be fully tuned in to what your thoughts are on this topic.  

the_toilet wrote re: Root cause analysis, alert suppression, and dependencies.
on 11-28-2008 4:05 AM
this topic is very high on our agenda at the moment. The key component is a unified, multi vendor, multi technology CMDB. SolarWinds are making leaps and bounds in all areas, and just need to bolt it all together. You have asset discovery, but it is for network centric, and does not speak tot he windows and UNIX Etc.. you have network mapping, but it is not integrated into the CMDB, and then onto the NPM. imagine where you would be if it all tied together, you have something that already did root-cause-identification... and alert correlation based on network dependencies. the corporate infrastructure would be visible on a screen, from the firewall, down to the servers and then onto the services and applications that run on them. All you need then is to start plugging your exchange tool into NPM, and then add SQL, oracle and few others into the SNPM and you will be unstoppable, and competing in the CA, BMC and HP space for a one tool fits all (which even they have not got to yet....) definitely elaborate on your thoughts, it is exciting stuff and the way of the future.... just ensure you treat it as root-cause identification rather than analysis, as in reality, you will need a human interaction for the analysis of the proposed identification....
electronic commerce wrote electronic commerce
on 12-19-2008 7:38 PM

good stuf! I have very similar information like this on my blog check it out! www.jeacarlohim.com