Skip to content

Only Contextual Situation Awareness efficiently enables issue Socialization and Collaboration

Friends. Recently this AppDynamics article concerning DevOps and Collaboration was sent my way and it led me off on one of my usual rants about the value of contextual collaboration versus socializing individual Alert or threshold exceptions…

My point is simple: You need contextual Situation Awareness efficiently take advantage of Socialization and Collaboration.

One of the great things about DevOps methodologies in action is that it builds upon the enthusiasm of smart and motivated people who are all interested in moving faster, by working together to minimize downtime and ensuring continuous innovation. It’s all about maintaining a sticky reliable experience, for end-users and customers.

Monitoring tools vendors see this and are trying to respond to DevOps needs. For example, it’s cool that AppDynamics are trying to encourage collaboration around AppDynamics’ related fault or performance threshold exception Alerts.

However…in most environments, APM is only a partial source of information to ensure application and service quality. It supports isolated support domains and, reinforces ‘Linear’ processes…review Alert, assess Application performance, review log files, run diagnostics…can’t resolve?…escalate…to a more experienced person…

Linear Process

Linear, Alert aggregation, noise filtering, domain assignment, sea of Red alert assessment, forensics, ticket escalation…yada yada yada processes are highly resource and MTTR inefficient

These linear, highly silo oriented processes leave our operations staff working in isolation from each other. They have no Situation Awareness.

Add to this the long-standing issue (which led to the DevOps revolution in the first place) where the Dev teams have no view into the underlying infrastructure that leads to the resource inefficient investigation of phantom application faults…and…

Boom! Inefficient use of highly skilled people. Reacting after issues have occurred and caused collateral or business impact.

The point here – you can socialize a ‘single Alert’ with many people, but a single Alert does not offer situation awareness nor does it offer a reason or context to collaborate with others on.

Situation Awareness comes from having a ‘context’ that is pertinent to the stakeholder; something that they should be aware of.

“Situational Awareness” enables Collaborative Remediation and real Efficiency Savings

Here at Moogsoft, we started from the premise that Application, Compute Infrastructure and Network operations teams all need to be “Situationally” aware in context to a given issue that relates to all of them.

In other words, if there is an issue with some infrastructure componentry that has caused collateral impact to Applications, the DevOps teams and the respective infrastructure operations teams are stakeholders to that Situation.

By providing context to the Situation (the Alerts which indicate the causality and the collateral impact), the appropriate parties can more quickly diagnose whether they need to action resolution activities or enact business continuity activities, significantly reducing time wasted on ‘spam’ diagnosis and forensics and, potentially enabling proactive notifications to impacted customers/end-users.

There is another important nuance here. Most DevOps operations support instrumentation today is architected around the principle of detecting ‘Performance’ (or Time Series) deviations. If performance or capacity trends deviate dramatically from a Historic or previous Baseline Model it probably indicates that something is wrong with the Application…and support resources react accordingly.

At Moogsoft, we have a slightly different perspective on the world. Modern infrastructures (whether pure Cloud elastic compute such as AWS or Google) or Disruptive Enterprise Infrastructures (with combinations of legacy, outsourced and elastic compute) cannot be modeled: neither from a topological nor from a Performance Behavior Model perspective.

Also, performance behavior typically deviates due to Events [Black Swan’s if you like!) of some kind…so ignoring Events in your decision making is kind of like driving down the Freeway at 100MPH (not that I’d do that of course!) with your eyes closed for 5 seconds and then open for 5 seconds (or texting while you drive I guess – again, something else I do not advocate!).

So, with this in mind, we’ve innovated Incident.MOOG around two guiding principles:

1. That anomalous behavior should be detected without a preconceived notion of what an anomaly is.

In other words, Moogsoft believes that one needs to treat the world of modern IT Infrastructures and DevOps in a manner eloquently described by Donald Rumsfeld (not that I advocate anything Donald Rumsfeld believes in either by the way) – we should realize that we are blind to the behavior of modern IT and so need to be able to detect the “Unknown Unknowns”.

If we can detect unknown unknown anomalies, by implication, we can detect known knowns and known knowns too. So, no models. Models need to be maintained. Maintenance of models is impossible at disruptive enterprise IT infrastructure scale.

2. That anomalous IT behavior is typically not exclusive in nature; anomalous IT behavior will concern multiple parties or domains of support.

So, all the stakeholders to an anomaly should be Situationally Aware of their relationship with that anomaly as soon as that context is detected.

The ideal approach is with a collaborative remediation environment – what we call the “Situation Room” – a virtual incident or Facebook ‘wall’ type concept for ensuring that all the stakeholders to an issue are situationally aware and, that individual domains of support can collaborate to efficiently resolve the issue.

So, want collaboration, you need a reason to collaborate.

Incident.MOOG automatically detects that some form of anomalous behavior is occurring (without models, topology or previous history – yes, we really can infer unknown unknowns!), then, ‘push notifies’ the issue to the appropriate stakeholders, who can diagnose and resolve the issue in a collaborative remediation environment.

Simple huh?

Incident.MOOG uses real-time “big-data-centric” methods that do not rely on outdated and inaccurate behavior models, topology and rules to inform the right stakeholders that there is a service-affecting situation they need to collaborate around, and contextualizes their relationship with that situation (am I the owner of the causal indicators, or am I the impacted party?).

This approach is inclusive, providing a top to bottom view across the entire stack. As a result, Moogsoft’s customers – which range from the largest Internet portals to global banks and cloud service providers – are showing more than 60% reduction in actionable work items and are receiving earlier warning signs of occurring situations hours (and sometimes Days!) before their previous processes.

More importantly, DevOps teams are able to reduce their reactive responses and war rooms and become proactive with their end-users and customers, rather than reacting after the customer calls to complain (when the disruption has already occurred!).

So embrace Situational Awareness for greater collaboration between Dev, Ops, and DevOps teams. The new era for IT and software requires a new generation of tools that accelerate work across technology domains and supplier silos, and Moogsoft is leading the new way.

%d bloggers like this: