So i’m in the middle of a World Tour. Well, I say ‘World’ but I use it in the same sense that the Major League Baseball uses the word. In my case, the World according to Moogsoft: leaving San Francisco and ‘doing’ the East Coast US and then Europe.
And it struck me…I love New York…but, how does any simpleton (like me) who is used to the beautiful simplicity of the London Underground or Paris Metro systems pick up the New York Subway system quickly? I have my A’s, B’s, C’s, 1’s, 2’s and 3’s (etc.) going from and to the same Streets.
- There is Canal Street and there is 50th Street.
- yes “there is good service between Canal and 50th”.
- There is always another train if your first one is out of service
- There is another entrance
- There is another exit
- Making it necessary for the Service Provider to integrate their operations processes with third parties to offer the customer a seamless support service.
- Increasing the volume of available data by the power of n.
- Challenging the notion of domains of Monitoringm Service Provisioning and Control by federating Services and Service ownership.
If your goal is to seek a Root-Cause, you’re wasting your time…for the following reasons:
(i) The issue may have been transient and so although there was an issue, it has gone away
(ii) The issue was probably caused by one of more things rather than a singular fault
(iii) Because operations support is ignorant of the relationships between the layers in the stack (e.g. net/compute/storage/middleware/database/apps) and possibly ignorant of any external / 3rd party / outsource / cloud supplier activities too, the first indication operations gets of an issue is when the end-users/customers call. As this point, we’re totally behind the curve because the first thing Ops will do is investigate the Application…then move slowly ‘cycle’ down the stack. The customer is down!
(iv) The move to virtualized infrastructures and users’ adoption of mobile applications in the IT world has increased raw Event and Log rates by a significant factor, overwhelming operations and support staff. This previews the burden that 5G next generation mobile network operations support will experience.
The new world of Service Delivery requires a new approach that addresses all three challenges faced by the providers of Service and that is where Moogsoft has risen to the challenge.
So, Moogsoft is extremely excited to be announced as a partner to Deutsche Telekom (see here, here and here) for the delivery of Big Data driven OSS in their 5G:haus. using streaming big data to help them detect and remediate complex system problems holistically. Incident.MOOG is a service assurance platform specifically designed to help increase the availability of modern virtualized infrastructures underpinning next generation mobile networking, NfV, and SDN. The important differentiator for our customers is that Incident.MOOG offers a single pane of glass for operations and service management across the old and the new technologies.”
Moogsoft applies an agile real-time big data approach to Service Assurance. We call it Situation Management
- Agile meaning to not rely on models of topology or historic trends to detect unusual behavior. Agile uses unsupervised machine learning techniques to automatically identify patterns in the data. Models or patterns do not need to be pre-defined; ‘simply’ feed data into the Incident.MOOG algorithms and let them detect anomalies within the data. The advantages are obvious, less maintenance of the system since when the fabric changes, these systems continue to work. Significantly, with this approach, anomalous behavior can be detected without prescribing those anomalies.
- Real-time means detecting anomalous or unusual behavior from the data as it is streaming by in order to warn you that issues are occuring that you should be aware of, before they become major Incidents or create service disruptions.
The fundamental Moogsoft innovation: the agile real-time approach, is that Incident.MOOG uses no models to detect anomalous or unusual behavior.
If you think of a modern IT fabric as quick sand, constantly changing, then it is full of (as Donald Rumsfeld called it) Unknown Unknowns. Things we do not know that we do not know. This is profound. As our IT fabric changes (the infrastructure and the Applications), new behavior occurs that we have not experienced before.
If we have to have seen that behavior before we can action / diagnose / resolve it, then we are destined to inflict unknown numbers of service interruptions on our customers.
The agile real-time approach is the only approach which can truly help us increase our quality of Service. Reflective Modelling means looking to resolve the issue after it has happened. Reflective Models are always incomplete. You need to have seen behavior before you can look for it.
With agile real-time, our customers can often pre-empt major issues, because they are ‘push notified’ of anomalous behavior before it transitions into a major issue. Incident.MOOG offers early warning.
Moogsoft is the first and only company to apply agile real-time techniques to IT Fault and Incident Management. Moogsoft works with “Event” data today, specifically because Event data can detect ‘Black Swan’ issues which Time Series approaches may not. A Single Event may or may not lead to Time Series deviations.
Popular ‘big data analytics’ approaches to finding anomalies use ‘reflective’ techniques to search for anomalies in large data sets. In order to produce statistically significant (useful!) results, these techniques require a large and broad “complete” set of data. This means that these techniques are limited to post-issue analysis (hence reflective).
An important but often overlooked nuance of the Moogsoft approach is that Moogsoft does not require a ‘complete’ data set to provide statistically significant results, compare the Incident.MOOG agile real-time and reflective approaches to a simple scenario:
A customer facing Service experiences an interruption. Was the Service Interruption caused by the some fault within the Application (running on a set of AppServers) or somewhere in the IT infrastructure underpinning the Application?
- With the reflective approach (sometimes referred to as ‘predictive’), if one only has data pertaining to the performance of the Application, the perception will be that the fault is within the Application. The results will show Time Series data deviations as a number of anomalies, each requiring further investigation. With the reflective approach, incomplete data does not offer enough data to produce statistically significant results.
- With the real-time approach (Incident.MOOG), again, if one only has data pertaining to the performance of the Application, the cluster of Alerts produced by Incident.MOOG will indicate that multiple AppServers are suffering the same effect and so immediately it will be understood that the issue is not within the Application but external to the Application, somewhere in the infrastructure.
Incident.MOOG would save a significant amount of resource time (usually spent investigating phantom issues) and offer early warning to the App Support team of an issue affecting their service, meaning they have the opportunity to proactively notify their user base. This increases the perception of quality of service, even when a disruption has occurred.
Bottom line, the agile real-time approach offers significant value even with incomplete sets of real-time data.
Moogsoft offers a ‘Single Pane of Glass’ across IT Operations data.
Moogsoft’s anomaly detection is performed at the “Manager of Managers” layer, using both complete and Incomplete data to deliver statistically significant results. Provide Incident.MOOG with a real-time streaming set of Big Data and it will offer top-to-bottom stack anomaly detection, relating issues within the network and IT fabric to application and service behavior, in real time. All without hard coded models and topology.
Today, IBM has Netcool for Events, SCAPI for Event Analytics, NI for Performance Analytics, yada yada for APM etc., no single tool for the Operations and Incident Management processes.
None of them offers a Single Pane of Glass across IT Operations.
Moogsoft offers a Single Pane of Glass across IT Operations and IT Service Management.
Those Stakeholders are brought together in a virtual Incident room (the Situation Room) relating specifically to that Situation.
Within the Situation Room, all parties become situationally aware of their relationship to the anomaly by assessing the Alerts which correspond to their domain (the App and Database people can quickly comprehend that they are not part of the cause – whereas the Storage person can work out quickly that they are the cause of the issue, they are also aware that there are impacted parties.
This reduces the number of support resources disrupted in the investigation of an issue.
The Incident.MOOG Situation Room has some other business value too though:
- If the first responder is unable to resolve the issue, they are able to escalate to a more experienced support person. When that person joins the room, they do not need to re-create all the activities that the first responder has done (logging into the device, assessing the log file, running diagnostics) because they can see what actions the first responder has taken. This significantly reduces the mean time to resolve issues
- The resolution knowledge that is captured within the Situation Room is recycled. When a new Situation is inferred in the future, that new Situation is compared with previous Situations. If there are high similarities, the previous Situation’s knowledge is presented to users within the new Situation Room, further reducing the time and resources required to resolve issues and, industrializing the process of Knowledge Article creation.
- Then of course there is the important point that the Service Desk also gets an early warning of a Service impacting Incident, so when the End Users call, they are able to respond knowledgeably reducing the number of service calls / tickets created.
So, if I had Moogsoft helping me with my New York Subway journey, I may have made it to my meetings on time, or, dynamically changed my schedule!