Skip to content

The New York Subway vs OpenStack, NfV and5G NextGen Mobile Networks

So i’m in the middle of a World Tour. Well, I say ‘World’ but I use it in the same sense that the Major League Baseball uses the word. In my case, the World according to Moogsoft: leaving San Francisco and ‘doing’ the East Coast US and then Europe.

Subway, OpenStack, Virtualization Service Assurance

A NY Subway journey is as complex a tracking an application through a network

And it struck me…I love New York…but, how does any simpleton (like me) who is used to the beautiful simplicity of the London Underground or Paris Metro systems pick up the New York Subway system quickly? I have my A’s, B’s, C’s, 1’s, 2’s and 3’s (etc.) going from and to the same Streets.

Easy to provision, not easy to manage the journey - just like OpenStack and virtualized infrastructures

Simple example: I want to get from Canal Street to 50th Street…yes, I have optionality, but boy oh boy, where do I start from and end at?
Then there is simply spotting/finding the Subway entrance on the street – no raised/illuminated signs like Europe – just a railing at waist height…and not necessarily on the street the Subway Map claims it to be (because the platforms can span multiple City blocks…with the Station Street number being ‘mid-multiple-blocks’.
…Then there is the choice of local and express – and navigating the system to find the appropriate platform.
…Then there is the frequency of the trains themselves…which, if I am being completely honest to those I had meetings with, was the major cause of my delays (of course, in NYC you are damned either way – either public transport and the road system assures the likelihood of getting between meetings in a timely manner. The double whammy of the Subway being inconsistent frequencies and finding the damn entrance…sometimes even getting your bearings when you exit the system!
Monty Python Always Look on the Bright Side of Life
The good news of course is that there is optionality. If one 50th Street station is closed, there are others you can use…just deploying a little Monty Python here: i’m looking on the Bright Side of Life!
(I guess I should not complain too much – at least it is a US City with a public transit system – wake up America, this is the 21st Century!!!)
So how could I possibly compare the first time use of the NY Subway to Service Assurance in the 21st Century? It’s a big leap I know, but bear with me:
So, I want to get from Canal to 50th using the Subway in NYC. Looks simple enough from an Application Layer.
  •  There is Canal Street and there is 50th Street.
Is the Service operational between between Canal and 50th
  • yes “there is good service between Canal and 50th”.
The journey is scheduled to be 10 minutes…but my ‘packet’ significantly exceeds 10 minutes…in fact, exceeds 20 minutes and more. WT…???
Well, any number of things could have gone wrong:
   (i) It took longer to find a Canal Street Station entrance
   (ii) When I found the Station entrance I then found I was on the wrong line [repeat (1) perhaps several times!]
   (iii) The train didn’t arrive for ages
   (iv) When I exited the 50th Street Station I found I was (a) not on 50th Street and (b) many many Avenues from my destination
Now, while I was ‘in the Subway system” I had intermittent communications, meaning I could only transmit my location periodically. Bottom line, there was a severe lack of management information available to either me, or those wishing to track my progress in reaching them.
But, I did have alternative routes if my Station was closed…
And here is the comparison and the issue with the modern IT and Service delivery fabric: Virtualized Networks (firstly MPLS and now NfV), Virtualized Compute (VMware, Hyper-V and OpenStack), Virtualized Storage, offers all the great features of the New York Subway; at the surface, the Service or Application layer, it looks simple enough:
  •  There is always another train if your first one is out of service
  •  There is another entrance
  •  There is another exit
But the getting from A to B is both unclear and not necessarily as reliable or timely as one may think. Added to that, we are left with a somewhat translucent view of our traversal of the system. Yes, at the App layer we know we are not there yet, sometimes we may get indications of where we have been, but we do not know at any given time necessarily where we are or why.
Busy TrainWe in the Service Assurance industry (whether vendor or consumer) have been spoiled for years. We have owned the service delivery timeline. “No, you cannot roll out your new Application (or Service) until we have ensured we are able to monitor that Service. After all, would you drive down the Freeway with your eyes closed, only opening them at random locations and times.
Well that was when the Power Base was with the IT Department “Compute is complex you know, we need to plan”.
Today, the Power Base has shifted to the Application Owner. As an Application owner, I have optionalityWhere is the Power. I can wait for my IT department to provision my Compute and Network, or I can simply use a cloud service. Neither are without their Service Assurance risks though. My IT department may take longer. My cloud service provide will give me limited operations information.
OpenStack, NfV and 5G Next Generation Mobile Networking is becoming the new Service Assurance Battlefield. To the outside, it looks simple. Underneath the skin…well…there’s the issue, you can’t look under the skin, it’s too complex. Beauty really is only skin deep.
And we’re fast moving towards a continuously virtually connected world: 5G Next Generation Mobile Networking offers the true dream of omnipresent operations; running business applications and services transparently over mobile and transmission networks.
5G will offer the potential to transition the use of mobile networks from ‘transit connectivity to ubiquitous Business Service delivery’. 5G Next Generation Mobile Networks utilizes Wideband Wireless, Mobile, Virtualization, Software Defined Networking, Virtualized Network functions all overlaid by layers and layers of application Services; but it requires a game changer of an approach to Service Assurance. 5G Next Generation Mobile Networking heralds the first true convergence of modern world telecommunications and compute technologies and processes.
Service Assurance approaches that have served Telecoms companies and their customers for years will no longer work. OpenStack and NfV are already challenging the traditional notion of a Service Assurance platform:
  1. Making it necessary for the Service Provider to integrate their operations processes with third parties to offer the customer a seamless support service.
  2. Increasing the volume of available data by the power of n.
  3. Challenging the notion of domains of Monitoringm Service Provisioning and Control by federating Services and Service ownership.

If your goal is to seek a Root-Cause, you’re wasting your time…for the following reasons:
(i) The issue may have been transient and so although there was an issue, it has gone away
(ii) The issue was probably caused by one of more things rather than a singular fault
(iii) Because operations support is ignorant of the relationships between the layers in the stack (e.g. net/compute/storage/middleware/database/apps) and possibly ignorant of any external / 3rd party / outsource / cloud supplier activities too, the first indication operations gets of an issue is when the end-users/customers call. As this point, we’re totally behind the curve because the first thing Ops will do is investigate the Application…then move slowly ‘cycle’ down the stack. The customer is down!

(iv) The move to virtualized infrastructures and users’ adoption of mobile applications in the IT world has increased raw Event and Log rates by a significant factor, overwhelming  operations and support staff. This previews the burden that 5G next generation mobile network operations support will experience.

The new world of Service Delivery requires a new approach that addresses all three challenges faced by the providers of Service and that is where Moogsoft has risen to the challenge.

Deutsche Telekom 5G:haus Big Data OSSSo, Moogsoft is extremely excited to be announced as a partner to Deutsche Telekom (see here, here and here) for the delivery of Big Data driven OSS in their 5G:haus. using streaming big data to help them detect and remediate complex system problems holistically. Incident.MOOG is a service assurance platform specifically designed to help increase the availability of modern virtualized infrastructures underpinning next generation mobile networking, NfV, and SDN. The important differentiator for our customers is that Incident.MOOG offers a single pane of glass for operations and service management across the old and the new technologies.”

Deutsche Telekom is the first of the service providers to strategically invest in big data for OSS and to benefit from the value that Moogsoft provides. The inclusion of Moogsoft in DT’s 5G:haus must beseen as recognition of their belief in the value of the Moogsoft ‘real-time agile’ approach to Service Assurance, Fault Management and Service Management.
Traditional monitoring tools and 3rd Generation mobile networks cannot scale to address the complexity and load of a 5G-era environment. Through the use of streaming big data, Moogsoft is enabling its customers to uncap their Event/Log telemetry data and rate of infrastructure change, while at the same time, detecting faults and the impact of those faults earlier, all without increases in operations and support resources, ensuring Situational Awareness to all the stakeholders involved in the delivery of Services.Incident.MOOG

Moogsoft applies an agile real-time big data approach to Service Assurance. We call it Situation Management

  • Agile meaning to not rely on models of topology or historic trends to detect unusual behavior. Agile uses unsupervised machine learning techniques to automatically identify patterns in the data. Models or patterns do not need to be pre-defined; ‘simply’ feed data into the Incident.MOOG algorithms and let them detect anomalies within the data. The advantages are obvious, less maintenance of the system since when the fabric changes, these systems continue to work. Significantly, with this approach, anomalous behavior can be detected without prescribing those anomalies.
  • Real-time means detecting anomalous or unusual behavior from the data as it is streaming by in order to warn you that issues are occuring that you should be aware of, before they become major Incidents or create service disruptions.

The fundamental Moogsoft innovation: the agile real-time approach, is that Incident.MOOG uses no models to detect anomalous or unusual behavior.

If you think of a modern IT fabric as quick sand, constantly changing, then it is full of (as Donald Rumsfeld called it) Unknown Unknowns. Things we do not know that we do not know. This is profound. As our IT fabric changes (the infrastructure and the Applications), new behavior occurs that we have not experienced before.

If we have to have seen that behavior before we can action / diagnose / resolve it, then we are destined to inflict unknown numbers of service interruptions on our customers.

The agile real-time approach is the only approach which can truly help us increase our quality of Service. Reflective Modelling means looking to resolve the issue after it has happened. Reflective Models are always incomplete. You need to have seen behavior before you can look for it.

With agile real-time, our customers can often pre-empt major issues, because they are ‘push notified’ of anomalous behavior before it transitions into a major issue. Incident.MOOG offers early warning.

Moogsoft is the first and only company to apply agile real-time techniques to IT Fault and Incident Management. Moogsoft works with “Event” data today, specifically because Event data can detect ‘Black Swan’ issues which Time Series approaches may not. A Single Event may or may not lead to Time Series deviations. 

Popular ‘big data analytics’ approaches to finding anomalies use ‘reflective’ techniques to search for anomalies in large data sets.  In order to produce statistically significant (useful!) results, these techniques require a large and broad “complete” set of data. This means that these techniques are limited to post-issue analysis (hence reflective).

An impoMoogsoft agile real-time Service Assurancertant but often overlooked nuance of the Moogsoft approach is that Moogsoft does not require a ‘complete’ data set to provide statistically significant results,  compare the Incident.MOOG agile real-time and reflective approaches to a simple scenario:

A customer facing Service experiences an interruption. Was the Service Interruption caused by the some fault within the Application (running on a set of AppServers) or somewhere in the IT infrastructure underpinning the Application?

  1. With the reflective approach (sometimes referred to as ‘predictive’), if one only has data pertaining to the performance of the Application, the perception will be that the fault is within the Application. The results will show Time Series data deviations as a number of anomalies, each requiring further investigation. With the reflective approach, incomplete data does not offer enough data to produce statistically significant results.
  2. With the real-time approach (Incident.MOOG), again, if one only has data pertaining to the performance of the Application, the cluster of Alerts produced by Incident.MOOG will indicate that multiple AppServers are suffering the same effect and so immediately it will be understood that the issue is not within the Application but external to the Application, somewhere in the infrastructure.

Incident.MOOG would save a significant amount of resource time (usually spent investigating phantom issues) and offer early warning to the App Support team of an issue affecting their service, meaning they have the opportunity to proactively notify their user base. This increases the perception of quality of service, even when a disruption has occurred.

Bottom line, the agile real-time approach offers significant value even with incomplete sets of real-time data.

Moogsoft offers a ‘Single Pane of Glass’ across IT Operations data. 

Single Pane of Glass

Moogsoft’s anomaly detection is performed at the “Manager of Managers” layer, using both complete and Incomplete data to deliver statistically significant results. Provide Incident.MOOG with a real-time streaming set of Big Data and it will offer top-to-bottom stack anomaly detection, relating issues within the network and IT fabric to application and service behavior, in real time. All without hard coded models and topology.

Today, IBM has Netcool for Events, SCAPI for Event Analytics, NI for Performance Analytics, yada yada for APM etc., no single tool for the Operations and Incident Management processes.

Incident.MOOG can offer value whether architected in isolation or as a bridge across the existing operations tools (Netcool, SCAPI, NI, Nimsoft, BMC Patrol, Splunk, Logstash, AppDynamics, New Relic, Dynatrace, Riverbed, etc.). Incident.MOOG can also aggregate data directly from original sources (App Logs, SNMP, TCP Sockets, Files, APIs, JSON, etc.). (Hey – we love that IBM has just bought to integrate Weather data…we’ve been demonstrating the ability to incorporate non-IT data into our anomaly detection since day one of our company. We have even demonstrated the integration of customer sentiment from Twitter !)

None of them offers a Single Pane of Glass across IT Operations.

Moogsoft offers a Single Pane of Glass across IT Operations and IT Service Management.

Incident.MOSituation AwarenessOG manifests anomalies as clusters (or groups) of ‘informationally similar’ Alerts; we call these Alert clusters Situations, because they represent actionable issues. Each Situation can contain Alerts from multiple Entities, for example an anomaly may include Storage Alerts, Database Alerts and Application Alerts. Incident.MOOG uses these Entity classifications to notify the appropriate support Stakeholders about their relationship the anomaly, giving all stakeholders  early warning of anomalous behavior when compared to the existing Alert Centric approach to Fault Management.

Those Stakeholders are brought together in a virtual Incident room (the Situation Room) relating specifically to that Situation.

Within the Situation Room, all parties become situationally aware of their relationship to the anomaly by assessing the Alerts which correspond to their domain (the App and Database people can quickly comprehend that they are not part of the cause – whereas the Storage person can work out quickly that they are the cause of the issue, they are also aware that there are impacted parties.

This reduces the number of support resources disrupted in the investigation of an issue.

The Incident.MOOG Situation Room has some other business value too though:
  • If the first responder is unable to resolve the issue, they are able to escalate to a more experienced support person. When that person joins the room, they do not need to re-create all the activities that the first responder has done (logging into the device, assessing the log file, running diagnostics) because they can see what actions the first responder has taken. This significantly reduces the mean time to resolve issues
  • The resolution knowledge that is captured within the Situation Room is recycled. When a new Situation is inferred in the future, that new Situation is compared with previous Situations. If there are high similarities, the previous Situation’s knowledge is presented to users within the new Situation Room, further reducing the time and resources required to resolve issues and, industrializing the process of Knowledge Article creation.
  • Then of course there is the important point that the Service Desk also gets an early warning of a Service impacting Incident, so when the End Users call, they are able to respond knowledgeably reducing the number of service calls / tickets created.

So, if I had Moogsoft helping me with my New York Subway journey, I may have made it to my meetings on time, or, dynamically changed my schedule!
Thanks for getting through this monologue 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: