MLOps: From Theory to Practice + Tips on Data Center Optimization

Catch up on highlights from the first episode of AIOps Evolution Weekly: taking MLOps from theory to practice and tips on data center optimization.

Insights from AIOps Evolution Weekly | Episode 1

In the latest AIOps Evolution Weekly Episode, Sean McDermott, CEO of Windward Consulting Group, and Bill Driscoll, Consulting Director at Windward, discuss the latest developments on the AIOps landscape. Namely, how to help early adopters with AIOps integration beyond IT departments.

Topics on AIOps integration included:

  • What to consider as you deploy AIOps
  • 4 ways to optimize your data center with AI/ML
  • How to take MLOps (AIOps) from research to applied business solution

Let’s take a deeper dive into the article topics and the takeaways from Bill and Sean.

Four Ways to Optimize Your Data Center with AI/ML

As Sean and Bill put it, the interesting/not interesting article that kicked off the conversation revolved around the topic of how AI can be advantageous for operations beyond an IT environment. In this article, Subhankar Pal, AVP of Technology and Innovation at Capgemini Engineering, makes a case for different ways AI can optimize a data center.

Pal notes that according to Alorithmia’s “2021 Enterprise Trends in Machine Learning,” 83 percent of organizations have increased their AI/ML budgets year-on-year. For example, major hyperscalers have developed in-house AI to support use cases such as cooling.

But smaller operators can achieve AI/ML benefits too, by leveraging AI-as-a-Service on cloud platforms. Some highlights from the article include the following:

Embedded AI

AI chips can be trained for a specific task such as pattern recognition, natural language processing, network security, robotics and automation. Also, AI/ML can be applied to the data center’s mechanical and electrical equipment to enable actionable insights and automation, saving money for the operator.

These AI integrations remove latency in workloads close to users, such as applications, 5G, and VR/gaming devices.
Use AI/ML for New Construction and Retrofits

Operators should make AI/ML a key part of their planning and construction process, especially for retrofits; AI/ML enables predictive maintenance at an existing facility.

Leverage Digital Twins

Digital twins provide a 3D virtual replica to simulate physical behavior under any operating scenario. This, in turn, brings all stakeholders together to strategize and take control of the performance and business impact of operations on a data center. The benefit? It gives teams oversight, visibility, prediction and quantification capabilities over any changes in a complex data center.
Sean and Bill’s Take
Sean noted, “The point of this is that we are going to be seeing AI showing up everywhere.” Bill mentioned how AI is an intuitive resource that only continues to improve – “As higher and higher compute is available on smaller and smaller chip sets, the algorithms only continue to get better.”

As it learns and improves, AI goes beyond the IT operations platform. It can help with “packet loss” or how to re-route traffic and use AI to self-heal issues. To draw a comparison, this is kind of like “super-smart IoT,” Sean said. “Ultimately, you have billions and billions of devices that are going to get smarter and smarter.”

The Road To MLOps: Machine Learning As An Engineering Discipline

As machine learning integrates more with our digital technologies to optimize systems and operations, companies are chomping at the bit to invest in MLOps. Yet, article author, Cristiano Breuel, makes the claim that applying MLOps is easier said than done. reports “only 22 percent of companies using machine learning have successfully deployed a model.” What makes it so hard? And what do we need to do to improve the situation?

What Are The Challenges With Deploying An ML Model?

Breuel starts by defining some basic differences between DevOps and MLOps. DevOps is supported by code, automation, tools, and workflows to extract accidental complexity and let developers focus on real problems that need to be resolved. The adoption of this has been relatively simple for companies, so what gives with ML?

At its core, ML is not only code; it is code plus data. Code is carefully crafted and developed for a specific function, but data comes from an infinite source called “the real world”. Data is constantly in flux, and one cannot know how it will change. Breuel notes that the relationship between code and data “is as if they live on separate planes that share the time dimension but are independent in all other ways.” So, the challenge of an ML process is to create a bridge between these two planes in a controlled environment.

What Does MLOps Look Like In Practice?

Breuel goes on to describe some of the areas that will need to evolve to deploy effective MLOps in enterprises:

Hybrid teams – A successful MLOps deployment needs to have the right people with the right set of skills. The ideal team would include a data scientist or ML engineer, a DevOps engineer, and a data engineer.

ML pipelines – ML models require data transformation to run reliably. Switching to proper data pipelines provides advantages in code reuse, run time visibility, management, and scalability. ML models will require two versions of the pipeline: one for training and one for serving.

Model and data versioning – In ML, we need to track model versions, data usage, and some meta-information training hyperparameters. However, there is no ideal solution at this moment.

Model validation – We need to develop a reliable ML model validation test that is subject to statistical data, rather than a binary (Y/N) status.

Data validation – ML pipelines should also validate higher-level statistical properties of the input. It is important to rule out systematic errors as causes that could contaminate the model and fix them as needed.

Monitoring – Monitoring ML systems is important because their performance depends on more than controlled factors, like infrastructure, but also data, which is less controllable.

What Does The Future Look Like?

Breuel sums it all up by stating that as MLOps matures, we need to improve the maturity of its operational processes. However, the greatest challenge of any technological advancement is sometimes it evolves quicker than the updates humans implement.

Sean and Bill’s Take

Bill’s stance on the article was how machine learning is traversing beyond the confines of IT operations to other business and enterprise applications. ML can use the same teams and disciplines regardless of the business you’re using it for, such as customer interfacing or running an IT team. The article shows that as ML continues to mature, it will be an evolution of both business and technology.

Sean found the tie back to DevOps interesting. IT has matured in the last few years into a discipline on how to take applications and codify into the applications the concept of running an operation so it runs more effectively.

He said, “The article was interesting because it had a premise that coding is predictable. And we do that consistently – we run predictable test cases with our code. But machine learning is unpredictable – you put data in and you don’t know what the outcome may be since we don’t necessarily know how the machine learns. So the idea of bringing data engineering into this is interesting.”

Ultimately, companies who are looking into this to build hybrid teams of people who understand how to build machine models; how to consume, version and interpret data; and DevOps can make these things operational.

The challenge is that to take advantage of this, you’re now building multi-disciplinary organizations where some of these skill sets are specific and need subject matter experts. Sean noted, “I can see larger enterprises using this model, but what about mid-sized companies? Will they be able to attract this type of talent and make these changes?

It will be interesting to see companies that come out with these ‘package solutions’ on how to have the engine, do automated data engineering, and having disciplines and models embedded in DevOps. We’re years away from that though.”

Why Your AIOps Deployments Could Fail

Gab Menachem, Senior Director, Product Management, ITOM at ServiceNow and founder and CEO of Loom Systems (a ServiceNow company)

Why do we need AIOps? Two words: digital transformation. Our work environments continue to evolve and that necessitates agile, efficient, and continuous digital transformation. According to IDC, digital transformation investments worldwide will total more than $7.8 trillion by 2024.

With the proper usage, AIOps enables IT teams to act with speed and efficiency and respond to issues proactively and in real-time by accessing the historical context of IT issues, providing valuable diagnosis and resolution.

Gab Menachem, Senior Director, Product Management, ITOM at ServiceNow and founder and CEO of Loom Systems (a ServiceNow company), writes in his article that IT leaders need to make some crucial considerations to set themselves and their team up for success and drive ROI on AIOps investments.

Focus Leads To Big Outcomes

When it comes to AIOps implementation, Menachem says less is more. Start with a single use-case, focused approach. Where should you start? Take a look at IT incidents and identify regularly occurring issues. When this is successful, IT leaders can use this case to scale across the business.

Continuous Data Flows Are Essential

As with any new technology implementation, testing and analyzing data on performance is necessary. However, IT departments often struggle to organize and consolidate the multitude of information from their data sources into one place.

This obstacle hinders AIOps deployments since they rely on historical and real-time data. Menachem says consolidating solutions allows IT to look at all assets holistically, which helps guarantee all data is funneled to a single location, making it easier for AIOps solutions to make educated decisions.

Sean and Bill’s Take

Bill thinks it is too early to call AIOps a “failure”, more so that companies are learning as they implement and deploy AIOps. As you deploy AIOps and machine learning, you have to decide: Is it domain agnostic or domain-centric?

Define your problems and break them down – in that way, you will tackle machine learning and how to implement it effectively into your organization. Bill agrees with Menachem’s assertion that taking a focused approach to AIOps implementation will be the smartest route to success. Another key thing is ITOps should not try to do this on their own without being part of the C-suite conversation.

Sean’s takeaway was to understand what you want to accomplish. If AIOps is a journey and a strategy, then you have to build momentum. He says, “Building an AIOps strategy is a visionary thing – you’ve got to be looking to the future; you’ve got to be change tolerant.”

The reality is that most people don’t like change, especially those who have been managing operations systems a specific way “forever”. Starting small and building ROI around that use case will build that momentum and cause the smallest waves of change.

It also dispels myths about AI replacing people. Finally, since AI adoption is rather nascent at the moment, early adopters may have questions about where to begin with their strategic vision for implementation. This is an area where vendors can provide expertise, guidance, and support to organizations looking to take advantage of AI as an investment piece.

Catch the full details from AIOps Evolution Weekly.

Recent Posts

APM Best Practices to Deliver Big Performance Gains


Continue reading...