Artificial intelligence (AI) is a huge, controversial topic. We often hear about it in the mainstream media as if it were this deliberate evil provocateur, like a Skynet, that is about to control all of humanity. However, actual use cases like self-driving cars and robotics are a little less threatening. And when it comes to our digital devices, AI is already embedded in most of the products we use today — our phones and apps, for example, already use AI for things like spell checking, noise reduction, and face recognition. Banks are now using AI to detect fraud, and healthcare is using AI to make MRI scans faster and cheaper, improving outcomes for patients.
At the same time, most companies are making the transition to cloud-native technologies. This new software stack adopts containerization and leverages open-source tools like Docker and Kubernetes to increase agility, performance, and scalability for digital services. Even highly regulated industries are becoming cloud native.
While maintenance of cloud-native technology can start easily, the burden can quickly mount as multi-cluster and multi-cloud modes begin to materialize. And to optimize their products and software release processes, companies often look to run more complex workloads and extract insights from their data.
I recently sat down with Tobi Knaup, CEO and co-founder of D2iQ, to explore the current and future role of AI within the cloud-native stack. According to Knaup, “The companies that can figure out how to use AI in their products will be the market leaders of tomorrow.” Hosting and running AI/ML computations and enhancing cloud-native architecture management through the use of AI.
Using AI and cloud-native architecture
Running AI with cloud-native tools offers many benefits. One benefit of using Kubernetes is that it can be centralizing—it makes sense to run these related components like microservices, data services, and AI components within the same platform. “Kubernetes is a fantastic platform for running AI workloads,” said Knaup. “You need an intelligent cloud-native platform to run these AI/ML workloads – many of the AI problems have been solved in cloud-native.”
Another critical challenge facing AI/ML projects is figuring out day 2 operations. While organizations may have many data science experts to build and train models, actively deploying and running those models is a different story altogether. This lack of understanding could be the reason why 85% of AI projects ultimately fail to deliver on their intended business promises. Cloud-native technologies like Kubernetes offer a way to actively run these models as an online service that adds value to the mission-critical product, says Knaup.
Benefits of running AI with cloud-native components
AI/ML and cloud native have similar deployment patterns. The AI/ML field is still relatively young. As it turns out, many of the best practices that DevOps has developed around cloud-native can also be applied to AI/ML. For example, CI/CD, observability, and blue-green deployment fit well with the specific needs of AI/ML. “You can build a very similar delivery pipeline for AI/ML as for microservices,” says Knaup. This is another reason it makes sense to run K8s for such workloads.
Cloud native brings elasticity and resource allocation to AI. AI/ML typically requires very elastic computations – when training a model on a new dataset, the process can become quite resource intensive and relieve GPUs. And when many data scientists are building models and competing for resources, you need an intelligent way to allocate resources and disk space. Cloud-native planners can solve this problem by allocating resources intelligently. Some toolsets, like Fluid and Volcano, are explicitly designed for AI/ML scenarios.
You benefit from the agility of open source. Cloud-native open source projects tend to move very quickly when the community works together. This is similar to the activity around AI/ML open source tools like Jupyter Notebook, Torch or Tensorflow that are cloud and Kubernetes native. While there are concerns about the security of open source software, at the end of the day, the more eyes we keep open source, the better. “With AI being built into so many things, we need to be able to question what decisions AI is making,” explains Knaup.
Cloud native does not mean cloud dependent. First, a machine learning model must be trained with a large data set. It’s typically far more cost-effective to run heavy number-crunching AI on-premises than in the cloud. But after these models are trained, companies will likely want to perform inference at the edge, closer to where new data is ingested. Kubernetes is great in this regard because it is flexible enough to run in these different operating environments.
“Data has gravity,” says Knaup, and the calculation should follow it. With K8s as your abstraction layer, you can design it once and run it in any environment, be it a security camera system, the manufacturing floor, or even aboard F-16 fighter jets.
Using AI/ML to improve cloud-native
On the other hand, there are many ways that artificial intelligence can help manage and optimize cloud-native technology. “You can make an endless list,” says Knaup.
Using AI to automate root cause analysis. First, AI could help human operators diagnose problems with their cloud-native tools more efficiently. Kubernetes is quite complex and can be integrated with many other components such as B. Service Mesh for ingress control or OPA for policy management.
When an error occurs in such a complex distributed system, it is often difficult to find the cause of the problem. Engineers have to deal with metrics and data sources from many sources to troubleshoot the problem. In doing so, they often follow similar patterns in aggregating this data. Using AI to find these patterns could help human operators diagnose problems more effectively. This would reduce time to resolution, which in turn would increase overall availability and security.
Using AI to predict and prevent problems. Another prospect uses AI to fully detect and prevent problems. It is common in marketing to use end-user data to inform predictive analytics. But what valuable data could we uncover by applying predictive analytics to cloud-native statistics? Suppose a monitoring tool can predict that a given disk will be at 80% utilization in four hours based on past usage. Platform engineers could therefore make the appropriate changes with sufficient time to avoid service disruption. Such predictive service level indicators could become another useful benchmark for SREs.
Using AI to optimize performance. There is plenty of room for AI to suggest performance optimizations to optimize the execution of cloud-native infrastructure. The results could shed light on which controls need to be adjusted to adjust computational efficiency or how best to schedule machine learning workloads.
If we consider AI/ML and cloud native, it’s a win-win situation. Cloud-native technology can support AI/ML’s goals of elasticity, scalability, and performance. At the same time, there are many benefits that AI can bring to streamline maintenance of cloud-native architecture.
AI is an emerging field with thousands of algorithms now in open source. TensorFlow Hub alone has hundreds of free, open-source machine learning models for working with text, images, audio, and video. Knaup therefore recommends relying on an open source strategy for AI.
However, working successfully with AI boils down to finding the right algorithm for your use case. Although there are a relatively small number of algorithm classifications, it takes expertise to figure out which one works best for your problem and apply it to your situation, Knaup explains. “You have to understand the problem space and know how to apply these top-notch AI algorithms,” he said.