Machine & Deep Learning on-premises with Red Hat OpenShift 4.7
By providing a vast amount of data as a training set, computers have been made capable of decision making and to learn autonomously. AI usually is undertaken in conjunction with machine learning, deep learning and big data analytics. Throughout the history of AI, the major limitation was computational power, cpu-memory-gpu bandwith and high performance storage systems. Machine learning requires immense computational power with intelligent execution of commands keeping processors and gpus at extremely high utilization, sustained for hours, days or weeks. In this article we will discuss the different alternatives for these types of workloads. From cloud environments like Azure, AWS, IBM or Google to on-premises deployments with secure containers running on Red Hat OpenShift and using IBM Power Systems.
Before executing an AI model, there are a few things to keep in mind. The choice of hardware and software tools is equally essential for algorithms for solving a particular problem. Before evaluating the best options, we must first understand the prerequisites to establish an AI running environment.
What Are the Hardware Prerequisites to Run an AI Application?
- High computing power (GPU can ignite deep learning up to 100 times more than standard CPUs)
- Storage capacity and disk performance
- Networking Infrastructure (from 10GB)
What Are the Software Prerequisites to Run an AI Application?
- Operating system
- Computing environment and its variables
- Libraries and other binaries
- Configuration files
As we now know the prerequisites to establish an AI setup, let’s dive into all components and the best possible combinations. There are two choices for setting up an AI deployment: cloud and on-premises. We already told you that none of them is better as it depends on each situation.
Cloud infrastructure
Some traditional famous cloud servers are
- Amazon Web Services (AWS)
- Microsoft Azure
- Google Cloud Platform (GCP)
- IBM Cloud
In addition to these, there exist clouds specialized for machine learning. These ML specific clouds provide support of GPUs rather than CPUs for better computation and specialized software environments. These environments are ideal for small workloads where there is no confidential or sensitive data. When we have to upload and download many TB of data or run intensive models for weeks at a time, being able to reduce these times to days or hours and do them on our own servers saves us a lot of costs. We will talk about this next.
On-premises servers and Platform-as-a-service (PAaS) deployments
These are specialized servers present in the working premises of an AI company. Many companies provide highly customized as well as built-from-scratch on-prem AI servers. For example, IBM’s AC922 and IC922 are perfect for on-premises AI setup.
Companies choose as they must consider future growth and tradeoff between current needs and expenses from the above two. If your company is just a startup, cloud AI servers are best because this choice can eliminate the worry of installations at somehow affordable rates. But if your company grows in number and more data scientists join, cloud computing will not ease your burdens. In this case, technology experts emphasize on-prem AI infrastructure for better security, customization, and expansion.
ALSO READ: Deploy Your Hybrid Cloud for ML and DL
Choosing the best HW architecture
Almost all cloud service platforms are now offering GPU’s supported computation as GPU has nearly 100 times more potent than the average CPU, especially if machine learning is about computer vision. But the real problem is the data flow rate between node and cloud server, no matter how many GPUs are connected. That’s why the on-prem AI setup has more votes, as data flow is no longer a big problem.
The second point to consider is the bandwidth between GPUs and CPUs. On traditional architectures such as Intel, this traffic is transferred over PCI channels. IBM developed a connector called NVLink, so that NVidia card GPUs and Power9 cores could communicate directly without intermediate layers. This multiplies the bandwidth, which is already more than 2 times higher between processor and memory. Result: no more bottlenecks!
As we have pointed out above the software prerequisites for running AI applications, now we have to consider the best software environment for optimal AI performance.
What data-center architecture is best for AI/DL/ML?
While talking about servers regarding AI soft infrastructure, the traditional design was virtualization; it’s a simple distribution of computation resources under separate operating systems. We call each independent operating system environment a “Virtual Machine.” If we need to run AI applications on virtual machines, we face multiple constraints. One is the resources necessary for running the entire system, including OS operations and AI operations. Each virtual machine requires extra computational and storage resources. Moreover, it’s not easy to transfer a running program on a specific virtual machine to another virtual machine without resetting the environmental variables.
What a container is?
To solve this virtualization problem, the concept “Container” jumps in. A container is an independent software environment under a standard operating system with a complete runtime environment, AI application, and dependencies, libraries, binaries, and configurations file brought together as a single entity. Containerization gives extra advantages as AI operations are executed directly in the container, and OS does not have to send each command every time (saving massive data flow instances). Second but not less, it’s relatively easy to transfer a container from one platform to another platform, as this transfer does not require changing the environmental variables. This approach enables data scientists to focus more on the application rather than the environment.
RedHat OpenShift Container Platform
The best containerization software built-in Linux is Red Hat’s OpenShift Container Platform (an On-prem PAaS) based on Kubernetes. Its fundamentals are built on CRI-O containers, while Kubernetes control containerization management. The latest version of Open Shift is 4.7. The major update provided in OpenShift 4.7 is its relative independency from Docker and better security.
NVIDIA GPU operator for Openshift containers
NVIDIA and Red Hat OpenShift have come together to assist in running AI applications. While using GPUs as high compute processors, the biggest problem is to virtualize or distribute GPUs’ power across the containers. NVIDIA GPU Operator for Red Hat Openshift is a Kubernetes that mediates GPU resources’ scheduling and distribution. Since the GPU is a special resource in the cluster, it requires a few components to be installed before application workloads can be deployed onto the GPU, these components include:
- NVIDIA drivers
- specific runtime for kubernetes
- container device plugin
- automatic node labelling rules
- monitoring compnents
The most widely used use cases that utilize GPUs for acceleration are image processing, computer audition, conversational AI using NLP, and computer vision using artificial neural networks.
Computing Environment for Machine Learning
There are several AI computing environments to test/run AI applications. Top of them is Tensorflow, Microsoft Azure, Apache Spark, and PyTorch. Among these, Tensorflow (Created by Google) is chosen by the most. Tensorflow is a production-grade end-to-end open-source platform having libraries for machine learning. The primary data unit used in both TensorFlow and Pytorch is Tensor. The best thing about TensorFlow is it uses data flow graphs for operations. It is just like a flowchart, and programs keep track of the success and failure of each data flow. This approach saves much time by not going back to the baseline of an operation and tests other sub-models if a data flow fails.
So far, we have discussed all the choices to establish an AI infrastructure, including hardware and software. It is not easy to select a specific product that harbors all the desired AI system components. IBM offers both hardware and software components for efficient and cost-effective AI research and development like IBM Power Systems and IBM PowerAI, respectively.
IBM Power Systems
IBM Power Systems has flexible and need-based components for running an AI system. IBM Power Systems offers accelerated servers like IBM AC922 and IBM IC922 for ML training and ML inference, respectively.
IBM PowerAI
IBM PowerAI is an intelligent AI environment execution platform facilitating efficient deep learning, machine learning, and AI applications by utilizing IBM Power Systems and NVidia GPUs’ full power. It provides many optimizations that accelerate performance, improve resource utilization, facilitates the installation, customization, and prevents management problems. It also provides ready-to-use deep learning frameworks such as Theano, Bazel, OpenBLAS, TensorFlow, Caffe-BVLC or IBM Caffe.
Which is the best server for on-premises deployments. Let’s compare IBM AC922 and IBM IC922
If you need servers that can withstand machine learning loads for months or years without interruption, running at a high load percentage, Intel systems are not an option. NVIDIA DGX systems are also available, but as they cannot virtualize GPUs, when you want to run several different learning models, you have to buy more graphics cards, which makes them much more expensive. The choice of the right server will also depend on the budget. IC922s (designed for AI inference and high performarce linux workloads) are about half the price of AC922s (designed for AI datasets training), so for small projects they can be perfectly suitable.
If you are interested in these technologies, request a demonstration without obligation. We have programs where we can assign you a complete server for a month so that you can test directly the advantages of this hardware architecture.