Deploying A Scalable Deep Learning Solution in Production with Tensorflow: A Reference Design with Mellanox and ParallelM

Version 3

    This document describes a reference design for an orchestration platform for deep learning and enable enterprises to quickly train a model and deploy and monitor them in production using state of the art technologies from Mellanox, ParallelM and Open Source. For our reference design, we chose Tensorflow, one of the most popular ML/DL frameworks, but the solution can be easily extended to other frameworks such as SparkML, Caffe, Torch and others.






    Devices ranging from sensors to robots are generating increasing amounts of richer data, ranging from simple value time series to text, images, sound and video. Similarly – applications (such as image gathering and video recording) generate massive amounts of data. While the data itself is valuable – its ultimate benefit to the business’ bottom line comes from the analytics that extract the insights hidden within.

    The increasing richness of this data volume requires more complex Machine Learning (ML) and Deep Learning (DL) approaches in order to extract insights. The ubiquity of high performance commodity computing, driven by both massive core count increases in individual CPUs and low-cost cloud computing services, as well as Hardware suited specifically for DL – such as GPUs – and high-speed interconnects, have made it possible to match data growth with similarly scalable DL capabilities.

    However, effectively deploying deep learning in production remains a challenge due to the complexity of configurations, the need for efficient hardware to scale training performance, and the complexities of continuously managing and supporting DL in production.

    This document describes a reference design for realizing the value of Deep Learning in production using state of the art technologies from Mellanox, ParallelM and Open Source. For our reference design, we chose Tensorflow, one of the most popular ML/DL frameworks, but the solution can be easily extended to other frameworks such as SparkML, Caffe, Torch and others.

    This reference design accomplishes two key objectives

    • Fastest Time to Train with support for leading tools and frameworks right out of the box

    • Fastest Time to Inference and retrain with the ability to rapidly move to and manage/optimize in production while ensuring ML prediction quality in a dynamic environment


    Mellanox's Machine Learning

    Mellanox Solutions accelerate many of the world’s leading artificial intelligence and machine learning platforms and wide range of applications, ranging from security, finance, and image and voice recognition, to self-driving cars and smart cities. Mellanox solutions enable companies and organizations such as Baidu, NVIDIA,, Facebook, PayPal and more to leverage machine learning platforms to enhance their competitive advantage.

    In this post, we will show how to build most efficient Machine Learning cluster enhanced by native RDMA (Remote Direct Memory Access) over 100Gbps IB or RoCE (RDMA over Converged Ethernet) network.


    ParallelM MLOps Suite

    ParallelM is the MLOps software company. The biggest challenge in driving successful AI initiatives is effectively managing the performance and business impact of Machine Learning in production. Taking ML into production incurs unique challenges brought on by the need to collaborate between the diverse practices of Ops and Data Science, the complicated nature of ML behavior, and the complexities of runtime analytics environments. ParallelM software solution solves these issues, so that enterprises can apply the power of Machine Learning in Production to transform and scale their businesses.


    Example Use Case

    We illustrate the deployment with an example use case of facial recognition. In this use case, individual facial images are analyzed in real time and various actions are taken depending on whether the individual is recognized. An example below, based on a Building Security system, illustrates the expected usage:

    • At the entrance to a building is a camera-based entry authorization system. The camera takes a picture of the entering individual – and the picture is analyzed to determine whether the individual should be granted entry
    • This analysis can utilize deep learning trained models that are trained on both general facial recognition and on data specific to the building tenants. Examples can include pre-training on individuals who are allowed admission as well as specific individuals who should be denied admission (security threats, recently departed employees etc.). The admission decision needs to be made in real time
    • The deep learning models used in the real-time prediction will need to be periodically refreshed to (a) factor in new employees, (b) factor in new individuals to be denied entry, and (c) accommodate new model baselines for better facial recognition. These retrained models will then need to be uploaded to the real-time prediction system.

      Note that this use case can be applied beyond Building Security. For example, it can be used in a Retail or Hotel context to provide better customer service etc.


    Reference Design

    Hardware and Software Configuration

    The overall reference configuration is shown in Figure 1

    • The image data incoming from the real-time camera Prediction System which analyzes the images and makes decisions on whether to accept or reject the individuals
    • The models used by the Prediction System are updated periodically by a Training System which uses an image database to train the Deep Learning models.
    • When new models are generated by the Training System, they are applied to the Prediction System based on the administrator specified policy.


    The software components of the reference design include the ParallelM MLOps Center solution for production Machine Learning Management, the TensorFlow analytic engine for the deep learning training and the Flink Analytic Engine for the deep learning real-time prediction. Each of these is described briefly below.

    • ParallelM’s MLOps Center software enables deployment and management of ML pipelines in production while at the same time ensuring they scale and work properly. ParallelM organizes your pipelines into virtual containers that abstract away the underlying compute infrastructure while managing logical relationships between pipelines and maintaining data-awareness across training and inference pipelines no matter where they are running. By combining sophisticated code instrumentation, distributed data correlation, and advanced ML analytic algorithms, ParallelM provides insight into ML prediction quality in real-time, alerts on issues, and drives corrective action.
    • Flink Streaming Analytic Engine: The Flink Streaming analytic engine is based on open source Flink 1.3 and provides both streaming and batch ML and DL support. The DL support is via the Flink-TensorFlow integration by which deep learning network inference can be performed on streaming data. Data can be ingested via both streams (stream brokers such as Kafka etc.) or batch (data lakes such as HDFS etc.)
    • TensorFlow: TensorFlow is an open source software library developed by the Google Brain team for the purpose of conducting machine learning and deep neural networks research. The library performs numerical computation by using data flow graphs, where the nodes in the graph represent mathematical operations and the graph edges represent the multidimensional data arrays (tensors) which communicate between the nodes. In order to use TensorFlow with GPU & RDMA support, you must have an NVIDIA GPU (minimum compute capability of 3.0) & Mellanox network (Infiniband or Ethernet).


    The hardware configuration includes the Training System (composed of the Network, the Parameter Server, and 4 Worker Servers), the Prediction System (composed of one Prediction Server).

    • The configurations of the Training System that includes network, Parameter Server and Worker Servers are shown in Figure 2
    • The configuration for Prediction System is same as the Worker Servers



    Use Case Operation

    The figure shows how the above defined configuration executes the use case


    • MLOps Center executes the entire use case as an Intelligent Overlay Network (ION) which links the training and prediction pipelines.
    • The ION used in this reference design (shown in Figure 3) is composed of two parallel executing pipelines, each built and deployed using MLOps Center. The first – Pipeline A, executes on the Training System and performs periodic retraining of the model to be used for the facial recognition. The second, Pipeline B, runs on the Prediction System and performs the real-time determination based on the stream of images coming in from the camera.
    • These two pipelines are connected via a Policy that executes within MLOps Center. Models generated by Pipeline A are passed to MLOps Server which executes the policy to determine whether the new model should be sent to Pipeline B. In this example – the policy of “Always update” is used – implying that all models generated by Pipeline A will be passed to Pipeline B for update.
    • On the Training System, Pipeline A executes on the TensorFlow analytic engine. On the Prediction System, Pipeline B executes on the Flink analytic engine. Both pipelines are deployed and managed by MLOps Server via the MLOps Agent which manages each analytic engine.


    Note: a number of different DNN (Deep Neural Network) algorithms can be used within Pipeline A. Section 4.0 shows the performance possible for several state of the art image detection Deep Learning algorithms (Inception V3, ResNet 50 and ResNet 152). Each is evaluated with publicly available data (ImageNet) for which accuracy for each of these algorithms has already been determined and demonstrated.

    Performance and Scale

    Distributed Training with RDMA and High-Speed Network

    Modern neural networks are trained on large datasets to obtain highest prediction accuracy during the inference phase. The applications using neural networks range from simple speech and image recognition to industry-focused real-time security and fraud detection. Training these large neural networks are however computationally intensive. Though huge strides of advancements have been made in GPU hardware as well as storage infrastructure, training on a single machine for such workloads can still take a long time, sometimes months. Hence, in order to enable faster Time-To-Train (TTT), all the leading deep learning frameworks support distributed training with RDMA.

    RDMA enables efficient parallelization of neural network training on many nodes with many GPUs and reduces up to 2x training time across multiple frameworks when compared to regular TCP/IP network stack. Further, high bandwidth network becomes critical to support data ingestion of large training dataset from HDFS (or any other scale-out storage).

    Here we will provide our performance benchmark results for InceptionV3 and ResNet-50 over TCP and RDMA for the training part of the system



    We ran two popular benchmarks Inception V3 and ResNet-50 using synthetic data. Testing with synthetic data was done by using a tf.variable set to the same shape as the data expected by each model for ImageNet. This load tests both the underlying hardware and the framework at preparing data for actual training.

    This document provides a high-level summary of potential performance. Server's hardware and configurations used for TCP and RDMA benchmarks are identical. More details on the experiments run here (including specific configuration settings for the network, NVidia GPUs and algorithms) are available in a detailed Mellanox community posts for both Infiniband and RoCE

    The server setup for the runs is included 4 worker servers and was explained in Section 3 of the document.

    In order to create results that are as repeatable as possible, each test was run 3 times and then the times were averaged together. GPUs are run in their default state on the given platform. For each test, 10 warmup steps are done and then the next 100 steps are averaged.



    Inception V3


    Training with Synthetic Data









    Training with Synthetic Data




    Training with Synthetic Data


    Managing the Deep Learning Lifecycle in Production

    This section covers the steps of running and managing the above configuration and its full lifecycle in production.

    • Build and Deploy: The build and deploy process occurs via the MLOps Workspace (part of the MLOps Center software). User accounts (both operator and data scientist) can define the required pipelines, link them into an Intelligent Overlay Network (ION) and deploy into production. Example of the ION builder window in MLOps Workspace.


    • Orchestrate: Once deployed, MLOps Center automates all of the orchestration of the running ION, including model (re)training, updates of new models into the prediction system, and the application of policies.
    • Monitor and Diagnose: All running IONs can be monitored by both Operations and Data Science via the MLOps Workspace, in order to verify that the system is predicting correctly. Examples of dashboard views shown below.






    • Collaborate: different organizations (such as operations and data science) can collaborate efficiently via the MLOps Workspace. Users from different disciplines can share data via Snapshots (Figure 8), jointly test new algorithms on production data via Sandbox and deploy proven algorithms directly into production via the Promote to Production feature. Example of collaboration via snapshot is shown in the figure below.