Deploying A Scalable Deep Learning Solution in Production with Tensorflow: A Reference Design with Mellanox and ParallelM

Version 7

    This document describes an orchestration platform reference design for deep learning which enables enterprises to quickly train, deploy and monitor a model in production using Mellanox state of the art technologies, ParallelM and Open Source.

    Tensorflow, one of the most popular Machine Learning and Deep Learning frameworks, is utilized in our reference design but the solution can easily be extended to SparkML, Caffe, Torch among other frameworks.






    Many devices, from sensors to robotic components, increasingly generate richer data such as simple value time series, text, images, sound and video. Similarly, applications such as image gathering and video recording are generating tremendous amounts of data and while the data itself is valuable – the ultimate benefit to the business’ bottom line comes from the hidden insights within the analytics.

    The increasing data richness and its volume require a more complex Machine Learning (ML) and Deep Learning (DL) approaches to extract these valuable insights.

    The ubiquity of high performance commodity computing, driven by both massive core count increase in individual CPUs and low-cost cloud computing services as well as hardware suited specifically for Deep Learning, such as GPUs and high-speed interconnect, made it possible to match data growth with similarly scalable Deep Learning capabilities.

    However, due to the configurations complexity, deploying deep learning effectively in production environment remains a challenge, as the complexities of continuously managing and supporting Deep Learning in production and scaling training performance require more efficient hardware.

    This document describes a reference design for realizing the value of Deep Learning in production using state of the art technologies from Mellanox, ParallelM and Open Source. For our reference design, we chose Tensorflow, one of the most popular Machine Learning and Deep Learning frameworks.

    This reference design accomplishes two key objectives:

    • Fastest time to train, using leading tools and frameworks

    • Fastest time to inference and retrain with the ability to rapidly manage, optimize and move to production while ensuring Machine Learning prediction quality in a dynamic environment


    Mellanox's Machine Learning

    Mellanox solutions accelerate many of the world’s leading artificial intelligence and machine learning platforms and a wide range of applications, ranging from security, finance, image and voice recognition to self-driving cars and smart cities.

    Mellanox solutions enable companies and organizations such as Baidu, NVIDIA,, Facebook, PayPal among others to leverage Machine Learning platforms to enhance their competitive advantages.

    This post demonstrates how to build the most efficient Machine Learning cluster enhanced by native RDMA (Remote Direct Memory Access) over 100Gbps InfiniBand or RoCE (RDMA over Converged Ethernet) network.


    ParallelM MLOps Suite

    ParallelM is the company behind MLOps software.

    The biggest challenge in driving a successful AI initiative is effectively managing the performance and business impact of Machine Learning in production.

    Taking Machine Learning into production incurs unique challenges brought on by the need to collaborate between the diverse practices of Ops and Data Science, the complicated nature of Machine Learning behavior, and the complexities of runtime analytics environments.

    ParallelM software solution solves these issues, so that enterprises can apply the power of Machine Learning in Production to transform and scale their businesses.


    Example Use Case

    Facial Recognition is one use case example that we will use to demonstrates how individual facial images are analyzed in real time and the various actions taken if an individual is recognized or not.

    The following example illustrates the expected scenario based on a building's security system:

    • A camera-based entry authorization system is installed at the entrance.
      The camera takes a picture of the individual which is then analyzed to determine whether or not the individual should be granted entry.
    • This analysis can utilize Deep Learning models that are trained on both general facial recognition algorithms and data specific to the building tenants.
      Examples can include pre-training on individuals who have entry permission as well as specific individuals without entry permission (such as security threats, recently departed employees etc.). The entry permission must be made in real time
    • The deep learning models used in the real-time prediction need to be refreshed periodically to factor in (a) new employees, (b) new individuals to be denied entry, and (c) accommodate new model baselines for better facial recognition.
      These retrained models will then need to be uploaded to a real-time prediction system.

      Note that this use case can be applied beyond building security. For example, it can be used in a Retail or Hotel context to provide better customer service etc.


    Reference Design

    Hardware and Software Configuration

    The overall reference configuration shown in Figure 1 illustrates the following:

    • The image data incoming from the real-time camera prediction system which analyzes the images and decide whether to accept or reject the individuals
    • The models used by the prediction system are updated periodically by a training system which uses an image database to train the Deep Learning models
    • When new models are generated by the training system, they are applied to the prediction system based on the administrator specified policy


    Figure 1



    The software components of the reference design include the ParallelM MLOps Center solution for Machine Learning management in production, the TensorFlow analytics engine for the Deep Learning training and the Flink Analytic Engine for the Deep Learning in real-time prediction, each of which is described briefly below:

    • ParallelM MLOps Center software:
      MLOps enables deployment and management of Machine Learning pipelines in production while ensuring they scale and work properly together. ParallelM organizes your pipelines into virtual containers that abstracts away the underlying compute infrastructure while managing logical relationships between pipelines and maintaining data-awareness across training and inference pipelines no matter where they are running.
      By combining sophisticated code instrumentation, distributed data correlation and advanced Machine Learning analytic algorithms, ParallelM provides insight into Machine Learning prediction quality in real-time, alerts on issues and drives corrective actions.
    • Flink Streaming Analytic Engine:
      Flink Streaming analytic engine is based on open source Flink 1.3 which supports streaming and batch Machine Learning, and Deep Learning.
      Deep Learning is supported via Flink-TensorFlow integration by which network inference can be performed on streaming data ingested by either streams (stream brokers such as Kafka etc.) or batch (data lakes such as HDFS etc.)
    • TensorFlow:
      Google Brain team developed TensorFlow, an open source software library, as a Machine Learning and Deep Neural Networks research tool.
      The library performs numerical computation using data flow graphs, where the nodes in the graph represent mathematical operations and the graph edges represent the multidimensional data arrays (tensors) which communicate between the nodes. To use TensorFlow with GPU & RDMA support, the system must be running both an NVIDIA GPU (minimum compute capability of 3.0) and a Mellanox network (Infiniband or Ethernet).


    The hardware configuration includes the Training System (composed of the Network, the Parameter Server, and 4 Worker Servers) and the Prediction System (composed of one Prediction Server).

    • The configurations of the Training System includes network, Parameter Server and Worker Servers as shown in Figure 2
    • The configuration for Prediction System is same as the Worker Servers

    Figure 2



    Use Case Operation

    Figure 3 illustrates how the above defined configuration execute the use case:

    Figure 3


    • MLOps Center execute the entire use case as an Intelligent Overlay Network (ION) which link together the training and prediction pipelines.
    • The ION used in this reference design (Figure 3) is composed of two parallel executing pipelines, each built and deployed using MLOps Center.
      Pipeline (A) executes on the Training System and performs periodic retraining of the model to be used for the facial recognition while Pipeline (B) runs on the Prediction System and performs the real-time determination based on the stream of images intercepted from the camera.
    • These two pipelines are connected via a Policy that executes within MLOps Center. Models generated by Pipeline A are transferred to MLOps Server which executes the policy to determine whether the new model should be transferred to Pipeline B.
      In this example – the policy of “Always update” is used – implying that all models generated by Pipeline A will be transferred to Pipeline B for update.
    • On the Training System, Pipeline A executes on the TensorFlow analytic engine.
      On the Prediction System, Pipeline B executes on the Flink analytic engine.
      Both pipelines are deployed and managed by MLOps Server via the MLOps Agent which manages each analytic engine.


    Note: a number of different DNN (Deep Neural Network) algorithms can be used within Pipeline A. Section 4.0 shows the performance possible for several state of the art image detection Deep Learning algorithms (Inception V3, ResNet 50 and ResNet 152). Each is evaluated with publicly available data (ImageNet) for which accuracy for each of these algorithms has already been determined and demonstrated.

    Performance and Scale

    Distributed Training with RDMA and High-Speed Network

    Modern Neural Networks are trained on large datasets to obtain the highest prediction accuracy during the inference phase.
    Neural Networks applications range from simple speech and image recognition to industry-focused real-time security and fraud detection.

    However, training these large Neural Networks is computationally intensive, though huge strides of advancements have been made in GPU hardware as well as storage infrastructure, training on a single machine for such workloads can still take a long time, sometimes months, which is why all leading deep learning frameworks support distributed training with RDMA allowing faster Time-To-Train (TTT).

    RDMA enables efficient parallelization of Neural Network training on many nodes with many GPUs, reducing up to 2x the needed training time across multiple frameworks compared to a regular TCP/IP network stack.

    Furthermore, high bandwidth network becomes critical to support data ingestion of large training datasets from HDFS (or any other scale-out storage).

    Below you will find performance benchmark results for InceptionV3 and ResNet-50 over TCP and RDMA for the training part of the system:



    We ran synthetic data on two popular benchmarks; Inception V3 and ResNet-50.
    Testing was done using tf.variable set to the same shape expected by each model for ImageNet. This tests both the underlying hardware and the framework at preparing data for actual training.

    The server's hardware and configurations used for TCP and RDMA benchmarks are identical, you can find more details regarding the experiments performed at Mellanox community posts for both Infiniband and RoCE (including specific configuration settings for the network, NVidia GPUs and algorithms)

    The server setup for the runs include 4 worker servers (as explained in Section 3 of this document).

    To create reproducible results, each test was run 3 times and the results were averaged together.
    GPUs were run in their default state on the given platform.
    For each test, 10 warmup steps were done and the following 100 steps were averaged.

    The following results provide a high-level summary of potential performance:



    Inception V3


    Training with Synthetic Data









    Training with Synthetic Data




    Training with Synthetic Data


    Managing Deep Learning Lifecycle in Production

    These are the steps for running and managing the above configuration and its full lifecycle in production.

    • Build and Deploy: The build and deploy process occurs via the MLOps Workspace (part of the MLOps Center software).
      User accounts (both operator and data scientist) can define the required pipelines, link them into an Intelligent Overlay Network (ION) and deploy into production.
      Example for ION builder window in MLOps Workspace:


    • Orchestrate: Once deployed, MLOps Center automates all of the orchestration of the running ION, including model (re)training, updates of new models into the prediction system, and the application of policies.
    • Monitor and Diagnose: All running IONs can be monitored by both Operations and Data Science via the MLOps Workspace, in order to verify that the system is predicting correctly. Examples of dashboard views shown below.






    • Collaborate: different organizations (such as operations and data science) can collaborate efficiently via the MLOps Workspace. Users from different disciplines can share data via Snapshots (Figure 8), jointly test new algorithms on production data via Sandbox and deploy proven algorithms directly into production via the Promote to Production feature. Example of collaboration via snapshot is shown in the figure below.