HowTo Configure Mellanox Nagios Plugin to support Mellanox Switches

Version 8

    This post shows how to configure Nagios Monitoring tool to support Mellanox switches via "nagios4mlnxos" plugin.

    This package generates Nagios configuration seamlessly and easily for Mellanox switches based on SNMP for monitoring Mellanox switches.

     

    References

     

    With this procedure we provide the following services for our Ethernet switches via Nagios:

    1. PING

    2. Uptime

    3. Software version

    4. CPU Load

    5. Memory utilization

    6. Network interfaces administrative status

    7. Network interfaces operational status

    8. Power supplies status

    9. Fans status

    10. Temperature status

     

    Server Requirements

    1. Python 2.6 or Python 2.7.  The Python required modules:

    •   argparse
    •   yaml
    •   voluptuous

     

    2. Perl modules: Net::SNMP

     

     

    Configuration

    1. Clone the Mellanox nagios4mlnxos Git:

    # git clone https://github.com/Mellanox/nagios4mlnxos

     

    2. Build the package

     

    Enter the build directory:

    # cd build

    Invoke the build.sh script:

    # ./build.sh

    The nagios4mlnxos package will be created in the build directory.

     

    2 Extract the package to a temporary location:

    # cd /tmp

    # tar xvf nagios4mlnxos-1.0.0.tar.gz

    ...

    # cd nagios4mlnxos

     

    3. Edit the configuration file, here is an example:

    # cat nagios4mlnxos/conf/nagios4mlnxos.yaml

    snmp:

      port: 161

      timeout: 5

      retries: 3

    # SNMP protocol version possible values: 2, 3 (default: 2)

      version: 2

      snmpv2:

        community: public

      snmpv3:

        username: admin

        # Possible authentication protocols: MD5, SHA (default: MD5)

        auth_protocol: MD5

        auth_password: adminauth123

        # Possible authentication protocols: DES, AES (default: DES)

        privacy_protocol: DES

        privacy_password: adminpriv123

     

    hostgroups:

      #<hostgroup>: <hostgroup switches CSV file path> e.g. new-york: /tmp/nyc-switches.csv

      new-york: /tmp/nyc-switches.csv

     

    services:

      ping:

        warning_threshold: 200.0,20%

        critical_threshold: 600.0,60%

      cpu_load:

        warning_threshold: 71

        critical_threshold: 90

      memory_utilization:

        warning_threshold: 88

        critical_threshold: 90


    4. Create/Edit the csv file for the list of switches per hostgroup. In this example edit /tmp/nyc-switches.csv

     

    # cat /tmp/nyc-switches.csv

    name,address,alias

    NYC-SX1,10.209.24.102,NYC-SX1012-1

    NYC-SX2,10.209.25.107,NYC-SX1400-2

     

    5. Invoke the script

    # ./nagios4mlnxos.py -c ./conf/nagios4mlnxos.yaml

     

    6. Copy the files from /tmp/nagios4mlnxos_output folder to Nagios directories:

       - conf: generated Nagios configuration files

       - plugins: Nagios plugins which are used for monitoring Mellanox switches

     

    For example of those files, refer to the Zip file attached to this post

     

    7. Reload Nagios configuration

    # service nagios reload

     

     

    Examples

     

    The examples are based on two servers NYC-SX1 and NYC-SX2:

     

    1. Global status (hostgroup)

    In the global status, you can see that switch NYC-SX1 is OK while NYC-SX2 has two critical errors.

     

    nagios_hostgroups.jpg

     

     

    2. Switch NYC-SX01 - Good status

    nagios_sx1_no_errors.jpg

     

     

     

    3. Switch NYC-SX02 - 2 critical errors

     

    In this example you can see that some ports are down and the power supply is in critical state.

     

    nagios_sx2_with_errors.jpg