4 Replies Latest reply on Feb 2, 2014 5:23 AM by ale

    Mellanox eSwitchd issue on Openstack Havana nova-compute

    ale

      Hi all,

      I have an SR-IOV enabled installation of Openstack Havana (RDO) on CentoOS 6.5 compute nodes. I have followed the docs at Mellanox-Neutron-Havana-Redhat - OpenStack to set up Neutron on the nova-compute nodes to use openstack-neutron-mellanox plugin and eSwitchd (Mellanox eSwitchd Installation for OpenFlow and OpenStack).

      I'm able to boot a VM with a properly configured IB VF but if I reboot the hypervisor the nova compute service always crash with the following error (/var/log/nova/compute.log):


      2014-01-28 15:02:11.187 6608 DEBUG mlnxvif.vif [req-608df241-5698-4e52-9b99-3d196179e725 None None] vif_type=hostdev plug /usr/lib/python

      2.6/site-packages/mlnxvif/vif.py:92

      2014-01-28 15:02:11.188 6608 DEBUG nova.openstack.common.processutils [req-608df241-5698-4e52-9b99-3d196179e725 None None] Running cmd (s

      ubprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf ebrctl add-port fa:16:3e:62:c2:c3 75c41eaf-77e1-49ba-bff5-368bedea66ca default hos

      tdev None execute /usr/lib/python2.6/site-packages/nova/openstack/common/processutils.py:147

      2014-01-28 15:02:11.406 6608 DEBUG nova.openstack.common.processutils [req-608df241-5698-4e52-9b99-3d196179e725 None None] Result was 1 execute /usr/lib/python2.6/site-packages/nova/openstack/common/processutils.py:172

      2014-01-28 15:02:11.406 6608 DEBUG mlnxvif.vif [req-608df241-5698-4e52-9b99-3d196179e725 None None] Error in Plug: Unexpected error while running command.

      Command: sudo nova-rootwrap /etc/nova/rootwrap.conf ebrctl add-port fa:16:3e:62:c2:c3 75c41eaf-77e1-49ba-bff5-368bedea66ca default hostdev None

      Exit code: 1

      Stdout: ''

      Stderr: 'ERROR:eswitchd.cli.conn_utils:Action  plug_nic failed: Plug vnic failed\nError in add-port commandAction  plug_nic failed: Plugvnic failed' plug /usr/lib/python2.6/site-packages/mlnxvif/vif.py:106

      2014-01-28 15:02:11.409 6608 ERROR nova.openstack.common.threadgroup [-] Processing Failure during vNIC plug

      2014-01-28 15:02:11.409 6608 TRACE nova.openstack.common.threadgroup Traceback (most recent call last):

       

      eSwitchd log shows (/var/log./eswitchd/eswitchd.log)

       

      2014-01-28 15:02:10,780 DEBUG Handling message - {u'action': u'get_vnics', u'fabric': u'*'}

      2014-01-28 15:02:10,780 DEBUG fabrics =['default']

      2014-01-28 15:02:10,780 DEBUG vnics are {'14:05:00:00:00:08': {'mac': '14:05:00:00:00:08', 'device_id': '75c41eaf-77e1-49ba-bff5-368bedea66ca'}, '14:05:00:00:00:07': {'mac': '14:05:00:00:00:07', 'device_id': '29314aa9-6ddc-421b-ad82-5090f8ccaecb'}, '14:05:00:00:00:06': {'mac': '14:05:00:00:00:06', 'device_id': 'd481dfde-0795-4a2e-89ca-3369cb49cbe1'}}

      2014-01-28 15:02:11,393 DEBUG Handling message - {u'fabric': u'default', u'dev_name': u'None', u'vnic_type': u'hostdev', u'action': u'plug_nic', u'vnic_mac': u'fa:16:3e:62:c2:c3', u'device_id': u'75c41eaf-77e1-49ba-bff5-368bedea66ca'}

      2014-01-28 15:02:11,394 ERROR Plug NIC: Didn't find dev for MAC:14:05:00:00:00:06 and device_id:75c41eaf-77e1-49ba-bff5-368bedea66ca

      2014-01-28 15:02:11,394 DEBUG Resync devices

      2014-01-28 15:02:12,780 DEBUG Handling message - {u'action': u'get_vnics', u'fabric': u'*'}

      2014-01-28 15:02:12,780 DEBUG fabrics =['default']

      2014-01-28 15:02:12,780 DEBUG vnics are {'14:05:00:00:00:08': {'mac': '14:05:00:00:00:08', 'device_id': '75c41eaf-77e1-49ba-bff5-368bedea66ca'}, '14:05:00:00:00:07': {'mac': '14:05:00:00:00:07', 'device_id': '29314aa9-6ddc-421b-ad82-5090f8ccaecb'}, '14:05:00:00:00:06': {'mac': '14:05:00:00:00:06', 'device_id': 'd481dfde-0795-4a2e-89ca-3369cb49cbe1'}}

       

      As you can see from the previous logs, eSwitchd reports the correct IDs of the VMs but the vnics MAC address configuration is wrong (for example, the instance 75c41eaf-77e1-49ba-bff5-368bedea66ca should have mac address fa:16:3e:62:c2:c3).

      Neutron Mellanox plugin also shows some configuration inconsistency (/var/log/neutron/mlnx-agent.log)


      2014-01-28 12:15:00.438 4499 INFO neutron.plugins.mlnx.agent.eswitch_neutron_agent [-] eSwitch Agent Started!

      2014-01-28 12:15:00.438 4499 INFO neutron.plugins.mlnx.agent.eswitch_neutron_agent [-] Agent out of sync with plugin!

      2014-01-28 12:15:00.439 4499 DEBUG neutron.plugins.mlnx.agent.utils [-] get_attached_vnics get_attached_vnics /usr/lib/python2.6/site-packages/neutron/plugins/mlnx/agent/utils.py:75

      2014-01-28 12:15:00.440 4499 DEBUG neutron.plugins.mlnx.agent.eswitch_neutron_agent [-] Agent loop process devices! daemon_loop /usr/lib/python2.6/site-packages/neutron/plugins/mlnx/agent/eswitch_neutron_agent.py:391

      2014-01-28 12:15:00.440 4499 DEBUG neutron.plugins.mlnx.agent.eswitch_neutron_agent [-] Ports added! process_network_ports /usr/lib/python2.6/site-packages/neutron/plugins/mlnx/agent/eswitch_neutron_agent.py:298

      2014-01-28 12:15:00.441 4499 INFO neutron.plugins.mlnx.agent.eswitch_neutron_agent [-] Adding port with mac 14:05:00:00:00:08

      2014-01-28 12:15:00.443 4499 DEBUG neutron.plugins.mlnx.agent.utils [-] get_attached_vnics get_attached_vnics /usr/lib/python2.6/site-packages/neutron/plugins/mlnx/agent/utils.py:75

      2014-01-28 12:15:00.548 4499 DEBUG neutron.plugins.mlnx.agent.eswitch_neutron_agent [-] Device with mac_address 14:05:00:00:00:08 not defined on Neutron Plugin treat_devices_added /usr/lib/python2.6/site-packages/neutron/plugins/mlnx/agent/eswitch_neutron_agent.py:350

       

      To give you more infos this is the configuration of the 'hostdev' device of the VM 75c41eaf-77e1-49ba-bff5-368bedea66ca

       

      [root@n08 ~]# virsh dumpxml instance-00000014 |grep -A5 '<hostdev'

          <hostdev mode='subsystem' type='pci' managed='no'>

            <source>

              <address domain='0x0000' bus='0x01' slot='0x01' function='0x0'/>

            </source>

            <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>

          </hostdev>

      [root@n08 ~]# cat /sys/class/infiniband/mlx4_0/iov/0000\:01\:01.0/ports/1/gid_idx/0

      8

      [root@n08 ~]# cat /sys/class/infiniband/mlx4_0/iov/ports/1/{admin_guids,gids}/8

      14050000000008

      fe80:0000:0000:0002:0014:0500:0000:0008

      [root@n08 ~]# cat /sys/class/infiniband/mlx4_0/iov/0000\:01\:01.0/ports/1/pkey_idx/{0,1}

      0

      none

       

      It seems that eSwitchd after reboot does not add the correct mac address in /sys/class/infiniband/mlx4_0/iov/ports/1/admin_guids/8.

       

      Can anybody help me, please?

       

      Thank you very much in advance.

       

      Ale

       

      Message was edited by: Alessandro Federico Added more info and logs.

        • Re: Mellanox eSwitchd issue on Openstack Havana nova-compute
          ale

          I was able to successfully restart the nova-compute service by adding the correct mac address to the correct iov ports in the following way.

          • find the MAC address of the VM using nova list and neutron port-list commands

          [root@n01 ~(keystone_admin)]# nova list | grep centos0

          | 75c41eaf-77e1-49ba-bff5-368bedea66ca | centos0 | SHUTOFF | None      | Shutdown    | net1=10.0.0.2 |

          [root@n01 ~(keystone_admin)]# neutron port-list | grep '10.0.0.2'

          | 00e377b5-c050-47f5-b3f9-8a807ff2ac7e |      | fa:16:3e:62:c2:c3 | {"subnet_id": "a82ea062-53e3-4dc3-8e1e-1524d7dbb2c4", "ip_address": "10.0.0.2"}      |

          • on the nova compute host owning the VM, find the IOV VF attached to the VM (in my case is 0000:01:01.0)

          [root@n08 ~]# virsh dumpxml instance-00000014 | grep -A5 '<hostdev'

              <hostdev mode='subsystem' type='pci' managed='no'>

                <source>

                  <address domain='0x0000' bus='0x01' slot='0x01' function='0x0'/>

                </source>

                <alias name='hostdev0'/>

                <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>

          • find the GUID index of the VF

          [root@n08 ~]# cat /sys/class/infiniband/mlx4_0/iov/0000\:01\:01.0/ports/1/gid_idx/0

          8

          • add the MAC address of the VM (fa:16:3e:62:c2:c3) to the GUID table at index 8

          [root@n08 ~]# ebrctl write-sys /sys/class/infiniband/mlx4_0/iov/ports/1/admin_guids/8 fa163e000062c2c3

          • restart eswitchd (optional?) and neutron-mlnx-agent. eSwitchd logs now shows vnics with the correct MAC addresses

          2014-02-01 15:58:01,460 DEBUG vnics are {'fa:16:3e:e3:8f:98': {'mac': 'fa:16:3e:e3:8f:98', 'device_id': 'd481dfde-0795-4a2e-89ca-3369cb49cbe1'}, 'fa:16:3e:1f:a5:94': {'mac': 'fa:16:3e:1f:a5:94', 'device_id': '29314aa9-6ddc-421b-ad82-5090f8ccaecb'}, 'fa:16:3e:62:c2:c3': {'mac': 'fa:16:3e:62:c2:c3', 'device_id': '75c41eaf-77e1-49ba-bff5-368bedea66ca'}}2014-02-01

          15:58:02,048 DEBUG Handling message - {u'action': u'set_vlan', u'vlan': 1, u'fabric': u'default', u'port_mac': u'fa:16:3e:62:c2:c3'}2014-02-01

          15:58:02,059 DEBUG Running command: sudo eswitch-rootwrap /etc/eswitchd/rootwrap.conf ebrctl write-sys /sys/class/infiniband/mlx4_0/iov/0000:01:01.0/ports/1/pkey_idx/0

          12014-02-01 15:58:02,228 DEBUGCommand: ['sudo', 'eswitch-rootwrap', '/etc/eswitchd/rootwrap.conf', 'ebrctl', 'write-sys', '/sys/class/infiniband/mlx4_0/iov/0000:01:01.0/ports/1/pkey_idx/0', '1']

          • from the logs above eswitchd correctly maps the VF pkey index 0 to the PF pkey index 1 (vlan 1) but it forgets to map the VF pkey index 1 to the default pkey index 0

          [root@n08 ~]# cat /sys/class/infiniband/mlx4_0/iov/0000\:01\:01.0/ports/1/pkey_idx/{0,1}

          1

          none

          [root@n08 ~]# echo 0 > /sys/class/infiniband/mlx4_0/iov/0000\:01\:01.0/ports/1/pkey_idx/1

          [root@n08 ~]# cat /sys/class/infiniband/mlx4_0/iov/0000\:01\:01.0/ports/1/pkey_idx/{0,1}

          1

          0

          • nova-compute service should now be happy to start ;-)
          • start your VM

           

          ale