17 Replies Latest reply on May 14, 2013 6:46 AM by renderfarmer

    40GbE optimization and bandwidth testing

      Hi all.

       

      I just got my MCX313A-BCBT HCAs and MC2207310-020 optical cable.

       

      I installed and connected them without a hitch in Win2008R2 using WinOF-3_2_0 but when I ran a quick test to measure the bandwidth I got a very disappointing 1.37GB/s

       

      The way I set up the benchmark is by creating a StarWind Ram Disk on my file server, sharing the directory, and then using Iometer to run a 1MB sequential read test with a 16 queue depth.

       

      I tried varying the size of the read, and the queue depth along with a couple of other parameters but the results were always around 1.3GB/s.

       

      My questions are:

      1. Is my testing methodology flawed? If so, what's the best way to test max bandwidth in Win2008R2?
      2. What specific performance optimizations can I perform in the drivers to get closer to 40Gbps?

       

      Thanks.

        • Re: 40GbE optimization and bandwidth testing
          justinclift

          rimblock - Any thoughts for this?

          • Re: 40GbE optimization and bandwidth testing

            To make sure that the issue didn't lie elsewhere I did a few tests:

             

            1- I used SiSoftware Sandra 2013 to check that my HCAs were in PCIe-3 x8 mode.

            2- I tested the RAM disk: Using 32 workers, 1MB transfers, 16 I/Os per target I got 21GB/s.

             

            I re-tested across the HCAs and got about the same as before, 1.3GB/s

             

            I ran the single Port Performance tuning on the card in the driver performance tab and it didn't change a thing.

             

            Does anyone have any real performance tuning suggestions, or answers as to why my HCAs appear so slow?

             

            Is there a better way of testing the bandwidth of my ad-hoc network?

             

            Thanks.

              • Re: 40GbE optimization and bandwidth testing

                hi

                i tried starwind on windows as iscsi server..  its quite bad

                suggest you build a linux server with scst on it, as the target.  and connect from windows host with iscsi

                you should get at least 2GB/s on ISCSI, and 3.5GBs on SRP.  IPOIB is much slower in performance than SRP

                also datagram vs connected mode with 56K MTU rather than the max 4096 MTU in connected mode will likley improve things.

                  • Re: 40GbE optimization and bandwidth testing
                    justinclift

                    bmac20 - Isn't connected mode the one with the big mtu's, not datagram mode?

                     

                    Unless I'm just reading your post wrong (kind of tired atm).

                    • Re: 40GbE optimization and bandwidth testing

                      Thanks, Bmac!

                       

                      My Mellanox card isn't IPoIB though. It's an actual 40Gbps Ethernet HCA. I'm sure the silicone is the same as the IB cards but it's hard-wired to function as a pure Ethernet controller with the proper header and MAC address of an ethernet controller.

                       

                      I appreciate the test suggestion but I'm really only interested in getting my 40GbE cards working as efficiently as they can in Windows using TCP/IP. 1.3GB/s is pretty crap considering 40GbE is theoretically 5GB/s and PCIe3 has max throughput of 6.5GB/s.

                       

                      My rendering software uses mapped network drives so that's what I'm testing against.

                       

                      I do plan on migrating to CentOS in the future but for now Windows will have to do.

                       

                      I'll definitely try datagram mode, though I tried larger MTUs and it didn't change anything.

                       

                      I updated the firmware on both cards and installed Windows 2012 on my file server as the 4.2 drivers have a lot more options than the 3.2 ones do in Win2008R2.

                       

                      There are 3 performance tuning options in the 4.2 drivers: Single Port, Multicast, and Single Stream. Which would be best suited for my type of application? I'm basically accessing large (500MB+) Binary scene files and lots of textures (5-200MB images) off of a RAID array.

                        • Re: 40GbE optimization and bandwidth testing

                          Hmm Strange.

                          have you tried turning TCP offload on the driver to OFF.  that used to make some difference for me with 20Gb cards using Ethernet.  they used to achieve 1800MB/s so you should get up to 3600MB/s or 3.6GB/s on 40Gb

                          Im guesing your not using a 40Gb Mellanox managed switch either, as i believe there are many benefits to doing so, like collision management etc.

                           

                          well not multicast, either single port or single stream.  But considering your connecting via windows mapped network, which uses CIFS, then thats your problem right there.  CIFS just cant carry that much data.

                           

                          You need a SCSI transport like iSCSI and then connect windows using iscsi initiator over your IB IP network.  iSCSI on 20Gb Infiniband cards using firmware 2.7 and later get 1800MB/s read write over iSCSI on SCST Target.

                           

                          Again id seriously suggest if you want real performance forget windows of any flavour as a San Target.  It is very bad at target mode.

                           

                          Use ubuntu to setup SCST.  takes 30 mins.  no need to recompile teh kernel it works just as good without that.

                           

                          install ubuntu 12.10. 

                          then follow this doc http://www.zimbio.com/Ubuntu+Linux/articles/5vq_mlaTjIT/How+To+Install+SCST+on+Ubuntu

                           

                          its a bit fiddly understanding how to setup SCST but i can send you the commands if you want.  That will get you much better speed.

                          Also make sure you use a LSI raid card or something better than mobo raid.

                           

                          Cheers

                            • Re: 40GbE optimization and bandwidth testing
                              have you tried turning TCP offload on the driver to OFF.

                              No, I'll give that a try.

                              But considering your connecting via windows mapped network, which uses CIFS, then thats your problem right there.  CIFS just cant carry that much data.

                              A member of the Servethehome forums has a ConnectX 20Gb IB card that does 2000MB/s using the mapped starwind drive method in Windows2008R2, with IPoIB. So I figured my brand new ConnectX-3 40GbE card would get at least that...

                              Im guesing your not using a 40Gb Mellanox managed switch either, as i believe there are many benefits to doing so, like collision management etc.

                              That might have something to do with it. I believe he had his card connected to a switch.

                              You need a SCSI transport like iSCSI and then connect windows using iscsi initiator over your IB IP network.  iSCSI on 20Gb Infiniband cards using firmware 2.7 and later get 1800MB/s read write over iSCSI on SCST Target.

                              My limited understanding of iSCSI targets is that they can only be accessed by one machine; is that correct? I'm a 3D Artist, not an IT Pro so I really have limited knowledge of these things. I use Maya for my work and it requires a project directory where it organizes the scene assets. Each scene references textures and geometry that have to be accessible in the same directory structure to all of the render slaves that will be working on the frames. Directory mapping is very convenient for this purpose. SCST may be fast but if it doesn't accomplish what my specific job requires then it's just not suitable.

                              Use ubuntu to setup SCST.  takes 30 mins.  no need to recompile teh kernel it works just as good without that.

                              I'm game to give linux a try, but Ubuntu has very limited support in 3D. I mostly use Autodesk and theFoundry products which are all RHL. It was always my intention to one day switch to CentOS.

                              Also make sure you use a LSI raid card or something better than mobo raid.

                              I have a brand new LSI 9271-8iCC

                          • Re: 40GbE optimization and bandwidth testing

                            bmac20 wrote:

                             

                            hi

                            i tried starwind on windows as iscsi server..  its quite bad

                             

                             

                            may I ask you what exactly appeared to be bad for you in your StarWind experience?

                        • Re: 40GbE optimization and bandwidth testing

                          Hi!

                           

                          Is there any chance you could try some kernel RAM disk instead of StarWinds (that is running in user mode as you may know)?

                           

                          Also, am I right in assumption that you are connecting to the disk via iSCSI and the connection is not local?

                            • Re: 40GbE optimization and bandwidth testing

                              Hi. I don't know if you're asking Bmac or me (the OP) but no, I'm not using iSCSI.

                               

                              As I already posted, a forum member on ServeTheHome was able to get 2GB/s using a 20Gbps ConnectX IB card from one windows machine to another using plain network shares and a StarWind RAM disk. I'm simply trying to understand why my more advanced 40GbE ConnectX-3 won't give me more than 1.3GB/s in the same environment.

                               

                              I checked with the guy and he said he tried it both direct connect like I have mine, and through a 40Gbps switch and got the same 2GB/s results.

                            • Re: 40GbE optimization and bandwidth testing

                              Hi renderfarmer,

                               

                              Can you open a Powershell window and enter: Get-SmbServerNetworkInterface

                              I'm curious to know if Windows thinks that RDMA is working. I don't often read the Mellanox forums, so it might be best to post on STH.

                                • Re: 40GbE optimization and bandwidth testing

                                  Hi, dba. I just posted this on the STH forums but here it is for anyone that is interested here:

                                   

                                  PS C:\Users\Administrator> Get-SmbServerNetworkInterface

                                   

                                  Scope Name          Interface Index     RSS Capable         RDMA Capable        Speed               IpAddress

                                  ----------          ---------------     -----------         ------------        -----               ---------

                                  *                   18                  True                True                40 Gbps             10.10.10.100

                                  *                   18                  True                True                40 Gbps             fe80::dd86:1d61:...

                                  *                   13                  True                False               1 Gbps              10.10.10.199

                                  *                   13                  True                False               1 Gbps              fe80::114d:f17d:...

                                • Re: 40GbE optimization and bandwidth testing

                                  So I now have two servers with Windows 2012 on them + The latest firmware on each HCA + the latest 4.2 drivers on each machine.

                                   

                                  I confirmed that RDMA is enabled on each machine and I still can't break past the 1350MB/s mark using windows file sharing.

                                   

                                  I ran NTttcp as per my Mellanox contact's advice and managed to get 23Gbps by throwing several threads at the job.

                                   

                                  I used the following settings which I googled for (some guy was able to get 9.9Gbps on his intel 10GbE card with these) :

                                   

                                  ntttcps -m 8,0,10.10.10.111 -l 1048576 -n 100000 -w -a 16 -t 10

                                  ntttcpr -m 8,0,10.10.10.111 -l 1048576 -rb 2097152 -n 100000 -w -a 16 -fr -t 10

                                   

                                  On a single thread I get 15Gbps. On two threads I get 20Gbps.

                                   

                                  This is still a far cry from 40Gbps.

                                    • Re: 40GbE optimization and bandwidth testing

                                      Someone pointed me the chelsio website which has white papers for 40GbE over SMB between Windows 2012 servers setup just like mine getting a whopping 36Gbps... Another poster confirmed that he's getting 31Gbps over SMB using Mellanox ConnectX-2 QDR cards with IPoIB.

                                       

                                      I've ordered myself a Mellanox 40GbE passive copper cable to see if my Fiber Optic FDR cable is what's causing the problem.

                                    • Re: 40GbE optimization and bandwidth testing

                                      4322MB/s using Starwind RAMDisk as a network share!!!

                                       

                                      As it turns out the Startech PCIE riser I was using in my render nodes was limiting the HCA to PCIe2 x8 which was having a disproportional effect on performance.

                                       

                                      I had to jerry-rig the mobo out of the chassis so that I could install the 40GbE card upright temporarily for testing. Once it was running at PCIE3-x8 (confirmed using SIV) all was good.


                                      Thanks to everyone for their help and advice.