Scaling Windows OS based clusters (InfiniBand and Ethernet)

Version 1

    With InfiniBand and Ethernet cluster running with Windows OS, some of the OS default values wouldn't scale well for large clusters bigger than couple of hundred nodes.

    Specifically the areas handling IPoIB ARPs needs some up-scaling. Below, see the details:

     

    ARP Table size: The system default (for Server 2012 OS and before) is 256 entries. To avoid rapid ARP table trashing it is recommended to increase it to a bigger size. Aim for 3 or 4 times bigger than the amount of ipoib or Eth interfaces in the fabric.

    Example: For fabric with 1K nodes we will set the table size to be 4K:

    To show current settings:

    netsh interface ipv4 show global

     

    General Global Parameters

    ---------------------------------------------

    Default Hop Limit : 128 hops

    Neighbor Cache Limit : 256 entries per interface

    Route Cache Limit : 128 entries per compartment

    Reassembly Limit : 16773568 bytes

    ICMP Redirects : enabled

    Source Routing Behavior : dontforward

    Task Offload : enabled

    Dhcp Media Sense : enabled

    Media Sense Logging : disabled

    MLD Level : all

    MLD Version : version3

    Multicast Forwarding : disabled

    Group Forwarded Fragments           : disabled

    Randomize Identifiers : enabled

    Address Mask Reply : disabled

     

    Current Global Statistics

    ---------------------------------------------

    Number of Compartments : 1

    Number of NL clients : 7

    Number of FL providers : 4

     

    For changing to 4K:

    netsh interface ipv4 set global neighborcachelimit=4000

     

    ARP Caching Reachable Time:

     

    The "Reachable Time" value is calculated as follows:

    Reachable Time = BaseReachable Time × (A random value between MIN_RANDOM_FACTOR and MAX_RANDOM_FACTOR)

    RFC provides the following calculated results.

    BaseReachable Time      30,000 milliseconds (ms)

    MIN_RANDOM_FACTOR 0.5

    MAX_RANDOM_FACTOR 1.5

    Therefore, the "Reachable Time" value is somewhere between 15 seconds (30 × 0.5 seconds) and 45 seconds (30 × 1.5 seconds). If an entry is not used for a time between 15 to 45 seconds, it changes to the "Stale" state. Then, the host must send an ARP Request for IPV4 to the network when any IP datagram is sent to that destination.

    With large multicast domains (like Infiniband clusters), anywhere between 15 to 45 seconds can be too little (could get to thousands of ARP renewals every second). It is therefore recommended to increase the BaseReachable Time value. The value to set it up for can be determine according to the size of the network and the application sensitivity for ARP changes.

     

    Example: for 1500 nodes cluster we changed the BaseReachable Time to 10 minutes. In this case all IP addresses are getting assigned statically (so no worries of nodes switching IP addresses between them through a reboot process). We also factored in this the time it take for a machine to reboot (which is approximately between 5-7 minutes)

    To show the current value:

    netsh interface ipv4 show interface <if number>

     

    Interface Local Area Connection Parameters

    ----------------------------------------------

    IfLuid : ethernet_6

    IfIndex : 10

    State : connected

    Metric : 10

    Link MTU : 1500 bytes

    Reachable Time : 15000 ms

    Base Reachable Time : 30000 ms

    Retransmission Interval            : 1000 ms

    DAD Transmits : 3

    Site Prefix Length : 64

    Site Id : 1

    Forwarding : disabled

    Advertising : disabled

    Neighbor Discovery : enabled

    Neighbor Unreachability Detection  : enabled

    Router Discovery : dhcp

    Managed Address Configuration      : enabled

    Other Stateful Configuration       : enabled

    Weak Host Sends : disabled

    Weak Host Receives : disabled

    Use Automatic Metric : enabled

    Ignore Default Routes : disabled

    Advertised Router Lifetime         : 1800 seconds

    Advertise Default Route            : disabled

    Current Hop Limit : 0

    Force ARPND Wake up patterns       : disabled

    Directed MAC Wake up patterns      : disabled

    For setting to 10 minutes:

    netsh interface ipv4 set interface <if number> basereachable=600000

     

    ARP Retransmission Time: defines the time in which the stack will retransmit an ARP again if it didn’t get a reply on an ARP request.

    The Stack’s defaults is 1 second (1000ms) which is enough but imagine large networks with multiple switching hops and pick times – sometimes ARP reply can take over a second. If you decide on changing this value, 3 seconds should be more than enough.

     

    To show the current value:

    netsh interface ipv4 show interface <if number>

    Interface Local Area Connection Parameters

    ----------------------------------------------

    IfLuid : ethernet_6

    IfIndex : 10

    State : connected

    Metric : 10

    Link MTU : 1500 bytes

    Reachable Time : 15000 ms

    Base Reachable Time : 30000 ms

    Retransmission Interval            : 1000 ms

    DAD Transmits : 3

    Site Prefix Length : 64

    Site Id : 1

    Forwarding : disabled

    Advertising : disabled

    Neighbor Discovery : enabled

    Neighbor Unreachability Detection  : enabled

    Router Discovery : dhcp

    Managed Address Configuration      : enabled

    Other Stateful Configuration       : enabled

    Weak Host Sends : disabled

    Weak Host Receives : disabled

    Use Automatic Metric : enabled

    Ignore Default Routes : disabled

    Advertised Router Lifetime         : 1800 seconds

    Advertise Default Route            : disabled

    Current Hop Limit : 0

    Force ARPND Wake up patterns       : disabled

    Directed MAC Wake up patterns      : disabled

     

    For setting this value to 3 seconds:

    netsh interface ipv4 set interface 10 retransmittime=3000

     

    Notes:

    • Run the above netsh command from Windows command line with Administrator privileges.
    • The above settings will stay persistent across reboots.
    • For network unity make sure to make the above changes on all compute nodes in the network.

    References:

    http://support.microsoft.com/kb/949589

    http://technet.microsoft.com/en-us/library/cc731521%28v=ws.10%29.aspx