Blade Chassis to End of Row Switches Connectivity & High Availability Options

Spanning Tree Protocol (STP) free network inside Data Center is main focus for network vendors and many technologies have been introduced in recent past to resolve STP issues in data center and ensure optimal link utilization. Advent of switching modules inside blade enclosures coupled with the requirements for optimal link utilization starting right from blade server has made today’s Data Center network more complex.

In one of earlier blog I discussed design guidelines for port bundling/ active-active NIC teaming between Juniper Virtual Chassis Fabric (QFX 5100 VCF) and Rack Mounted Servers, however few questions were still un-answered.

Q-1. What if Virtual Chassis option is not available, specifically in the scenarios where Juniper EX 9200 / QFX 10K Chassis based switches are  deployed as End or Row switches.

Q-2. What if, instead of Rack mounted servers blade chassis enclosure (equipped with switching module)  are used as compute machines and there is requirement for active/ active  or active/ passive NIC teaming on blade servers.  How the overall network will look and work right starting from blade server NIC.

This blog will answer these questions:-

Assumption:- Network Switches are placed as End of Row  model and to cater for STP Multi-Chassis Link Aggregation (MC-LAG) is deployed. Please see one of my earlier blog for understanding of MC-LAG.

Option 1: Rack mounted servers for computing machines, servers have installed multiple NICs in Pass-Though module and Virtual Machines hosted inside servers require Active/Active NIC Teaming.

picture5

Option 2 : Blade Chassis has multiple blade servers and each blade servers has more than 1 NIC (which are connected with blade chassis switches through internal fabric link). Virtul Machines hosted inside blade servers require active/active NIC teaming.

picture9

Option 3 : Blade Chassis has multiple blade servers and each blade servers has more than 1 NIC (which are connected with blade chassis switches through internal fabric link). Virtual machines hosted inside blade servers require active/passive NIC teaming.

picture10

Packet Walk Through-Part 1

The objective of this blog is to discuss end to end packet (client to server)  traversing through a service provider network with special consideration on performance effecting factors.   

 

screenshot

 

 

We will suppose client needs to access any of the service hosted in server connected with CE-2, all the network links and NICs on end system are Ethernet based. Almost all the vendors compute machines (PC/ servers) are generating IP data gram with 1500 bytes size  (20 bytes header +1480  data bytes) in normal circumstances. 

ip

Fragmentation:- If any of link is unable to handle 1500 size IP data-gram then packet will be fragmented and forwarded to its destination where it will be re-assembled. The fragmentation and re-assembly will introduce overhead and  defiantly over all performance will be degraded.  In IP header following fields are important to detect fragmentation and to re-assemble the packets.

  •  Identification:- Is unique for all segments if packet is fragmented at all 
  •  Flags – 3 bits  . Bit 0 always 0, bit 1 -DF (Fermentation allowed or not  0 and 1 respectively), Bit 2-MF (More fragments expected or Last ,  1 and 0 respectively)
  • Fragments Offset :- Determine where data will start after removal of IP header in 1st and subsequent segments once packet is re-assembled.

With below example we can understand transfer of IP data-gram between end system. Total size of IP data-gram is 4860 bytes which includes 20 bytes IP header so total data bytes are 4840. 

ip-freg-1

As earlier explained Ethernet based NICs starts assume the default IP packet size 1500 bytes (20 bytes header + 1480 bytes data). So packets will be divided in 1480 bytes size chucks and will be transmitted after adding 20 bytes IP header.

The receiving host will reassemble the data once all fragments are received and will place the data bytes extracted from all fragments into relevant position inside original packet as it was before on sending host before fragmentation. Fragment offset value will be used for  this purpose (its always multiplier of 8)

  • 1st fragment data= 1480 bytes (0-1479)
  • 2nd fragment data =1480 bytes  (placement start at 1480 and end at 2959 bytes, fragment offset 185 x8 =1480)
  • 3rd fragment data =1480 bytes   (placement start at 2960 and end 4439 bytes, fragment offset 370 x8 =2960)
  • 4th fragment data = 400 bytes (placement start at 4440 and end 4840 bytes, fragment offset 555 x8 =4440)
  • Total data bytes are now 4840 and if we add 20 bytes IP header it will become 4860 bytes which was size of IP data-gram on sending host before fragmentation starts.

ip-freg-2

Everything handled smoothly and no performance degradation observed,  end system uses 2 mechanisms  for smooth data transfers:-

Path MTU Discovery -MTUD :– Path MTUD is defined in RFC 1191 where end systems detect maximum allowed MTU on a communication path. End system starts sending IP packets by using MTU of outing interface (default value is 1500 with 20 bytes IP header and 1480 data bytes) and DF bit SET. If any of link in between source and destination does not support 1500 bytes size IP packet then “ICMP destination un-reachable” with type 3, code 4 returned to originator host because of DF bit SET value as it tells “do not fragment this packet”.  When such messages are initiated MTU of next hop is also included in it by the router which was unable to pass 1500 bytes IP packet. This process will continue until originator host finds maximum MTU allowed on a communication path till its destination and subsequently adjust its MTU for outgoing interface.

TCP Maxim Segment Size -MSS- RFC 2923:- TCP MSS size of TCP data in an IP  packet (IP MTU-40) e.g 1500-40=1460 where 40 comes from 20 bytes IP header and 20 bytes TCP header. New TCP stack implies TCP MSS detection mechanism between end systems  to identify maximum TCP  segment can be allowed on opposite end system and thus adjust its sending TCP MSS accordingly.

Impact of Fragmentation:- End system can handle fragmentation without much overhead as they  have sufficient resources (buffer to hold fragments until  last packet arrives and CPU to re-assemble the packet in original order as it was on originator host before fragmentation). Routers can do the fragmentation fairly easily as all information required for fragmentation is available in original packet header and router just need to replicate the header and add fragment offset , MF flag and sequence no. But devices which handle packets from layer 4 and above then need to re-assemble the all fragments into original IP segment before any further processing such sort of devices may face performance issues with fragmentation and  packet re-assembly.

As in our above example we suppose originator host start sending IP packet (1500 bytes) with DF bit value=1 (which means fragmentation not allowed). Now suppose, if a link between P and PE2 router has IP MTU of 1400 bytes and fragmentation is not allowed so packet will  be dropped on P router and as subsequent action  ICMP destination un-reachable with type 3, code 4″ will be returned to the originator host. If this message does not reach the originator host or unable to report MTU (as per RFC specification) of next-hop on the router where packet was dropped. Then only option available with application administrator is to clear the DF bit on originator host which  mean if any links comes across with lesser MTU then packet can be fragmented.

screenshot

 

Now suppose IP packet with 1500 bytes and DF Bit=0 (fragmentation allowed) has been initiated from client, it reaches P routers and finds MTU of next-hop is 1400 which is lesser then original packet size.  Hence DF bit is clear so P router will fragments the packet and transmit 1400 bytes  (20 IP header and 1380 Data bytes) and then 120 bytes packet. Original packet  which arrived at P router was of 1500 bytes size will be transmitted in 2 segments toward destination.

As in above referred example IP   segment size was 4840 including 20 bytes IP head  and it was transmitted  in 4 segments by the originator  ( 3x 1500 bytes segments and   1  segment with 520 segment (all fragments includes 20 bytes IP header). 1st three segments will be fragmented on P routers and will be transmitted in 2 segments  of 1420 and 120 bytes . It looks normal with respect to forwarding operation on router but lets suppose any of 100 bytes segments transmitted by P router does not reach destination due to congestion on any link , in this case destination host will not be able to re-assemble the packet due to lost of 1 fragment. Thus whole transmission needs to be repeated and one can imagine the impacts on performance.

But usually Service Providers have Service Level Agreement (SLA) with corporate customers to support minimum MTU (usually it is 1500 bytes). In next blog we will discuss MTU overhead for various tunneling / VPN approaches (e.g L3PVN, L2VPN/VPLS , GRE & IPsec from CE-1 to CE-2)

Junos MTU Handling on Access & Trunk Ports

MTU is most important aspect for proper functionality of any application. In this blog post I will highlight MTU handling by Junos based devices for (802.3 un-tag and 802.1Q tag packets) .

802-3

Simple 802.3 packet header is shown above total packet size is 1514 bytes (14 bytes header + 1500 bytes max payload). Now we will see how  Junos based devices handle MTU on access ports.

 

  • LAB> show interfaces xe-1/0/32
    Physical interface: xe-1/0/32, Enabled, Physical link is UpLink-level type: Ethernet, MTU: 1514, MRU: 0, Link-mode: Auto, Speed: Auto, BPDU Error: None, MAC-REWRITE Error: None, Loopback: Disabled, Source filtering: Disabled, Flow control: Disabled, Auto-negotiation: Disabled,
    ———-output omitted for brevity——————–
    Protocol eth-switch, MTU: 1514
  • LAB > monitor traffic interface xe-1/0/32 no-resolve layer2-headers print-hex 02:09:00.266841 Out 00:31:46:52:dd:80 > 00:1e:0b:d3:1d:1a, ethertype 802.1Q (0x8100), length 1486: vlan 243, p 0, ethertype IPv4, truncated-ip – 32 bytes missing!
    (tos 0x0, ttl 64, id 49385, offset 0, flags [DF], proto: ICMP (1), length: 1500)
    192.168.243.1 > 192.168.243.52: ICMP echo reply, id 29316, seq 5, length 1480

 

  • As we can see an access interface “xe-1/0/32″ showing MTU 1514 but when we monitor traffic on same interface astonishingly  we can see   Tag-Protocol Identifier (0x8100) which represent 802.1Q tagging. As per port configuration 802.1Q is not allowed  as port is configured in access mode (conclusion is given in last paragraph).

Let’s explore 802.1Q packet and  Junos device behavior with respect to MTU

802-1q

In 802.1Q header additional 4 bytes are added thus new header consists of 18 bytes as compare to 14 bytes header for 802.3 and total packet size will be (18 bytes header + 1500 bytes payload). We have configured an aggregated Ethernet (ae1) interface trunk mode (which enables the interface to receive 802.1Q tag packets).

LAB> show interfaces ae1
Physical interface: ae1, Enabled, Physical link is Up
———-output omitted for brevity——————–
Protocol eth-switch, MTU: 1514
Flags: Trunk-Mode

As per configuration the ae1 interface must show MTU value of 1518 due to “interface-mode trunk”  but its showing MTU value of 1514.

Obviously its creating confusion, if trunk  interface is showing MTU value of 1514 then how it will receive packet with 1500 bytes payload + 18 bytes header . But the matter of the fact is , this interface will receive payload size of 1500 bytes and header size of 18 bytes even with MTU value displayed in CLI as 1514 .

Conclusion

  • Trunk ports- Even though MTU size displayed in CLI is 1514 bytes but at hardware level 1518 bytes are handled for 802.1Q packets.
  • Access Port– In CLI the MTU is  showed as 1514 which is quite normal but once we monitor the traffic on access port we can see  Tag-Protocol Identifier (0x8100) which represent 802.1Q tagging. So for access ports , Junos hardware adds one tag also.

 

Juniper QFX 5100 & VMware ESXI Host NIC Teaming -Design Consideration

The objective of this article is to highlight design consideration for NIC Teaming between  Juniper QFX 5100 (Virtual Chassis -VC) and VMWare ESXI host.

Reference topology is as under:-

We have 2 x Juniper QFX 5100 48S switches which are deployed as VC in order to provide connectivity to  compute machines. All compute machines are running VMWare ESXI Hyper-visor. Link Aggregation Group (LAG or Active/ Active NIC Teaming) is  required between compute machines and QFX 5100 VC.

  • Data Traffic from server to switch – xe-0/0/0  interface on both switches connected to NIC 3 & 4 on a single Compute Machine.
  • ESXI Host Management  and V-Motion Traffic from server to Switch-  xe-0/0/45 interface from both switches connected to NIC 1 & 2 ports on compute machine.
  • VLANs-ID
    • Data VLANs – 116, 126
    • V-Motion- 12
    • ESXI Management-11

Hence,the requirement is to configure  LAG (Active/ Active NIC Teaming) between compute machines and network switch for optimal link utilization in addition to fault tolerance if in case one physical link goes down between network switch and compute machine.

In order to achieve the required results one’s needs to understand default load balancing mechanism over LAG member interfaces in Juniper devices and same load balancing mechanism must be  configured on VMware ESXI  for NIC Teaming.

  • Juniper default load balancing over LAG member interfaces is based on “layer 2 payload” and it takes into consideration “Src IP”, “Dst IP”, Src Port” and Dst Port
  • In order to support similar behavior on VMWare ESXI hosts Active/ Active NIC teaming must be enabled with “Route Based on IP Hash”
  • Data Traffic
    •  LAG  will be configured on Juniper switch with interface-mode trunk and all required VLANs will be allowed
    • Active-Active NIC Teaming must be enables with “Route Based on IP Hash”  (LACP only supported in VCenter vDS where as without V-Center we can configure simple LAG with out LACP)
  •  VM-Motion and ESXI Management Traffic
    • VMWare does not recommend  Active-Active NIC Teaming over a link needed for VM-Kernal (VM-Motion). So Active/ Passive NIC teaming will be configured for such link with “Route Based on Originating Port ID” .
    •  Both links on Juniper Switch will be configured as trunk  by allowing both VM-Motion and ESXI MGMT VLANs and in addition ESXI MGMT VLAN must be allowed as native VLAN.