Blade Chassis to End of Row Switches Connectivity & High Availability Options

Spanning Tree Protocol (STP) free network inside Data Center is main focus for network vendors and many technologies have been introduced in recent past to resolve STP issues in data center and ensure optimal link utilization. Advent of switching modules inside blade enclosures coupled with the requirements for optimal link utilization starting right from blade server has made today’s Data Center network more complex.

In one of earlier blog I discussed design guidelines for port bundling/ active-active NIC teaming between Juniper Virtual Chassis Fabric (QFX 5100 VCF) and Rack Mounted Servers, however few questions were still un-answered.

Q-1. What if Virtual Chassis option is not available, specifically in the scenarios where Juniper EX 9200 / QFX 10K Chassis based switches are  deployed as End or Row switches.

Q-2. What if, instead of Rack mounted servers blade chassis enclosure (equipped with switching module)  are used as compute machines and there is requirement for active/ active  or active/ passive NIC teaming on blade servers.  How the overall network will look and work right starting from blade server NIC.

This blog will answer these questions:-

Assumption:- Network Switches are placed as End of Row  model and to cater for STP Multi-Chassis Link Aggregation (MC-LAG) is deployed. Please see one of my earlier blog for understanding of MC-LAG.

Option 1: Rack mounted servers for computing machines, servers have installed multiple NICs in Pass-Though module and Virtual Machines hosted inside servers require Active/Active NIC Teaming.

picture5

Option 2 : Blade Chassis has multiple blade servers and each blade servers has more than 1 NIC (which are connected with blade chassis switches through internal fabric link). Virtul Machines hosted inside blade servers require active/active NIC teaming.

picture9

Option 3 : Blade Chassis has multiple blade servers and each blade servers has more than 1 NIC (which are connected with blade chassis switches through internal fabric link). Virtual machines hosted inside blade servers require active/passive NIC teaming.

picture10

Packet Walk Through-Part 1

The objective of this blog is to discuss end to end packet (client to server)  traversing through a service provider network with special consideration on performance effecting factors.   

 

screenshot

 

 

We will suppose client needs to access any of the service hosted in server connected with CE-2, all the network links and NICs on end system are Ethernet based. Almost all the vendors compute machines (PC/ servers) are generating IP data gram with 1500 bytes size  (20 bytes header +1480  data bytes) in normal circumstances. 

ip

Fragmentation:- If any of link is unable to handle 1500 size IP data-gram then packet will be fragmented and forwarded to its destination where it will be re-assembled. The fragmentation and re-assembly will introduce overhead and  defiantly over all performance will be degraded.  In IP header following fields are important to detect fragmentation and to re-assemble the packets.

  •  Identification:- Is unique for all segments if packet is fragmented at all 
  •  Flags – 3 bits  . Bit 0 always 0, bit 1 -DF (Fermentation allowed or not  0 and 1 respectively), Bit 2-MF (More fragments expected or Last ,  1 and 0 respectively)
  • Fragments Offset :- Determine where data will start after removal of IP header in 1st and subsequent segments once packet is re-assembled.

With below example we can understand transfer of IP data-gram between end system. Total size of IP data-gram is 4860 bytes which includes 20 bytes IP header so total data bytes are 4840. 

ip-freg-1

As earlier explained Ethernet based NICs starts assume the default IP packet size 1500 bytes (20 bytes header + 1480 bytes data). So packets will be divided in 1480 bytes size chucks and will be transmitted after adding 20 bytes IP header.

The receiving host will reassemble the data once all fragments are received and will place the data bytes extracted from all fragments into relevant position inside original packet as it was before on sending host before fragmentation. Fragment offset value will be used for  this purpose (its always multiplier of 8)

  • 1st fragment data= 1480 bytes (0-1479)
  • 2nd fragment data =1480 bytes  (placement start at 1480 and end at 2959 bytes, fragment offset 185 x8 =1480)
  • 3rd fragment data =1480 bytes   (placement start at 2960 and end 4439 bytes, fragment offset 370 x8 =2960)
  • 4th fragment data = 400 bytes (placement start at 4440 and end 4840 bytes, fragment offset 555 x8 =4440)
  • Total data bytes are now 4840 and if we add 20 bytes IP header it will become 4860 bytes which was size of IP data-gram on sending host before fragmentation starts.

ip-freg-2

Everything handled smoothly and no performance degradation observed,  end system uses 2 mechanisms  for smooth data transfers:-

Path MTU Discovery -MTUD :– Path MTUD is defined in RFC 1191 where end systems detect maximum allowed MTU on a communication path. End system starts sending IP packets by using MTU of outing interface (default value is 1500 with 20 bytes IP header and 1480 data bytes) and DF bit SET. If any of link in between source and destination does not support 1500 bytes size IP packet then “ICMP destination un-reachable” with type 3, code 4 returned to originator host because of DF bit SET value as it tells “do not fragment this packet”.  When such messages are initiated MTU of next hop is also included in it by the router which was unable to pass 1500 bytes IP packet. This process will continue until originator host finds maximum MTU allowed on a communication path till its destination and subsequently adjust its MTU for outgoing interface.

TCP Maxim Segment Size -MSS- RFC 2923:- TCP MSS size of TCP data in an IP  packet (IP MTU-40) e.g 1500-40=1460 where 40 comes from 20 bytes IP header and 20 bytes TCP header. New TCP stack implies TCP MSS detection mechanism between end systems  to identify maximum TCP  segment can be allowed on opposite end system and thus adjust its sending TCP MSS accordingly.

Impact of Fragmentation:- End system can handle fragmentation without much overhead as they  have sufficient resources (buffer to hold fragments until  last packet arrives and CPU to re-assemble the packet in original order as it was on originator host before fragmentation). Routers can do the fragmentation fairly easily as all information required for fragmentation is available in original packet header and router just need to replicate the header and add fragment offset , MF flag and sequence no. But devices which handle packets from layer 4 and above then need to re-assemble the all fragments into original IP segment before any further processing such sort of devices may face performance issues with fragmentation and  packet re-assembly.

As in our above example we suppose originator host start sending IP packet (1500 bytes) with DF bit value=1 (which means fragmentation not allowed). Now suppose, if a link between P and PE2 router has IP MTU of 1400 bytes and fragmentation is not allowed so packet will  be dropped on P router and as subsequent action  ICMP destination un-reachable with type 3, code 4″ will be returned to the originator host. If this message does not reach the originator host or unable to report MTU (as per RFC specification) of next-hop on the router where packet was dropped. Then only option available with application administrator is to clear the DF bit on originator host which  mean if any links comes across with lesser MTU then packet can be fragmented.

screenshot

 

Now suppose IP packet with 1500 bytes and DF Bit=0 (fragmentation allowed) has been initiated from client, it reaches P routers and finds MTU of next-hop is 1400 which is lesser then original packet size.  Hence DF bit is clear so P router will fragments the packet and transmit 1400 bytes  (20 IP header and 1380 Data bytes) and then 120 bytes packet. Original packet  which arrived at P router was of 1500 bytes size will be transmitted in 2 segments toward destination.

As in above referred example IP   segment size was 4840 including 20 bytes IP head  and it was transmitted  in 4 segments by the originator  ( 3x 1500 bytes segments and   1  segment with 520 segment (all fragments includes 20 bytes IP header). 1st three segments will be fragmented on P routers and will be transmitted in 2 segments  of 1420 and 120 bytes . It looks normal with respect to forwarding operation on router but lets suppose any of 100 bytes segments transmitted by P router does not reach destination due to congestion on any link , in this case destination host will not be able to re-assemble the packet due to lost of 1 fragment. Thus whole transmission needs to be repeated and one can imagine the impacts on performance.

But usually Service Providers have Service Level Agreement (SLA) with corporate customers to support minimum MTU (usually it is 1500 bytes). In next blog we will discuss MTU overhead for various tunneling / VPN approaches (e.g L3PVN, L2VPN/VPLS , GRE & IPsec from CE-1 to CE-2)

Junos MTU Handling on Access & Trunk Ports

MTU is most important aspect for proper functionality of any application. In this blog post I will highlight MTU handling by Junos based devices for (802.3 un-tag and 802.1Q tag packets) .

802-3

Simple 802.3 packet header is shown above total packet size is 1514 bytes (14 bytes header + 1500 bytes max payload). Now we will see how  Junos based devices handle MTU on access ports.

 

  • LAB> show interfaces xe-1/0/32
    Physical interface: xe-1/0/32, Enabled, Physical link is UpLink-level type: Ethernet, MTU: 1514, MRU: 0, Link-mode: Auto, Speed: Auto, BPDU Error: None, MAC-REWRITE Error: None, Loopback: Disabled, Source filtering: Disabled, Flow control: Disabled, Auto-negotiation: Disabled,
    ———-output omitted for brevity——————–
    Protocol eth-switch, MTU: 1514
  • LAB > monitor traffic interface xe-1/0/32 no-resolve layer2-headers print-hex 02:09:00.266841 Out 00:31:46:52:dd:80 > 00:1e:0b:d3:1d:1a, ethertype 802.1Q (0x8100), length 1486: vlan 243, p 0, ethertype IPv4, truncated-ip – 32 bytes missing!
    (tos 0x0, ttl 64, id 49385, offset 0, flags [DF], proto: ICMP (1), length: 1500)
    192.168.243.1 > 192.168.243.52: ICMP echo reply, id 29316, seq 5, length 1480

 

  • As we can see an access interface “xe-1/0/32″ showing MTU 1514 but when we monitor traffic on same interface astonishingly  we can see   Tag-Protocol Identifier (0x8100) which represent 802.1Q tagging. As per port configuration 802.1Q is not allowed  as port is configured in access mode (conclusion is given in last paragraph).

Let’s explore 802.1Q packet and  Junos device behavior with respect to MTU

802-1q

In 802.1Q header additional 4 bytes are added thus new header consists of 18 bytes as compare to 14 bytes header for 802.3 and total packet size will be (18 bytes header + 1500 bytes payload). We have configured an aggregated Ethernet (ae1) interface trunk mode (which enables the interface to receive 802.1Q tag packets).

LAB> show interfaces ae1
Physical interface: ae1, Enabled, Physical link is Up
———-output omitted for brevity——————–
Protocol eth-switch, MTU: 1514
Flags: Trunk-Mode

As per configuration the ae1 interface must show MTU value of 1518 due to “interface-mode trunk”  but its showing MTU value of 1514.

Obviously its creating confusion, if trunk  interface is showing MTU value of 1514 then how it will receive packet with 1500 bytes payload + 18 bytes header . But the matter of the fact is , this interface will receive payload size of 1500 bytes and header size of 18 bytes even with MTU value displayed in CLI as 1514 .

Conclusion

  • Trunk ports- Even though MTU size displayed in CLI is 1514 bytes but at hardware level 1518 bytes are handled for 802.1Q packets.
  • Access Port– In CLI the MTU is  showed as 1514 which is quite normal but once we monitor the traffic on access port we can see  Tag-Protocol Identifier (0x8100) which represent 802.1Q tagging. So for access ports , Junos hardware adds one tag also.

 

Juniper QFX 5100 & VMware ESXI Host NIC Teaming -Design Consideration

The objective of this article is to highlight design consideration for NIC Teaming between  Juniper QFX 5100 (Virtual Chassis -VC) and VMWare ESXI host.

Reference topology is as under:-

We have 2 x Juniper QFX 5100 48S switches which are deployed as VC in order to provide connectivity to  compute machines. All compute machines are running VMWare ESXI Hyper-visor. Link Aggregation Group (LAG or Active/ Active NIC Teaming) is  required between compute machines and QFX 5100 VC.

  • Data Traffic from server to switch – xe-0/0/0  interface on both switches connected to NIC 3 & 4 on a single Compute Machine.
  • ESXI Host Management  and V-Motion Traffic from server to Switch-  xe-0/0/45 interface from both switches connected to NIC 1 & 2 ports on compute machine.
  • VLANs-ID
    • Data VLANs – 116, 126
    • V-Motion- 12
    • ESXI Management-11

Hence,the requirement is to configure  LAG (Active/ Active NIC Teaming) between compute machines and network switch for optimal link utilization in addition to fault tolerance if in case one physical link goes down between network switch and compute machine.

In order to achieve the required results one’s needs to understand default load balancing mechanism over LAG member interfaces in Juniper devices and same load balancing mechanism must be  configured on VMware ESXI  for NIC Teaming.

  • Juniper default load balancing over LAG member interfaces is based on “layer 2 payload” and it takes into consideration “Src IP”, “Dst IP”, Src Port” and Dst Port
  • In order to support similar behavior on VMWare ESXI hosts Active/ Active NIC teaming must be enabled with “Route Based on IP Hash”
  • Data Traffic
    •  LAG  will be configured on Juniper switch with interface-mode trunk and all required VLANs will be allowed
    • Active-Active NIC Teaming must be enables with “Route Based on IP Hash”  (LACP only supported in VCenter vDS where as without V-Center we can configure simple LAG with out LACP)
  •  VM-Motion and ESXI Management Traffic
    • VMWare does not recommend  Active-Active NIC Teaming over a link needed for VM-Kernal (VM-Motion). So Active/ Passive NIC teaming will be configured for such link with “Route Based on Originating Port ID” .
    •  Both links on Juniper Switch will be configured as trunk  by allowing both VM-Motion and ESXI MGMT VLANs and in addition ESXI MGMT VLAN must be allowed as native VLAN.

Integrating SRX in Svc Provider Network (Routing and Multi-tenancy Considerations)

Service Providers networks are always have complex requirements of multi-tenancy, routing & security and pose challenges to network architects.  In this blog I will write about SRX integration in Svc Provider Network while highlighting methodologies how to handle challenges of implementing security features with multi-tenancy and routing consideration.srx-in-sp

                                                                               REFERENCE TOPOLOGY

Devices have been classified into following segments based on their role:-

  •  Remote Customer Network (consist of Customer PCs connected to Provide Edge through Customer Edge).
  • Provider Network (Consist of Provider Edge Routers and Provider Back Bone Rout
  • Data Center Network (Consist of Internet Firewall and Server inside Data Center directly connected with Internet Firewall).
  •   Internet Edge (Consist of Internet Router connected with Internet Firewall hence providing internet access to Customer Networks connected with Data Center through provider network).

Traffic flow and security requirements are as under:-

  • Customer 1 Network (PC-1) requires access to Server-1 installed in Data Center and to Public DNS Server reachable via Internet Edge Router.
  • Customer 2 Network (PC-2) requires access to Server-2 installed in Data Center and to Public DNS Server reachable via Internet Edge Router.
  • Complete isolation of customer routing domains (end to end).
  • Implementation of security features (State full firewall, IPS, UTM and App Fire Wall) on clients’ traffic accessing the servers inside the Data Centre to Public Server.

Provider Network Implementation:-

  •  IGP (OSFP/ IS-IS) between Provider Edge and back bone router to provider reach-ability for signaling protocol (RSVP/ LDP).
  • MP-BGP Session among Provider Edge Routers to exchange MPVPN NLRI.
  • Routing between Customer Edge and Provider can be static / IGP or BGP.

Main focus of this discussion will be on Internet Firewall (SRX) for security, multi-tenancy and routing consideration.

srx-in-sp2

Multi-Tenancy is ensured by creating separate routing instances for Customer-1 and Customer-2 .

  • Interfaces connected to Server-1 and Server-2 placed in specific Routing Instance.
  • For connectivity between Internet Firewall and Provider Edge Router sub interfaces (on same physical interface) configured in Customer Specific VRF on both devices.
  • In order to ensure internet reach-ability for customer remote network following approach is suited.
    • On Internet Firewall interface connected to Internet Router is automatically placed in Master Routing Instance so internet routes received form Internet Router (through BGP) are available in Internet Firewall Mater Routing Table.
    • For exchange of routing information from Customer to Master Routing Instance following approach is suited.
      • Separate RIB groups (for both customers) are configured with “import-rib” statement and applied to “interface-routes”   and OSPF in Customer Instances.
      • It enables sharing of Customer Interface routes and Customer Remote Network Routes (reached in Internet Firewall though OSPF connectivity with Provider Edge Router).
      • In order to further control the route leakage (as per Customer requirement routes for Server placed in Data Center must not be leaked to Master Instance) routing policy can be configured and applied to rib-group (as import policy)
      • In order to share Mater Routing instance Interface routes and Internet Routes, routing policy can be configured and applied as “instance-import” inside Customer Instances.
      • Now these routes are only available in Customer Instance routing table but not advertised to Remote Customer Networks.
      • To achieve the desired results the same policy needs to apply to OSPF as export policy inside Customer Instance.
      • It will enable the advertisement of these routes to Provider Edge Router which will further re-distribute these routes to Remote Customer Network through MPBGP session between Provider Edge routers.

Despite of desired route sharing, Remote Customer Network still unable to reach Internet because Internet do not have reach-ability information for these private sub nets. Source NAT must be configured on Internet Firewall for traffic originated from Customer Instances and heading toward Internet Router to resolve this issue. Off course security, polices are required to allow the traffic form Customer WAN-Zone to Sever-Zone / Internet-Zone.

All these scenarios and requirements can be simulated in GNS3 using vSRX. I used vSRX in packet mode for Provider and Internet Router and off course in state full mode for Internet Firewall to simulate all scenarios.

 

 

 

 

 

 

 

Multi-Chassis-Link Aggregation (MC-LAG)

In my earlier blog (Junos High Availability Design Guide) it was discussed how to make use of redundant routing engines by configuring features like (GRES, NSR, NSB)  for reduction of downtime to minimum possible level.

The real problem is that one RE is active at one time and all PFEs must be connected with active RE . In case of failure of primary Routing Engine (RE) the backup RE will take over  and all PFEs now, needs to connect to new primary RE. This scenario can cause momentary disruption of services.

MC-LAG (Active-Active) is correct solution to above described problem as it offers 2 active REs in 2 different devices/ chassis. Important concepts for MC-LAG proper configuration / functionality  are as under:-

  • Inter Chassis Control Protocol. The MC-LAG peers use the Inter-Chassis Control Protocol (ICCP) to exchange control information and coordinate with each other to ensure that data traffic is forwarded properly. ICCP replicates control traffic and forwarding states across the MC-LAG peers and communicates the operational state of the MC-LAG members. It uses TCP as a transport protocol and requires Bidirectional Forwarding Detection (BFD) for fast convergence. Because ICCP uses TCP/IP to communicate between the peers, the two peers must be connected to each other. ICCP messages exchange MC-LAG configuration parameters and ensure that both peers use the correct LACP parameters. ICCP configuration parameters as under:-
    • Local-IP-Address– IP adress configured on lcoal MC-LAG member that will be used to estblish ICCP session with MC-LAG peer device. (lo0 address is recommended to  be used for ICCP peer establishment)
    • ICCP Peer- IP adress configured on peer MC-LAG member that will be used to estblish ICCP session with local MC-LAG device.(lo0 address is recommended to  be used for ICCP peer establishment)
      • session-establishment-hold-time- 50 seconds is recommended vlaue for faster ICCP connection establishment among MC-LAG peers
      • redundancy-group-id-list-  it must be same on both MC-LAG peers and will be used in MC-ae configuration 
      • liveness-detection minimum-interval. BFD session timer to detect failure of MC-LAG peer ( 60 ms is used in this topology) 
      • liveness-detection multiplier.  This multiplier will be used along with liveness-detection minimum-interval to detect failure if ICCP peer , default value is 3. So for this topology BFD failure detection will be 60×3 =180 ms)
  • Inter Chassis Link (ICL)– ICL is used to forward data traffic across the MC-LAG peers
  • Multi Chassis Control Aggregated Link (MC-AE)- 1 x interface from each member of MC-LAG peer  is connected to downstream or upstream network devices or compute machines. The devices connected to MC-LAG peers will not know that they are connected to different devices  rather they will treat the link as normal Aggregate Link and continue to load balance traffic over LAG member interfaces.  
    • lacp system-id – Must be  same configuration on both MC -LAG peer but must be unique in MC-LAG configuration from other MC-AE. Its LACP ID that will be transmitted to  upward or downward connected devices from both MC-LAG peers and link from both MC-LAG peer who has same system-id will be considered as same LAG member.
    • lacp admin-key – Must be  same configuration on both MC -LAG peer but must be unique in MC-LAG configuration from other MC-AE interfaces.
    • mc-ae-id- Must be  same configuration on both MC -LAG peer but must be unique in MC-LAG configuration from other MC-AE interfaces
    • mc-ae redundancy-group – Must be  same configuration on both MC -LAG peer and it should be as per redundancy-group value configured under ICCP
    • mc-ae chassis-id – Specify the chassis ID for Link Aggregation Control Protocol (LACP) to calculate the port number of MC-LAG physical member links.Values: 0 or 1
    • mc-ae mode – Active-Active is used in this topology and it will ensure both MC-LAG peers are actively sending and transmitting data despite the fact that VRRP is master in only 1 MC-LAG peer.
    • mc-ae status-control (Active / standby)- Desctibe the status of MC-AE interface when ICL goes down. It must be active in 1 MC-LAG peer and stadby in other peer.
    •  prefer-status-control-active– Specify that the node configured as status control active become the active node if the peer of this node goes down
  • Multi-chassis-protection.   If the Inter-chassis Control Protocol (ICCP) connection is up and the inter-chassis link (ICL) comes up, the peer configured as standby brings up the multi chassis aggregated Ethernet interfaces shared with the peer. Multi chassis protection must be configured on one interface for each peer.
  • Hold Time Configure a hold-down timer on the ICL member links that is greater than the configured BFD timer for the ICCP interface. This prevents the ICL from being advertised as being down before the ICCP link is down. If the ICL goes down before the ICCP link, this causes a flap of the MC-LAG interface on the status-control standby node, which leads to a delay in convergence
  • Service-id. The switch service ID is used to synchronize applications, IGMP, ARP, and MAC learning across MC-LAG members
  • arp-l2-validate. Enables periodic checking of ARP Layer 3 addressing and MAC Layer 2 addressing tables, and fixes entries if they become out of sync among MC-LAG peers.

Note:- On EX9200 switches, the prefer-status-control-active statement with the mc-ae status-control standby configuration is required to prevent the LACP MC-LAG system ID from reverting to the default Link Aggregation Control Protocol (LACP) system ID on ICCP failure.

The hold-time down value (at the [edit interfaces interface-name] hierarchy level) for the ICL with the mc-ae status-control standby configuration must  be higher than the ICCP Bidirectional Forwarding Detection (BFD) timeout. This configuration prevents data traffic loss by ensuring that when the router or switch with the mc-ae status-control active configuration goes down, the router or switch with the mc-ae status-control standby configuration does not go into standby mode.

.

untitled

Topology Description 

  1. In above topology 2 x EX 9208 are deployed as Campus Core Router/ Switch.
  2. ae0 is configured as ICL-PL and ae1 is configured as ICL between MC-LAG pair.
  3. Access Layer devices (which is EX 4300 VC)  is connected on ae3 on MC-LAG pair (through normal LAG at access side and MC-AE on MC-LAG pair).
  4. Upward devices is service provider router and connected on ae2 on MC-LAg Pair.
  5. VRRP over IRB (integrated routing and bridge interface, IRB is same like RVI  old Junos EX folks or SVI for Cisco folks) will be configured so that downstream Access devices can have single gateway on MC-LAG pair.
  6. OSPF will be configured between Campus Core Router (i.e MC-LAG pair) and service provider router. The service provider router will view the MC-LAG pair as 2 different next hops and will have 2 OSPF neighbors (1 with each MC-LAG member device)

Note: VRRP over IRB will be configured in order to provide single gateway on MC-LAG pair to access layer devices. The normal behavior for VRRP  is that only VRRP master will active and backup will be ready to take in case of master fails but in Active-Active MC-LAG scenarios both members of VRRP will transmit and receive data.

Lets starts with step by step configuration and explanation

ICL-PL Link configuration

set chassis aggregated-devices ethernet device count 3 #create 4 x ae on both MC LAG pair

#ICCP-link configuration 

set interfaces ae0 description ICCP-PL # EX-9208-1
set interfaces ae0 aggregated-ether-options lacp active
set interfaces ae0 aggregated-ether-options lacp periodic fast
set interfaces ae0 unit 0 family inet address 172.172.172.1/30

set interfaces xe-0/3/6 ether-options 802.3ad ae0
set interfaces xe-0/3/7 ether-options 802.3ad ae0

set interfaces lo0 unit 0 family inet address 1.1.1.1/32

 

set interfaces ae0 description ICCP-PL # EX-9208-2 
set interfaces ae0 aggregated-ether-options lacp active
set interfaces ae0 aggregated-ether-options lacp periodic fast
set interfaces ae0 unit 0 family inet address 172.172.172.1/30

set interfaces xe-0/3/6 ether-options 802.3ad ae0
set interfaces xe-0/3/7 ether-options 802.3ad ae0

set interfaces lo0 unit 0 family inet address 2.2.2.2/32

#Configure OSPF  on ae0 and lo0.0 on both MC-LAG members , later lo0 addresses of each member will be used for Inter-chassis Control Protocol (ICCP) configuration 

set protocols ospf area 0.0.0.0 interface ae0.0 interface-type p2p

set protocols ospf area 0.0.0.0 interface lo0.0

# ICL Configuration

set interfaces xe-0/3/4  # EX-9208-1
set interfaces xe-0/3/4 ether-options 802.3ad ae1
set interfaces xe-0/3/5 ether-options 802.3ad ae1

set interfaces xe-0/3/4 hold-time up 0 # EX-9208-2 i.e mc-ae status-control standby 
set interfaces xe-0/3/4 hold-time down 300
set interfaces xe-0/3/4 ether-options 802.3ad ae1
set interfaces xe-0/3/5 hold-time up 0
set interfaces xe-0/3/5 hold-time down 300
set interfaces xe-0/3/5 ether-options 802.3ad ae1

set interfaces ae1 description ICL-LINK # Same configration of both MC-LAG Peers
set interfaces ae1 aggregated-ether-options lacp active
set interfaces ae1 aggregated-ether-options lacp periodic fast
set interfaces ae1 unit 0 family ethernet-switching interface-mode trunk
set interfaces ae1 unit 0 family ethernet-switching vlan members all

# ICCP Configuration

set protocols iccp local-ip-addr 1.1.1.1  #EX-9208-1
set protocols iccp peer 2.2.2.2 session-establishment-hold-time 50
set protocols iccp peer 2.2.2.2  redundancy-group-id-list 1
set protocols iccp peer 2.2.2.2  liveness-detection minimum-interval 60
set protocols iccp peer 2.2.2.2  liveness-detection multiplier 3

set protocols iccp local-ip-addr 2.2.2.2  #EX-9208-2
set protocols iccp peer 1.1.1.1 session-establishment-hold-time 50
set protocols iccp peer 1.1.1.1  redundancy-group-id-list 1
set protocols iccp peer 1.1.1.1 liveness-detection minimum-interval 60
set protocols iccp peer 1.1.1.1  liveness-detection multiplier 3

# Multi-Chassis protection 
set multi-chassis multi-chassis-protection 2.2.2.2 interface ae1 #EX-9208-1, ae1 is ICL

set multi-chassis multi-chassis-protection 1.1.1.1 interface ae1 #EX-9208-2, ae1 is ICL

service-id

set switch-options service-id 1 # Must be same on both MC-LAG peers

# Periodic ARP synchronization  

set interfaces irb arp-l2-validate # Must be same on both MC-LAG peers

MC-Ae interface (ae2 connected to Up link device i.e Service Provider Router)

set interfaces xe-0/3/3 ether-options 802.3ad ae2 #EX-9208-1

set interfaces ae2 description to-WAN
set interfaces ae2 aggregated-ether-options lacp active
set interfaces ae2 aggregated-ether-options lacp periodic fast
set interfaces ae2 aggregated-ether-options lacp system-id 00:00:00:00:00:02
set interfaces ae2 aggregated-ether-options lacp admin-key 2
set interfaces ae2 aggregated-ether-options mc-ae mc-ae-id 2
set interfaces ae2 aggregated-ether-options mc-ae redundancy-group 1
set interfaces ae2 aggregated-ether-options mc-ae chassis-id 0
set interfaces ae2 aggregated-ether-options mc-ae mode active-active
set interfaces ae2 aggregated-ether-options mc-ae status-control active
set interfaces ae2 unit 0 family ethernet-switching interface-mode access
set interfaces ae2 unit 0 family ethernet-switching vlan members WAN

set interfaces xe-0/3/3 ether-options 802.3ad ae2 #EX-9208-2

set interfaces ae2 description to-WAN
set interfaces ae2 aggregated-ether-options lacp active
set interfaces ae2 aggregated-ether-options lacp periodic fast
set interfaces ae2 aggregated-ether-options lacp system-id 00:00:00:00:00:02
set interfaces ae2 aggregated-ether-options lacp admin-key 2
set interfaces ae2 aggregated-ether-options mc-ae mc-ae-id 2
set interfaces ae2 aggregated-ether-options mc-ae redundancy-group 1
set interfaces ae2 aggregated-ether-options mc-ae chassis-id 1
set interfaces ae2 aggregated-ether-options mc-ae mode active-active
set interfaces ae2 aggregated-ether-options mc-ae status-control standby
set interfaces ae2 aggregated-ether-options mc-ae events iccp-peer-down prefer-status-control-active
set interfaces ae2 unit 0 family ethernet-switching interface-mode access
set interfaces ae2 unit 0 family ethernet-switching vlan members WAN

 

MC-Ae interface (ae3 connected to down link device i.e EX 4300 Virtual Chassis)

set interfaces xe-0/3/2 ether-options 802.3ad ae2 #EX-9208-1

set interfaces ae3 description to-ACCESS-Device
set interfaces ae3 aggregated-ether-options lacp active
set interfaces ae3 aggregated-ether-options lacp periodic fast
set interfaces ae3 aggregated-ether-options lacp system-id 00:00:00:00:00:03
set interfaces ae3 aggregated-ether-options lacp admin-key 3
set interfaces ae3 aggregated-ether-options mc-ae mc-ae-id 3
set interfaces ae3 aggregated-ether-options mc-ae redundancy-group 1
set interfaces ae3 aggregated-ether-options mc-ae chassis-id 0
set interfaces ae3 aggregated-ether-options mc-ae mode active-active
set interfaces ae3 aggregated-ether-options mc-ae status-control active
set interfaces ae3 unit 0 family ethernet-switching interface-mode access
set interfaces ae3 unit 0 family ethernet-switching vlan members DATA

set interfaces xe-0/3/2 ether-options 802.3ad ae2 #EX-9208-2

set interfaces ae3 description to-ACCESS-Device
set interfaces ae3 aggregated-ether-options lacp active
set interfaces ae3 aggregated-ether-options lacp periodic fast
set interfaces ae3 aggregated-ether-options lacp system-id 00:00:00:00:00:03
set interfaces ae3 aggregated-ether-options lacp admin-key 3
set interfaces ae3 aggregated-ether-options mc-ae mc-ae-id 3
set interfaces ae3 aggregated-ether-options mc-ae redundancy-group 1
set interfaces ae3 aggregated-ether-options mc-ae chassis-id 1
set interfaces ae3 aggregated-ether-options mc-ae mode active-active
set interfaces ae3 aggregated-ether-options mc-ae status-control active

set interfaces ae2 aggregated-ether-options mc-ae events iccp-peer-down prefer-status-control-active
set interfaces ae3 unit 0 family ethernet-switching interface-mode access
set interfaces ae3 unit 0 family ethernet-switching vlan members DATA

# VRRP and IRB Configuration 

 

set interfaces irb unit 160 family inet address 10.102.160.2/29 arp 10.102.160.3 l2-interface ae1.0 # irb.160 will be used on EX 9208-1 to establish OSPF peer ship with uplink router

#static ARP entry for MC-LAG peer is required for VRRP over IRB configuration , ae1 is ICL

set interfaces irb unit 160 family inet address 10.102.160.2/29 arp 10.102.160.3 mac cc:e1:7f:a7:43:f0

#mac address of IRB interface from other MC-LAG peer and can be obtained by operation mode command “show interface irb” once ICCP session is established.

set interfaces irb unit 160 family inet address 10.102.160.2/29 vrrp-group 160 virtual-address 10.102.160.1
set interfaces irb unit 160 family inet address 10.102.160.2/29 vrrp-group 160 priority 254 #VRRP master 
set interfaces irb unit 160 family inet address 10.102.160.2/29 vrrp-group 160 accept-data

 

set interfaces irb unit 160 family inet address 10.102.160.3/29 arp 10.102.160.2 l2-interface  ae1.0 #irb.160 will be used on EX 9208-2 to establish OSPF peer ship with uplink router 

#static ARP entry for MC-LAG peer is required for VRRP over IRB configuration , ae1 is ICL
set interfaces irb unit 160 family inet address 10.102.160.3/29 arp 10.102.160.2 mac cc:e1:7f:a7:3f:f0

#mac address of IRB interface from other MC-LAG peer and can be obtained by operation mode command “show interface irb” once ICCP session is established.
set interfaces irb unit 160 family inet address 10.102.160.3/29 vrrp-group 160 virtual-address 10.102.160.1
set interfaces irb unit 160 family inet address 10.102.160.3/29 vrrp-group 160 priority 100 #VRRP backup
set interfaces irb unit 160 family inet address 10.102.160.3/29 vrrp-group 160 accept-data

 

set interfaces irb unit 50 family inet address 10.102.50.2/23 arp 10.102.50.3 l2-interface ae1.0 #irb.50 will be used on EX 9208-1  as gateway for  traffic coming from access devices 

#static ARP entry for MC-LAG peer is required for VRRP over IRB configuration , ae1 is ICL

set interfaces irb unit 50 family inet address 10.102.50.2/23 arp 10.102.50.3 mac cc:e1:7f:a7:43:f0

#mac address of IRB interface from other MC-LAG peer and can be obtained by operation mode command “show interface irb” once ICCP session is established.
set interfaces irb unit 50 family inet address 10.102.50.2/23 vrrp-group 50 virtual-address 10.102.50.1
set interfaces irb unit 50 family inet address 10.102.50.2/23 vrrp-group 50 priority 254 #VRRP mater 
set interfaces irb unit 50 family inet address 10.102.50.2/23 vrrp-group 50 accept-data

 

set interfaces irb unit 50 family inet address 10.102.50.3/23 arp 10.102.50.2 l2-interface ae1.0 #irb.50 will be used on EX 9208-2  as gateway for  traffic coming from access devices 

#static ARP entry for MC-LAG peer is required for VRRP over IRB configuration , ae1 is ICL

set interfaces irb unit 50 family inet address 10.102.50.3/23 arp 10.102.50.2 mac cc:e1:7f:a7:3f:f0

#mac address of IRB interface from other MC-LAG peer and can be obtained by operation mode command “show interface irb” once ICCP session is established
set interfaces irb unit 50 family inet address 10.102.50.3/23 vrrp-group 50 virtual-address 10.102.50.1

#VLAN Configuration  

set vlans DATA vlan-id 50 l3-ineerface irb.50

set vlans WAN vlan-id 160 l3-ineerface irb.160

#OSPF Configuration in MC-LAG Peers for uplink router 

set protocols ospf area 0.0.0.0 interface irb.160

Both MC-LAG peer will establish OSPF neighbor-ship with Uplink router and with other MC-LAG peer.

Junos High Availability Design Guide

High availability is one of the important consideration during network design and deployment stage and all most all the network vendors support various high availability features.

The objective of this article is to describe Junos best practices required to achieve minimum downtime in case of fail-over scenarios.

The Routing Engine or Control Plan is the brain in Junos based devices to run and execute all the management functions. Most of the  Junos based devices offers redundant routing engines (either through default configuration or through explicit configuration virtual chassis ). At one time only one Routing engine can be active (exception of Active-Active MC-LAG which is beyond the scope of this blog).  The mere presence of 2nd routing engine in the Junos device will not add any advantage with respect to high availability until certain features are not configured.

  •  Grace-full Routing Engine Switch Over  (GRES). GRES enables synchronization of kernel and chassis demon between mater routing engine and backup routing engines and in case of failure of master routing Packet Forwarding Engine (PFE) will simply join to new master routing engine (which was backup routing before fail-over).

Preparing for a Graceful Routing Engine Switchover

 

Graceful Routing Engine Switchover Process

GRES can be configured by following configuration command:-

set chassis redundancy graceful-switchover

Effects of above configuration can be monitored on backup RE

{backup:1}

show system switchover

fpc1:————————————————

Graceful switchover: On

Configuration database: Ready

Kernel database: Ready

Peer state: Steady State

If GRES is not enabled and primary routing engines fails then kernel demon will restart on new master routing engine , after this chassis demon will restart and traffic will contentiously drop during this whole process.  

  • Non Stop Active Routing (NSR). As described above GRES will only sycnh kernel and chassis demon between primary and backup routing engines but will not sycnh routing demon (rpd) between two routing engines. In case of fail-over of  primary routing engine new primary will start the routing demon and all peer ship will be re-established. Non Stop Active Routing helps to avoid this scenario.

 

If we compar fig 1 for GRES and below fig for NSR we can see rpd is also now added to backup rotuing engine.

Nonstop Active Routing Switchover Preparation Process

Nonstop Active Routing During a Switchover

 

Follow these commands to enable and verify NSR

set system commit synchronize
set routing-options nonstop-routing
Results can be verified by:-

run show task replication

Stateful Replication: Enabled

RE mode: Master

Protocol Synchronization Status

OSPF Complete

  • Non Stop Bridging. NSB is very simiar to NSR, but works for Layer 2 protocols such xSTP, LLDP, LLDP-MED, LACP. On backup RE starts l2cpd, which controls layer 2 protocols in JunOS.

set protocols layer2-control nonstop-bridging