NSX Dual Active/Active Datacenters BCDR

Overview

The modern data center design requires better redundancy and demands the ability to have Business Continuity (BC) and Disaster Recovery (DR) in case of catastrophic failure in our datacenter. Planning a new data center with BCDR requires meeting certain fundamental design guidelines.

In this blog post I will describe the Active/Active datacenter with VMware Full SDDC product suite.

The NSX running in Cross-vCenter mode, this ability introduced in VMware NSX release 6.2.x. In this blog post we will focus on network and security.

An introduction and overview blog post can be found in this link:

http://blogs.vmware.com/consulting/2015/11/how-nsx-simplifies-and-enables-true-disaster-recovery-with-site-recovery-manager.html

The goals that we are trying to achieve in this post are:

  1. Having the ability to deploy workloads with vRA on both of the datacenters.
  2. Provide Business Continuity in case of a partial of a full site failure.
  3. Having the ability to perform planned or unplanned migration of workloads from one datacenter to another.

To demonstrate the functionality of this design I’ve created demo ‘vPOD’ in VMware internal cloud with the following products in each datacenter:

  • vCenter 6.0 with ESXi host 6.0
  • NSX 6.2.1
  • vRA 6.2.3
  • vSphere Replication 6.1
  • SRM 6.1
  • Cloud Client 3.4.1

In this blog post I will not cover the recovery part of the vRA/vRO components, but this could be achieved with a separated SRM instance for the management infrastructure.

Environment overview

I’m adding short video to introduce the environment.

NSX Manager

The NSX manager in Site A will have the IP address of 192.168.110.15 and will be configured as primary.

The NSX Manager in site B will be configured with the IP 192.168.210.15 and is set as secondary.

Each NSX manager pairs with its own vCenter and learns its local inventory. Any configuration change related to the cross site deployment will run at the primary NSX manager and will be replicated automatically to the remote site.

 

Universal Logical Switch (ULS)

Creating logical switches (L2) between sites with VxLAN is not new to NSX, however starting from version 6.2.X we’ve introduced the ability of stretching the L2 between NSX managers paired to different vCenters. This new logical switch is known as a ‘Universal Logical Switch’ or ‘ULS’. Any new ULS we will create in the Primary NSX Manger will be synced to the secondary.

I’ve created the following ULS in my Demo vPOD:

Universal Logical Switch (ULS)

Universal Distributed Logical Router (UDLR)

The concept of a Distributed Logical Router is still the same as it was before NSX 6.2.x. The new functionally that was added to this release allows us to configure Universal Distributed Logical Router (UDLR).  When we deploy a UDLR it will show up in all NSX Managers Universal Transport Zone.

The following UDLR created was created:

Universal Distributed Logical Router (UDLR)

Universal Security Policy with Distributed Firewall (UDFW)

With version 6.2.x we’ve introduced the universal security group and universal IP-Set.

Any firewall rule configured in the Universal Section must be IP-SET or Security Group that contain IP-SET.

When we are configuring or changing Universal policy, automatically there is a sync process that runs from the primary to the secondary NSX manager.

The recommended way to work with an ipset is to add it to a universal security group.

The following Universal security policy is an example to allow communicating to 3-Tier application. The security policy is built from universal security groups. Each group contain IP-SET with the relevant IP address for each tier.

Universal Security Policy with Distributed Firewall (UDFW)

vRA

At the automation side we’re creating two unique machine blueprints per site. The MBP are based on Classic CentOS image that allows us to perform some connectivity tests.

The MBP named “Center-Site_A” will be deployed by vRA to Site A into the green ULS named: ULS_Green_Web-A.

The IP address pool configured for this ULS is 172.16.10.0/24.

The MBP named “Center-Site_B” will be deployed by vRA to Site B into the blue ULS named: ULS_Blue_Web-B.

The IP address pool configured for this ULS is 172.17.10.0/24

vRA Catalog

Cloud Client:

To quote from VMware Official documentation:

“Typically, a vSphere hosted VM managed by vRA belongs to a reservation, which belongs to a compute resource (cluster), which in turn belongs to a vSphere Endpoint. The VMs reservation in vRA needs to be accurate in order for vRA to know which vSphere proxy agent to utilize to manage that VM in the underlying vSphere infrastructure. This is all well and good and causes few (if any) problems in a single site setup, as the VM will not normally move from the vSphere endpoint it is originally located on.

With a multi-site deployment utilizing Site Recovery Manager all this changes as part of the site to site fail over process involves moving VMs from one vCenter to another. This has the effect in vRA of moving the VM to a different endpoint, but the reservation becomes stale. As a result it becomes no longer possible to perform day 2 operation on the VMs until the reservation is updated.”

When we failover VMs from Site A to Site B cloud client will run the following action behind the science to solve this challenge.

Process Flow for Planned Failover:

Process Flow for Planned Failover

The Conceptual Routing Design with Active/Active Datacenter

The main key point for this design is to run Active/Active for workloads in both datacenters.

The workloads will reside on both Site A and Site B. In the modern datacenter the entry point is protected with perimeter firewall.

In our design each site has its on perimeter firewall run independently FW_A located in Site A and FW_B Located in Site B.
Site A (Shown in Green color) run its own ESGs (Edge Security Gateways), Universal DLR (UDLR) and Universal Logical Switch (ULS).

Site B site (shown in Blue color) have different ESGs, Universal DLR (UDLR) and Universal Logical Switch (ULS).

The main reason for the different ESG, UDLR and ULS per site is to force single ingress/egress point for workload traffic per site.

Without this ingress/egress deterministic traffic flow, we may face asymmetric routing between the two sites, that means that ingress traffic will be via Site A to FW_A and egress via Site B to FW_B, this asymmetric traffic will dropped by the FW_B.

Note: The ESGs in this blog run in ECMP mode, As a consequence we turned off the firewall service on the ESGs.

The Green network will always will be advertise via FW_A.  For an example The Control VM (IP 192.168.110.10) shown in the figure below need to access the Green Web VM connected to the ULS_Web_Green_A , the traffic  from the client will be routed via Core router and to FW_A, from there to one of the ESG working in ECMP mode, then to the Green UDLR and finally to the Green Web VM itself.

Now Assume the same client would like to access the Blue Web VM connected to ULS_Web_Blue_B, this traffic will be routed via the Core router to FW_B, from there to one of the Blue ESG working in ECMP mode, to the Blue ULDR and at the end to the Blue VM itself.

Routing Design with Active/Active Datacenter

What is the issue with this design?

What will happen if we will face a complete failure in one of our Edge Clusters or FW_A?

For our scenario I’ve combined failures of the Green Edge cluster and FW_A in the image below.

In that case we will lose all our N-S traffic to all of our ULS behind this Green Edge Cluster.

As a result, all clients outside the SDDC will lose connectivity immediately to all of the green Green ULS.

Please note: forwarding traffic to the Blue ULS will continue to work in this event regardless of the failure in Site A.

 

PIC7

If we’ll have a stretched vSphere Edge cluster between Site A and Site B, then we will able to leverage vSphere HA to restart the failed Green ESGs in the remote Blue site (This is not the case here, in our design each site has its own local cluster and storage), but even if we had vSphere HA, the restart process can take few minutes. Another way to recover from this failure is to manually deploy Green ESGs in Site B, and connect them to Site B FW_B. The recovery time of this solution could take few minutes. Both of these options are not suitable for modern datacenter design.

In the next paragraph I will introduce a new way to design the ESGs in Active/Active datacenter architecture.

This design will be much faster and will work in a more efficient way to recover from such an event in Site A (or Site B).

Active/Active Datacenter with mirrored ESGs

In this design architecture we will be deploying mirrored Green ESGs in Site B, and blue mirrored ESGs into Site A. Under normal datacenter operation the mirrored ESGs will be up and running but will not forward traffic. Site-A green ULS traffic from external clients will always enter via Site A ESGs (E1-Green-A , E2-Green-A) for all of Site A Prefix and leave through the same point.

Adding the mirrored ESGs add some complexity in the single Ingres/Egress design, but improves the converge time of any failure.

PIC8How Ingress Traffic flow works in this design?

Now we will explain how the Ingress traffic flow works in this architecture with mirrored ESGs. In order to simplify the explanation, we will be focusing only on the green flow in both of the datacenters and remove the blue components from the diagrams but the same explanation works for the Blue Site B network as well.

Site A Green UDLR control VM runs eBGP protocol with all Green ESGs (E1-Green-A to E4-Green-B). The UDLR Redistributes all connected interfaces as Site A prefix via eBGP. Note: “Site A prefix” represent any Green Segments part of the green ULS.

The Green ESGs (E1-Green-A  to E4-Green-B) sends out via BGP Site-A’s prefix to both physical firewalls: FW_A located in Site A and FW_B located Site B.

FW_B in Site B will add BGP AS prepending for Site A prefix.

From the Core router point of view, we’ll have two different paths to reach Site A Prefix: one via FW_A (Site A) and the second via FW_B (Site B). Under normal operation, this traffic will flow only through Site A because of the fact that Site B prepending for prefix A.

PIC9

Egress Traffic

Egress traffic is handled by UDLR control VM with different BGP Weigh values.

Site A ESGs: E1-Green-A and E2-Green-A has mirrors ESGs: E3-Green-B and E4-Green-B located at Site B. The mirrors ESGs provide availability. Under normal operation The UDLR Control VM will always prefer to route the traffic via higher BGP Wight value of E1-Green-A and E2-Green-A.  E3-Green-B and E4-Green-B will not forward any traffic and will wait for E1-E2 to fail.

In the figure below, we can see Web workload running on Site A ULS_Green_A initiate traffic to the Core. This egress traffic pass through DLR Kernel module, trough E1-Green-A ESG and then forward to Site A FW_A.

PIC10

There are other options for ingress/egress within NSX 6.2:

Great new feature called ‘Local-ID’. Hany Michael wrote a blog post to cover this option.

In Hany’s blog we don’t have a firewall like in my design so please pay attention to few minor differences.

http://www.networkskyx.com/2016/01/06/introducing-the-vmware-nsx-vlab-2-0/

Anthony Burke wrote a blog post about how to use local-id with physical firewall

https://networkinferno.net/ingress-optimisation-with-nsx-for-vsphere

Routing updates

Below, we’re demonstrating routing updates for Site-A, but the same mechanism works for Site B. The Core router connected to FW_A in Site A will peer with the FW_A via eBGP.

The core will send out 0/0 Default gateway.

FW_A will perform eBGP peering with both E1-Green-A and E2-Green-A. FW_A will forward the 0/0 default gateway to Green ESGs and will receive Site A green Prefix’s from Green ESGs. The Green ESGs E1-Green-A and E2-Green-A peers in eBGP with UDLR control VM.

The UDLR and the ESGs will work in ECMP mode, as results the UDLR will get 0/0 from both ESGs. The UDLR will redistribute connected interfaces (LIFs) to both green ESGs.

The golden rule of iBGP is that each iBGP router must peer with all other iBGP Neighbors unless we use route-reflector or confederation (currently Not support on NSX). Therefore, if iBGP is used, its restriction will force us to peer between all ESGs and the UDLR control VMs, resulting in exponential operation complexity.  As a result, it will be a better decision to go with eBGP between ESGs to UDLR Control VM.

In order to reduce the eBGP converge time of Active UDLR control VM failure, we will configure flowing static route in all of the Green side to point to UDLR forwarding address for the internal LIF’s.

Routing filters will apply on all ESGs to prevent unwanted prefixes advertisement and EGSs becoming transit gateways.

PIC11

Failure of One Green ESG in Site A

The Green ESGs: E1-Green-A and E2-Green-A working in ECMP mode. From UDLR and FW_A point of view both of the ESG work in Active/Active mode.

As long as we have at least one active Green ESG in Site A, The Green UDLR and the Core router will always prefer to work with Site A Green ESGs.

Let’s assume we have active flow of traffic from the Green WEB VM in site A to the external client behind the core router, and this traffic initially passing through via E1-Green-A. In and event of failure of E1-Green-A ESG, the UDLR will reroute the traffic via E2-Green-ESG because this ESG has better weight then Green ESGs on site B (E3-Green-B and E4-Green-B).

FW_A is still advertising a better as-path to ‘ULS_Web_Green_A’ prefixes than FW_B (remember FW_B always prepending Site_A prefix).

We’ll use low BGP time interval settings (hello=1 sec, hold down=3 sec) to improve BGP converge routing.

 

PIC12

Complete Edges cluster failure in site A

In this scenario we face a failure of all Edge cluster in Site A (Green ESGs and Blue ESGs), this issue might include the failure of FW_A.

Core router we will not be receiving any BGP updates from the Site A, so the core will prefer to go to FW_B in order to reach any Site A prefix.

From the UDLR point of view there arn’t any working Green ESGs in Site A, so the UDLR will work with the remaining green ESGs in site B (E3-Green-B, E4-Green-B).

The traffic initiated from the external client will be reroute via the mirrored green ESGs (E3-Green-B and E4-GreenB) to the green ULS in site B. The reroute action will work very fast based on the BGP converge routing time interval settings (hello=1 sec, hold down=3 sec).

This solution is much faster than other options mentioned before.

Same recovery mechanism exists for failure in Site B datacenter.

PIC13

Note: The Green UDLR control VM was deployed to the payload cluster and isn’t affected by this failure.

 

Complete Site A failure:

In this catastrophic scenario all components in site A were failed. Including the management infrastructure (vCenter, NSX Manager, controller, ESGs and UDLR control VM). Green workloads will face an outage until they are recovered in Site B, the Blue workloads continues to work without any interference.

The recovery procedure for this event will be made for the infrastructure management/control plan component and for the workloads them self.

Recovery the Management/control plan:

  • Log in to secondary NSX Manager and then Promote Secondary NSX Manager to Primary by: Assign Primary Role.
  • Deploy new Universal Controller Cluster and synchronize all objects
  • Universal CC configuration pushed to ESXi Hosts managed by Secondary
  • Redeploying the UDLR Control VM.

The recovery procedure for the workloads will run the “Recovery plan” from SRM located in site B.

PIC14

 

Summery:

In this blog post we are demonstrating the great power of NSX to create Active/Active datacenter with the ability to recover very fast from many failure scenarios.

  • We showed how NSX simplifies Disaster Recovery process.
  • NSX and SRM Integration is the reasonable approach to DR where we can’t use stretch vSphere cluster.
  • NSX works in Cross vCenter mode. Dual vCenters and NSX managers improving our availability. Even in the event of a complete site failure we were able to continue working immediately in our management layer (Seconday NSX manager and vCenter are Up and running).
  • In this design, half of our environment (Blue segments) wasn’t affected by a complete site failure. SRM recovered our failed Green workloads without need to change our Layer 2/ Layer 3 networks topology.
  • We did not use any specific hardware to achieve our BCDR and we were 100% decupled from the physical layer.
  • With SRM and vRO we were able to protect any deployed VM from Day 0.

 

I would like to thanks to:

Daniel Bakshi that help me a lots to review this blog post.

Also Thanks Boris Kovalev and Tal Moran that help to with the vRA/vRO demo vPOD.

 

 

 

Posted in Cross-VC, Design, DLR, Edge, Install
5 comments on “NSX Dual Active/Active Datacenters BCDR
  1. Anonymous says:

    Great post! Shows the power of network virtualisation.

  2. Stunning job Roie, thanks for shareing!

  3. My understanding is that the MTU size should be set to 1600 across the physical network connecting both the DC end – end.
    Also VXLAN does not support fragmentation,
    In this case the service provider which provided the connectivity to my both DC should also support MTU size of 1600 bytes without fragmentation.

    Let me know if my above understanding is right.

    • roie9876@gmail.com says:

      Yes you are right, MTU need to be 1600 between the Datacenters for Cross-vCenter VXLAN.

Leave a Reply