NSX Dual Active/Active Datacenters BCDR

Overview

The modern data center design requires better redundancy and demands the ability to have Business Continuity (BC) and Disaster Recovery (DR) in case of catastrophic failure in our datacenter. Planning a new data center with BCDR requires meeting certain fundamental design guidelines.

In this blog post I will describe the Active/Active datacenter with VMware Full SDDC product suite.

The NSX running in Cross-vCenter mode, this ability introduced in VMware NSX release 6.2.x. In this blog post we will focus on network and security.

An introduction and overview blog post can be found in this link:

http://blogs.vmware.com/consulting/2015/11/how-nsx-simplifies-and-enables-true-disaster-recovery-with-site-recovery-manager.html

The goals that we are trying to achieve in this post are:

  1. Having the ability to deploy workloads with vRA on both of the datacenters.
  2. Provide Business Continuity in case of a partial of a full site failure.
  3. Having the ability to perform planned or unplanned migration of workloads from one datacenter to another.

To demonstrate the functionality of this design I’ve created demo ‘vPOD’ in VMware internal cloud with the following products in each datacenter:

  • vCenter 6.0 with ESXi host 6.0
  • NSX 6.2.1
  • vRA 6.2.3
  • vSphere Replication 6.1
  • SRM 6.1
  • Cloud Client 3.4.1

In this blog post I will not cover the recovery part of the vRA/vRO components, but this could be achieved with a separated SRM instance for the management infrastructure.

Environment overview

I’m adding short video to introduce the environment.

NSX Manager

The NSX manager in Site A will have the IP address of 192.168.110.15 and will be configured as primary.

The NSX Manager in site B will be configured with the IP 192.168.210.15 and is set as secondary.

Each NSX manager pairs with its own vCenter and learns its local inventory. Any configuration change related to the cross site deployment will run at the primary NSX manager and will be replicated automatically to the remote site.

 

Universal Logical Switch (ULS)

Creating logical switches (L2) between sites with VxLAN is not new to NSX, however starting from version 6.2.X we’ve introduced the ability of stretching the L2 between NSX managers paired to different vCenters. This new logical switch is known as a ‘Universal Logical Switch’ or ‘ULS’. Any new ULS we will create in the Primary NSX Manger will be synced to the secondary.

I’ve created the following ULS in my Demo vPOD:

Universal Logical Switch (ULS)

Universal Distributed Logical Router (UDLR)

The concept of a Distributed Logical Router is still the same as it was before NSX 6.2.x. The new functionally that was added to this release allows us to configure Universal Distributed Logical Router (UDLR).  When we deploy a UDLR it will show up in all NSX Managers Universal Transport Zone.

The following UDLR created was created:

Universal Distributed Logical Router (UDLR)

Universal Security Policy with Distributed Firewall (UDFW)

With version 6.2.x we’ve introduced the universal security group and universal IP-Set.

Any firewall rule configured in the Universal Section must be IP-SET or Security Group that contain IP-SET.

When we are configuring or changing Universal policy, automatically there is a sync process that runs from the primary to the secondary NSX manager.

The recommended way to work with an ipset is to add it to a universal security group.

The following Universal security policy is an example to allow communicating to 3-Tier application. The security policy is built from universal security groups. Each group contain IP-SET with the relevant IP address for each tier.

Universal Security Policy with Distributed Firewall (UDFW)

vRA

At the automation side we’re creating two unique machine blueprints per site. The MBP are based on Classic CentOS image that allows us to perform some connectivity tests.

The MBP named “Center-Site_A” will be deployed by vRA to Site A into the green ULS named: ULS_Green_Web-A.

The IP address pool configured for this ULS is 172.16.10.0/24.

The MBP named “Center-Site_B” will be deployed by vRA to Site B into the blue ULS named: ULS_Blue_Web-B.

The IP address pool configured for this ULS is 172.17.10.0/24

vRA Catalog

Cloud Client:

To quote from VMware Official documentation:

“Typically, a vSphere hosted VM managed by vRA belongs to a reservation, which belongs to a compute resource (cluster), which in turn belongs to a vSphere Endpoint. The VMs reservation in vRA needs to be accurate in order for vRA to know which vSphere proxy agent to utilize to manage that VM in the underlying vSphere infrastructure. This is all well and good and causes few (if any) problems in a single site setup, as the VM will not normally move from the vSphere endpoint it is originally located on.

With a multi-site deployment utilizing Site Recovery Manager all this changes as part of the site to site fail over process involves moving VMs from one vCenter to another. This has the effect in vRA of moving the VM to a different endpoint, but the reservation becomes stale. As a result it becomes no longer possible to perform day 2 operation on the VMs until the reservation is updated.”

When we failover VMs from Site A to Site B cloud client will run the following action behind the science to solve this challenge.

Process Flow for Planned Failover:

Process Flow for Planned Failover

The Conceptual Routing Design with Active/Active Datacenter

The main key point for this design is to run Active/Active for workloads in both datacenters.

The workloads will reside on both Site A and Site B. In the modern datacenter the entry point is protected with perimeter firewall.

In our design each site has its on perimeter firewall run independently FW_A located in Site A and FW_B Located in Site B.
Site A (Shown in Green color) run its own ESGs (Edge Security Gateways), Universal DLR (UDLR) and Universal Logical Switch (ULS).

Site B site (shown in Blue color) have different ESGs, Universal DLR (UDLR) and Universal Logical Switch (ULS).

The main reason for the different ESG, UDLR and ULS per site is to force single ingress/egress point for workload traffic per site.

Without this ingress/egress deterministic traffic flow, we may face asymmetric routing between the two sites, that means that ingress traffic will be via Site A to FW_A and egress via Site B to FW_B, this asymmetric traffic will dropped by the FW_B.

Note: The ESGs in this blog run in ECMP mode, As a consequence we turned off the firewall service on the ESGs.

The Green network will always will be advertise via FW_A.  For an example The Control VM (IP 192.168.110.10) shown in the figure below need to access the Green Web VM connected to the ULS_Web_Green_A , the traffic  from the client will be routed via Core router and to FW_A, from there to one of the ESG working in ECMP mode, then to the Green UDLR and finally to the Green Web VM itself.

Now Assume the same client would like to access the Blue Web VM connected to ULS_Web_Blue_B, this traffic will be routed via the Core router to FW_B, from there to one of the Blue ESG working in ECMP mode, to the Blue ULDR and at the end to the Blue VM itself.

Routing Design with Active/Active Datacenter

What is the issue with this design?

What will happen if we will face a complete failure in one of our Edge Clusters or FW_A?

For our scenario I’ve combined failures of the Green Edge cluster and FW_A in the image below.

In that case we will lose all our N-S traffic to all of our ULS behind this Green Edge Cluster.

As a result, all clients outside the SDDC will lose connectivity immediately to all of the green Green ULS.

Please note: forwarding traffic to the Blue ULS will continue to work in this event regardless of the failure in Site A.

 

PIC7

If we’ll have a stretched vSphere Edge cluster between Site A and Site B, then we will able to leverage vSphere HA to restart the failed Green ESGs in the remote Blue site (This is not the case here, in our design each site has its own local cluster and storage), but even if we had vSphere HA, the restart process can take few minutes. Another way to recover from this failure is to manually deploy Green ESGs in Site B, and connect them to Site B FW_B. The recovery time of this solution could take few minutes. Both of these options are not suitable for modern datacenter design.

In the next paragraph I will introduce a new way to design the ESGs in Active/Active datacenter architecture.

This design will be much faster and will work in a more efficient way to recover from such an event in Site A (or Site B).

Active/Active Datacenter with mirrored ESGs

In this design architecture we will be deploying mirrored Green ESGs in Site B, and blue mirrored ESGs into Site A. Under normal datacenter operation the mirrored ESGs will be up and running but will not forward traffic. Site-A green ULS traffic from external clients will always enter via Site A ESGs (E1-Green-A , E2-Green-A) for all of Site A Prefix and leave through the same point.

Adding the mirrored ESGs add some complexity in the single Ingres/Egress design, but improves the converge time of any failure.

PIC8How Ingress Traffic flow works in this design?

Now we will explain how the Ingress traffic flow works in this architecture with mirrored ESGs. In order to simplify the explanation, we will be focusing only on the green flow in both of the datacenters and remove the blue components from the diagrams but the same explanation works for the Blue Site B network as well.

Site A Green UDLR control VM runs eBGP protocol with all Green ESGs (E1-Green-A to E4-Green-B). The UDLR Redistributes all connected interfaces as Site A prefix via eBGP. Note: “Site A prefix” represent any Green Segments part of the green ULS.

The Green ESGs (E1-Green-A  to E4-Green-B) sends out via BGP Site-A’s prefix to both physical firewalls: FW_A located in Site A and FW_B located Site B.

FW_B in Site B will add BGP AS prepending for Site A prefix.

From the Core router point of view, we’ll have two different paths to reach Site A Prefix: one via FW_A (Site A) and the second via FW_B (Site B). Under normal operation, this traffic will flow only through Site A because of the fact that Site B prepending for prefix A.

PIC9

Egress Traffic

Egress traffic is handled by UDLR control VM with different BGP Weigh values.

Site A ESGs: E1-Green-A and E2-Green-A has mirrors ESGs: E3-Green-B and E4-Green-B located at Site B. The mirrors ESGs provide availability. Under normal operation The UDLR Control VM will always prefer to route the traffic via higher BGP Wight value of E1-Green-A and E2-Green-A.  E3-Green-B and E4-Green-B will not forward any traffic and will wait for E1-E2 to fail.

In the figure below, we can see Web workload running on Site A ULS_Green_A initiate traffic to the Core. This egress traffic pass through DLR Kernel module, trough E1-Green-A ESG and then forward to Site A FW_A.

PIC10

There are other options for ingress/egress within NSX 6.2:

Great new feature called ‘Local-ID’. Hany Michael wrote a blog post to cover this option.

In Hany’s blog we don’t have a firewall like in my design so please pay attention to few minor differences.

http://www.networkskyx.com/2016/01/06/introducing-the-vmware-nsx-vlab-2-0/

Anthony Burke wrote a blog post about how to use local-id with physical firewall

https://networkinferno.net/ingress-optimisation-with-nsx-for-vsphere

Routing updates

Below, we’re demonstrating routing updates for Site-A, but the same mechanism works for Site B. The Core router connected to FW_A in Site A will peer with the FW_A via eBGP.

The core will send out 0/0 Default gateway.

FW_A will perform eBGP peering with both E1-Green-A and E2-Green-A. FW_A will forward the 0/0 default gateway to Green ESGs and will receive Site A green Prefix’s from Green ESGs. The Green ESGs E1-Green-A and E2-Green-A peers in eBGP with UDLR control VM.

The UDLR and the ESGs will work in ECMP mode, as results the UDLR will get 0/0 from both ESGs. The UDLR will redistribute connected interfaces (LIFs) to both green ESGs.

We can work with iBGP or eBGP  or mix from the UDLR – > ESG ->  physical routers.

In order to reduce the eBGP converge time of Active UDLR control VM failure, we will configure flowing static route in all of the Green side to point to UDLR forwarding address for the internal LIF’s.

Routing filters will apply on all ESGs to prevent unwanted prefixes advertisement and EGSs becoming transit gateways.

PIC11

Failure of One Green ESG in Site A

The Green ESGs: E1-Green-A and E2-Green-A working in ECMP mode. From UDLR and FW_A point of view both of the ESG work in Active/Active mode.

As long as we have at least one active Green ESG in Site A, The Green UDLR and the Core router will always prefer to work with Site A Green ESGs.

Let’s assume we have active flow of traffic from the Green WEB VM in site A to the external client behind the core router, and this traffic initially passing through via E1-Green-A. In and event of failure of E1-Green-A ESG, the UDLR will reroute the traffic via E2-Green-ESG because this ESG has better weight then Green ESGs on site B (E3-Green-B and E4-Green-B).

FW_A is still advertising a better as-path to ‘ULS_Web_Green_A’ prefixes than FW_B (remember FW_B always prepending Site_A prefix).

We’ll use low BGP time interval settings (hello=1 sec, hold down=3 sec) to improve BGP converge routing.

 

PIC12

Complete Edges cluster failure in site A

In this scenario we face a failure of all Edge cluster in Site A (Green ESGs and Blue ESGs), this issue might include the failure of FW_A.

Core router we will not be receiving any BGP updates from the Site A, so the core will prefer to go to FW_B in order to reach any Site A prefix.

From the UDLR point of view there arn’t any working Green ESGs in Site A, so the UDLR will work with the remaining green ESGs in site B (E3-Green-B, E4-Green-B).

The traffic initiated from the external client will be reroute via the mirrored green ESGs (E3-Green-B and E4-GreenB) to the green ULS in site B. The reroute action will work very fast based on the BGP converge routing time interval settings (hello=1 sec, hold down=3 sec).

This solution is much faster than other options mentioned before.

Same recovery mechanism exists for failure in Site B datacenter.

PIC13

Note: The Green UDLR control VM was deployed to the payload cluster and isn’t affected by this failure.

 

Complete Site A failure:

In this catastrophic scenario all components in site A were failed. Including the management infrastructure (vCenter, NSX Manager, controller, ESGs and UDLR control VM). Green workloads will face an outage until they are recovered in Site B, the Blue workloads continues to work without any interference.

The recovery procedure for this event will be made for the infrastructure management/control plan component and for the workloads them self.

Recovery the Management/control plan:

  • Log in to secondary NSX Manager and then Promote Secondary NSX Manager to Primary by: Assign Primary Role.
  • Deploy new Universal Controller Cluster and synchronize all objects
  • Universal CC configuration pushed to ESXi Hosts managed by Secondary
  • Redeploying the UDLR Control VM.

The recovery procedure for the workloads will run the “Recovery plan” from SRM located in site B.

PIC14

 

Summery:

In this blog post we are demonstrating the great power of NSX to create Active/Active datacenter with the ability to recover very fast from many failure scenarios.

  • We showed how NSX simplifies Disaster Recovery process.
  • NSX and SRM Integration is the reasonable approach to DR where we can’t use stretch vSphere cluster.
  • NSX works in Cross vCenter mode. Dual vCenters and NSX managers improving our availability. Even in the event of a complete site failure we were able to continue working immediately in our management layer (Seconday NSX manager and vCenter are Up and running).
  • In this design, half of our environment (Blue segments) wasn’t affected by a complete site failure. SRM recovered our failed Green workloads without need to change our Layer 2/ Layer 3 networks topology.
  • We did not use any specific hardware to achieve our BCDR and we were 100% decupled from the physical layer.
  • With SRM and vRO we were able to protect any deployed VM from Day 0.

 

I would like to thanks to:

Daniel Bakshi that help me a lots to review this blog post.

Also Thanks Boris Kovalev and Tal Moran that help to with the vRA/vRO demo vPOD.

 

 

 

NSX Edge and DRS Rules

The NSX Edge Cluster Connects the Logical and Physical worlds and usually hosts the NSX Edge Services Gateways and the DLR Control VMs.

There are deployments where the Edge Cluster may contain the NSX Controllers as well.

In this section we discuss how to design an Edge Cluster to survive a failure of an ESXi host or an Physical entire chassis and lower the time of outage.

In the figure below we deploy NSX Edges, E1 and E2, in ECMP mode where they run active/active both from the perspective of the control and data planes. The DLR Control VMs run active/passive while both E1 and E2 running a dynamic routing protocol with the active DLR Control VM.

When the DLR learns a new route from E1 or E2, it will push this information to the NSX Controller cluster. The NSX Controller will update the routing tables in the kernel of each ESXi hosts, which are running this DLR instance.

 

1

 

In the scenario where the ESXi host, which contains the Edge E1, failed:

  • The active DLR will update the NSX Controller to remove E1 as next hop, the NSX Controller will update the ESXi host and as a result the “Web” VM traffic will be routed to Edge E2.
    The time it takes to re-route the traffic depends on the dynamic protocol converge time.

2

In the specific scenario where the failed ESXi or Chassis contained both the Edge E1 and the active DLR, we would instead face a longer outage in the forwarded traffic.

The reason for this is that the active DLR is down and cannot detect the failure of the Edge E1 and accordingly update the Controller. The ESXi will continue to forward traffic to Edge E1 until the passive DLR becomes active, learns that the Edge E1 is down and updates the NSX Controller.

3

The Golden Rule is:

We must ensure that when the Edge Services Gateway and the DLR Control VM belong to the same tenant they will not reside in the same ESXi host. It is better to distribute them between ESXi hosts and reduce the affected functions.

By default when we deploy a NSX Edge or DLR in active/passive mode, the system takes care of creating a DRS anti-affinity rule and this prevents the active/passive VMs from running in the same ESXi host.

DRS anti affinity rules

DRS anti affinity rules

We need to build new DRS rules as these default rules will not prevent us from getting to the previous dual failure scenario.

The figure below describes the network logical view for our specific example. This topology is built from two different tenants where each tenant is being represented with a different color and has its own Edge and DLR.

Note connectivity to the physical world is not displayed in the figure below in order to simplify the diagram.

multi tenants

My physical Edge Cluster has four ESXi hosts which are distributed over two physical chassis:

Chassis A: esxcomp-01a, esxcomp-02a

Chassis B: esxcomp-01b, esxcomp-02b

4

Create DRS Host Group for each Chassis

We start with creating a container for all the ESXi hosts in Chassis A, this container group configured is in DRS Host Group.

Edge Cluster -> Manage -> Settings -> DRS Groups

Click on Create Add button and call this group “Chassis A”.

Container type need to be “Host DRS Group” and Add ESXi host running on Chassis A (esxcomp-01a and esxcomp-02a).

5

Create another DRS group called Chassis B that contains esxcomp-01b and esxcomp-02b:

6

 

VM’s DRS Group for Chassis A:

We need to create a container for VMs that will run in Chassis A. At this point we just name it as Chassis A, but we are not actually putting the VMs in Chassis A.

This Container type is “VM DRS Group”:

7

VM DRS Group for Chassis B:

8

 

At this point we have four DRS groups:

9

DRS Rules:

Now we need to take the DRS object we created before: “Chassis A” and “VM to Chassis A “ and tie them together. The next step is to do the same for “Chassis B” and “VM to Chassis B“

* This configuration needs to be part of “DRS Rules”.

Edge Cluster -> Manage -> Settings -> DRS Rules

Click on the Add button in DRS Rules, in the name enter something like: “VM’s Should Run on Chassis A”

In the Type select “Virtual Machine to Hosts” because we want to bind the VM’s group to the Hosts Group.

In the VM group name choose “VM to Chassis A” object.

Below the VM group selection we need to select the group & hosts binding enforcement type.

We have two different options:

“Should run on hosts in group” or “Must run on hosts in group”

If we choose “Must” option, in the event of the failure of all the ESXi hosts in this group (for example if Chassis A had a critical power outage), the other ESXi hosts in the cluster (Chassis B) would not be considered by vSphere HA as a viable option for the recovery of the VMs. “Should” option will take other ESXi hosts as recovery option.

10

 

Same for Chassis B:

11

Now the problem with the current DRS rules and the VM placement in this Edge cluster is that the Edge and DLR Control VM are actually running in the same ESXi host.  We need to create anti-affinity DRS rules.

Anti-Affinity Edge and DLR:

An Edge and DLR that belong to the same tenant should not run in the same ESXi host.

For Green Tenant:

12

For Blue Tenant:

13

The Final Result:

In the case of a failure of one of the ESXi hosts we don’t face the problem where Edge and DLR are on the same ESXi host, even if we have a catastrophic event of a chassis A or B failure.

15

 

Note:

Control VM location can move to compute cluster and we can avoid this design consideration.

Thanks to Max Ardica and  Tiran Efrat for reviewing this post.

 

NSX L2 Bridging

Overview

This next overview of L2 Bridging  was taken from great work of Max Ardica and Nimish Desai in the official NSX Design Guide:

There are several circumstances where it may be required to establish L2 communication between virtual and physical workloads. Some typical scenarios are (not exhaustive list):

  • Deployment of multi-tier applications: in some cases, the Web, Application and Database tiers can be deployed as part of the same IP subnet. Web and Application tiers are typically leveraging virtual workloads, but that is not the case for the Database tier where bare-metal servers are commonly deployed. As a consequence, it may then be required to establish intra-subnet (intra-L2 domain) communication between the Application and the Database tiers.
  • Physical to virtual (P-to-V) migration: many customers are virtualizing applications running on bare metal servers and during this P-to-V migration it is required to support a mix of virtual and physical nodes on the same IP subnet.
  • Leveraging external physical devices as default gateway: in such scenarios, a physical network device may be deployed to function as default gateway for the virtual workloads connected to a logical switch and a L2 gateway function is required to establish connectivity to that gateway.
  • Deployment of physical appliances (firewalls, load balancers, etc.).

To fulfill the specific requirements listed above, it is possible to deploy devices performing a “bridging” functionality that enables communication between the “virtual world” (logical switches) and the “physical world” (non virtualized workloads and network devices connected to traditional VLANs).

NSX offers this functionality in software through the deployment of NSX L2 Bridging allowing VMs to be connected at layer 2 to a physical network (VXLAN to VLAN ID mapping), even if the hypervisor running the VM is not physically connected to that L2 physical network.

L2 Bridge topology

 

Figure above shows an example of L2 bridging, where a VM connected in logical space to the VXLAN segment 5001 needs to communicate with a physical device deployed in the same IP subnet but connected to a physical network infrastructure (in VLAN 100). In the current NSX-v implementation, the VXLAN-VLAN bridging configuration is part of the distributed router configuration; the specific ESXi hosts performing the L2 bridging functionality is hence the one where the control VM for that distributed router is running. In case of failure of that ESXi host, the ESXi hosting the standby Control VM (which gets activated once it detects the failure of the Active one) would take the L2 bridging function.

Independently from the specific implementation details, below are some important deployment considerations for the NSX L2 bridging functionality:

  • The VXLAN-VLAN mapping is always performed in 1:1 fashion. This means traffic for a given VXLAN can only be bridged to a specific VLAN, and vice versa.
  • A given bridge instance (for a specific VXLAN-VLAN pair) is always active only on a specific ESXi host.
  • However, through configuration it is possible to create multiple bridges instances (for different VXLAN-VLAN pairs) and ensure they are spread across separate ESXi hosts. This improves the overall scalability of the L2 bridging function.
  • The NSX Layer 2 bridging data path is entirely performed in the ESXi kernel, and not in user space. Once again, the Control VM is only used to determine the ESXi host where a given bridging instance is active, and not to perform the bridging function.

 

 

Configure L2 Bridge

In this scenario we would like to Bridge Between App VM connected to VXLAN 5002 to virtual machine connected to VLAN 100.

Create Bridge 1

My current Logical Switch configuration:

Logical Switch table

We have pre-configured a VLAN-backed port group for VLAN 100:

Port group

Bridging configuration is done at the DLR level. In this specific example, the DLR name is Distributed-Router:

Double Click on the edge-1:

DLR1

 

Click on the Bridging and then green + button:

DLR2

Type Bridge Name, Logical Switch ID and Port-Group name:

DLR3

 

Click OK and Publish:

DLR4

 

Now VM on Logical Switch App-Tier-01 can communicate with Physical or virtual machine on VLAN 100.

 

Design Consideration

Currently in NSX-V 6.1 we can’t enable routing on the VXLAN logical switch that is bridged to a VLAN.

In other words, the default gateway for devices connected to the VLAN can’t be configured on the distributed logical router:

None working  L2 Bridge Topology

None working L2 Bridge Topology

So how can VM in VXLAN 5002 communicate with VXLAN 5001?

The big difference is VXLAN 5002 is no longer connected to the DLR LIF, but it is connected instead to the NSX Edge.

Working Bridge Topology

Redundancy

DLR Control VM can work in high availability mode, if the Active DLR control VM fails, the standby Control VM takes over, which means the Bridge instance will move to a new ESXi host location.

HA

 

Bridge Troubleshooting:

Most issues I ran into was that the bridged VLAN was missing on the trunk interface configured on the physical switch.

In the figure below:

  • Physical server is connected to VLAN 100, App VM connected to VXLAN 5002 in esx-01b.
  • Active DLR control VM is located at esx-02a, so the bridging function will be active in this ESXi host.
  • Both ESXi hosts have two physical nics: vmnic2 and vmnic3.
  • Transport VLAN carries all VNI (VXLAN’s) traffic and is forwarded on the physical switch in VLAN 20.
  • On physical switch-2 port E1/1 we must configure trunk port and allow both VLAN 100 and VLAN 20.

Bridge and Trunk configuration

Note: Port E1/1 will carry both VXLAN and VLAN traffic. 

 

 

 

Find Where Bridge is Active:

We need to know where the Active DLR Control VM is located (if we have HA). Inside this ESXi host the Bridging happens in kernel space. The easy way to find it is to look at “Configuration” section in the “Manage” tab.

Note: When we powered off the DLR Control VM (if HA is not enabled), the bridging function on this ESXi host will stop to prevent loop.

DLR5We can see that Control VM located in esx-02a.corp.local

SSH to this esxi host,  find the Vdr Name of the DLR Control VM:

xxx-xxx -I -l

VDR Instance Information :
—————————

Vdr Name: default+edge-1
Vdr Id: 1460487509
Number of Lifs: 4
Number of Routes: 5
State: Enabled
Controller IP: 192.168.110.201
Control Plane IP: 192.168.110.52
Control Plane Active: Yes
Num unique nexthops: 1
Generation Number: 0
Edge Active: Yes

Now we know that “default+edge-1” is the VDR name.

 

xxx-xxx -b –mac default+edge-1

###################################################################################################

~ # xxx-xxx -b –mac default+edge-1

VDR ‘default+edge-1’ bridge ‘Bridge_App_VLAN100’ mac address tables :
Network ‘vxlan-5002-type-bridging’ MAC address table:
total number of MAC addresses: 0
number of MAC addresses returned: 0
Destination Address Address Type VLAN ID VXLAN ID Destination Port Age
——————- ———— ——- ——– —————- —
Network ‘vlan-100-type-bridging’ MAC address table:
total number of MAC addresses: 0
number of MAC addresses returned: 0
Destination Address Address Type VLAN ID VXLAN ID Destination Port Age
——————- ———— ——- ——– —————- —

###################################################################################################

From this output we can see there is no any mac address learning ,

After connect VM to Logical Switch App-Tier-01 and ping VM in VLAN 100.

Now we can see mac address from both VXLAN 5002 and VLAN100:

Bridge TSHOOT

 

 

 

 

NSX – Distributed Logical Router Deep Dive

Overview

In today’s modern Datacenter, the physical router is essential for building a workable network design. As in the physical infrastructure, we need to provide similar functionality in virtual networking. Routing between IP subnets can be performed in a logical space without traffic going out to the physical router. This routing is performed in the hypervisor kernel with a minimal CPU and memory overhead. This functionality provides an optimal data-path for routing traffic within the virtual infrastructure. Distributed routing capability in the NSX-v platform provides an optimized and scalable way of handling East – West traffic within a data center. East – West traffic is the communication between virtual machines within the datacenter. The amount of East – West traffic in the datacenter is growing. The new collaborative, distributed, and service oriented application architecture demands a higher bandwidth for server-to-server communication.

If these servers are virtual machines running on a hypervisor, and they are connected to different subnets, the communication between these servers has to go through a router. Also, if a physical router is used to provide routing services the virtual machine communication has to go out to the physical router and get back in to the server after the routing decisions have been made. This is obviously not an optimal traffic flow and is sometimes referred to as “hair pinning”.

The distributed routing on the NSX-v platform prevents the “hair-pinning” by providing hypervisor level routing functionality. Each hypervisor has a routing kernel module that performs routing between the Logical Interfaces (LIFs) defined on that distributed router instance.

The distributed logical router possesses and manages the logical interface (LIF). The LIF idea is similar to interfaces VLAN on a physical router. But on the distributed logical router, the interfaces are called LIFs. The LIF connects to the logical switches or distributed port groups. A single distributed logical router can have a maximum of 1,000 LIFs.

DLR Overview

DLR Overview

DLR Interfaces type

With the DLR we have three types of interfaces. These are called Uplink, LIFs and Management.

Uplink: This is used by the DLR Control VM to connect the upstream router. In most of the documentation you will see, it is also referred to as “transit”, and this interface is the transit interface between the logical space to the physical space. The DLR supports both OSPF and BGP on its Uplink Interface, but cannot run both at the same time. OSPF can be enabled only on single Uplink Interface.

LIFs: LIFs exist on the ESXi host at the kernel level; LIFs are the Layer 3 interface that act as the default gateway for all VM’s connected to logical switches.

Management: DLR management interface can be used for different purposes. The first one is to manage the DLR control VM remote access like SSH. Another use case is for High Availability. The last one is to send out syslog information to a syslog server. The management interface is part of the routing table of the control VM; there is no separate routing table. When we configure an IP address for the management interface only devices on the same subnet as the Management subnet will be able to reach the DLR Control VM management IP, and the remote device will not be able to contact this IP.

DLR Interface Type

DLR Interface Type

Note: If we just need the IP address to manage the DLR remotely we can SSH to the DLR “Protocol Address” explain later in this chapter, there is no need to configure new IP address for management interface.

Logical Interfaces and virtual MAC’s and Physical MAC:

Logical Interfaces (LIFs) including IP address of the DLR Kernel module inside the ESXi host. For each LIF we will have an associated MAC address called virtual MAC (vMAC).  This vMAC is not visible to the physical network. The virtual MAC (vMAC) is the MAC address of the LIF and is the same across all the ESXi hosts and is never seen by the physical network, only by virtual machines. The virtual machines use the vMAC as their default gateway MAC address. The physical MAC (pMAC) is the MAC address of the uplink through which traffic flows to the physical network, and in this case when the DLR needs to route traffic outside of the ESXi host it is the Physical MAC (pMAC) address that will be used.

In the following figure, inside esxcomp-01a that is an ESXi host, we have the DLR kernel module, this DLR instance will have two LIFs. Each LIF is associated with a logical switch VXLAN 5001 and 5002. From the perspective of VM1, the default gateway is LIF1 with IP address 172.16.10.1, VM2 has a default gateway that is LIF2 172.16.20.1 and vMAC is the same mac address for both LIFs.

The LIFs IP address and vMAC will be the same across all NSX-v hosts for the same DLR instance.

DLR and vMotion

DLR and vMotion

When VM2 is vMotioned from esxcomp-01a to esxcomp-01b, VM2 will have the same default gateway (LIF2), which is associated with vMAC, and from the perspective of VM2 nothing has been changed.

 

DLR Kernel module and ARP table

The DLR does not communicate with the NSX-v Controller to figure out the MAC address of VMs. Instead it sends an ARP request to the entire ESXi host VTEP’s members on that logical switch The VTEP’s that receive this ARP request forward it to all VMs on that logical switch.

In the following figure, if VM1 needs to communicate with VM2, this traffic will route inside the DLR kernel module at escomp-01a, this DLR needs to know the MAC address of VM1 and VM2. The DLR will then send an ARP request to all VTEP members on VXLAN 5002 to learn the MAC address of VM2. In addition to this, the DLR will also keep the ARP table entry for 600 seconds, which is called its aging time.

DLR Kernel module and ARP table

DLR Kernel module and ARP table

Note: The DLR instance may have different ARP entries between different ESXi hosts. Each DLR Kernel module maintains its own ARP table.

DLR and local routing

Since the DLR instance is distributed, each ESXi host has a route instance that can route traffic. When VM1 need to send traffic to VM2, theoretically both DLR in esxcomp-01a and esxcomp-01b can route the traffic as in the following figure. In NSX-v the DLR will always perform local routing for VMs traffic!

When VM1 sends a packet to VM2, the DLR in esxcomp-01a will route the traffic from VXLAN 5001 to VXLAN 5002 because VM1 has initiated the traffic.

DLR Local Routing

DLR Local Routing

The following illustration shows that when VM2 replies back to VM1, the DLR at esxcomp-01b will route the traffic because VM2 is near to the DLR at esxomp-01b.

Note: the actual traffic between the ESXi hosts will flow via VTEP’s.

DLR Local Routing

DLR Local Routing

Note: the actual traffic between the ESXi hosts will flow via VTEP’s.

Multiple Route Instances

The Distributed Logical Router (DLR) has two components, the first one is the DLR Control VM that is a virtual machine and the second one is the DLR Kernel module that runs in all ESXi hypervisor.  This DLR Kernel module, which is called, route-instance has the same copy of information in each ESXi host. The Route-instance works at the kernel level. We will have at least one unique route-instance of the DLR kernel module inside the ESXi host but not limited to just on ESXi host.

The following figure shows two DLR control VMs, with the DLR Control VM1 on the right and DLR Control VM2 on the left. Each Control VM has its own route-instance in the ESXi hosts. In esxcomp-01a we have the route-instance1, which is managed by the DLR control VM1, and route-instance 2, which is managed by the Control VM2, and the same also applies to escomp-01b. The DLR instance has its own range of LIFs that it manages. The DLR control VM1 manages the LIF in VXLAN 5001 and 5002. The DLR control VM2 manages the LIF in VXLAN 5003 and 5004.

Multiple Route Instances

Multiple Route Instances

Logical Router Port

Regardless of the amount of route-instances we have inside the ESXi hosts we will have one special port called the “Logical Router Port” or “vdr Port”.

This port works like a “route in stick” concept. That means all routed traffic will pass through this port. We can think of route-instance like vrf lite because each route-instance will have its own LIFs and routing table, even the LIFs IP address can overlap with others.

In the following figure we have an example of an ESXi host with two route-instances where in route-instance-1 we have the same IP address as route-instace-2, but with a different VXLAN.

Note: Different DLRs cannot share the same VXLAN

DLR vdr port

DLR vdr port

Routing information Control Plan Update Flow

We need to understand how a route is configured and pushed from the DLR control VM to the ESXi hosts. Let’s look at the following figure to understand the flow.

Step 1: An end user configures a new DLR Control VM. This DLR will have LIFs (Logical interfaces) and a static or dynamic routing protocol peer with the NSX-v Edge Services gateway device.

Step 2: The DLR LIFs configuration information is pushed to all ESXi hosts in the cluster that have been prepared by the NSX-v platform. If more than one route instance exists, the DLR LIFs information will be sent to that instance only.

At this point VM’s in a different VXLAN (East – West traffic) can communicate with each other.

Step 3: The NSX-v Edge Services gateway (ESG) will update the DLR control VM about new routes.

Step 4: The DLR control VM will update the NSX-v controller (via UWA) with Routing Information Tables (RIBs).

Step 5: Then NSX-v controller will push RIBs to all ESXi hosts that have prepared by the NSX-v platform. If more than one route instance exists, RIBs information will send to that instance only.

Step 6: Route Instance on the ESXi host creates Forwarding Information Base (FIB) and handles the data path traffic.

Routing information Control Plan Update Flow

Routing information Control Plan Update Flow

DLR Control VM communications

The DLR Control VM is a virtual machine that is typically deployed in the Management or Edge Cluster. When the ESXi host has been prepared by the NSX-v platform, one of the VIB’s creates the control plane channel between the ESXi hosts to the NSX-v controllers. The service demon inside the ESXi host which is responsible for this channel, is called netcpad, and which is also more commonly referred to as the User World Agent (UWA).

The netcpad is responsible for communication between the NSX-v controller and ESXi host learns MAC/IP/VTEP address information, and for VXLAN communications. The communication is secured and uses SSL to communicate with NSX-v controller on the control plane. The UWA can also connect to multiple NSX-v controller instances and maintains its logs at /var/log/ netcpa.log

 

Another Service demon called the vShield-Statefull-Firewall is responsible for interacting with the NSX-v Manager. This service daemon receives configuration information from the NSX-v Manager to create (or delete) the DLR Control VM, create (or delete) the ESG. Beside that, this demon also performs NSX-v firewall tasks: Retrieve the DFW policy rules, gather the DFW statistics information and send them to the NSX-v Manager, send audit logs and information to the NSX-v Manager. Part of host preparation processes SSL related tasks from the NSX-v Manager.

The DLR control VM runs two VMCI sockets to the user world agents (UWA) on the ESXi host it is residing on. The first VMCI socket is to the vShield-Statefull-Firewall service daemon on the host for receiving update configuration information from the NSX-v Manager to the DLR control VM itself, and the second to netcpad for control plane access to the controllers.

The VMCI socket provides the local communication whereby the guest virtual machines can communicate to the hypervisor where they reside but cannot communicate to the other ESXi hosts.

On this basis the routing update happens in the following manner:

  • Step (1) DLR Control VM learn new route information (from the dynamic routing as an example) to update the NSX-v controller,
  • Step (2) the DLR will use the internal channel inside the ESXi01 host called the “Virtual Machine Communication Interface” (VMCI). VMCI will open a socket to transfer learned routes as Routing Information Base (RIB) information to the netcpa service daemon.
  • Step (3) The netcpa service demon will send the RIB information to the NSX-v controller. The flow of routing information passes through the Management VMkernel interface of the ESXi host, which means that the NSX-v controllers do not need a new interface to communicate to the DLR control VM. The protocol and port used for this communication is TCP/1234.
  • Step (4) NSX Controller will forward the DLR RIB to all netcpa service daemons on the ESXi host.
  • Step (5) netcpa will forward the FIB’s to the DLR route instance.
DLR Control VM communications

DLR Control VM communications

DLR High Availability

The High Availability (HA) DLR Control VM allows redundancy at the VM level. The HA mode is Active/Passive where the active DLR Control VM holds the IP address, and if the active DLR Control VM fails the passive DLR Control VM will take ownership of the IP address (flip event). The DLR route-instance and the interface of the LIFs and IP address exists on the ESXi host as a kernel module and are not part of this Active/passive mode flip event.

The Active DLR Control VM sync-forwarding table to secondary DLR Control VM, if the active fails, the forwarding table will continue to run on the secondary unit until the secondary DLR will renew the adjacency with the upper router.

The HA heartbeat message is sent out through the DLR management interface. We must have L2 connectivity between the Active DLR Control VM and the Secondary DLR Control VM. IP address of Active/Passive assign automatic as /30 when we deploy HA. The default failover detection mechanism is 15 seconds but can be lowered down to 6 seconds. The heartbeat uses UDP Port 694 for its communication.

DLR High Availability

DLR High Availability

You can also verify the HA status by running the following command:

DLR HA verification command:

$ show service highavailability

$ show service highavailability connection-sync

$ show service highavailability link

Protocol Address and Forwarding Address

The Protocol address is the IP address of the DLR Control VM. This Control Plane actually establishes the OSPF or BGP peering with the ESG’s. The following figure shows OSPF as example:

Protocol Address and Forwarding Address

Protocol Address and Forwarding Address

The following figure shows that the DLR Forwarding Address is the IP address that uses as the  next-hop for ESG’s.

Protocol Address and Forwarding Address

Protocol Address and Forwarding Address

DLR Control VM Firewall

The DLR Control VM can protect its Management or Uplink interfaces with the built in firewall. For any device that needs to communicate with the DLR Control VM itself we will need a firewall rule to approve it.

For example SSH to the DLR control VM or even OSPF adjacencies with the upper router will need to have a firewall rule. We can Disable/Enable the DLR Control VM firewall globally.

 Note: do not confuse DLR Control VM firewall rule with NSX-v distributed firewall rule. The following image shows the firewall rule for DLR Control VM.

DLR Control VM Firewall

DLR Control VM Firewall

Creating DLR

First step will be to create the DLR Control VM.

We need to go to Network and Security -> NSX Edges -> and click on the green + button.

Here we need to specify Logical (distributed) Router

 

Creating DLR

Creating DLR

Specify the User and Password, we can Enable SSH Access:

DLR CLI Credentials

DLR CLI Credentials

We need to specify where we want to place the DLR Control VM:

place the DLR Control VM

place the DLR Control VM

We need to specify the Management interfaces and Logical Interface (LIF)

Management Interface is for access with SSH to Control VM.

Lif interface needed to be configure Second Table below “Configure Interfaces of this NSX Edge”

Configure Interfaces of this DLR

Configure Interfaces of this DLR

Configure the Lif Interface’s done by connected interface to “Logical Switch” interfaces

Connected Lif  to DLR

Connected Lif to DLR

Configure the Up-Link Transit Lif:

Configure Up-Link Lif

Configure Up-Link Lif

Configure the Web Lif:

Configure the web Lif

Configure the web Lif

Configure the App Lif:

Configure the App Lif:

Configure the App Lif:

Configure the DB Lif:

Configure the DB Lif

Configure the DB Lif

Summary of all DLR Lif’s:

Summary of all DLR Lif’s

Summary of all DLR Lif’s

DLR Control VM can work in High Availability mode, in our lab we will not enable H.A:

DLR High Availability

DLR High Availability

Summary of DLR configuration:

Summary of DLR configuration:

Summary of DLR configuration:

 

DLR Intermediate step

After completed deploying DLR, we created 4 different Lif’s.

Tranit-Network-01, Web-Tier-01, App-Tier-01, DB-Tier01

All these Lif’s are spanned over all our ESX Cluster’s.

So for example virtual machine connected to Logical Switch called “App-Tier-01” will have a default gateway of 172.16.20.1 regardless where this VM located in the DC.

DLR Intermediate step

DLR Intermediate step

 

DLR Routing verification

We can verify NSX controller receiving the DRL Lif’s IP address for each VXLAN Logical switch.

From NSX controller run this command: show control-cluster logical-routers instance all

DLR Routing verification

DLR Routing verification

The LR-Id “1460487505” is the internal id of the DLR control VM.

To verify all DLR Lif’s interfaces run this command: show control-cluster logical-routers interface-summary LR-Id.

In our lab:

show control-cluster logical-routers interface-summary LR-Id14604875

DLR Routing verification

DLR Routing verification

 

Configure OSPF on DLR

On the ESX Edges click on the DLR Type Logical Router

Configure OSPF on DLR

Configure OSPF on DLR

Go to Manage – > Routing ->  OSPF and Click “Edit”

Configure OSPF on DLR

Configure OSPF on DLR

Type in the Protocol Address and Forwarding Address.

Do not Mark the “Enable OSPF” Check box !!!

Protocol Address and Forwarding Address

Protocol Address and Forwarding Address

The Protocol address is the IP address of the DLR Logical Router Control VM, this Control Plane actually establishing the OSPF peering with the NSX Edge.

The Forwarding Address is the IP address that use next-hop for NSX Edge to forward the packet to DRL:

DLR Forwarding Address

DLR Forwarding Address

Click on “Publish Changes”:

Publish Changes

Publish Changes

The results will look like this:

DLR

Go to “Global Configuration”:

Global Configuration

Global Configuration

Type the Default Gateway for DLR (Next hop NSX Edge):

Default Gateway

Default Gateway

Enable the OSPF:

Enable the OSPF

Enable the OSPF

Then click on “Publish the Change’s”

Go Back to “OSPF” to “Are to Interface Mapping” and add the Transit-Uplink to Area 51:

Are to Interface Mapping

Are to Interface Mapping

Click on “Publish Change”

Go to Route Redistribution and make sure OSPF is enabled:

Route Redistribution

Route Redistribution

Deploy NSX Edge

In our LAB we will use NSX Edge as next-hop for LDR but it can be physical router.

NSX Edge is virtual appliance offers L2, L3, perimeter firewall, load-balancing and other services such as SSL VPN, DHCP, etc.

We will use this Edge for Dynamic Routing.

 

Go to “NSX Edge” -> and Click on the green plus button

Select “Edge Services Gateway” fill in the Name and Hostname for this Edge.

If we would like the use redundant Edge we need to checked the “Enable High Availability”

NSX Edge

NSX Edge

Put your username and password:

username and password

username and password

Select the Size of the NSX Edge:

NSX Edge size

NSX Edge size

Select where to install the Edge:

Configure the Network Interfaces:

Configure the Network Interfaces

Configure the Network Interfaces

Configure the Mgmt interface:

 

 

 

 

Configure the Mgmt interface

Configure the Mgmt interface

Configure the Transit interface:

Configure the Transit interface

Configure the Transit interface (toward  DLR):

Configure Default Gateway:

Edge Default Gateway

Edge Default Gateway

 

Set Firewall Default policy to permit all traffic:

Firewall Default policy to permit all traffic

Firewall Default policy to permit all traffic:

Summary of Edge Configuration:

Summary of Edge Configuration

Summary of Edge Configuration

Configure OSPF at NSX Edge:

Configure OSPF at NSX Edge

Configure OSPF at NSX Edge

Enable OSPF at “Global Configuration”:

Enable OSPF at "Global Configuration"

Enable OSPF at “Global Configuration”

In the “Dynamic Routing Configuration” Click “Edit”

For the “Router ID” select the interface that you have configured as the OSPF Router-ID.

Check “Enable OSPF”:

 

Enable OSPF

Enable OSPF

Publish and Go to “OSPF” Add Transit Network to Area 51 in the interface mapping section:

Map Interface to OSPF Area

Map Interface to OSPF Area

 

Click “Publish”

Make sure OSPF Status is in “Enabled” state and the Red button on the right is in “Disable”.

Getting the full picture

 

Getting the full picture

Getting the full picture

 

Dynamic OSPF Routing Verification

Open the Edge CLI

The Edge has OSPF neighbor adjacency with 192.168.10.3 This is the Control VM IP address.

Edge OSPF verfication

Edge OSPF verfication

The NSX Edge Received OSPF Routes from the DLR.

From the Edge Perspective the next-hope to DLR is the Forwarding Address 192.168.10.2

Edge OSPF Routing Verification

Edge OSPF Routing Verification

 

Related Post:

NSX Manager

NSX Controller

Host Preparation

Logical Switch

Distributed Logical Router

 

Thanks to:

Shachar Bobrovskye, Michael Haines,  Prasenjit Sarkar for contribute to this post.

Offer Nissim for reviewing this post

 

To find out more info what is Distributed Dynamic routing I recommend on reading two blogs of

Colleague of mine:

Brad Hedlund

http://bradhedlund.com/2013/11/20/distributed-virtual-and-physical-routing-in-vmware-nsx-for-vsphere/

Antony Burke

http://networkinferno.net/nsx-compendium