NSX Dual Active/Active Datacenters BCDR


The modern data center design requires better redundancy and demands the ability to have Business Continuity (BC) and Disaster Recovery (DR) in case of catastrophic failure in our datacenter. Planning a new data center with BCDR requires meeting certain fundamental design guidelines.

In this blog post I will describe the Active/Active datacenter with VMware Full SDDC product suite.

The NSX running in Cross-vCenter mode, this ability introduced in VMware NSX release 6.2.x. In this blog post we will focus on network and security.

An introduction and overview blog post can be found in this link:


The goals that we are trying to achieve in this post are:

  1. Having the ability to deploy workloads with vRA on both of the datacenters.
  2. Provide Business Continuity in case of a partial of a full site failure.
  3. Having the ability to perform planned or unplanned migration of workloads from one datacenter to another.

To demonstrate the functionality of this design I’ve created demo ‘vPOD’ in VMware internal cloud with the following products in each datacenter:

  • vCenter 6.0 with ESXi host 6.0
  • NSX 6.2.1
  • vRA 6.2.3
  • vSphere Replication 6.1
  • SRM 6.1
  • Cloud Client 3.4.1

In this blog post I will not cover the recovery part of the vRA/vRO components, but this could be achieved with a separated SRM instance for the management infrastructure.

Environment overview

I’m adding short video to introduce the environment.

NSX Manager

The NSX manager in Site A will have the IP address of and will be configured as primary.

The NSX Manager in site B will be configured with the IP and is set as secondary.

Each NSX manager pairs with its own vCenter and learns its local inventory. Any configuration change related to the cross site deployment will run at the primary NSX manager and will be replicated automatically to the remote site.


Universal Logical Switch (ULS)

Creating logical switches (L2) between sites with VxLAN is not new to NSX, however starting from version 6.2.X we’ve introduced the ability of stretching the L2 between NSX managers paired to different vCenters. This new logical switch is known as a ‘Universal Logical Switch’ or ‘ULS’. Any new ULS we will create in the Primary NSX Manger will be synced to the secondary.

I’ve created the following ULS in my Demo vPOD:

Universal Logical Switch (ULS)

Universal Distributed Logical Router (UDLR)

The concept of a Distributed Logical Router is still the same as it was before NSX 6.2.x. The new functionally that was added to this release allows us to configure Universal Distributed Logical Router (UDLR).  When we deploy a UDLR it will show up in all NSX Managers Universal Transport Zone.

The following UDLR created was created:

Universal Distributed Logical Router (UDLR)

Universal Security Policy with Distributed Firewall (UDFW)

With version 6.2.x we’ve introduced the universal security group and universal IP-Set.

Any firewall rule configured in the Universal Section must be IP-SET or Security Group that contain IP-SET.

When we are configuring or changing Universal policy, automatically there is a sync process that runs from the primary to the secondary NSX manager.

The recommended way to work with an ipset is to add it to a universal security group.

The following Universal security policy is an example to allow communicating to 3-Tier application. The security policy is built from universal security groups. Each group contain IP-SET with the relevant IP address for each tier.

Universal Security Policy with Distributed Firewall (UDFW)


At the automation side we’re creating two unique machine blueprints per site. The MBP are based on Classic CentOS image that allows us to perform some connectivity tests.

The MBP named “Center-Site_A” will be deployed by vRA to Site A into the green ULS named: ULS_Green_Web-A.

The IP address pool configured for this ULS is

The MBP named “Center-Site_B” will be deployed by vRA to Site B into the blue ULS named: ULS_Blue_Web-B.

The IP address pool configured for this ULS is

vRA Catalog

Cloud Client:

To quote from VMware Official documentation:

“Typically, a vSphere hosted VM managed by vRA belongs to a reservation, which belongs to a compute resource (cluster), which in turn belongs to a vSphere Endpoint. The VMs reservation in vRA needs to be accurate in order for vRA to know which vSphere proxy agent to utilize to manage that VM in the underlying vSphere infrastructure. This is all well and good and causes few (if any) problems in a single site setup, as the VM will not normally move from the vSphere endpoint it is originally located on.

With a multi-site deployment utilizing Site Recovery Manager all this changes as part of the site to site fail over process involves moving VMs from one vCenter to another. This has the effect in vRA of moving the VM to a different endpoint, but the reservation becomes stale. As a result it becomes no longer possible to perform day 2 operation on the VMs until the reservation is updated.”

When we failover VMs from Site A to Site B cloud client will run the following action behind the science to solve this challenge.

Process Flow for Planned Failover:

Process Flow for Planned Failover

The Conceptual Routing Design with Active/Active Datacenter

The main key point for this design is to run Active/Active for workloads in both datacenters.

The workloads will reside on both Site A and Site B. In the modern datacenter the entry point is protected with perimeter firewall.

In our design each site has its on perimeter firewall run independently FW_A located in Site A and FW_B Located in Site B.
Site A (Shown in Green color) run its own ESGs (Edge Security Gateways), Universal DLR (UDLR) and Universal Logical Switch (ULS).

Site B site (shown in Blue color) have different ESGs, Universal DLR (UDLR) and Universal Logical Switch (ULS).

The main reason for the different ESG, UDLR and ULS per site is to force single ingress/egress point for workload traffic per site.

Without this ingress/egress deterministic traffic flow, we may face asymmetric routing between the two sites, that means that ingress traffic will be via Site A to FW_A and egress via Site B to FW_B, this asymmetric traffic will dropped by the FW_B.

Note: The ESGs in this blog run in ECMP mode, As a consequence we turned off the firewall service on the ESGs.

The Green network will always will be advertise via FW_A.  For an example The Control VM (IP shown in the figure below need to access the Green Web VM connected to the ULS_Web_Green_A , the traffic  from the client will be routed via Core router and to FW_A, from there to one of the ESG working in ECMP mode, then to the Green UDLR and finally to the Green Web VM itself.

Now Assume the same client would like to access the Blue Web VM connected to ULS_Web_Blue_B, this traffic will be routed via the Core router to FW_B, from there to one of the Blue ESG working in ECMP mode, to the Blue ULDR and at the end to the Blue VM itself.

Routing Design with Active/Active Datacenter

What is the issue with this design?

What will happen if we will face a complete failure in one of our Edge Clusters or FW_A?

For our scenario I’ve combined failures of the Green Edge cluster and FW_A in the image below.

In that case we will lose all our N-S traffic to all of our ULS behind this Green Edge Cluster.

As a result, all clients outside the SDDC will lose connectivity immediately to all of the green Green ULS.

Please note: forwarding traffic to the Blue ULS will continue to work in this event regardless of the failure in Site A.



If we’ll have a stretched vSphere Edge cluster between Site A and Site B, then we will able to leverage vSphere HA to restart the failed Green ESGs in the remote Blue site (This is not the case here, in our design each site has its own local cluster and storage), but even if we had vSphere HA, the restart process can take few minutes. Another way to recover from this failure is to manually deploy Green ESGs in Site B, and connect them to Site B FW_B. The recovery time of this solution could take few minutes. Both of these options are not suitable for modern datacenter design.

In the next paragraph I will introduce a new way to design the ESGs in Active/Active datacenter architecture.

This design will be much faster and will work in a more efficient way to recover from such an event in Site A (or Site B).

Active/Active Datacenter with mirrored ESGs

In this design architecture we will be deploying mirrored Green ESGs in Site B, and blue mirrored ESGs into Site A. Under normal datacenter operation the mirrored ESGs will be up and running but will not forward traffic. Site-A green ULS traffic from external clients will always enter via Site A ESGs (E1-Green-A , E2-Green-A) for all of Site A Prefix and leave through the same point.

Adding the mirrored ESGs add some complexity in the single Ingres/Egress design, but improves the converge time of any failure.

PIC8How Ingress Traffic flow works in this design?

Now we will explain how the Ingress traffic flow works in this architecture with mirrored ESGs. In order to simplify the explanation, we will be focusing only on the green flow in both of the datacenters and remove the blue components from the diagrams but the same explanation works for the Blue Site B network as well.

Site A Green UDLR control VM runs eBGP protocol with all Green ESGs (E1-Green-A to E4-Green-B). The UDLR Redistributes all connected interfaces as Site A prefix via eBGP. Note: “Site A prefix” represent any Green Segments part of the green ULS.

The Green ESGs (E1-Green-A  to E4-Green-B) sends out via BGP Site-A’s prefix to both physical firewalls: FW_A located in Site A and FW_B located Site B.

FW_B in Site B will add BGP AS prepending for Site A prefix.

From the Core router point of view, we’ll have two different paths to reach Site A Prefix: one via FW_A (Site A) and the second via FW_B (Site B). Under normal operation, this traffic will flow only through Site A because of the fact that Site B prepending for prefix A.


Egress Traffic

Egress traffic is handled by UDLR control VM with different BGP Weigh values.

Site A ESGs: E1-Green-A and E2-Green-A has mirrors ESGs: E3-Green-B and E4-Green-B located at Site B. The mirrors ESGs provide availability. Under normal operation The UDLR Control VM will always prefer to route the traffic via higher BGP Wight value of E1-Green-A and E2-Green-A.  E3-Green-B and E4-Green-B will not forward any traffic and will wait for E1-E2 to fail.

In the figure below, we can see Web workload running on Site A ULS_Green_A initiate traffic to the Core. This egress traffic pass through DLR Kernel module, trough E1-Green-A ESG and then forward to Site A FW_A.


There are other options for ingress/egress within NSX 6.2:

Great new feature called ‘Local-ID’. Hany Michael wrote a blog post to cover this option.

In Hany’s blog we don’t have a firewall like in my design so please pay attention to few minor differences.


Anthony Burke wrote a blog post about how to use local-id with physical firewall


Routing updates

Below, we’re demonstrating routing updates for Site-A, but the same mechanism works for Site B. The Core router connected to FW_A in Site A will peer with the FW_A via eBGP.

The core will send out 0/0 Default gateway.

FW_A will perform eBGP peering with both E1-Green-A and E2-Green-A. FW_A will forward the 0/0 default gateway to Green ESGs and will receive Site A green Prefix’s from Green ESGs. The Green ESGs E1-Green-A and E2-Green-A peers in eBGP with UDLR control VM.

The UDLR and the ESGs will work in ECMP mode, as results the UDLR will get 0/0 from both ESGs. The UDLR will redistribute connected interfaces (LIFs) to both green ESGs.

We can work with iBGP or eBGP  or mix from the UDLR – > ESG ->  physical routers.

In order to reduce the eBGP converge time of Active UDLR control VM failure, we will configure flowing static route in all of the Green side to point to UDLR forwarding address for the internal LIF’s.

Routing filters will apply on all ESGs to prevent unwanted prefixes advertisement and EGSs becoming transit gateways.


Failure of One Green ESG in Site A

The Green ESGs: E1-Green-A and E2-Green-A working in ECMP mode. From UDLR and FW_A point of view both of the ESG work in Active/Active mode.

As long as we have at least one active Green ESG in Site A, The Green UDLR and the Core router will always prefer to work with Site A Green ESGs.

Let’s assume we have active flow of traffic from the Green WEB VM in site A to the external client behind the core router, and this traffic initially passing through via E1-Green-A. In and event of failure of E1-Green-A ESG, the UDLR will reroute the traffic via E2-Green-ESG because this ESG has better weight then Green ESGs on site B (E3-Green-B and E4-Green-B).

FW_A is still advertising a better as-path to ‘ULS_Web_Green_A’ prefixes than FW_B (remember FW_B always prepending Site_A prefix).

We’ll use low BGP time interval settings (hello=1 sec, hold down=3 sec) to improve BGP converge routing.



Complete Edges cluster failure in site A

In this scenario we face a failure of all Edge cluster in Site A (Green ESGs and Blue ESGs), this issue might include the failure of FW_A.

Core router we will not be receiving any BGP updates from the Site A, so the core will prefer to go to FW_B in order to reach any Site A prefix.

From the UDLR point of view there arn’t any working Green ESGs in Site A, so the UDLR will work with the remaining green ESGs in site B (E3-Green-B, E4-Green-B).

The traffic initiated from the external client will be reroute via the mirrored green ESGs (E3-Green-B and E4-GreenB) to the green ULS in site B. The reroute action will work very fast based on the BGP converge routing time interval settings (hello=1 sec, hold down=3 sec).

This solution is much faster than other options mentioned before.

Same recovery mechanism exists for failure in Site B datacenter.


Note: The Green UDLR control VM was deployed to the payload cluster and isn’t affected by this failure.


Complete Site A failure:

In this catastrophic scenario all components in site A were failed. Including the management infrastructure (vCenter, NSX Manager, controller, ESGs and UDLR control VM). Green workloads will face an outage until they are recovered in Site B, the Blue workloads continues to work without any interference.

The recovery procedure for this event will be made for the infrastructure management/control plan component and for the workloads them self.

Recovery the Management/control plan:

  • Log in to secondary NSX Manager and then Promote Secondary NSX Manager to Primary by: Assign Primary Role.
  • Deploy new Universal Controller Cluster and synchronize all objects
  • Universal CC configuration pushed to ESXi Hosts managed by Secondary
  • Redeploying the UDLR Control VM.

The recovery procedure for the workloads will run the “Recovery plan” from SRM located in site B.




In this blog post we are demonstrating the great power of NSX to create Active/Active datacenter with the ability to recover very fast from many failure scenarios.

  • We showed how NSX simplifies Disaster Recovery process.
  • NSX and SRM Integration is the reasonable approach to DR where we can’t use stretch vSphere cluster.
  • NSX works in Cross vCenter mode. Dual vCenters and NSX managers improving our availability. Even in the event of a complete site failure we were able to continue working immediately in our management layer (Seconday NSX manager and vCenter are Up and running).
  • In this design, half of our environment (Blue segments) wasn’t affected by a complete site failure. SRM recovered our failed Green workloads without need to change our Layer 2/ Layer 3 networks topology.
  • We did not use any specific hardware to achieve our BCDR and we were 100% decupled from the physical layer.
  • With SRM and vRO we were able to protect any deployed VM from Day 0.


I would like to thanks to:

Daniel Bakshi that help me a lots to review this blog post.

Also Thanks Boris Kovalev and Tal Moran that help to with the vRA/vRO demo vPOD.




Asymmetric routing with ECMP and Edge Firewall Enabled

What is Asymmetric Routing?

In Asymmetric routing, a packet traverses from a source to a destination in one path and takes a different path when it returns to the source.

Start from version 6.1 NSX Edge can work with ECMP – Equal Cost Multipath, ECMP traffic involved Asymmetric routing between Edges and DLR or between Edge and physical routers.

ECMP Consideration with Asymmetric Routing

ECMP with  Asymmetric routing is not a problem by itself, but will cause problems when more than one NSX Edge in place  and stateful services inserted in the path of the traffic.

Stateful services like firewall, Load Balanced  Network Address Translation (NAT) can’t work with asymmetric routing.

Explain the problem:

User from outside try to access Web VM inside the Data Center. the traffic will pass through E1 Edge.

From E1 the traffic will go to DLR transverse NSX distributed firewall and get to Web VM.

When Web VM respond back the traffic will hit the DLR default gateway. DLR have two option to route the traffic E1 or E2.

If DLR choose E2 the traffic will get the E2 and will Dropped !!!

The reason for this is E2 does not aware the state of session started at E1, replay packet from Red VM arrived to E2 are not match any existing session at E2.
From E2 perspective this is new session need to validate, any new TCP session should start with SYN, since this is not the begin of the session E2 will drop it!!!

Asymmetric Routing with Edge Firewall Enabled

Asymmetric Routing with Edge Firewall Enabled

Note: NSX Distributed firewall is not part of this problem, NSX Distributed firewall implement at the vNic level, all traffic get in/out same vNic.

there is no Asymmetric route in the vNic level, btw this is the reason when we vMotion VM, the Firewall Rule, Connection state is move with the VM itself.

ECMP and Edge Firewall NSX

Starting from version 6.1 when we enable ECMP  on NSX Edge get message:

Enable ECMP in 6.1 version

The firewall service disabled by default:

Enable ECMP in 6.1 version Firewall turnoff

Even if you try to enable it you will get warning message:

Firewall Service in 6.1 with ECMP

In version 6.1.2 when we enable ECMP we get same message:

Enable ECMP in 6.1 version

But the BIG difference is Firewall Service  is Not disable by default. (you need to turn it off)

Even if you have “Any, Any” rule with “Accept” action we still be subject for DROP packet subject of the Asymmetric routing problem!!!

Firewall Service Enable in 6.1.2

Even in Syslog or LogInSight you will not see this DROP packet !!!

The end users expirese for will be some of the session’s are working just fine (this sessions are not asymmetric) other session will drop (asymmetric sessions)

The place i found we can learn packet are drops because state of the session is with the command: show tech-support:

show tech-support
vShield Edge Firewall Packet Counters:
~~~~~~~~~~~~~~~ snip ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
rid    pkts bytes target     prot opt in     out     source               destination         
0        20  2388 ACCEPT     all  --  *      lo             
0        12   720 DROP       all  --  *      *              state INVALID
0        51  7108 block_out  all  --  *      *             
0         0     0 ACCEPT     all  --  *      *              PHYSDEV match --physdev-in tap0 --physdev-out vNic_+
0         0     0 ACCEPT     all  --  *      *              PHYSDEV match --physdev-in vNic_+ --physdev-out tap0
0         0     0 ACCEPT     all  --  *      *              PHYSDEV match --physdev-in na+ --physdev-out vNic_+
0         0     0 ACCEPT     all  --  *      *              PHYSDEV match --physdev-in vNic_+ --physdev-out na+
0         0     0 ACCEPT     all  --  *      *              state RELATED,ESTABLISHED
0        51  7108 usr_rules  all  --  *      *             
0         0     0 DROP       all  --  *      *  

From line 7 we can see DROP packet because of INVALID state.


When you enable ECMP and you have more then one NSX Edge in you topology, go to Firewall service and disable it by yourself otherwise you will spend lots of troubleshooting hours 🙁

NSX Edge and DRS Rules

The NSX Edge Cluster Connects the Logical and Physical worlds and usually hosts the NSX Edge Services Gateways and the DLR Control VMs.

There are deployments where the Edge Cluster may contain the NSX Controllers as well.

In this section we discuss how to design an Edge Cluster to survive a failure of an ESXi host or an Physical entire chassis and lower the time of outage.

In the figure below we deploy NSX Edges, E1 and E2, in ECMP mode where they run active/active both from the perspective of the control and data planes. The DLR Control VMs run active/passive while both E1 and E2 running a dynamic routing protocol with the active DLR Control VM.

When the DLR learns a new route from E1 or E2, it will push this information to the NSX Controller cluster. The NSX Controller will update the routing tables in the kernel of each ESXi hosts, which are running this DLR instance.




In the scenario where the ESXi host, which contains the Edge E1, failed:

  • The active DLR will update the NSX Controller to remove E1 as next hop, the NSX Controller will update the ESXi host and as a result the “Web” VM traffic will be routed to Edge E2.
    The time it takes to re-route the traffic depends on the dynamic protocol converge time.


In the specific scenario where the failed ESXi or Chassis contained both the Edge E1 and the active DLR, we would instead face a longer outage in the forwarded traffic.

The reason for this is that the active DLR is down and cannot detect the failure of the Edge E1 and accordingly update the Controller. The ESXi will continue to forward traffic to Edge E1 until the passive DLR becomes active, learns that the Edge E1 is down and updates the NSX Controller.


The Golden Rule is:

We must ensure that when the Edge Services Gateway and the DLR Control VM belong to the same tenant they will not reside in the same ESXi host. It is better to distribute them between ESXi hosts and reduce the affected functions.

By default when we deploy a NSX Edge or DLR in active/passive mode, the system takes care of creating a DRS anti-affinity rule and this prevents the active/passive VMs from running in the same ESXi host.

DRS anti affinity rules

DRS anti affinity rules

We need to build new DRS rules as these default rules will not prevent us from getting to the previous dual failure scenario.

The figure below describes the network logical view for our specific example. This topology is built from two different tenants where each tenant is being represented with a different color and has its own Edge and DLR.

Note connectivity to the physical world is not displayed in the figure below in order to simplify the diagram.

multi tenants

My physical Edge Cluster has four ESXi hosts which are distributed over two physical chassis:

Chassis A: esxcomp-01a, esxcomp-02a

Chassis B: esxcomp-01b, esxcomp-02b


Create DRS Host Group for each Chassis

We start with creating a container for all the ESXi hosts in Chassis A, this container group configured is in DRS Host Group.

Edge Cluster -> Manage -> Settings -> DRS Groups

Click on Create Add button and call this group “Chassis A”.

Container type need to be “Host DRS Group” and Add ESXi host running on Chassis A (esxcomp-01a and esxcomp-02a).


Create another DRS group called Chassis B that contains esxcomp-01b and esxcomp-02b:



VM’s DRS Group for Chassis A:

We need to create a container for VMs that will run in Chassis A. At this point we just name it as Chassis A, but we are not actually putting the VMs in Chassis A.

This Container type is “VM DRS Group”:


VM DRS Group for Chassis B:



At this point we have four DRS groups:


DRS Rules:

Now we need to take the DRS object we created before: “Chassis A” and “VM to Chassis A “ and tie them together. The next step is to do the same for “Chassis B” and “VM to Chassis B“

* This configuration needs to be part of “DRS Rules”.

Edge Cluster -> Manage -> Settings -> DRS Rules

Click on the Add button in DRS Rules, in the name enter something like: “VM’s Should Run on Chassis A”

In the Type select “Virtual Machine to Hosts” because we want to bind the VM’s group to the Hosts Group.

In the VM group name choose “VM to Chassis A” object.

Below the VM group selection we need to select the group & hosts binding enforcement type.

We have two different options:

“Should run on hosts in group” or “Must run on hosts in group”

If we choose “Must” option, in the event of the failure of all the ESXi hosts in this group (for example if Chassis A had a critical power outage), the other ESXi hosts in the cluster (Chassis B) would not be considered by vSphere HA as a viable option for the recovery of the VMs. “Should” option will take other ESXi hosts as recovery option.



Same for Chassis B:


Now the problem with the current DRS rules and the VM placement in this Edge cluster is that the Edge and DLR Control VM are actually running in the same ESXi host.  We need to create anti-affinity DRS rules.

Anti-Affinity Edge and DLR:

An Edge and DLR that belong to the same tenant should not run in the same ESXi host.

For Green Tenant:


For Blue Tenant:


The Final Result:

In the case of a failure of one of the ESXi hosts we don’t face the problem where Edge and DLR are on the same ESXi host, even if we have a catastrophic event of a chassis A or B failure.




Control VM location can move to compute cluster and we can avoid this design consideration.

Thanks to Max Ardica and  Tiran Efrat for reviewing this post.



Thanks to Francis Guillier Max Ardica and  Tiran Efrat of the overview and feedback.

One of the most important NSX Edge features is NAT.
With NAT (Network Address Translation) we can change the Source or Destination IP addresses and TCP/UDP port. Combined NAT and Firewall rules can lead to confusion when we try to determine the correct IP address to which apply the firewall rule.
To create the correct rule we need to understand the packet flow inside the NSX Edge in details. In NSX Edge we have two different type of NAT: Source Nat (SNAT) and Destination NAT (DNAT).



Allows translating an internal IP address (for example private IP described in RFC 1918) to a public External IP address.
In figure below, the IP address for any VM in VXLAN 5001 that needs outside connectivity to the WAN can be translated to an external IP address (this mapping is configured on the Edge). For example, VM1 with IP address needs to communicate with WAN Internet, so the NSX Edge can translate it to a IP address configured on the Edge external interface.
Users in the external network are not aware of the internal Private IP address.




Allow to access internal private IP addresses from the outside world.
In the example in figure below, users from the WAN need to communicate with the Server
NSX Edge DNAT mapping configuration is created so that the users from outside connect to and NSX Edge translates this IP address to


Below is the outline of the Packet flow process inside the Edge. The important parts are where the SNAT/DNAT Action and firewall decision action are being taken.

packet flow

We can see from this process that the ingress packet will evaluate against FW rules before SNAT/DNAT translation.

Note: the actual packet flow details are more complicated with more action/decisions in Edge flow, but the emphasis here is on the NAT and FW functionalities only.

Note:  NAT function will work only if firewall service is enabled.

Enable Firewall Service



Firewall rules and SNAT

Because of this packet flow the firewall rule for SNAT need to be applied on the internal IP address object and not on the IP address translated by the SNAT function. For example, when a VM1 needs to communicate with the WAN, the firewall rule needs to be:

fw and SNAT

 Firewall rules and DNAT

Because of this packet flow the firewall rules for DNAT need to be applied on the public IP address object and not on the Private IP address after the DNAT translation. When a user from the WAN sends traffic to, this packet will be checked against this FW rule and then the NAT will change the destination IP address to

fw and DNAT

DNAT Configuration

Users from outside need to access an internal web server connecting to its public IP address.
The server internal IP address is, the NAT IP address is



The first step is creating the External IP on the Edge, this IP is secondary because this edge already has a main IP address configured in the IP subnet.

Note: the main IP address is marked with a black Ddot (

For this example the DNAT IP address is


Create a DNAT Rule in the Edge:


Now pay attention to the firewall rules one the Edge: a user coming from the outside will try to access the internal server by connecting to the public IP address This implies that the fw rule needs to allow this access.



DNAT Verification:

There are several ways to verify NAT is functioning as originally planned. In our example, users from any source address access the public IP address, and after the NAT translation the packet destination IP address is changed to

The output of the command:

show nat

show nat

The output of the command:

show firewall flow

We can see that packet is received by the Edge and destined to the address, the return traffic is instead originated from the different IP address (the private IP address).
That means DNAT translation is happening here.

show flow

We can capture the traffic and see the actual packet:
Capture Edge traffic on its outside interface vNic_0, in this example user source IP address is and destination is

The command for capture is:
debug packet display interface vNic_0 port_80_and_src_192.168.110.10

Debug packet display interface vNic_0 port_80_and_src_192.168.110.10

debug packet 1

Capture edge on internal interface vNic_1 we can see destination IP address has changed to because of DNAT translation:

debug packet 2

SNAT configuration

All the servers part of VXLAN segment 5001 (associated to the IP subnet need to leverage SNAT translation (in this example to IP address on the outside interface of the Edge to be able to communicate with the external network.


SNAT config

SNAT Configuration:

snat config 2

Edge Firewall Rules:

Allow to to go out

SNAT config fw rule



The output of the command

Show nat

show nat verfication

DNAT with L4 Address Translation (PAT)

DNAT with L4 Address Translation allows changing Layer4 TCP/UDP port.
For example we would like to mask our internal SSH server port for all users from outside.
The new port will be TCP/222 instead of regular SSH TCP/22 port.

The user originates a connection to the Web Server on destination port TCP/222 but the NSX Edge will change it to TCP/22.


From the command line the show nat command:

PAT show nat

NAT Order

In this specific scenario, we want to create the two following SNAT rules.

  • SNAT Rule 1:
    The IP addresses for the devices part of VXLAN 5001 (associated to the IP subnet need to be translated to the Edge outside interface address
  • SNAT Rule 2:
    Web-SRV-01a on VXLAN 5001 needs its IP address to be translated to the Edge outside address

nat order

In the configuration example above, traffic will never hit rule number 4 because is part of subnet, so its IP address will be translated to (and not the desired

Order for SNAT rules is important!
We need to re-order the SNAT rules and put the more specific one on top, so that rule 3 will be hit for traffic originated from the IP address, whereas rule 4 will apply to all the other devices part of IP subnet

nat reorder

After re-order:

nat after reorer


another useful command

show configuration nat


NSX Load Balancing

This next overview of Load Balancing  was taken from great work of Max Ardica and Nimish Desai in the official NSX Design Guide:


Load Balancing is another network service available within NSX that can be natively enabled on the NSX Edge device. The two main drivers for deploying a load balancer are scaling out an application (through distribution of workload across multiple servers), as well as improving its high-availability characteristics

NSX Load Balancing

NSX Load Balancing

The NSX load balancing service is specially designed for cloud with the following characteristics:

  • Fully programmable via API
  • Same single central point of management/monitoring as other NSX network services

The load balancing services natively offered by the NSX Edge satisfies the needs of the majority of the application deployments. This is because the NSX Edge provides a large set of functionalities:

  • Support any TCP applications, including, but not limited to, LDAP, FTP, HTTP, HTTPS
  • Support UDP application starting from NSX SW release 6.1.
  • Multiple load balancing distribution algorithms available: round-robin, least connections, source IP hash, URI
  • Multiple health checks: TCP, HTTP, HTTPS including content inspection
  • Persistence: Source IP, MSRDP, cookie, ssl session-id
  • Connection throttling: max connections and connections/sec
  • L7 manipulation, including, but not limited to, URL block, URL rewrite, content rewrite
  • Optimization through support of SSL offload

Note: the NSX platform can also integrate load-balancing services offered by 3rd party vendors. This integration is out of the scope for this paper.

In terms of deployment, the NSX Edge offers support for two types of models:

  • One-arm mode (called proxy mode): this scenario is highlighted in Figure below and consists in deploying an NSX Edge directly connected to the logical network it provides load-balancing services for.
One-Arm Mode Load Balancing Services

One-Arm Mode Load Balancing Services

The one-armed load balancer functionality is shown above:

  1. The external client sends traffic to the Virtual IP address (VIP) exposed by the load balancer.
  2. The load balancer performs two address translations on the original packets received from the client: Destination NAT (D-NAT) to replace the VIP with the IP address of one of the servers deployed in the server farm and Source NAT (S-NAT) to replace the client IP address with the IP address identifying the load-balancer itself. S-NAT is required to force through the LB the return traffic from the server farm to the client.
  3. The server in the server farm replies by sending the traffic to the LB (because of the S-NAT function previously discussed).

The LB performs again a Source and Destination NAT service to send traffic to the external client leveraging its VIP as source IP address.

The advantage of this model is that it is simpler to deploy and flexible as it allows deploying LB services (NSX Edge appliances) directly on the logical segments where they are needed without requiring any modification on the centralized NSX Edge providing routing communication to the physical network. On the downside, this option requires provisioning more NSX Edge instances and mandates the deployment of Source NAT that does not allow the servers in the DC to have visibility into the original client IP address.

Note: the LB can insert the original IP address of the client into the HTTP header before performing S-NAT (a function named “Insert X-Forwarded-For HTTP header”). This provides the servers visibility into the client IP address but it is obviously limited to HTTP traffic.

Inline mode (called transparent mode) requires instead deploying the NSX Edge inline to the traffic destined to the server farm. The way this works is shown in Figure below.

Two-Arms Mode Load Balancing Services

Two-Arms Mode Load Balancing Services

    1. The external client sends traffic to the Virtual IP address (VIP) exposed by the load balancer.
    2. The load balancer (centralized NSX Edge) performs only Destination NAT (D-NAT) to replace the VIP with the IP address of one of the servers deployed in the server farm.
    3. The server in the server farm replies to the original client IP address and the traffic is received again by the LB since it is deployed inline (and usually as the default gateway for the server farm).
    4. The LB performs Source NAT to send traffic to the external client leveraging its VIP as source IP address.

    This deployment model is also quite simple and allows the servers to have full visibility into the original client IP address. At the same time, it is less flexible from a design perspective as it usually forces using the LB as default gateway for the logical segments where the server farms are deployed and this implies that only centralized (and not distributed) routing must be adopted for those segments. It is also important to notice that in this case LB is another logical service added to the NSX Edge already providing routing services between the logical and the physical networks. As a consequence, it is recommended to increase the form factor of the NSX Edge to X-Large before enabling load-balancing services.


    In terms of scalability and throughput figures, the NSX load balancing services offered by each single NSX Edge can scale up to (best case scenario):

    • Throughput: 9 Gbps
    • Concurrent connections: 1 million
    • New connections per sec: 131k


    In below are some deployment examples of tenants with different applications and different load balancing needs. Notice how each of these applications is hosted on the same Cloud with the network services offered by NSX.

Deployment Examples of NSX Load Balancing

Deployment Examples of NSX Load Balancing

Two final important points to highlight:

  • The load balancing service can be fully distributed across This brings multiple benefits:
  • Each tenant has its own load balancer.
  • Each tenant configuration change does not impact other tenants.
  • Load increase on one tenant load-balancer does not impact other tenants load-balancers scale.
  • Each tenant load balancing service can scale up to the limits mentioned above.

Other network services are still fully available

  • The same tenant can mix its load balancing service with other network services such as routing, firewalling, VPN.


One Arm Load Balance Lab Topology

In this One Arm Load Balance Lab Topology we have a 3-tiers application built from:

Web servers: web-sv-01a (, web-sv-02a (

App: app-sv-01a (

DB: db-sv-01a (

We will add to this lab NSX Edge service gateway (ESG) for load balancer function.

The ESG (highlighted with the red line) is deployed in one-arm mode and exposes the VIP to load-balance traffic to the Web-Tier-01 segment.

One-Armed Lab topology


Configure One Arm Load Balance

Create NSX Edge gateway:


Select Edge Service Gateway (ESG):

Set the Admin password, enable SSH and Auto rule:


Install the ESG in Management Cluster:


In our lab appliance size is Compact, but we should choose the right size according to amount of traffic expected:


Configure the Edge interface and IP address; since this is one-arm mode we have only one interface:


Create default gateway


Configure default accept fw rule:


Complete the installation:


Verify ESG is deployed::


Enable Load Balance in the ESG, go to Load Balance and click Edit:


Check mark “Enable Load Balancer”


Create the application profile:


Add a name, in the Type select HTTPS and Enable SSL Passthrough:


Create the pool:


In the Algorithm select ROUND-ROBIN, monitor is default https, and add two servers member to monitor:


To add Members click on the + icon, the port we monitor is 443:


We need then to create the VIP:


In this step we glue all the configuration parts, tie the application profile to pool and give it the Virtual IP address:


Now we can check that the load balancer is actually working by connecting to the VIP address with a client web browser.

In the web browser, we point to the VIP address

The results is to hit web-sv-01a:


When we try to refresh our web browser client we see we hit web-sv-02a :


Troubleshooting One Arm Load Balance

General Loadbalancer troubleshooting workflow

Review the configuration through UI

Check the pool member status through UI

Do online troubleshooting via CLI:

  • Check LB engine status (L4/L7)
  • Check LB objects statistics (vips, pools, members)
  • Check Service Monitor status (OK, WARNING, CRITICAL)
  • Check system log message (# show log)
  • Check LB L4/L7 session table
  • Check LB L7 sticky-table status


Check the configuration through UI





  1. Check the pool member status through UI:



Possible errors discovered:

  1. 80/443 port might be used by other services (e.g. sslvpn);
  2. Member port and monitor port are miss configured hence health check failed.
  3. Member in WARNING state should be treated as DOWN.
  4. L4 LB is used when:
    a) TCP/HTTP protocol;
    b) no persistence settings and L7 settings;
    c) accelerateEnable is true;
  5. Pool is in transparent mode but Edge doesn’t sit in the return pat

Do online troubleshooting via CLI:

Check LB engine status (L4/L7)

# show service loadbalancer

Check LB objects statistics (vips, pools, members)

# show service loadbalancer virtual [vip-name]

# show service loadbalancer pool [poo-name]

Check Service Monitor status (OK, WARNING, CRITICAL)

# show service loadbalancer monitor

Check system log message

# show log

Check LB session table

# show service loadbalancer session

Check LB L7 sticky-table status

# show service loadbalancer table



One-Arm-LB-0> show service loadbalancer
error Show loadbalancer Latest Errors information.
monitor Show loadbalancer HealthMonitor information.
pool Show loadbalancer pool information.
session Show loadbalancer Session information.
table Show loadbalancer Sticky-Table information.
virtual Show loadbalancer virtualserver information.


One-Arm-LB-0> show service loadbalancer
Loadbalancer Services Status:

L7 Loadbalancer : running
Health Monitor : running


One-Arm-LB-0> show service loadbalancer monitor
Loadbalancer HealthMonitor Statistics:

POOL                               MEMBER                                  HEALTH STATUS
Web-Servers-Pool-01  web-sv-02a_172.16.10.12   default_https_monitor:OK
Web-Servers-Pool-01  web-sv-01a_172.16.10.11   default_https_monitor:OK


One-Arm-LB-0> show service loadbalancer virtual
Loadbalancer VirtualServer Statistics:

| ADDRESS []:443
| SESSION (cur, max, total) = (0, 3, 35)
| RATE (cur, max, limit) = (0, 6, 0)
| BYTES in = (17483), out = (73029)
+->POOL Web-Servers-Pool-01
| LB METHOD round-robin
| Transparent disabled
| SESSION (cur, max, total) = (0, 3, 35)
| BYTES in = (17483), out = (73029)
+->POOL MEMBER: Web-Servers-Pool-01/web-sv-01a_172.16.10.11, STATUS: UP
| | STATUS = UP, MONITOR STATUS = default_https_monitor:OK
| | SESSION (cur, max, total) = (0, 2, 8)
| | BYTES in = (8882), out = (43709)
+->POOL MEMBER: Web-Servers-Pool-01/web-sv-02a_172.16.10.12, STATUS: UP
| | STATUS = UP, MONITOR STATUS = default_https_monitor:OK
| | SESSION (cur, max, total) = (0, 1, 7)
| | BYTES in = (7233), out = (29320)

One-Arm-LB-0> show service loadbalancer pool
Loadbalancer Pool Statistics:

POOL Web-Servers-Pool-01
| LB METHOD round-robin
| Transparent disabled
| SESSION (cur, max, total) = (0, 3, 35)
| BYTES in = (17483), out = (73029)
+->POOL MEMBER: Web-Servers-Pool-01/web-sv-01a_172.16.10.11, STATUS: UP
| | STATUS = UP, MONITOR STATUS = default_https_monitor:OK
| | SESSION (cur, max, total) = (0, 2, 8)
| | BYTES in = (8882), out = (43709)
+->POOL MEMBER: Web-Servers-Pool-01/web-sv-02a_172.16.10.12, STATUS: UP
| | STATUS = UP, MONITOR STATUS = default_https_monitor:OK
| | SESSION (cur, max, total) = (0, 1, 7)
| | BYTES in = (7233), out = (29320)


One-Arm-LB-0> show service loadbalancer session
L7 Loadbalancer Current Sessions:

0x5fe50a2b230: proto=tcpv4 src= fe=Web-Servers-VIP be=Web-Servers-Pool-01 srv=web-sv-01a_172.16.10.11 ts=08 age=8s calls=3 rq[f=808202h,i=0,an=00h,rx=4m53s,wx=,ax=] rp[f=008202h,i=0,an=00h,rx=4m53s,wx=,ax=] s0=[7,8h,fd=13,ex=] s1=[7,8h,fd=14,ex=] exp=4m52s
0x5fe50a22960: proto=unix_stream src=unix:1 fe=GLOBAL be=<NONE> srv=<none> ts=09 age=0s calls=2 rq[f=c08200h,i=0,an=00h,rx=20s,wx=,ax=] rp[f=008002h,i=0,an=00h,rx=,wx=,ax=] s0=[7,8h,fd=1,ex=] s1=[7,0h,fd=-1,ex=] exp=20s


Disconnect web-sv-01a_172.16.10.11 from the network




From the GUI we can see the effect in members pool status:



One-Arm-LB-0> show service loadbalancer virtual
Loadbalancer VirtualServer Statistics:

| ADDRESS []:443
| SESSION (cur, max, total) = (0, 3, 35)
| RATE (cur, max, limit) = (0, 6, 0)
| BYTES in = (17483), out = (73029)
+->POOL Web-Servers-Pool-01
| LB METHOD round-robin
| Transparent disabled
| SESSION (cur, max, total) = (0, 3, 35)
| BYTES in = (17483), out = (73029)
+->POOL MEMBER: Web-Servers-Pool-01/web-sv-01a_172.16.10.11, STATUS: DOWN
| | STATUS = DOWN, MONITOR STATUS = default_https_monitor:CRITICAL
| | SESSION (cur, max, total) = (0, 2, 8)
| | BYTES in = (8882), out = (43709)
+->POOL MEMBER: Web-Servers-Pool-01/web-sv-02a_172.16.10.12, STATUS: UP
| | STATUS = UP, MONITOR STATUS = default_https_monitor:OK
| | SESSION (cur, max, total) = (0, 1, 7)
| | BYTES in = (7233), out = (29320)

VMware NSX Edge Scale Out with Equal-Cost Multi-Path Routing

This post was written by Roie Ben Haim and Max Ardica, with a special thanks to Jerome Catrouillet, Michael Haines, Tiran Efrat and Ofir Nissim for their valuable input

The modern data center design is changing, following a shift in the habits of consumers using mobile devices, the number of new applications that appear every day and the rate of end-user browsing which has grown exponentially. Planning a new data center requires meeting certain fundamental design guidelines. The principal goals in data center design are: Scalability, Redundancy and High-bandwidth.

In this blog we will describe the Equal Cost Multi-Path functionality (ECMP) introduced in VMware NSX release 6.1 and discuss how it addresses the requirements of scalability, redundancy and high bandwidth. ECMP has the potential to offer substantial increases in bandwidth by load-balancing traffic over multiple paths as well as providing fault tolerance for failed paths. This is a feature which is available on physical networks but we are now introducing this capability for virtual networking as well. ECMP uses a dynamic routing protocol to learn the next-hop towards a final destination and to converge in case of failures. For a great demo of how this works, you can start by watching this video, which walks you through these capabilities in VMware NSX.




Scalability and Redundancy and ECMP

To keep pace with the growing demand for bandwidth, the data center must meet scale out requirements, which provide the capability for a business or technology to accept increased volume without redesign of the overall infrastructure. The ultimate goal is avoiding the “rip and replace” of the existing physical infrastructure in order to keep up with the growing demands of the applications. Data centers running business critical applications need to achieve near 100 percent uptime. In order to achieve this goal, we need the ability to quickly recover from failures affecting the main core components. Recovery from catastrophic events needs to be transparent to end user experiences.

ECMP with VMware NSX 6.1 allows you to use upto a maximum of 8 ECMP Paths simultaneously. In a specific VMware NSX deployment, those scalability and resilience improvements are applied to the “on-ramp/off-ramp” routing function offered by the Edge Services Gateway (ESG) functional component, which allows communication between the logical networks and the external physical infrastructure.

ECMP Topology

ECMP Topology


External user’s traffic arriving from the physical core routers can use up to 8 different paths (E1-E8) to reach the virtual servers (Web, App, DB).

In the same way, traffic returning from the virtual server’s hit the Distributed Logical Router (DLR), which can choose up to 8 different paths to get to the core network.

How the Path is Determined

NSX for vSphere Edge Services Gateway device:

When a traffic flow needs to be routed, the round robin algorithm is used to pick up one of the links as the path for all traffic of this flow. The algorithm ensures to keep in order all the packets related to this flow by sending them through the same path. Once the next-hop is selected for a particular Source IP and Destination IP pair, the route cache stores this. Once a path has been chosen, all packets related to this flow will follow the same path.

There is a default IPv4 route cache timeout, which is 300 seconds. If an entry is inactive for this period of time, it is then eligible to be removed from route cache. Note that these settings can be tuned for your environment.

Distributed Logical Router (DLR):

The DLR will choose a path based on a Hashing algorithm of Source IP and Destination IP.


What happens in case of a failure on one of Edge Devices?

In order to work with ECMP the requirement is to use a dynamic routing protocol: OSPF or BGP. If we take OSPF for example, the main factor influencing the traffic outage experience is the tuning of the

OSPF timers.

OSPF will send hello messages between neighbors, the OSPF “Hello” protocol is used and determines the Interval as to how often an OSPF Hello is sent.

Another OSPF timer called “Dead” Interval is used, which is how long to wait before we consider an OSPF neighbor as “down”. The OSPF Dead Interval is the main factor that influences the convergence time. Dead Interval is usually 4 times the Hello Interval but the OSPF (and BGP) timers can be set as low as 1 second (for Hello interval) and 3 seconds (for Dead interval) to speed up the traffic recovery.


ECMP failed Edge

ECMP failed Edge


In the example above, the E1 NSX Edge has a failure; the physical routers and DLR detect E1 as Dead at the expiration of the Dead timer and remove their OSPF neighborship with him. As a consequence, the DLR and the physical router remove the routing table entries that originally pointed to the specific next-hop IP address of the failed ESG.

As a result, all corresponding flows on the affected path are re-hashed through the remaining active units. It’s important to emphasize that network traffic that was forwarded across the non-affected paths remains unaffected.


Troubleshooting and visibility

With ECMP it’s important to have introspection and visibility tools in order to troubleshoot optional point of failure. Let’s look at the following topology.



A user outside our Data Center would like to access the Web Server service inside the Data Center. The user IP address is and the web server IP address is

This User traffic will hit the Physical Router (R1), which has established OSPF adjacencies with E1 and E2 (the Edge devices). As a result R1 will learn how to get to the Web server from both E1 and E2 and will get two different active paths towards R1 will pick one of the paths to forward the traffic to reach the Web server and will advertise the user network subnet to both E1 and E2 with OSPF.

E1 and E2 are NSX for vSphere Edge devices that also establish OSPF adjacencies with the DLR. E1 and E2 will learn how to get to the Web server via OSPF control plane communication with the DLR.

From the DLR perspective, it acts as a default gateway for the Web server. This DLR will form an OSPF adjacency with E1 and E2 and have 2 different OSPF routes to reach the user network.
From the DLR we can verify OSPF adjacency with E1, E2.

We can use the command: “show ip ospf neighbor”

show ip ospf neighbor

show ip ospf neighbor

From this output we can see that the DLR has two Edge neighbors: and next step will be to verify that ECMP is actually working.

We can use the command: “show ip route”

show ip route

show ip route

The output from this command shows that the DLR learned the user network via two different paths, one via E1 = and the other via E2 =

Now we want to display all the packets which were captured by an NSX for vSphere Edge interface.

In the example below and in order to display the traffic passing through interface vNic_1, and which is not OSPF protocol control packets, we need to type this command:
“debug packet display interface vNic_1 not_ip_proto_ospf”

We can see an example with a ping running from host to host

Capture traffic

Capture traffic

If we would like to display the captured traffic to a specific ip address, the command capture would look like: “debug packet display interface vNic_1 dst_172.16.10.10”

debug packet display interface vNic_1 dst

debug packet display interface vNic_1 dst

* Note: When using the command “debug packter display interface” we need to add underscore between the expressions after the interface name.

Useful CLI for Debugging ECMP

To check which ECMP path is chosen for a flow

  • debug packet display interface IFNAME

To check the ECMP configuration

  • show configuration routing-global

To check the routing table

  • show ip route

To check the forwarding table

  • show ip forwarding


Useful CLI for Dynamic Routing

  • show ip ospf neighbor
  • show ip ospf database
  • show ip ospf interface
  • show ip bgp neighbors
  • show ip bgp

ECMP Deployment Consideration

ECMP currently implies stateless behavior. This means that there is no support for stateful services such as the Firewall, Load Balancing or NAT on the NSX Edge Services Gateway.

Starting from 6.1.2 Edge Firewall not disabled automatic on ESG when ECMP is enabled, turn off Firewall when enable ECMP.

In the current NSX 6.1 release, the Edge Firewall and ECMP cannot be turned on at the same time on NSX edge device. Note however, that the Distributed Firewall (DFW) is unaffected by this.


About the authors:

Roie Ben Haim

Roie works as a professional services consultant at VMware, focusing on design and implementation of VMware’s software-defined data center products.  Roie has more than 12 years in data center architecture, with a focus on network and security solutions for global enterprises. An enthusiastic M.Sc. graduate, Roie holds a wide range of industry leading certifications including Cisco CCIE x2 # 22755 (Data Center, CCIE Security), Juniper Networks JNCIE – Service Provider #849, and VMware vExpert 2014, VCP-NV, VCP-DCV.

Max Ardica

Max Ardica is a senior technical product manager in VMware’s networking and security business unit (NSBU). Certified as VCDX #171, his primary task is helping to drive the evolution of the VMware NSX platform, building the VMware NSX architecture and providing validated design guidance for the software-defined data center, specifically focusing on network virtualization. Prior to joining VMware, Max worked for almost 15 years at Cisco, covering different roles, from software development to product management. Max owns also a CCIE certification (#13808).