NSX Dual Active/Active Datacenters BCDR

Overview

The modern data center design requires better redundancy and demands the ability to have Business Continuity (BC) and Disaster Recovery (DR) in case of catastrophic failure in our datacenter. Planning a new data center with BCDR requires meeting certain fundamental design guidelines.

In this blog post I will describe the Active/Active datacenter with VMware Full SDDC product suite.

The NSX running in Cross-vCenter mode, this ability introduced in VMware NSX release 6.2.x. In this blog post we will focus on network and security.

An introduction and overview blog post can be found in this link:

http://blogs.vmware.com/consulting/2015/11/how-nsx-simplifies-and-enables-true-disaster-recovery-with-site-recovery-manager.html

The goals that we are trying to achieve in this post are:

  1. Having the ability to deploy workloads with vRA on both of the datacenters.
  2. Provide Business Continuity in case of a partial of a full site failure.
  3. Having the ability to perform planned or unplanned migration of workloads from one datacenter to another.

To demonstrate the functionality of this design I’ve created demo ‘vPOD’ in VMware internal cloud with the following products in each datacenter:

  • vCenter 6.0 with ESXi host 6.0
  • NSX 6.2.1
  • vRA 6.2.3
  • vSphere Replication 6.1
  • SRM 6.1
  • Cloud Client 3.4.1

In this blog post I will not cover the recovery part of the vRA/vRO components, but this could be achieved with a separated SRM instance for the management infrastructure.

Environment overview

I’m adding short video to introduce the environment.

NSX Manager

The NSX manager in Site A will have the IP address of 192.168.110.15 and will be configured as primary.

The NSX Manager in site B will be configured with the IP 192.168.210.15 and is set as secondary.

Each NSX manager pairs with its own vCenter and learns its local inventory. Any configuration change related to the cross site deployment will run at the primary NSX manager and will be replicated automatically to the remote site.

 

Universal Logical Switch (ULS)

Creating logical switches (L2) between sites with VxLAN is not new to NSX, however starting from version 6.2.X we’ve introduced the ability of stretching the L2 between NSX managers paired to different vCenters. This new logical switch is known as a ‘Universal Logical Switch’ or ‘ULS’. Any new ULS we will create in the Primary NSX Manger will be synced to the secondary.

I’ve created the following ULS in my Demo vPOD:

Universal Logical Switch (ULS)

Universal Distributed Logical Router (UDLR)

The concept of a Distributed Logical Router is still the same as it was before NSX 6.2.x. The new functionally that was added to this release allows us to configure Universal Distributed Logical Router (UDLR).  When we deploy a UDLR it will show up in all NSX Managers Universal Transport Zone.

The following UDLR created was created:

Universal Distributed Logical Router (UDLR)

Universal Security Policy with Distributed Firewall (UDFW)

With version 6.2.x we’ve introduced the universal security group and universal IP-Set.

Any firewall rule configured in the Universal Section must be IP-SET or Security Group that contain IP-SET.

When we are configuring or changing Universal policy, automatically there is a sync process that runs from the primary to the secondary NSX manager.

The recommended way to work with an ipset is to add it to a universal security group.

The following Universal security policy is an example to allow communicating to 3-Tier application. The security policy is built from universal security groups. Each group contain IP-SET with the relevant IP address for each tier.

Universal Security Policy with Distributed Firewall (UDFW)

vRA

At the automation side we’re creating two unique machine blueprints per site. The MBP are based on Classic CentOS image that allows us to perform some connectivity tests.

The MBP named “Center-Site_A” will be deployed by vRA to Site A into the green ULS named: ULS_Green_Web-A.

The IP address pool configured for this ULS is 172.16.10.0/24.

The MBP named “Center-Site_B” will be deployed by vRA to Site B into the blue ULS named: ULS_Blue_Web-B.

The IP address pool configured for this ULS is 172.17.10.0/24

vRA Catalog

Cloud Client:

To quote from VMware Official documentation:

“Typically, a vSphere hosted VM managed by vRA belongs to a reservation, which belongs to a compute resource (cluster), which in turn belongs to a vSphere Endpoint. The VMs reservation in vRA needs to be accurate in order for vRA to know which vSphere proxy agent to utilize to manage that VM in the underlying vSphere infrastructure. This is all well and good and causes few (if any) problems in a single site setup, as the VM will not normally move from the vSphere endpoint it is originally located on.

With a multi-site deployment utilizing Site Recovery Manager all this changes as part of the site to site fail over process involves moving VMs from one vCenter to another. This has the effect in vRA of moving the VM to a different endpoint, but the reservation becomes stale. As a result it becomes no longer possible to perform day 2 operation on the VMs until the reservation is updated.”

When we failover VMs from Site A to Site B cloud client will run the following action behind the science to solve this challenge.

Process Flow for Planned Failover:

Process Flow for Planned Failover

The Conceptual Routing Design with Active/Active Datacenter

The main key point for this design is to run Active/Active for workloads in both datacenters.

The workloads will reside on both Site A and Site B. In the modern datacenter the entry point is protected with perimeter firewall.

In our design each site has its on perimeter firewall run independently FW_A located in Site A and FW_B Located in Site B.
Site A (Shown in Green color) run its own ESGs (Edge Security Gateways), Universal DLR (UDLR) and Universal Logical Switch (ULS).

Site B site (shown in Blue color) have different ESGs, Universal DLR (UDLR) and Universal Logical Switch (ULS).

The main reason for the different ESG, UDLR and ULS per site is to force single ingress/egress point for workload traffic per site.

Without this ingress/egress deterministic traffic flow, we may face asymmetric routing between the two sites, that means that ingress traffic will be via Site A to FW_A and egress via Site B to FW_B, this asymmetric traffic will dropped by the FW_B.

Note: The ESGs in this blog run in ECMP mode, As a consequence we turned off the firewall service on the ESGs.

The Green network will always will be advertise via FW_A.  For an example The Control VM (IP 192.168.110.10) shown in the figure below need to access the Green Web VM connected to the ULS_Web_Green_A , the traffic  from the client will be routed via Core router and to FW_A, from there to one of the ESG working in ECMP mode, then to the Green UDLR and finally to the Green Web VM itself.

Now Assume the same client would like to access the Blue Web VM connected to ULS_Web_Blue_B, this traffic will be routed via the Core router to FW_B, from there to one of the Blue ESG working in ECMP mode, to the Blue ULDR and at the end to the Blue VM itself.

Routing Design with Active/Active Datacenter

What is the issue with this design?

What will happen if we will face a complete failure in one of our Edge Clusters or FW_A?

For our scenario I’ve combined failures of the Green Edge cluster and FW_A in the image below.

In that case we will lose all our N-S traffic to all of our ULS behind this Green Edge Cluster.

As a result, all clients outside the SDDC will lose connectivity immediately to all of the green Green ULS.

Please note: forwarding traffic to the Blue ULS will continue to work in this event regardless of the failure in Site A.

 

PIC7

If we’ll have a stretched vSphere Edge cluster between Site A and Site B, then we will able to leverage vSphere HA to restart the failed Green ESGs in the remote Blue site (This is not the case here, in our design each site has its own local cluster and storage), but even if we had vSphere HA, the restart process can take few minutes. Another way to recover from this failure is to manually deploy Green ESGs in Site B, and connect them to Site B FW_B. The recovery time of this solution could take few minutes. Both of these options are not suitable for modern datacenter design.

In the next paragraph I will introduce a new way to design the ESGs in Active/Active datacenter architecture.

This design will be much faster and will work in a more efficient way to recover from such an event in Site A (or Site B).

Active/Active Datacenter with mirrored ESGs

In this design architecture we will be deploying mirrored Green ESGs in Site B, and blue mirrored ESGs into Site A. Under normal datacenter operation the mirrored ESGs will be up and running but will not forward traffic. Site-A green ULS traffic from external clients will always enter via Site A ESGs (E1-Green-A , E2-Green-A) for all of Site A Prefix and leave through the same point.

Adding the mirrored ESGs add some complexity in the single Ingres/Egress design, but improves the converge time of any failure.

PIC8How Ingress Traffic flow works in this design?

Now we will explain how the Ingress traffic flow works in this architecture with mirrored ESGs. In order to simplify the explanation, we will be focusing only on the green flow in both of the datacenters and remove the blue components from the diagrams but the same explanation works for the Blue Site B network as well.

Site A Green UDLR control VM runs eBGP protocol with all Green ESGs (E1-Green-A to E4-Green-B). The UDLR Redistributes all connected interfaces as Site A prefix via eBGP. Note: “Site A prefix” represent any Green Segments part of the green ULS.

The Green ESGs (E1-Green-A  to E4-Green-B) sends out via BGP Site-A’s prefix to both physical firewalls: FW_A located in Site A and FW_B located Site B.

FW_B in Site B will add BGP AS prepending for Site A prefix.

From the Core router point of view, we’ll have two different paths to reach Site A Prefix: one via FW_A (Site A) and the second via FW_B (Site B). Under normal operation, this traffic will flow only through Site A because of the fact that Site B prepending for prefix A.

PIC9

Egress Traffic

Egress traffic is handled by UDLR control VM with different BGP Weigh values.

Site A ESGs: E1-Green-A and E2-Green-A has mirrors ESGs: E3-Green-B and E4-Green-B located at Site B. The mirrors ESGs provide availability. Under normal operation The UDLR Control VM will always prefer to route the traffic via higher BGP Wight value of E1-Green-A and E2-Green-A.  E3-Green-B and E4-Green-B will not forward any traffic and will wait for E1-E2 to fail.

In the figure below, we can see Web workload running on Site A ULS_Green_A initiate traffic to the Core. This egress traffic pass through DLR Kernel module, trough E1-Green-A ESG and then forward to Site A FW_A.

PIC10

There are other options for ingress/egress within NSX 6.2:

Great new feature called ‘Local-ID’. Hany Michael wrote a blog post to cover this option.

In Hany’s blog we don’t have a firewall like in my design so please pay attention to few minor differences.

http://www.networkskyx.com/2016/01/06/introducing-the-vmware-nsx-vlab-2-0/

Anthony Burke wrote a blog post about how to use local-id with physical firewall

https://networkinferno.net/ingress-optimisation-with-nsx-for-vsphere

Routing updates

Below, we’re demonstrating routing updates for Site-A, but the same mechanism works for Site B. The Core router connected to FW_A in Site A will peer with the FW_A via eBGP.

The core will send out 0/0 Default gateway.

FW_A will perform eBGP peering with both E1-Green-A and E2-Green-A. FW_A will forward the 0/0 default gateway to Green ESGs and will receive Site A green Prefix’s from Green ESGs. The Green ESGs E1-Green-A and E2-Green-A peers in eBGP with UDLR control VM.

The UDLR and the ESGs will work in ECMP mode, as results the UDLR will get 0/0 from both ESGs. The UDLR will redistribute connected interfaces (LIFs) to both green ESGs.

We can work with iBGP or eBGP  or mix from the UDLR – > ESG ->  physical routers.

In order to reduce the eBGP converge time of Active UDLR control VM failure, we will configure flowing static route in all of the Green side to point to UDLR forwarding address for the internal LIF’s.

Routing filters will apply on all ESGs to prevent unwanted prefixes advertisement and EGSs becoming transit gateways.

PIC11

Failure of One Green ESG in Site A

The Green ESGs: E1-Green-A and E2-Green-A working in ECMP mode. From UDLR and FW_A point of view both of the ESG work in Active/Active mode.

As long as we have at least one active Green ESG in Site A, The Green UDLR and the Core router will always prefer to work with Site A Green ESGs.

Let’s assume we have active flow of traffic from the Green WEB VM in site A to the external client behind the core router, and this traffic initially passing through via E1-Green-A. In and event of failure of E1-Green-A ESG, the UDLR will reroute the traffic via E2-Green-ESG because this ESG has better weight then Green ESGs on site B (E3-Green-B and E4-Green-B).

FW_A is still advertising a better as-path to ‘ULS_Web_Green_A’ prefixes than FW_B (remember FW_B always prepending Site_A prefix).

We’ll use low BGP time interval settings (hello=1 sec, hold down=3 sec) to improve BGP converge routing.

 

PIC12

Complete Edges cluster failure in site A

In this scenario we face a failure of all Edge cluster in Site A (Green ESGs and Blue ESGs), this issue might include the failure of FW_A.

Core router we will not be receiving any BGP updates from the Site A, so the core will prefer to go to FW_B in order to reach any Site A prefix.

From the UDLR point of view there arn’t any working Green ESGs in Site A, so the UDLR will work with the remaining green ESGs in site B (E3-Green-B, E4-Green-B).

The traffic initiated from the external client will be reroute via the mirrored green ESGs (E3-Green-B and E4-GreenB) to the green ULS in site B. The reroute action will work very fast based on the BGP converge routing time interval settings (hello=1 sec, hold down=3 sec).

This solution is much faster than other options mentioned before.

Same recovery mechanism exists for failure in Site B datacenter.

PIC13

Note: The Green UDLR control VM was deployed to the payload cluster and isn’t affected by this failure.

 

Complete Site A failure:

In this catastrophic scenario all components in site A were failed. Including the management infrastructure (vCenter, NSX Manager, controller, ESGs and UDLR control VM). Green workloads will face an outage until they are recovered in Site B, the Blue workloads continues to work without any interference.

The recovery procedure for this event will be made for the infrastructure management/control plan component and for the workloads them self.

Recovery the Management/control plan:

  • Log in to secondary NSX Manager and then Promote Secondary NSX Manager to Primary by: Assign Primary Role.
  • Deploy new Universal Controller Cluster and synchronize all objects
  • Universal CC configuration pushed to ESXi Hosts managed by Secondary
  • Redeploying the UDLR Control VM.

The recovery procedure for the workloads will run the “Recovery plan” from SRM located in site B.

PIC14

 

Summery:

In this blog post we are demonstrating the great power of NSX to create Active/Active datacenter with the ability to recover very fast from many failure scenarios.

  • We showed how NSX simplifies Disaster Recovery process.
  • NSX and SRM Integration is the reasonable approach to DR where we can’t use stretch vSphere cluster.
  • NSX works in Cross vCenter mode. Dual vCenters and NSX managers improving our availability. Even in the event of a complete site failure we were able to continue working immediately in our management layer (Seconday NSX manager and vCenter are Up and running).
  • In this design, half of our environment (Blue segments) wasn’t affected by a complete site failure. SRM recovered our failed Green workloads without need to change our Layer 2/ Layer 3 networks topology.
  • We did not use any specific hardware to achieve our BCDR and we were 100% decupled from the physical layer.
  • With SRM and vRO we were able to protect any deployed VM from Day 0.

 

I would like to thanks to:

Daniel Bakshi that help me a lots to review this blog post.

Also Thanks Boris Kovalev and Tal Moran that help to with the vRA/vRO demo vPOD.

 

 

 

NSX-v Host Preparation

The information in this post is based on my NSX Professional experience in the field and from a lecture by Kevin Barrass, a NSX solution architect.

Thanks toTiran Efrat for reviewing this post.

Host preparation overview

Host preparation is the process in which the NSX manager installs the NSX Kernel module inside vSphere cluster and builds the NSX Control plan fabric.

Before the host preparation process we need to complete:

  1. Register the NSX Manager in the vCenter. This process was covered in NSX-V Troubleshooting registration to vCenter.
  2. Deploy the NSX Controllers, covered in deploying-nsx-v-controller-disappear-from-vsphere-client

Three components are involved during the NSX host preparation:
vCenter, NSX Manager, EAM(ESX Agent Manager).

Host Preperation1

vCenter Server:
Management of vSphere compute infrastructure.

NSX Manager:
Provides the single point of configuration and REST API entry-points in a vSphere environment for NSX.

EAM (ESX Agent Management):
The middleware component between the NSX manager and the vCenter. The EAM is part of the vCenter and is responsible to install the VIB (vSphere Installation Bundles), which are software packages prepared to be installed inside a ESXi host.

Host Preparation process

The host preparation begins when we click the “Install” process in vCenter GUI.

host preparation

host preparation

This process is done in the vSphere Cluster level and not per ESXi host. The EAM will create an agent to track the VIB’s installation process for each host. The VIB’s are being copied from the NSX manager and cache in EAM.
If the VIBs are not present in the ESXi host, the EAM will install the VIBs (ESXi host reboot is not needed).
The EAM will remove installed old version VIBs but an ESXi host reboot is needed.

VIBs installed during host preparation:
esx-dvfilter-switch-security
esx-vsip
esx-vxlan

The ESXi host has a fully working Control Plane after the host preparation was successfully completed

Two control plan channels will be created:

  • RabbitMQMessage bus: provides communication between the vsfwd process on the ESXi hypervisor to NSX Manager over TCP/5671.
  • User World Agent (UWA) process (netcpa on the ESXi hypervisor): establishes TCP/1234 over SSL communication channels to the Controller Cluster nodes.

Host Preperation2

Troubleshooting Host Preparation

DNS:

EAM fails to deploy VIBs due to misconfigured DNS or no DNS configuration on host.
We may get a status of “Not Ready”:

Not Ready

This indicates “Agent VIB module not installed” on one or more hosts.

We can check the vSphere ESX Agent Manager for errors:

“vCenter home > vCenter Solutions Manager > vSphere ESX Agent Manager”

On “vSphere ESX Agent Manager”, check the status of “Agencies” prefixed with “_VCNS_153” If any of the agencies has a bad status, select the agency and view its issues:

EAM

We need to check the associated log  /var/log/esxupdate.log (on the ESXi host) for more details on host preparation issues.
Log into host in which you have the issue, run “tail /var/log/esxupdate.log” to view the log

esxupdate error1

Solution:
Configure the DNS settings in the ESXi host for the NSX host preparation to success.

 

TCP/80 from ESXi to vCenter is blocked:

The ESXi host unable to connect to vCenter EAM on TCP/80:

Could be caused by a firewall block on this port. From the ESXi host /var/log/esxupdate.log file:

esxupdate: esxupdate: ERROR: MetadataDownloadError: (‘http://VC_IP_Address:80/eam/vib?id=xxx-xxx-xxx-xxx), None, “( http://VC_IP_Address:80/eam/vib?id=xxx-xxx-xxx-xxx), ‘/tmp/tmp_TKl58’, ‘[Errno 4] IOError: <urlopen error [Errno 111] Connection refused>’)”)

Solution:
The NSX-v has a list of ports that need to be open in order for the host preparation to succeed.
The complete list can be found in:
https://communities.vmware.com/docs/DOC-28142

 

Older VIB’s version:

If an old VIBs version exists on the ESXi host, EAM will remove the old VIB’s
But host preparation will not automatically continue.

Solution:
We will need to reboot the ESXi host to complete the process.

 

ESXi Bootbank Space issue:

If you try Upgrade ESXi 5.1u1 to ESXi 5.5 and then start NSX host preparation you may face issue and from /var/log/esxupdate log file you will see message like:
“Installationerror: the pending transaction required 240MB free space, however the maximum size is 239 MB”
I faced this issue in customer ISO of IBM blade but may appear in other vendors.

Solution:
Install fresh ESXi 5.5 Customer ISO. (this is the version i upgrade too)

 

vCenter on Windows, EAM TCP/80 taken by other application:

If the vCenter runs on a Windows machine, other applications can be installed and use port 80,  causing a conflict with EAM port tcp/80.

For example: By default IIS server use TCP/80

Solution:
Use a different port for EAM:

Changed the port to 80 in eam.properties in \ProgramFiles\VMware\Infrastructure\tomcat\webapps\eam\WEB-INF\

 

UWA Agent Issues:

In rare cases the installation of the VIBs succeeded but for some reason one or both of the userworld agents does not functioning correctly. This could manifest itself as:
The firewall showing a bad status OR The control plane between hypervisor(s) and the controllers being down
UWA error

If Message bus service is active on NSX Manager:

Check the messaging bus userworld agent status on hosts by running the command /etc/init.d/vShield-Stateful-Firewall status on the ESXi hosts

vShield-Stateful-Firewall

Check Message bus userworld logs on hosts at /var/log/vsfwd.log

esxcfg-advcfg -l | grep Rmq

Run this command on the ESXi hosts to show all Rmq variables –there should be 16 variable in total

esxcfg-advcfg -g /UserVars/RmqIpAddress

Run this command on the ESXi hosts, it should display the NSX Manager IP address

RmqIpAddress

Run this command on the ESXi hosts to check for active messaging bus connection

esxcli network ip connection list | grep 5671 (Message bus TCP connection)

network connection

 

 

The NSX manager has a direct link to download the VIB’s as zip file:

https://$nsxmgr/bin/vdn/vibs/5.5/vxlan.zip

 

Reverting a NSX prepared ESXi host:

Remove the host from the vSphere cluster:

Put ESXi host in maintenance mode and remove the ESXi host from the cluster. This will automatically uninstall NSX VIBs.

Note: ESXi host must be rebooted to complete the operation.

 

Manually Uninstall VIB’s:

esxcli software vib remove -n esx-vxlan

esxcli software vib remove -n esx-vsip

esxcli software vib remove -n dvfilter-switch-security

Note: ESXi host must be rebooted to complete the operation

NSX Edge and DRS Rules

The NSX Edge Cluster Connects the Logical and Physical worlds and usually hosts the NSX Edge Services Gateways and the DLR Control VMs.

There are deployments where the Edge Cluster may contain the NSX Controllers as well.

In this section we discuss how to design an Edge Cluster to survive a failure of an ESXi host or an Physical entire chassis and lower the time of outage.

In the figure below we deploy NSX Edges, E1 and E2, in ECMP mode where they run active/active both from the perspective of the control and data planes. The DLR Control VMs run active/passive while both E1 and E2 running a dynamic routing protocol with the active DLR Control VM.

When the DLR learns a new route from E1 or E2, it will push this information to the NSX Controller cluster. The NSX Controller will update the routing tables in the kernel of each ESXi hosts, which are running this DLR instance.

 

1

 

In the scenario where the ESXi host, which contains the Edge E1, failed:

  • The active DLR will update the NSX Controller to remove E1 as next hop, the NSX Controller will update the ESXi host and as a result the “Web” VM traffic will be routed to Edge E2.
    The time it takes to re-route the traffic depends on the dynamic protocol converge time.

2

In the specific scenario where the failed ESXi or Chassis contained both the Edge E1 and the active DLR, we would instead face a longer outage in the forwarded traffic.

The reason for this is that the active DLR is down and cannot detect the failure of the Edge E1 and accordingly update the Controller. The ESXi will continue to forward traffic to Edge E1 until the passive DLR becomes active, learns that the Edge E1 is down and updates the NSX Controller.

3

The Golden Rule is:

We must ensure that when the Edge Services Gateway and the DLR Control VM belong to the same tenant they will not reside in the same ESXi host. It is better to distribute them between ESXi hosts and reduce the affected functions.

By default when we deploy a NSX Edge or DLR in active/passive mode, the system takes care of creating a DRS anti-affinity rule and this prevents the active/passive VMs from running in the same ESXi host.

DRS anti affinity rules

DRS anti affinity rules

We need to build new DRS rules as these default rules will not prevent us from getting to the previous dual failure scenario.

The figure below describes the network logical view for our specific example. This topology is built from two different tenants where each tenant is being represented with a different color and has its own Edge and DLR.

Note connectivity to the physical world is not displayed in the figure below in order to simplify the diagram.

multi tenants

My physical Edge Cluster has four ESXi hosts which are distributed over two physical chassis:

Chassis A: esxcomp-01a, esxcomp-02a

Chassis B: esxcomp-01b, esxcomp-02b

4

Create DRS Host Group for each Chassis

We start with creating a container for all the ESXi hosts in Chassis A, this container group configured is in DRS Host Group.

Edge Cluster -> Manage -> Settings -> DRS Groups

Click on Create Add button and call this group “Chassis A”.

Container type need to be “Host DRS Group” and Add ESXi host running on Chassis A (esxcomp-01a and esxcomp-02a).

5

Create another DRS group called Chassis B that contains esxcomp-01b and esxcomp-02b:

6

 

VM’s DRS Group for Chassis A:

We need to create a container for VMs that will run in Chassis A. At this point we just name it as Chassis A, but we are not actually putting the VMs in Chassis A.

This Container type is “VM DRS Group”:

7

VM DRS Group for Chassis B:

8

 

At this point we have four DRS groups:

9

DRS Rules:

Now we need to take the DRS object we created before: “Chassis A” and “VM to Chassis A “ and tie them together. The next step is to do the same for “Chassis B” and “VM to Chassis B“

* This configuration needs to be part of “DRS Rules”.

Edge Cluster -> Manage -> Settings -> DRS Rules

Click on the Add button in DRS Rules, in the name enter something like: “VM’s Should Run on Chassis A”

In the Type select “Virtual Machine to Hosts” because we want to bind the VM’s group to the Hosts Group.

In the VM group name choose “VM to Chassis A” object.

Below the VM group selection we need to select the group & hosts binding enforcement type.

We have two different options:

“Should run on hosts in group” or “Must run on hosts in group”

If we choose “Must” option, in the event of the failure of all the ESXi hosts in this group (for example if Chassis A had a critical power outage), the other ESXi hosts in the cluster (Chassis B) would not be considered by vSphere HA as a viable option for the recovery of the VMs. “Should” option will take other ESXi hosts as recovery option.

10

 

Same for Chassis B:

11

Now the problem with the current DRS rules and the VM placement in this Edge cluster is that the Edge and DLR Control VM are actually running in the same ESXi host.  We need to create anti-affinity DRS rules.

Anti-Affinity Edge and DLR:

An Edge and DLR that belong to the same tenant should not run in the same ESXi host.

For Green Tenant:

12

For Blue Tenant:

13

The Final Result:

In the case of a failure of one of the ESXi hosts we don’t face the problem where Edge and DLR are on the same ESXi host, even if we have a catastrophic event of a chassis A or B failure.

15

 

Note:

Control VM location can move to compute cluster and we can avoid this design consideration.

Thanks to Max Ardica and  Tiran Efrat for reviewing this post.

 

NSX-v Troubleshooting L2 Connectivity

In this blog post we describe the methodology to troubleshoot L2 connectivity within the same Logical switch L2 segment.

Some of the steps here can and should be done via NSX GUI,vRealize Operations Manager 6.0 and vRealize Log Insight,  so see it like education post.

There are lots of CLI commands in this post :-). To view the output of CLI command you can scroll right.

 

High level approach to solve L2 problems:

1. Understand  the problem.

2. Know your network topology.

3. Figure out  if is its configuration issue.

4. Check  if the problem within the physical space or logical space.

5. Verify NSX control plane from ESXi hosts and NSX Controllers.

6. Move VM to different ESXi host.

7. Start to Capture traffic in right spots.

 

Understand the Problem

VM’s on same logical switch 5001 are  unable to communicate .

show the problem:

web-sv-01a:~ # ping 172.16.10.12
PING 172.16.10.12 (172.16.10.12) 56(84) bytes of data.
^C
--- 172.16.10.12 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3023ms

 

Know your network topology:

TSHOT1

VM’s: web-sv-01a and  web-sv-02a  reside in different compute resource  esxcomp-01a and esxcomp-02a respectively.

web-sv-01a: IP: 172.16.10.11,  MAC: 00:50:56:a6:7a:a2

web-sv-02a: IP:172.16.10.12, MAC: 00:50:56:a6:a1:e3

 

Validate network topology

I know its sounds stupid, let’s make sure that VM’s actually reside in the right esxi host and connected to right VXLAN.

Verify VM “web-sb-01a” is actually reside in “escomp-01a“:

From esxcomp-01a run the command esxtop then press “n” (Network):

esxcomp-01a # esxtop
   PORT-ID              USED-BY  TEAM-PNIC DNAME              PKTTX/s  MbTX/s    PKTRX/s  MbRX/s %DRPTX %DRPRX
  33554433           Management        n/a vSwitch0              0.00    0.00       0.00    0.00   0.00   0.00
  50331649           Management        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331650               vmnic0          - DvsPortset-0          8.41    0.02     437.81    3.17   0.00   0.00
  50331651     Shadow of vmnic0        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331652                 vmk0     vmnic0 DvsPortset-0          5.87    0.01       1.76    0.00   0.00   0.00
  50331653                 vmk1     vmnic0 DvsPortset-0          0.59    0.01       0.98    0.00   0.00   0.00
  50331654                 vmk2     vmnic0 DvsPortset-0          0.00    0.00       0.39    0.00   0.00   0.00
  50331655                 vmk3     vmnic0 DvsPortset-0          0.20    0.00       0.39    0.00   0.00   0.00
  50331656 35669:db-sv-01a.eth0     vmnic0 DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331657 35888:web-sv-01a.eth     vmnic0 DvsPortset-0          4.89    0.01       3.72    0.01   0.00   0.00
  50331658          vdr-vdrPort     vmnic0 DvsPortset-0          2.15    0.00       0.00    0.00   0.00   0.00

In line 12 we can see that “web-sv-01a.eth0” is shown, another imported information is has “Port-ID“.

The “Port-ID” is unique identifier for each virtual switch port , in our example web-sv-01a.eth0 as Port-ID “50331657″.

Find the vDS name:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan list
VDS ID                                           VDS Name      MTU  Segment ID     Gateway IP     Gateway MAC        Network Count  Vmknic Count
-----------------------------------------------  -----------  ----  -------------  -------------  -----------------  -------------  ------------
3b bf 0e 50 73 dc 49 d8-2e b0 df 20 91 e4 0b bd  Compute_VDS  1600  192.168.250.0  192.168.250.2  00:50:56:09:46:07              4             1

From Line 4 vDS name is “Compute_VDS

Verify “web-sv-01a.eth0″ Connect to VXLAN 5001:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan network port list --vds-name Compute_VDS --vxlan-id=5001
Switch Port ID  VDS Port ID  VMKNIC ID
--------------  -----------  ---------
      50331657  68                   0
      50331658  vdrPort              0

From Line 4 we have VM connect to VXLAN 5001 to port ID 50331657 this port ID is the Same port ID of VM web-sv-01a.eth0

Verification in esxcomp-01b:

esxcomp-01b esxtop
  PORT-ID              USED-BY  TEAM-PNIC DNAME              PKTTX/s  MbTX/s    PKTRX/s  MbRX/s %DRPTX %DRPRX
  33554433           Management        n/a vSwitch0              0.00    0.00       0.00    0.00   0.00   0.00
  50331649           Management        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331650               vmnic0          - DvsPortset-0          6.54    0.01     528.31    4.06   0.00   0.00
  50331651     Shadow of vmnic0        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331652                 vmk0     vmnic0 DvsPortset-0          2.77    0.00       1.19    0.00   0.00   0.00
  50331653                 vmk1     vmnic0 DvsPortset-0          0.59    0.00       0.40    0.00   0.00   0.00
  50331654                 vmk2     vmnic0 DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331655                 vmk3     vmnic0 DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331656 35663:web-sv-02a.eth     vmnic0 DvsPortset-0          3.96    0.01       3.57    0.01   0.00   0.00
  50331657          vdr-vdrPort     vmnic0 DvsPortset-0          2.18    0.00       0.00    0.00   0.00   0.00

From Line 11 we can see that “web-sv-02a.eth0” has Port-ID “50331656“.

Verify “web-sv-02a.eth0″ Connect to VXLAN 5001:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan network port list --vds-name Compute_VDS --vxlan-id=5001
Switch Port ID  VDS Port ID  VMKNIC ID
--------------  -----------  ---------
      50331656  69                   0
      50331657  vdrPort              0

From Line 4 we have VM connect to VXLAN 5001 to port ID 50331656

At this point we verify are VM’s located as draw in topology. now start with actual TSHOOT steps.

Is the problem in the physical network ?

Our first step will be to find out  if the problem is in the physical space or logical space.

TSHOT2

The easy way to find out is by ping from VTEP in esxcomp-01a to VTEP in esxcomp-01b, before ping let’s find out the VTEP IP address.

esxcomp-01a # esxcfg-vmknic -l
Interface  Port Group/DVPort   IP Family IP Address                              Netmask         Broadcast       MAC Address       MTU     TSO MSS   Enabled Type         
vmk0       16                  IPv4      192.168.210.51                          255.255.255.0   192.168.210.255 00:50:56:09:08:3e 1500    65535     true    STATIC       
vmk1       26                  IPv4      10.20.20.51                             255.255.255.0   10.20.20.255    00:50:56:69:80:0f 1500    65535     true    STATIC       
vmk2       35                  IPv4      10.20.30.51                             255.255.255.0   10.20.30.255    00:50:56:64:70:9f 1500    65535     true    STATIC       
vmk3       44                  IPv4      192.168.250.51                          255.255.255.0   192.168.250.255 00:50:56:66:e2:ef 1600    65535     true    STATIC

From Line 6 we can tell that VTEP IP address for VMK3(MTU is 1600) is 192.168.250.51.

Another command to find VTEP IP address is:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan vmknic list --vds-name=Compute_VDS
Vmknic Name  Switch Port ID  VDS Port ID  Endpoint ID  VLAN ID  IP              Netmask        IP Acquire Timeout  Multicast Group Count  Segment ID
-----------  --------------  -----------  -----------  -------  --------------  -------------  ------------------  ---------------------  -------------
vmk3               50331655  44                     0        0  192.168.250.51  255.255.255.0                   0                      0  192.168.250.0

Same commands in esxcomp-01b:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan vmknic list --vds-name=Compute_VDS
Vmknic Name  Switch Port ID  VDS Port ID  Endpoint ID  VLAN ID  IP              Netmask        IP Acquire Timeout  Multicast Group Count  Segment ID
-----------  --------------  -----------  -----------  -------  --------------  -------------  ------------------  ---------------------  -------------
vmk3               50331655  46                     0        0  192.168.250.53  255.255.255.0                   0                      0  192.168.250.0

VTEP IP for esxcomp-01b is 192.168.250.53. now let’s add this info to our  topology.

 

TSHOT3

Checks for VXLAN Routing:

NSX use use different IP stack for VXLAN  traffic,so we need to verify if default gateway is configured correctly for VXLAN traffic.

From esxcomp-01a:

esxcomp-01a # esxcli network ip route ipv4 list -N vxlan
Network        Netmask        Gateway        Interface  Source
-------------  -------------  -------------  ---------  ------
default        0.0.0.0        192.168.250.2  vmk3       MANUAL
192.168.250.0  255.255.255.0  0.0.0.0        vmk3       MANUAL

From esxcomp-01b:

esxcomp-01b # esxcli network ip route ipv4 list -N vxlan
Network        Netmask        Gateway        Interface  Source
-------------  -------------  -------------  ---------  ------
default        0.0.0.0        192.168.250.2  vmk3       MANUAL
192.168.250.0  255.255.255.0  0.0.0.0        vmk3       MANUAL

My two ESXi hosts in VTEP IP address space for this LAB work on same L2 segment, both VTEP have same default gateway.

Ping from VTEP in esxcomp-01a to VTEP located in esxcomp-02a.

Source ping will be from VXLAN IP stack with packet size of 1570 and don’t fragment bit set to 1.

esxcomp-01a #  ping ++netstack=vxlan 192.168.250.53 -s 1570 -d
PING 192.168.250.53 (192.168.250.53): 1570 data bytes
1578 bytes from 192.168.250.53: icmp_seq=0 ttl=64 time=0.585 ms
1578 bytes from 192.168.250.53: icmp_seq=1 ttl=64 time=0.936 ms
1578 bytes from 192.168.250.53: icmp_seq=2 ttl=64 time=0.831 ms

--- 192.168.250.53 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.585/0.784/0.936 ms

Ping is successfully.

If ping with “-d” don’t work and without “-d” work its MTU problem. Check for MTU in the physical switch’s

Because VXLAN in this example in the same L2 we can view ARP entry for others VTEP’s:

From esxcomp-01a:

esxcomp-01a # esxcli network ip neighbor list -N vxlan
Neighbor        Mac Address        Vmknic    Expiry  State  Type
--------------  -----------------  ------  --------  -----  -----------
192.168.250.52  00:50:56:64:f4:25  vmk3    1173 sec         Unknown
192.168.250.53  00:50:56:67:d9:91  vmk3    1171 sec         Unknown
192.168.250.2   00:50:56:09:46:07  vmk3    1187 sec         Autorefresh

Look like our physical layer is not the issue.

 

Verify NSX control plane

During NSX host preparation NSX Manager install  VIB agents called User World Agent (UWA) inside ESXi hosts.

The process responsible to communicate with NSX controller called netcpad.

ESXi host using VMkernel Management interface to create this secure channel over TCP/1234, traffic is encrypted with SSL.

Part of the information netcpad send to NSX Controller is:

VM’s: MAC, IP.

VTEP: MAC, IP.

VXLAN: the VXLAN Id’s

Routing: Routes learn from the DLR Control VM. (explain in next post).

TSHOT4

Base on this information the Controller learn the network state and build directory services.

To learn how the Controller Cluster works and how fix problem in the cluster itself  NSX Controller Cluster Troubleshooting .

For two VM’s to be able to talk to each others we need working control plane. In this lab we have 3 NSX controller.

Verification command need to done from both ESXi  and Controllers side.

NSX controllers IP address: 192.168.110.201, 192.168.110.202, 192.168.110.203

Control Plane verification from ESXi point of view:

Verify esxcomp-01a have ESTABLISHED connection to NSX Controllers. (grep 1234  to show only TCP port 1234 ).

esxcomp-01a # esxcli network ip  connection list | grep 1234
tcp         0       0  192.168.210.51:54153  192.168.110.202:1234  ESTABLISHED     35185  newreno  netcpa-worker
tcp         0       0  192.168.210.51:34656  192.168.110.203:1234  ESTABLISHED     34519  newreno  netcpa-worker
tcp         0       0  192.168.210.51:41342  192.168.110.201:1234  ESTABLISHED     34519  newreno  netcpa-worker

Verify esxcomp-01b have ESTABLISHED connection to NSX Controllers:

esxcomp-01b # esxcli network ip  connection list | grep 1234
tcp         0       0  192.168.210.56:16580  192.168.110.202:1234  ESTABLISHED     34517  newreno  netcpa-worker
tcp         0       0  192.168.210.56:49434  192.168.110.203:1234  ESTABLISHED     34678  newreno  netcpa-worker
tcp         0       0  192.168.210.56:12358  192.168.110.201:1234  ESTABLISHED     34516  newreno  netcpa-worker

Example of problem with communication from ESXi host to NSX Controllers:

esxcli network ip  connection list | grep 1234
tcp         0       0  192.168.210.51:54153  192.168.110.202:1234  TIME_WAIT           0
tcp         0       0  192.168.210.51:34656  192.168.110.203:1234  FIN_WAIT_2      34519  newreno
tcp         0       0  192.168.210.51:41342  192.168.110.201:1234  TIME_WAIT           0

If we can’t see ESTABLISHED connection check:

1. IP connectivity from ESXi host to all NSX controllers.

2. If you have firewall between ESXi host to NSX controllers, TCP/1234 need to be open.

3. Is netcpad is running on ESXi host:

/etc/init.d/netcpad status
netCP agent service is not running

start netcpad:

esxcomp-01a # /etc/init.d/netcpad status
netCP agent service is running

If netcpad is not running start with command:

esxcomp-01a #/etc/init.d/netcpad start
Memory reservation set for netcpa
netCP agent service starts

Verify again:

esxcomp-01a # /etc/init.d/netcpad status
netCP agent service is running

 

Verify in esxcomp-01a Control Plane is Enable and connection is up state for VXLAN 5001:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS
VXLAN ID  Multicast IP               Control Plane                        Controller Connection  Port Count  MAC Entry Count  ARP Entry Count
--------  -------------------------  -----------------------------------  ---------------------  ----------  ---------------  ---------------
    5003  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            2                0                0
    5001  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.201 (up)            2                3                0
    5000  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            1                3                0
    5002  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.203 (up)            1                2                0

Verify in esxcomp-01b Control Plane is Enable and connection is up state for VXLAN 5001:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS
VXLAN ID  Multicast IP               Control Plane                        Controller Connection  Port Count  MAC Entry Count  ARP Entry Count
--------  -------------------------  -----------------------------------  ---------------------  ----------  ---------------  ---------------
    5001  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.201 (up)            2                3                0
    5000  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            1                0                0
    5002  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.203 (up)            1                2                0
    5003  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            1                0                0

Check esxcomp-01a learn ARP of remote VM’s VXLAN 5001:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan network arp list --vds-name Compute_VDS --vxlan-id=5001
IP            MAC                Flags
------------  -----------------  --------
172.16.10.12  00:50:56:a6:a1:e3  00001101

From this output we can understand that esxcomp-01a learn the ARP info of  web-sv-02a

Check esxcomp-01b learn ARP  for remote VM’s VXLAN 5001:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan network arp list --vds-name Compute_VDS --vxlan-id=5001
IP            MAC                Flags
------------  -----------------  --------
172.16.10.11  00:50:56:a6:7a:a2  00010001

From this output we can understand that esxcomp-01b learn the ARP info of  web-sv-01a

What we can tell at this point.

esxcomp-01a:

Know web-sv-01a is VM running in VXLAN 5001, his ip 172.16.10.11 and MAC address : 00:50:56:a6:7a:a2.

The communication to Controller’s cluster is UP for VXLAN 5001.

esxcomp-01b:

Know web-sv-01b is VM running in VXLAN 5001, his ip 172.16.10.12 and MAC address: 00:50:56:a6:a1:e3

The communication to Controller’s cluster is UP for VXLAN 5001.

So why web-sv-01a can’t take to web-sv-02a ?

the answer to this question is an another question: what the NSX  controller know ?

Control Plane verification from NSX Controller point of view:

We have 3 active controller, one of then is elected to manage VXLAN 5001. Remember slicing ?

Find out who is manage VXLAN 5001, SSH to one of the NSX controllers, for example 192.168.110.202:

nsx-controller # show control-cluster logical-switches vni 5001
VNI      Controller      BUM-Replication ARP-Proxy Connections VTEPs
5001     192.168.110.201 Enabled         Enabled   0           0

Line 3 say that 192.168.110.201 is manage VXLAN 5001, so the next command will run from 192.168.110.201:

nsx-controller # show control-cluster logical-switches vni 5001
VNI      Controller      BUM-Replication ARP-Proxy Connections VTEPs
5001     192.168.110.201 Enabled         Enabled   6           4

From this output we learn that VXLAN 5001 have 4 VTEP connected to him and total of 6 active connection.

At this point i would like to point you for excellent blogger with lots of information of what is happen under the hood in NSX.

His name is Dmitri Kalintsev. link to his blog: NSX for vSphere: Controller “Connections” and “VTEPs”

From Dimitri Post:

“ESXi host joins a VNI in two cases:

  1. When a VM running on that host connects to VNI’s dvPg and its vNIC transitions into “Link Up” state; and
  2. When DLR kernel module on that host needs to route traffic to a VM on that VNI that’s running on a different host.”

We are not route traffic between VM’s, DLR is not  part of the game here.

Find out VTEP IP address connected to VXLAN 5001:

nsx-controller # show control-cluster logical-switches vtep-table 5001
VNI      IP              Segment         MAC               Connection-ID
5001     192.168.250.53  192.168.250.0   00:50:56:67:d9:91 5
5001     192.168.250.52  192.168.250.0   00:50:56:64:f4:25 3
5001     192.168.250.51  192.168.250.0   00:50:56:66:e2:ef 4
5001     192.168.150.51  192.168.150.0   00:50:56:60:bc:e9 6

From this output we can learn that both VTEP’s esxcomp-01a line 5  and esxcomp-01b line 3 are seen by NSX Controller on VXLAN 5001.

The MAC address output in this comments are VTEP’s MAC.

Find out that MAC address of the VM’s has learn by NSX Controller:

nsx-controller # show control-cluster logical-switches mac-table 5001
VNI      MAC               VTEP-IP         Connection-ID
5001     00:50:56:a6:7a:a2 192.168.250.51  4
5001     00:50:56:a6:a1:e3 192.168.250.53  5
5001     00:50:56:8e:45:33 192.168.150.51  6

Line 3 show MAC of web-sv-01a, line 4 show MAC of web-sv-02a

Find out that ARP entry of the VM’s has learn by NSX Controller:

 

nsx-controller # show control-cluster logical-switches arp-table 5001
VNI      IP              MAC               Connection-ID
5001     172.16.10.11    00:50:56:a6:7a:a2 4
5001     172.16.10.12    00:50:56:a6:a1:e3 5
5001     172.16.10.10    00:50:56:8e:45:33 6

Line 3,4 show the exact IP/MAC of  web-sv-01a and  web-sv-02a

To understand how Controller have learn this info read my post NSX-V IP Discovery

Some time restart the netcpad process can fix problem between ESXi host and NSX Controllers.

esxcomp-01a # /etc/init.d/netcpad restart
watchdog-netcpa: Terminating watchdog process with PID 4273913
Memory reservation released for netcpa
netCP agent service is stopped
Memory reservation set for netcpa
netCP agent service starts

Summary of controller verification:

NSX Controller Controller know where VM’s is located, their  ip address and MAC address. it’s seem like Control plane work just fine.

 

Move VM to different ESXi host

In NSX-v each ESXi host has its one UWA service daemon part of the management and control  plane, sometimes when UWA not working as expected VMs on this ESXi host will have connectivity issue.

The fast way to check it, is to vMotion none working VMs  from one ESXi host to different, it VMs start to work we need to focus on the none working ESXi host control plane.

In this scenario even i vMotion my VM to different ESXi host the problem didn’t go away.

 

Capture in the right spots:

pktcap-uw command allow to capture traffic in so many places in NSX environments.

before start to capture all over the place, lets think where we think the problem is.

When VM connect to Logical switch there are few security services that pack a transverse, each service represent with different slot id.

TSHOT5

SLOT 0 : implement vDS Access List.

SLOT 1: Switch Security module (swsec) capture DHCP Ack and ARP message, this info then forward to NSX Controller.

SLOT2: NSX Distributed Firewall.

We need Check if VM traffic successfully pass  after NSX Distributed firewall, that mean in slot 2.

The capture command will need to SLOT 2 filter name for Web-sv-01a

From esxcomp-01a:

esxcomp-01a # summarize-dvfilter
~~~snip~~~~
world 35888 vmm0:web-sv-01a vcUuid:'50 26 c7 cd b6 f3 f4 bc-e5 33 3d 4b 25 5c 62 77'
 port 50331657 web-sv-01a.eth0
  vNic slot 2
   name: nic-35888-eth0-vmware-sfw.2
   agentName: vmware-sfw
   state: IOChain Attached
   vmState: Detached
   failurePolicy: failClosed
   slowPathID: none
   filter source: Dynamic Filter Creation
  vNic slot 1
   name: nic-35888-eth0-dvfilter-generic-vmware-swsec.1
   agentName: dvfilter-generic-vmware-swsec
   state: IOChain Attached
   vmState: Detached
   failurePolicy: failClosed
   slowPathID: none
   filter source: Alternate Opaque Channel

We can see in line4 that VM name is web-sv-01a, in line  5 that filter applied at slot 2 and in line 6 we have the filter name: nic-35888-eth0-vmware-sfw.2

pktcap-uw command help with -A output:

esxcomp-01a # pktcap-uw -A
Supported capture points:
        1: Dynamic -- The dynamic inserted runtime capture point.
        2: UplinkRcv -- The function that receives packets from uplink dev
        3: UplinkSnd -- Function to Tx packets on uplink
        4: Vmxnet3Tx -- Function in vnic backend to Tx packets from guest
        5: Vmxnet3Rx -- Function in vnic backend to Rx packets to guest
        6: PortInput -- Port_Input function of any given port
        7: IOChain -- The virtual switch port iochain capture point.
        8: EtherswitchDispath -- Function that receives packets for switch
        9: EtherswitchOutput -- Function that sends out packets, from switch
        10: PortOutput -- Port_Output function of any given port
        11: TcpipDispatch -- Tcpip Dispatch function
        12: PreDVFilter -- The DVFIlter capture point
        13: PostDVFilter -- The DVFilter capture point
        14: Drop -- Dropped Packets capture point
        15: VdrRxLeaf -- The Leaf Rx IOChain for VDR
        16: VdrTxLeaf -- The Leaf Tx IOChain for VDR
        17: VdrRxTerminal -- Terminal Rx IOChain for VDR
        18: VdrTxTerminal -- Terminal Tx IOChain for VDR
        19: PktFree -- Packets freeing point

capture command have support to sniff traffic in interesting points, with PreDVFilter and PostDVFilter line 14,15 can sniffing traffic before or after filtering action.

Capture after SLOT 2 filter:

pktcap-uw --capture PostDVFilter --dvfilter nic-35888-eth0-vmware-sfw.2 --proto=0x1 -o web-sv-01a_after.pcap
The session capture point is PostDVFilter
The name of the dvfilter is nic-35888-eth0-vmware-sfw.2
The session filter IP protocol is 0x1
The output file is web-sv-01a_after.pcap
No server port specifed, select 784 as the port
Local CID 2
Listen on port 784
Accept...Vsock connection from port 1049 cid 2
Destroying session 25

Dumped 0 packet to file web-sv-01a_after.pcap, dropped 0 packets.

PostDVFilter = capture after the filter name.

–proto=01x capture only icmp packet.

–dvfilter = filter name as it show from summarize-dvfilter command.

-o = where to capture the traffic.

From output of this command line 12 we can tell ICMP packet are not pass this filters because we have 0 Dumped packet.

We found our smoking gun 🙂

Now capture before SLOT 2 filter.

pktcap-uw –capture PreDVFilter –dvfilter nic-35888-eth0-vmware-sfw.2 –proto=0x1 -o web-sv-01a_before.pcap

pktcap-uw –capture PreDVFilter –dvfilter nic-35888-eth0-vmware-sfw.2 –proto=0x1 -o web-sv-01a_before.pcap
The session capture point is PreDVFilter
The name of the dvfilter is nic-35888-eth0-vmware-sfw.2
The session filter IP protocol is 0x1
The output file is web-sv-01a_before.pcap
No server port specifed, select 5782 as the port
Local CID 2
Listen on port 5782
Accept...Vsock connection from port 1050 cid 2
Dump: 6, broken : 0, drop: 0, file err: 0Destroying session 26

Dumped 6 packet to file web-sv-01a_before.pcap, dropped 0 packets.

Now we can see at line 6 that we have Dumped packet. we can open web-sv-01a_before.pcap  captured  file:

esxcomp-01a # tcpdump-uw -r web-sv-01a_before.pcap
reading from file web-sv-01a_before.pcap, link-type EN10MB (Ethernet)
20:15:31.389158 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18628, length 64
20:15:32.397225 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18629, length 64
20:15:33.405253 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18630, length 64
20:15:34.413356 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18631, length 64
20:15:35.421284 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18632, length 64
20:15:36.429219 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18633, length 64

Walla, NSX dFW block the traffic.

And now from NSX GUI:

TSHOT6

Looking back on this article can be skipped intentionally step 3 “Configuration issue”.

If we were checked configuration settings, we immediately notice this problem.

 

 

Summary of all CLI Commands for this post:

ESXI Commands:

esxtop
esxcfg-vmknic -l
esxcli network vswitch dvs vmware vxlan list
esxcli network vswitch dvs vmware vxlan network port list --vds-name Compute_VDS --vxlan-id=5001
esxcli network vswitch dvs vmware vxlan vmknic list --vds-name=Compute_VDS
esxcli network ip route ipv4 list -N vxlan
esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS
esxcli network vswitch dvs vmware vxlan network arp list --vds-name Compute_VDS --vxlan-id=5001
esxcli network ip connection list | grep 1234
ping ++netstack=vxlan 192.168.250.53 -s 1570 -d
/etc/init.d/netcpad (status|start|)
pktcap-uw --capture PostDVFilter --dvfilter nic-35888-eth0-vmware-sfw.2 --proto=0x1 -o web-sv-01a_after.pcap

 

NSX Controller Commands:

show control-cluster logical-switches vni 5001
show control-cluster logical-switches vtep-table 5001
show control-cluster logical-switches mac-table 5001
show control-cluster logical-switches arp-table 5001

 

NSX-V IP Discovery

Thanks to Dimitri Desmidt for feedbacks.

IP discovery Allow NSX to suppress the ARP message over the logical switch.

To understand why we need IP Discovery and how its work we need some background regard ARP.

IP Discovery

What is ARP (Address Resolution Protocol)?

When VM1 need to communicate with VM2 it need know VM2 MAC Address, the way to know MAC2 is to send a broadcast message (ARP request) to all VM in the same L2 segment (same VLAN or in the example above same VXLAN 5001).

ALL VM’s on this Logical Switch will see this message including VM3 since it’s a broadcast, but only VM’2 will respond. The response will come in Unicast Message from VM2 directly to VM1 mac@ with VM’2 MAC address (MAC2) in the response body.

 

topology

VM1 will cache the mac@ of VM2 IP@ in its ARP table. The entry it saved between few seconds to few minutes depending on the Operating System.

Windows 7 OS for example

http://support.microsoft.com/kb/949589

If VM1 and VM2 will not talk again in this Cache time window, VM1 will clear is ARP table entry for that MAC2, when next time VM1 will need to talk to VM2, VM1 OS will send again ARP message to relearn same MAC2 of VM2.

Note: In the unlikely event of the NSX Controller who dodoesn’tnow the mac@ for VM2-IP@, then the ARP request message is flooded, but only to the ESXi that have VMs in the logical switch 5001.

 

How IP Discovery works:

VMware NSX leverage NSX controller to achieve IP Discovery.

Inside ESXi host running with NSX software there is process called User World Agent (UWA), this process communicate with NSX  controller and update the controller directory MAC,IP,VTEP tables for VM’s reside inside this ESXi host.

NSX Arch

When VM connect to Logical switch there are few security services that pack a transverse, each service represent with different slot id.

Filter slots

SLOT 0 : implement vDS Access List

SLOT 1: Switch Security module (swsec) capture DHCP Ack and ARP message, this info then forward to NSX Controller.

SLOT2: NSX Distributed Firewall.

From the figure above we now understand that slot 1 is the service responsibly to implement the IP Discovery.

When VM1 power up even if the ip address is static the VM will send out ARP message do discover the MAC address of the default gateway, when swsec module see this ARP message he will forward to NSX Controller. That way NSX controller learn VM1 MAC1 address, same way Controller will learn VM2 MAC2 address.

Now when VM1 want to talk to VM2, MAC2 is not known to VM1, then ARP message will send out to VXLAN 5001.

The UWA will send out query to NSX controller and ask if he know MAC2, since controller already know this Controller will  send unicast message back to VM1 with MAC2, the ARP broadcast message will not send out to all VM’s  in VXLAN 5001.

Note: There is 3 min timer in NSX controller for ARP query, if host send same query in this time frame the controller ignore this request and broadcast message will be send out to all VM in the logical switch

 

IP Discovery Verification:

The easiest to know if IP discovery it actually works is to run Wireshark software in VM3, clear the ARP table in VM1 with the command: arp –d.

Now ping from VM1 to VM2, ARP broadcast message from VM1 should not see in VM3.

I would like to point out grate post explain in deep dive how IP discovery work by Dmitri Kalintsev

NSX-v under the hood: VXLAN ARP suppression

 

NSX-V Edge NAT

Thanks to Francis Guillier Max Ardica and  Tiran Efrat of the overview and feedback.

One of the most important NSX Edge features is NAT.
With NAT (Network Address Translation) we can change the Source or Destination IP addresses and TCP/UDP port. Combined NAT and Firewall rules can lead to confusion when we try to determine the correct IP address to which apply the firewall rule.
To create the correct rule we need to understand the packet flow inside the NSX Edge in details. In NSX Edge we have two different type of NAT: Source Nat (SNAT) and Destination NAT (DNAT).

 

SNAT

Allows translating an internal IP address (for example private IP described in RFC 1918) to a public External IP address.
In figure below, the IP address for any VM in VXLAN 5001 that needs outside connectivity to the WAN can be translated to an external IP address (this mapping is configured on the Edge). For example, VM1 with IP address 172.16.10.11 needs to communicate with WAN Internet, so the NSX Edge can translate it to a 192.168.100.50 IP address configured on the Edge external interface.
Users in the external network are not aware of the internal Private IP address.

 

SANT

DNAT

Allow to access internal private IP addresses from the outside world.
In the example in figure below, users from the WAN need to communicate with the Server 172.16.10.11.
NSX Edge DNAT mapping configuration is created so that the users from outside connect to 192.168.100.51 and NSX Edge translates this IP address to 172.16.10.11.

DNAT

Below is the outline of the Packet flow process inside the Edge. The important parts are where the SNAT/DNAT Action and firewall decision action are being taken.

packet flow

We can see from this process that the ingress packet will evaluate against FW rules before SNAT/DNAT translation.

Note: the actual packet flow details are more complicated with more action/decisions in Edge flow, but the emphasis here is on the NAT and FW functionalities only.

Note:  NAT function will work only if firewall service is enabled.

Enable Firewall Service

 

 

Firewall rules and SNAT

Because of this packet flow the firewall rule for SNAT need to be applied on the internal IP address object and not on the IP address translated by the SNAT function. For example, when a VM1 172.16.10.11 needs to communicate with the WAN, the firewall rule needs to be:

fw and SNAT

 Firewall rules and DNAT

Because of this packet flow the firewall rules for DNAT need to be applied on the public IP address object and not on the Private IP address after the DNAT translation. When a user from the WAN sends traffic to 192.168.100.51, this packet will be checked against this FW rule and then the NAT will change the destination IP address to 172.16.10.11.

fw and DNAT

DNAT Configuration

Users from outside need to access an internal web server connecting to its public IP address.
The server internal IP address is 172.16.100.11, the NAT IP address is 192.168.100.6.

 

DNAT

The first step is creating the External IP on the Edge, this IP is secondary because this edge already has a main IP address configured in the 192.168.100.0/24 IP subnet.

Note: the main IP address is marked with a black Ddot (192.168.100.3).

For this example the DNAT IP address is 192.168.100.6.

DNAT1

Create a DNAT Rule in the Edge:

DNAT2

Now pay attention to the firewall rules one the Edge: a user coming from the outside will try to access the internal server by connecting to the public IP address 192.168.100.6. This implies that the fw rule needs to allow this access.

.

DNAT3

DNAT Verification:

There are several ways to verify NAT is functioning as originally planned. In our example, users from any source address access the public IP address 192.168.100.6, and after the NAT translation the packet destination IP address is changed to 172.16.10.11.

The output of the command:

show nat

show nat

The output of the command:

show firewall flow

We can see that packet is received by the Edge and destined to the 192.168.100.6 address, the return traffic is instead originated from the different IP address 172.16.10.11 (the private IP address).
That means DNAT translation is happening here.

show flow

We can capture the traffic and see the actual packet:
Capture Edge traffic on its outside interface vNic_0, in this example user source IP address is 192.168.110.10 and destination is 192.168.100.6

The command for capture is:
debug packet display interface vNic_0 port_80_and_src_192.168.110.10

Debug packet display interface vNic_0 port_80_and_src_192.168.110.10

debug packet 1

Capture edge on internal interface vNic_1 we can see destination IP address has changed to 172.16.10.11 because of DNAT translation:

debug packet 2

SNAT configuration

All the servers part of VXLAN segment 5001 (associated to the IP subnet 172.16.10.0/24) need to leverage SNAT translation (in this example to IP address 192.168.100.3) on the outside interface of the Edge to be able to communicate with the external network.

 

SNAT config

SNAT Configuration:

snat config 2

Edge Firewall Rules:

Allow to 172.16.10.0/24 to go out

SNAT config fw rule

 

Verification:

The output of the command

Show nat

show nat verfication

DNAT with L4 Address Translation (PAT)

DNAT with L4 Address Translation allows changing Layer4 TCP/UDP port.
For example we would like to mask our internal SSH server port for all users from outside.
The new port will be TCP/222 instead of regular SSH TCP/22 port.

The user originates a connection to the Web Server on destination port TCP/222 but the NSX Edge will change it to TCP/22.

PAT

From the command line the show nat command:

PAT show nat

NAT Order

In this specific scenario, we want to create the two following SNAT rules.

  • SNAT Rule 1:
    The IP addresses for the devices part of VXLAN 5001 (associated to the IP subnet 172.16.10.0/24) need to be translated to the Edge outside interface address 192.168.100.3.
  • SNAT Rule 2:
    Web-SRV-01a on VXLAN 5001 needs its IP address 172.16.10.4 to be translated to the Edge outside address 192.168.100.4.

nat order

In the configuration example above, traffic will never hit rule number 4 because 172.16.10.4 is part of subnet 172.16.10.0/24, so its IP address will be translated to 192.168.100.3 (and not the desired 192.168.100.4).

Order for SNAT rules is important!
We need to re-order the SNAT rules and put the more specific one on top, so that rule 3 will be hit for traffic originated from the IP address 172.16.10.4, whereas rule 4 will apply to all the other devices part of IP subnet 172.16.10.0/24.

nat reorder

After re-order:

nat after reorer

 

another useful command

show configuration nat

 

NSX Load Balancing

This next overview of Load Balancing  was taken from great work of Max Ardica and Nimish Desai in the official NSX Design Guide:

Overview

Load Balancing is another network service available within NSX that can be natively enabled on the NSX Edge device. The two main drivers for deploying a load balancer are scaling out an application (through distribution of workload across multiple servers), as well as improving its high-availability characteristics

NSX Load Balancing

NSX Load Balancing

The NSX load balancing service is specially designed for cloud with the following characteristics:

  • Fully programmable via API
  • Same single central point of management/monitoring as other NSX network services

The load balancing services natively offered by the NSX Edge satisfies the needs of the majority of the application deployments. This is because the NSX Edge provides a large set of functionalities:

  • Support any TCP applications, including, but not limited to, LDAP, FTP, HTTP, HTTPS
  • Support UDP application starting from NSX SW release 6.1.
  • Multiple load balancing distribution algorithms available: round-robin, least connections, source IP hash, URI
  • Multiple health checks: TCP, HTTP, HTTPS including content inspection
  • Persistence: Source IP, MSRDP, cookie, ssl session-id
  • Connection throttling: max connections and connections/sec
  • L7 manipulation, including, but not limited to, URL block, URL rewrite, content rewrite
  • Optimization through support of SSL offload

Note: the NSX platform can also integrate load-balancing services offered by 3rd party vendors. This integration is out of the scope for this paper.

In terms of deployment, the NSX Edge offers support for two types of models:

  • One-arm mode (called proxy mode): this scenario is highlighted in Figure below and consists in deploying an NSX Edge directly connected to the logical network it provides load-balancing services for.
One-Arm Mode Load Balancing Services

One-Arm Mode Load Balancing Services

The one-armed load balancer functionality is shown above:

  1. The external client sends traffic to the Virtual IP address (VIP) exposed by the load balancer.
  2. The load balancer performs two address translations on the original packets received from the client: Destination NAT (D-NAT) to replace the VIP with the IP address of one of the servers deployed in the server farm and Source NAT (S-NAT) to replace the client IP address with the IP address identifying the load-balancer itself. S-NAT is required to force through the LB the return traffic from the server farm to the client.
  3. The server in the server farm replies by sending the traffic to the LB (because of the S-NAT function previously discussed).

The LB performs again a Source and Destination NAT service to send traffic to the external client leveraging its VIP as source IP address.

The advantage of this model is that it is simpler to deploy and flexible as it allows deploying LB services (NSX Edge appliances) directly on the logical segments where they are needed without requiring any modification on the centralized NSX Edge providing routing communication to the physical network. On the downside, this option requires provisioning more NSX Edge instances and mandates the deployment of Source NAT that does not allow the servers in the DC to have visibility into the original client IP address.

Note: the LB can insert the original IP address of the client into the HTTP header before performing S-NAT (a function named “Insert X-Forwarded-For HTTP header”). This provides the servers visibility into the client IP address but it is obviously limited to HTTP traffic.

Inline mode (called transparent mode) requires instead deploying the NSX Edge inline to the traffic destined to the server farm. The way this works is shown in Figure below.

Two-Arms Mode Load Balancing Services

Two-Arms Mode Load Balancing Services

    1. The external client sends traffic to the Virtual IP address (VIP) exposed by the load balancer.
    2. The load balancer (centralized NSX Edge) performs only Destination NAT (D-NAT) to replace the VIP with the IP address of one of the servers deployed in the server farm.
    3. The server in the server farm replies to the original client IP address and the traffic is received again by the LB since it is deployed inline (and usually as the default gateway for the server farm).
    4. The LB performs Source NAT to send traffic to the external client leveraging its VIP as source IP address.

    This deployment model is also quite simple and allows the servers to have full visibility into the original client IP address. At the same time, it is less flexible from a design perspective as it usually forces using the LB as default gateway for the logical segments where the server farms are deployed and this implies that only centralized (and not distributed) routing must be adopted for those segments. It is also important to notice that in this case LB is another logical service added to the NSX Edge already providing routing services between the logical and the physical networks. As a consequence, it is recommended to increase the form factor of the NSX Edge to X-Large before enabling load-balancing services.

     

    In terms of scalability and throughput figures, the NSX load balancing services offered by each single NSX Edge can scale up to (best case scenario):

    • Throughput: 9 Gbps
    • Concurrent connections: 1 million
    • New connections per sec: 131k

     

    In below are some deployment examples of tenants with different applications and different load balancing needs. Notice how each of these applications is hosted on the same Cloud with the network services offered by NSX.

Deployment Examples of NSX Load Balancing

Deployment Examples of NSX Load Balancing

Two final important points to highlight:

  • The load balancing service can be fully distributed across This brings multiple benefits:
  • Each tenant has its own load balancer.
  • Each tenant configuration change does not impact other tenants.
  • Load increase on one tenant load-balancer does not impact other tenants load-balancers scale.
  • Each tenant load balancing service can scale up to the limits mentioned above.

Other network services are still fully available

  • The same tenant can mix its load balancing service with other network services such as routing, firewalling, VPN.

 

One Arm Load Balance Lab Topology

In this One Arm Load Balance Lab Topology we have a 3-tiers application built from:

Web servers: web-sv-01a (172.16.10.11), web-sv-02a (172.16.10.12)

App: app-sv-01a (172.16.20.11)

DB: db-sv-01a (172.16.30.11)

We will add to this lab NSX Edge service gateway (ESG) for load balancer function.

The ESG (highlighted with the red line) is deployed in one-arm mode and exposes the VIP 172.16.10.10 to load-balance traffic to the Web-Tier-01 segment.

One-Armed Lab topology

 

Configure One Arm Load Balance

Create NSX Edge gateway:

One-Arem-1

Select Edge Service Gateway (ESG):
One-Arem-2

Set the Admin password, enable SSH and Auto rule:

One-Arem-3

Install the ESG in Management Cluster:

One-Arem-4

In our lab appliance size is Compact, but we should choose the right size according to amount of traffic expected:

One-Arem-5

Configure the Edge interface and IP address; since this is one-arm mode we have only one interface:

One-Arem-6

Create default gateway

One-Arem-8

Configure default accept fw rule:

One-Arem-9

Complete the installation:

One-Arem-10

Verify ESG is deployed::

One-Arem-11

Enable Load Balance in the ESG, go to Load Balance and click Edit:

One-Arem-12

Check mark “Enable Load Balancer”

One-Arem-13

Create the application profile:

One-Arem-14

Add a name, in the Type select HTTPS and Enable SSL Passthrough:

One-Arem-15

Create the pool:

One-Arem-16

In the Algorithm select ROUND-ROBIN, monitor is default https, and add two servers member to monitor:

One-Arem-16h

To add Members click on the + icon, the port we monitor is 443:

One-Arem-17

We need then to create the VIP:

One-Arem-18

In this step we glue all the configuration parts, tie the application profile to pool and give it the Virtual IP address:

One-Arem-19

Now we can check that the load balancer is actually working by connecting to the VIP address with a client web browser.

In the web browser, we point to the VIP address 172.16.10.10.

The results is to hit 172.16.10.11 web-sv-01a:

One-Arem-verification-1

When we try to refresh our web browser client we see we hit 172.16.10.12 web-sv-02a :

One-Arem-verification-2

Troubleshooting One Arm Load Balance

General Loadbalancer troubleshooting workflow

Review the configuration through UI

Check the pool member status through UI

Do online troubleshooting via CLI:

  • Check LB engine status (L4/L7)
  • Check LB objects statistics (vips, pools, members)
  • Check Service Monitor status (OK, WARNING, CRITICAL)
  • Check system log message (# show log)
  • Check LB L4/L7 session table
  • Check LB L7 sticky-table status

 

Check the configuration through UI

 

 

One-Arem-TSHOT-1

 

  1. Check the pool member status through UI:

 

One-Arem-TSHOT-2

Possible errors discovered:

  1. 80/443 port might be used by other services (e.g. sslvpn);
  2. Member port and monitor port are miss configured hence health check failed.
  3. Member in WARNING state should be treated as DOWN.
  4. L4 LB is used when:
    a) TCP/HTTP protocol;
    b) no persistence settings and L7 settings;
    c) accelerateEnable is true;
  5. Pool is in transparent mode but Edge doesn’t sit in the return pat

Do online troubleshooting via CLI:

Check LB engine status (L4/L7)

# show service loadbalancer

Check LB objects statistics (vips, pools, members)

# show service loadbalancer virtual [vip-name]

# show service loadbalancer pool [poo-name]

Check Service Monitor status (OK, WARNING, CRITICAL)

# show service loadbalancer monitor

Check system log message

# show log

Check LB session table

# show service loadbalancer session

Check LB L7 sticky-table status

# show service loadbalancer table

 

 

One-Arm-LB-0> show service loadbalancer
<cr>
error Show loadbalancer Latest Errors information.
monitor Show loadbalancer HealthMonitor information.
pool Show loadbalancer pool information.
session Show loadbalancer Session information.
table Show loadbalancer Sticky-Table information.
virtual Show loadbalancer virtualserver information.

#########################################################

One-Arm-LB-0> show service loadbalancer
———————————————————————–
Loadbalancer Services Status:

L7 Loadbalancer : running
Health Monitor : running

#########################################################

One-Arm-LB-0> show service loadbalancer monitor
———————————————————————–
Loadbalancer HealthMonitor Statistics:

POOL                               MEMBER                                  HEALTH STATUS
Web-Servers-Pool-01  web-sv-02a_172.16.10.12   default_https_monitor:OK
Web-Servers-Pool-01  web-sv-01a_172.16.10.11   default_https_monitor:OK
One-Arm-LB-0>

##########################################################

One-Arm-LB-0> show service loadbalancer virtual
———————————————————————–
Loadbalancer VirtualServer Statistics:

VIRTUAL Web-Servers-VIP
| ADDRESS [172.16.10.10]:443
| SESSION (cur, max, total) = (0, 3, 35)
| RATE (cur, max, limit) = (0, 6, 0)
| BYTES in = (17483), out = (73029)
+->POOL Web-Servers-Pool-01
| LB METHOD round-robin
| LB PROTOCOL L7
| Transparent disabled
| SESSION (cur, max, total) = (0, 3, 35)
| BYTES in = (17483), out = (73029)
+->POOL MEMBER: Web-Servers-Pool-01/web-sv-01a_172.16.10.11, STATUS: UP
| | STATUS = UP, MONITOR STATUS = default_https_monitor:OK
| | SESSION (cur, max, total) = (0, 2, 8)
| | BYTES in = (8882), out = (43709)
+->POOL MEMBER: Web-Servers-Pool-01/web-sv-02a_172.16.10.12, STATUS: UP
| | STATUS = UP, MONITOR STATUS = default_https_monitor:OK
| | SESSION (cur, max, total) = (0, 1, 7)
| | BYTES in = (7233), out = (29320)

####################################################################
One-Arm-LB-0> show service loadbalancer pool
———————————————————————–
Loadbalancer Pool Statistics:

POOL Web-Servers-Pool-01
| LB METHOD round-robin
| LB PROTOCOL L7
| Transparent disabled
| SESSION (cur, max, total) = (0, 3, 35)
| BYTES in = (17483), out = (73029)
+->POOL MEMBER: Web-Servers-Pool-01/web-sv-01a_172.16.10.11, STATUS: UP
| | STATUS = UP, MONITOR STATUS = default_https_monitor:OK
| | SESSION (cur, max, total) = (0, 2, 8)
| | BYTES in = (8882), out = (43709)
+->POOL MEMBER: Web-Servers-Pool-01/web-sv-02a_172.16.10.12, STATUS: UP
| | STATUS = UP, MONITOR STATUS = default_https_monitor:OK
| | SESSION (cur, max, total) = (0, 1, 7)
| | BYTES in = (7233), out = (29320)

##########################################################################

One-Arm-LB-0> show service loadbalancer session
———————————————————————–
L7 Loadbalancer Current Sessions:

0x5fe50a2b230: proto=tcpv4 src=192.168.110.10:49392 fe=Web-Servers-VIP be=Web-Servers-Pool-01 srv=web-sv-01a_172.16.10.11 ts=08 age=8s calls=3 rq[f=808202h,i=0,an=00h,rx=4m53s,wx=,ax=] rp[f=008202h,i=0,an=00h,rx=4m53s,wx=,ax=] s0=[7,8h,fd=13,ex=] s1=[7,8h,fd=14,ex=] exp=4m52s
0x5fe50a22960: proto=unix_stream src=unix:1 fe=GLOBAL be=<NONE> srv=<none> ts=09 age=0s calls=2 rq[f=c08200h,i=0,an=00h,rx=20s,wx=,ax=] rp[f=008002h,i=0,an=00h,rx=,wx=,ax=] s0=[7,8h,fd=1,ex=] s1=[7,0h,fd=-1,ex=] exp=20s
———————————————————————–

 

Disconnect web-sv-01a_172.16.10.11 from the network

 

 

One-Arem-TSHOT-3

From the GUI we can see the effect in members pool status:

One-Arem-TSHOT-4

 

One-Arm-LB-0> show service loadbalancer virtual
———————————————————————–
Loadbalancer VirtualServer Statistics:

VIRTUAL Web-Servers-VIP
| ADDRESS [172.16.10.10]:443
| SESSION (cur, max, total) = (0, 3, 35)
| RATE (cur, max, limit) = (0, 6, 0)
| BYTES in = (17483), out = (73029)
+->POOL Web-Servers-Pool-01
| LB METHOD round-robin
| LB PROTOCOL L7
| Transparent disabled
| SESSION (cur, max, total) = (0, 3, 35)
| BYTES in = (17483), out = (73029)
+->POOL MEMBER: Web-Servers-Pool-01/web-sv-01a_172.16.10.11, STATUS: DOWN
| | STATUS = DOWN, MONITOR STATUS = default_https_monitor:CRITICAL
| | SESSION (cur, max, total) = (0, 2, 8)
| | BYTES in = (8882), out = (43709)
+->POOL MEMBER: Web-Servers-Pool-01/web-sv-02a_172.16.10.12, STATUS: UP
| | STATUS = UP, MONITOR STATUS = default_https_monitor:OK
| | SESSION (cur, max, total) = (0, 1, 7)
| | BYTES in = (7233), out = (29320)

NSX L2 Bridging

Overview

This next overview of L2 Bridging  was taken from great work of Max Ardica and Nimish Desai in the official NSX Design Guide:

There are several circumstances where it may be required to establish L2 communication between virtual and physical workloads. Some typical scenarios are (not exhaustive list):

  • Deployment of multi-tier applications: in some cases, the Web, Application and Database tiers can be deployed as part of the same IP subnet. Web and Application tiers are typically leveraging virtual workloads, but that is not the case for the Database tier where bare-metal servers are commonly deployed. As a consequence, it may then be required to establish intra-subnet (intra-L2 domain) communication between the Application and the Database tiers.
  • Physical to virtual (P-to-V) migration: many customers are virtualizing applications running on bare metal servers and during this P-to-V migration it is required to support a mix of virtual and physical nodes on the same IP subnet.
  • Leveraging external physical devices as default gateway: in such scenarios, a physical network device may be deployed to function as default gateway for the virtual workloads connected to a logical switch and a L2 gateway function is required to establish connectivity to that gateway.
  • Deployment of physical appliances (firewalls, load balancers, etc.).

To fulfill the specific requirements listed above, it is possible to deploy devices performing a “bridging” functionality that enables communication between the “virtual world” (logical switches) and the “physical world” (non virtualized workloads and network devices connected to traditional VLANs).

NSX offers this functionality in software through the deployment of NSX L2 Bridging allowing VMs to be connected at layer 2 to a physical network (VXLAN to VLAN ID mapping), even if the hypervisor running the VM is not physically connected to that L2 physical network.

L2 Bridge topology

 

Figure above shows an example of L2 bridging, where a VM connected in logical space to the VXLAN segment 5001 needs to communicate with a physical device deployed in the same IP subnet but connected to a physical network infrastructure (in VLAN 100). In the current NSX-v implementation, the VXLAN-VLAN bridging configuration is part of the distributed router configuration; the specific ESXi hosts performing the L2 bridging functionality is hence the one where the control VM for that distributed router is running. In case of failure of that ESXi host, the ESXi hosting the standby Control VM (which gets activated once it detects the failure of the Active one) would take the L2 bridging function.

Independently from the specific implementation details, below are some important deployment considerations for the NSX L2 bridging functionality:

  • The VXLAN-VLAN mapping is always performed in 1:1 fashion. This means traffic for a given VXLAN can only be bridged to a specific VLAN, and vice versa.
  • A given bridge instance (for a specific VXLAN-VLAN pair) is always active only on a specific ESXi host.
  • However, through configuration it is possible to create multiple bridges instances (for different VXLAN-VLAN pairs) and ensure they are spread across separate ESXi hosts. This improves the overall scalability of the L2 bridging function.
  • The NSX Layer 2 bridging data path is entirely performed in the ESXi kernel, and not in user space. Once again, the Control VM is only used to determine the ESXi host where a given bridging instance is active, and not to perform the bridging function.

 

 

Configure L2 Bridge

In this scenario we would like to Bridge Between App VM connected to VXLAN 5002 to virtual machine connected to VLAN 100.

Create Bridge 1

My current Logical Switch configuration:

Logical Switch table

We have pre-configured a VLAN-backed port group for VLAN 100:

Port group

Bridging configuration is done at the DLR level. In this specific example, the DLR name is Distributed-Router:

Double Click on the edge-1:

DLR1

 

Click on the Bridging and then green + button:

DLR2

Type Bridge Name, Logical Switch ID and Port-Group name:

DLR3

 

Click OK and Publish:

DLR4

 

Now VM on Logical Switch App-Tier-01 can communicate with Physical or virtual machine on VLAN 100.

 

Design Consideration

Currently in NSX-V 6.1 we can’t enable routing on the VXLAN logical switch that is bridged to a VLAN.

In other words, the default gateway for devices connected to the VLAN can’t be configured on the distributed logical router:

None working  L2 Bridge Topology

None working L2 Bridge Topology

So how can VM in VXLAN 5002 communicate with VXLAN 5001?

The big difference is VXLAN 5002 is no longer connected to the DLR LIF, but it is connected instead to the NSX Edge.

Working Bridge Topology

Redundancy

DLR Control VM can work in high availability mode, if the Active DLR control VM fails, the standby Control VM takes over, which means the Bridge instance will move to a new ESXi host location.

HA

 

Bridge Troubleshooting:

Most issues I ran into was that the bridged VLAN was missing on the trunk interface configured on the physical switch.

In the figure below:

  • Physical server is connected to VLAN 100, App VM connected to VXLAN 5002 in esx-01b.
  • Active DLR control VM is located at esx-02a, so the bridging function will be active in this ESXi host.
  • Both ESXi hosts have two physical nics: vmnic2 and vmnic3.
  • Transport VLAN carries all VNI (VXLAN’s) traffic and is forwarded on the physical switch in VLAN 20.
  • On physical switch-2 port E1/1 we must configure trunk port and allow both VLAN 100 and VLAN 20.

Bridge and Trunk configuration

Note: Port E1/1 will carry both VXLAN and VLAN traffic. 

 

 

 

Find Where Bridge is Active:

We need to know where the Active DLR Control VM is located (if we have HA). Inside this ESXi host the Bridging happens in kernel space. The easy way to find it is to look at “Configuration” section in the “Manage” tab.

Note: When we powered off the DLR Control VM (if HA is not enabled), the bridging function on this ESXi host will stop to prevent loop.

DLR5We can see that Control VM located in esx-02a.corp.local

SSH to this esxi host,  find the Vdr Name of the DLR Control VM:

xxx-xxx -I -l

VDR Instance Information :
—————————

Vdr Name: default+edge-1
Vdr Id: 1460487509
Number of Lifs: 4
Number of Routes: 5
State: Enabled
Controller IP: 192.168.110.201
Control Plane IP: 192.168.110.52
Control Plane Active: Yes
Num unique nexthops: 1
Generation Number: 0
Edge Active: Yes

Now we know that “default+edge-1” is the VDR name.

 

xxx-xxx -b –mac default+edge-1

###################################################################################################

~ # xxx-xxx -b –mac default+edge-1

VDR ‘default+edge-1’ bridge ‘Bridge_App_VLAN100’ mac address tables :
Network ‘vxlan-5002-type-bridging’ MAC address table:
total number of MAC addresses: 0
number of MAC addresses returned: 0
Destination Address Address Type VLAN ID VXLAN ID Destination Port Age
——————- ———— ——- ——– —————- —
Network ‘vlan-100-type-bridging’ MAC address table:
total number of MAC addresses: 0
number of MAC addresses returned: 0
Destination Address Address Type VLAN ID VXLAN ID Destination Port Age
——————- ———— ——- ——– —————- —

###################################################################################################

From this output we can see there is no any mac address learning ,

After connect VM to Logical Switch App-Tier-01 and ping VM in VLAN 100.

Now we can see mac address from both VXLAN 5002 and VLAN100:

Bridge TSHOOT

 

 

 

 

NSX Role Based Access Control

One of the most challenging problems in managing large networks is the complexity of security administration.

“Role-based access control (RBAC) is a method of regulating access to computer or network resources based on the roles of individual users within an enterprise. In this context, access is the ability of an individual user to perform a specific task, such as view, create, or modify a file. Roles are defined according to job competency, authority, and responsibility within the enterprise”

Within NSX we have four built in roles, We can map User or Group to one of the NSX Role. but i think Instead of assigning roles to individual users the preferred way is to assigning role to group.

Organizations create user groups for proper user management. After integration with SSO, NSX Manager can get the details of groups to which a user belongs to.

NSX Roles

Within NSX Manager we have four pre built RBAC roles cover different nsx permission and area in NSX environment.

The four NSX built in roles are: Auditor, Security Administrator, NSX administrator and Enterprise Administrator:

NSX RBAC Diagram

NSX RBAC Diagram

Configure the Lookup Service in NSX Manager

Whenever we want to assign role on NSX, we can assign role to SSO User or Group. When Lookup service is not configured then the group based role assignment would not work i.e the user from that group would not be able to login to NSX.

The reason is we cannot fetch any group information from the SSO server. The group based authentication provider is only available when Lookup service is configured. User login where the user is explicitly assigned role on NSX will not be affected. This means that the customer has to individually assign roles to the users and would not be able to take advantage of SSO groups.

For NSX, vCenter SSO server is one of the identity provider for authentication. For authentication on NSX, prerequisite is that the user / group has to be assigned role on NSX.

NSX Manager Lookup Service

NSX Manager Lookup Service

Note: NTP/DNS must configure on the NSX Manager for lookup service to work.

Note: The domain account must have AD read permission for all objects in the domain tree. 

Configure Active Directory Groups

In this blog i will use Microsoft Active directory  as user Identity source.  in “Active Directory Users and Computers” i created four different groups. The groups will have the same name is the NSX roles to make life easier:

Auditor, Security Administrator, NSX Administrator, Enterprise Administrator.

AD Groups

AD Groups

We create four A/D users and Add each user to different A/D group. for example nsxadmin user:

the user nsxadmin is associate with the group NSX Administrator. the association done by the Add button:

AD user

AD user

Same way i will associate a others users to A/D groups:

username:     groups:

auditor1      ->  Auditor

secadmin ->   Security Administrator

nsxadmin ->  NSX Administrator

entadmin ->  Enterprise Administrator

Connect Directory Domain to NSX Manager.

Go to “Network & Security” tab double click on the “NSX Manager”

map ad to nsx manager role 1

map ad to nsx manager role 1

Double click on “192.168.110.42” icon:

map ad to nsx manager role 2

Note: Configure Domain is not needed for RBAC, only if we want to use identity firewall rules base of user or group.

Go to “Manage -> “Domains” -> Click on the green Plus button:

map ad to nsx manager role 8

Fill Name and NetBIOS name fields with appropriate information of your Domain Name and NetBIOS name:

In My example my domain name is corp.local:

map ad to nsx manager role 9

Enter LDAP (i.e AD) IP address or hostname and domain account (username and password):

map ad to nsx manager role 10

Configuring LDAP option task  can be done via direct API call to bypass the Event Log Access described in the next steps).

Click on next.

Event Log Access:

In case we need to create NSX firewall rule with user identity based on AD groups. We will need to allow the NSX Manager read Active Directory “Security Event Log”. This logs contain Active Directory users logon/logoff from to domain. We use this information to bind the AD user  to an IP address.

NSX need access to “Event Log” provide dFW with user identity in one of the two case:

  1. The user logon to VM that doesn’t running VMtools.
  2. The user logon to the domain from PC located on physical environment.

BTW users login to to VM with VMtools up and running , we do not need the “Security Event Log” to bind the user to IP.

Permissions for the user to read logon/logoff events:

Windows 2008 or later domain servers:

Add the account to the Event Log Readers group. If you are using the on-device

User-ID agent, the account must also be a member of the Distributed COM Users Group.

 

 Windows 2003 domain servers:

Assign Manage Auditing and

Security Logs permissions through group policy

In both of this cases NSX will need to access the AD with read permissions for security event logs, the protocol using to read this information are CIFS or WMI.

During this process NSX  collecting  the following microsoft event ID:

For windows 2008/2012 – Event ID: 4624

For Windows 2003 – Event ID: 540

NSX will “Copy” this Event access log and from A/D and parse the data inside the nsx manager appliance.

map ad to nsx manager role 11

Click Next and Finish:

map ad to nsx manager role 12

Mapping Active Directory  Groups to NSX Managers Roles

Note: This step is must for NSX RBAC to work. 

Now we can map Active Directory groups to pre-built NSX Manager roles.

Go to “Manage -> “Users” -> Click on the green Plus button:

map ad to nsx manager role 3

Here we can select if we want to map specific A/D user to NSX Role or A/D Group to Role.

map ad to nsx manager role 4

In this blog i will use A/D group, we create A/D group called auditor. The format to input here is:

“group_name”@domain.name.  let’s start with auditor group, this group is “Read Only” permission:

map ad to nsx manager role 5

Select one of the NSX Role, for Auditor A/D group we chose Auditor

map ad to nsx manager role 6

We can limit the scope this group can work inside nsx manager object, for this example there is no limit:

map ad to nsx manager role 7

Same way Map all others A/D groups to NSX Roles:

Auditor@corp.local                           – >  Auditor

Security Administrator@corp.local        -> Security Administrator

NSX Administrator@corp.local               -> NSX Administrator

Enterprise Administrator@corp.local     -> Enterprise Administrator

Try our first login with user Auditor1:

Login1

 The login successfull but where is the “Network & Security” tab gone ?

Login2

So far we configure all NSX Manager part but we didnt take care of the vCenter Configuration permission for that group. are you confusing ?

vCenter has is own Role for each group. we need to configure roles to etch A/D group we configured. These settings determine what the user can make the in vCenter environment.

Configure vCenter Roles:

Let’s start by configure the Auditor Role for Auditor A/D group. we know this group is for “Read Only” in the NSX Manager, so it will make sense to give this group “Read Only” to all other vCenter environment.

Go to vCenter -> Manage -> Permissions and click the green button:

vCenter Roles 1

We need to choose Roles from the Assigned Role, if we select No-Access we will not be able login to vCenter. So we need to choose something from “Read-Only” to “Administrator”

For Auditor Role “Read Only” is the Minimum.

Select “Read Only” from the Assigned Role drop down list and click on the “Add” button from “User and Group”:

vCenter Roles 2

From the Domain Select your Domain name, in our lab the domain is “CORP”, choose your Active Directory group from the list (Auditor for this example) and click the “Add” button:

vCenter Roles 3

Click Ok and Ok for Next Step:

vCenter Roles 4

Same way we need to configure all other groups roles:

vCenter Roles 5

Now we can try to login with auditor1 user:

auditor1

As we can see auditor1 is in “Read Only” role:

auditor2

We can  verify that auditor1 can’t change any other vCenter configuration:

auditor3

Test secadmin user map to “NSX Security” role, this user cannot Change any NSX infrastructure related task like create new  add new NSX Controller Node:

secadmin1

But secadmin can create new firewall rule:

secadmin2

When logging with nsxadmin user map to NSX Administrator Role we can see that the user can add new Controller Node:

nsxadmin1

But nsxadmin user cannot change or see any firewall rules configure :

nsxadmin2

What if the user member of two A/D Group ?

The user will gain combined permission access of both of the groups.

For example: the user memberof “Auditor” group and “NSX Security”, the results will be user will have read only permission on all nsx infrastructure and also gain access to all security related area in NSX.

Summery

In this post we demonstrate the NSX manager different roles. We configure Microsoft Active Directory as External database source for user’s identity.

Deploying NSX-V controller failed and disappear from vSphere client

One of the following issues hit during the deployment of the NSX-v Controller cluster may cause the deployment to fail and the deletion after few minutes of the instantiated Controller nodes.

  1. Firewall blocking Controller communication with NSX Manager.
  2. Network Connectivity between NSX Manager and Controllers.
  3. DNS/NTP misconfiguration between NSX Manager/vCenter/ESXi hosts.
  4. Lack of available resources, like disk space, in the Datastore utilized for the deployment of the Controllers.

The first area to investigate is the “Task Console” on vCenter. From an analysis of the entries displayed on the console, it is clear that first the Controller virtual machine is “powered on”, but then it gets powered off and deleted. But why?

 

View vCenter Tasks

View vCenter Tasks

 

Troubleshooting step:

  • Download the NSX manager logs.
  • Right click on the upper right corner of the NSX Manager GUI and choose “Download Tech Support Log”.
Download NSX Manager Logs

Download NSX Manager Logs

 

The Tech support file can be a very large text file, so finding an issue is as challenging as looking for a needle in a pile of hay.  What to look for?

My best advice is to start with something we know, the name of the Controller node that was first instantiated and then deleted. This name was assigned to the Controller node after the completion of the deployment wizard.

In my specific example it was “controller-2”.

Open the text file and search for this name:

Search in Tech Support File

Search in Tech Support File

 

When you find the name try to use the arrow down key and start to read:

NSX Tech Support file

NSX Tech Support file

 

From this error we can learn we have connectivity issues; it appears that if the Controller node can’t connect to NSX Manager during the deploying process, it will get automatically deleted.

The next question is: why do I have connectivity issues? In my case the NSX Controller and the NSX Manager run in the same IP subnet.

The answer is found in the manual Static IP pool object that was created for the Controller cluster.

In this lab I work with subnet class B 255.255.0.0 = prefix of 16, but in the object pool I mistakenly assigned a prefix length of 24.

 

Wrong IP Pool

Wrong IP Pool

 

This was just an example on how to troubleshoot an NSX-v Controller node deployment but there may be other reasons that can cause a similar problem.

  • Firewall block Controller to talk NSX Manager.
  • Network Connectivity between NSX Manager and Controllers.
  • Make sure NSX Manager/vCenter/ESXi hosts have DNS/NTP configured
  • Make sure you have available resource like disk space in the Datastore you deploying the controllers.