NSX Edge and DRS Rules

The NSX Edge Cluster Connects the Logical and Physical worlds and usually hosts the NSX Edge Services Gateways and the DLR Control VMs.

There are deployments where the Edge Cluster may contain the NSX Controllers as well.

In this section we discuss how to design an Edge Cluster to survive a failure of an ESXi host or an Physical entire chassis and lower the time of outage.

In the figure below we deploy NSX Edges, E1 and E2, in ECMP mode where they run active/active both from the perspective of the control and data planes. The DLR Control VMs run active/passive while both E1 and E2 running a dynamic routing protocol with the active DLR Control VM.

When the DLR learns a new route from E1 or E2, it will push this information to the NSX Controller cluster. The NSX Controller will update the routing tables in the kernel of each ESXi hosts, which are running this DLR instance.

 

1

 

In the scenario where the ESXi host, which contains the Edge E1, failed:

  • The active DLR will update the NSX Controller to remove E1 as next hop, the NSX Controller will update the ESXi host and as a result the “Web” VM traffic will be routed to Edge E2.
    The time it takes to re-route the traffic depends on the dynamic protocol converge time.

2

In the specific scenario where the failed ESXi or Chassis contained both the Edge E1 and the active DLR, we would instead face a longer outage in the forwarded traffic.

The reason for this is that the active DLR is down and cannot detect the failure of the Edge E1 and accordingly update the Controller. The ESXi will continue to forward traffic to Edge E1 until the passive DLR becomes active, learns that the Edge E1 is down and updates the NSX Controller.

3

The Golden Rule is:

We must ensure that when the Edge Services Gateway and the DLR Control VM belong to the same tenant they will not reside in the same ESXi host. It is better to distribute them between ESXi hosts and reduce the affected functions.

By default when we deploy a NSX Edge or DLR in active/passive mode, the system takes care of creating a DRS anti-affinity rule and this prevents the active/passive VMs from running in the same ESXi host.

DRS anti affinity rules

DRS anti affinity rules

We need to build new DRS rules as these default rules will not prevent us from getting to the previous dual failure scenario.

The figure below describes the network logical view for our specific example. This topology is built from two different tenants where each tenant is being represented with a different color and has its own Edge and DLR.

Note connectivity to the physical world is not displayed in the figure below in order to simplify the diagram.

multi tenants

My physical Edge Cluster has four ESXi hosts which are distributed over two physical chassis:

Chassis A: esxcomp-01a, esxcomp-02a

Chassis B: esxcomp-01b, esxcomp-02b

4

Create DRS Host Group for each Chassis

We start with creating a container for all the ESXi hosts in Chassis A, this container group configured is in DRS Host Group.

Edge Cluster -> Manage -> Settings -> DRS Groups

Click on Create Add button and call this group “Chassis A”.

Container type need to be “Host DRS Group” and Add ESXi host running on Chassis A (esxcomp-01a and esxcomp-02a).

5

Create another DRS group called Chassis B that contains esxcomp-01b and esxcomp-02b:

6

 

VM’s DRS Group for Chassis A:

We need to create a container for VMs that will run in Chassis A. At this point we just name it as Chassis A, but we are not actually putting the VMs in Chassis A.

This Container type is “VM DRS Group”:

7

VM DRS Group for Chassis B:

8

 

At this point we have four DRS groups:

9

DRS Rules:

Now we need to take the DRS object we created before: “Chassis A” and “VM to Chassis A “ and tie them together. The next step is to do the same for “Chassis B” and “VM to Chassis B“

* This configuration needs to be part of “DRS Rules”.

Edge Cluster -> Manage -> Settings -> DRS Rules

Click on the Add button in DRS Rules, in the name enter something like: “VM’s Should Run on Chassis A”

In the Type select “Virtual Machine to Hosts” because we want to bind the VM’s group to the Hosts Group.

In the VM group name choose “VM to Chassis A” object.

Below the VM group selection we need to select the group & hosts binding enforcement type.

We have two different options:

“Should run on hosts in group” or “Must run on hosts in group”

If we choose “Must” option, in the event of the failure of all the ESXi hosts in this group (for example if Chassis A had a critical power outage), the other ESXi hosts in the cluster (Chassis B) would not be considered by vSphere HA as a viable option for the recovery of the VMs. “Should” option will take other ESXi hosts as recovery option.

10

 

Same for Chassis B:

11

Now the problem with the current DRS rules and the VM placement in this Edge cluster is that the Edge and DLR Control VM are actually running in the same ESXi host.  We need to create anti-affinity DRS rules.

Anti-Affinity Edge and DLR:

An Edge and DLR that belong to the same tenant should not run in the same ESXi host.

For Green Tenant:

12

For Blue Tenant:

13

The Final Result:

In the case of a failure of one of the ESXi hosts we don’t face the problem where Edge and DLR are on the same ESXi host, even if we have a catastrophic event of a chassis A or B failure.

15

 

Note:

Control VM location can move to compute cluster and we can avoid this design consideration.

Thanks to Max Ardica and  Tiran Efrat for reviewing this post.

 

NSX-v Troubleshooting L2 Connectivity

In this blog post we describe the methodology to troubleshoot L2 connectivity within the same Logical switch L2 segment.

Some of the steps here can and should be done via NSX GUI,vRealize Operations Manager 6.0 and vRealize Log Insight,  so see it like education post.

There are lots of CLI commands in this post :-). To view the output of CLI command you can scroll right.

 

High level approach to solve L2 problems:

1. Understand  the problem.

2. Know your network topology.

3. Figure out  if is its configuration issue.

4. Check  if the problem within the physical space or logical space.

5. Verify NSX control plane from ESXi hosts and NSX Controllers.

6. Move VM to different ESXi host.

7. Start to Capture traffic in right spots.

 

Understand the Problem

VM’s on same logical switch 5001 are  unable to communicate .

show the problem:

web-sv-01a:~ # ping 172.16.10.12
PING 172.16.10.12 (172.16.10.12) 56(84) bytes of data.
^C
--- 172.16.10.12 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3023ms

 

Know your network topology:

TSHOT1

VM’s: web-sv-01a and  web-sv-02a  reside in different compute resource  esxcomp-01a and esxcomp-02a respectively.

web-sv-01a: IP: 172.16.10.11,  MAC: 00:50:56:a6:7a:a2

web-sv-02a: IP:172.16.10.12, MAC: 00:50:56:a6:a1:e3

 

Validate network topology

I know its sounds stupid, let’s make sure that VM’s actually reside in the right esxi host and connected to right VXLAN.

Verify VM “web-sb-01a” is actually reside in “escomp-01a“:

From esxcomp-01a run the command esxtop then press “n” (Network):

esxcomp-01a # esxtop
   PORT-ID              USED-BY  TEAM-PNIC DNAME              PKTTX/s  MbTX/s    PKTRX/s  MbRX/s %DRPTX %DRPRX
  33554433           Management        n/a vSwitch0              0.00    0.00       0.00    0.00   0.00   0.00
  50331649           Management        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331650               vmnic0          - DvsPortset-0          8.41    0.02     437.81    3.17   0.00   0.00
  50331651     Shadow of vmnic0        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331652                 vmk0     vmnic0 DvsPortset-0          5.87    0.01       1.76    0.00   0.00   0.00
  50331653                 vmk1     vmnic0 DvsPortset-0          0.59    0.01       0.98    0.00   0.00   0.00
  50331654                 vmk2     vmnic0 DvsPortset-0          0.00    0.00       0.39    0.00   0.00   0.00
  50331655                 vmk3     vmnic0 DvsPortset-0          0.20    0.00       0.39    0.00   0.00   0.00
  50331656 35669:db-sv-01a.eth0     vmnic0 DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331657 35888:web-sv-01a.eth     vmnic0 DvsPortset-0          4.89    0.01       3.72    0.01   0.00   0.00
  50331658          vdr-vdrPort     vmnic0 DvsPortset-0          2.15    0.00       0.00    0.00   0.00   0.00

In line 12 we can see that “web-sv-01a.eth0” is shown, another imported information is has “Port-ID“.

The “Port-ID” is unique identifier for each virtual switch port , in our example web-sv-01a.eth0 as Port-ID “50331657″.

Find the vDS name:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan list
VDS ID                                           VDS Name      MTU  Segment ID     Gateway IP     Gateway MAC        Network Count  Vmknic Count
-----------------------------------------------  -----------  ----  -------------  -------------  -----------------  -------------  ------------
3b bf 0e 50 73 dc 49 d8-2e b0 df 20 91 e4 0b bd  Compute_VDS  1600  192.168.250.0  192.168.250.2  00:50:56:09:46:07              4             1

From Line 4 vDS name is “Compute_VDS

Verify “web-sv-01a.eth0″ Connect to VXLAN 5001:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan network port list --vds-name Compute_VDS --vxlan-id=5001
Switch Port ID  VDS Port ID  VMKNIC ID
--------------  -----------  ---------
      50331657  68                   0
      50331658  vdrPort              0

From Line 4 we have VM connect to VXLAN 5001 to port ID 50331657 this port ID is the Same port ID of VM web-sv-01a.eth0

Verification in esxcomp-01b:

esxcomp-01b esxtop
  PORT-ID              USED-BY  TEAM-PNIC DNAME              PKTTX/s  MbTX/s    PKTRX/s  MbRX/s %DRPTX %DRPRX
  33554433           Management        n/a vSwitch0              0.00    0.00       0.00    0.00   0.00   0.00
  50331649           Management        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331650               vmnic0          - DvsPortset-0          6.54    0.01     528.31    4.06   0.00   0.00
  50331651     Shadow of vmnic0        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331652                 vmk0     vmnic0 DvsPortset-0          2.77    0.00       1.19    0.00   0.00   0.00
  50331653                 vmk1     vmnic0 DvsPortset-0          0.59    0.00       0.40    0.00   0.00   0.00
  50331654                 vmk2     vmnic0 DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331655                 vmk3     vmnic0 DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331656 35663:web-sv-02a.eth     vmnic0 DvsPortset-0          3.96    0.01       3.57    0.01   0.00   0.00
  50331657          vdr-vdrPort     vmnic0 DvsPortset-0          2.18    0.00       0.00    0.00   0.00   0.00

From Line 11 we can see that “web-sv-02a.eth0” has Port-ID “50331656“.

Verify “web-sv-02a.eth0″ Connect to VXLAN 5001:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan network port list --vds-name Compute_VDS --vxlan-id=5001
Switch Port ID  VDS Port ID  VMKNIC ID
--------------  -----------  ---------
      50331656  69                   0
      50331657  vdrPort              0

From Line 4 we have VM connect to VXLAN 5001 to port ID 50331656

At this point we verify are VM’s located as draw in topology. now start with actual TSHOOT steps.

Is the problem in the physical network ?

Our first step will be to find out  if the problem is in the physical space or logical space.

TSHOT2

The easy way to find out is by ping from VTEP in esxcomp-01a to VTEP in esxcomp-01b, before ping let’s find out the VTEP IP address.

esxcomp-01a # esxcfg-vmknic -l
Interface  Port Group/DVPort   IP Family IP Address                              Netmask         Broadcast       MAC Address       MTU     TSO MSS   Enabled Type         
vmk0       16                  IPv4      192.168.210.51                          255.255.255.0   192.168.210.255 00:50:56:09:08:3e 1500    65535     true    STATIC       
vmk1       26                  IPv4      10.20.20.51                             255.255.255.0   10.20.20.255    00:50:56:69:80:0f 1500    65535     true    STATIC       
vmk2       35                  IPv4      10.20.30.51                             255.255.255.0   10.20.30.255    00:50:56:64:70:9f 1500    65535     true    STATIC       
vmk3       44                  IPv4      192.168.250.51                          255.255.255.0   192.168.250.255 00:50:56:66:e2:ef 1600    65535     true    STATIC

From Line 6 we can tell that VTEP IP address for VMK3(MTU is 1600) is 192.168.250.51.

Another command to find VTEP IP address is:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan vmknic list --vds-name=Compute_VDS
Vmknic Name  Switch Port ID  VDS Port ID  Endpoint ID  VLAN ID  IP              Netmask        IP Acquire Timeout  Multicast Group Count  Segment ID
-----------  --------------  -----------  -----------  -------  --------------  -------------  ------------------  ---------------------  -------------
vmk3               50331655  44                     0        0  192.168.250.51  255.255.255.0                   0                      0  192.168.250.0

Same commands in esxcomp-01b:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan vmknic list --vds-name=Compute_VDS
Vmknic Name  Switch Port ID  VDS Port ID  Endpoint ID  VLAN ID  IP              Netmask        IP Acquire Timeout  Multicast Group Count  Segment ID
-----------  --------------  -----------  -----------  -------  --------------  -------------  ------------------  ---------------------  -------------
vmk3               50331655  46                     0        0  192.168.250.53  255.255.255.0                   0                      0  192.168.250.0

VTEP IP for esxcomp-01b is 192.168.250.53. now let’s add this info to our  topology.

 

TSHOT3

Checks for VXLAN Routing:

NSX use use different IP stack for VXLAN  traffic,so we need to verify if default gateway is configured correctly for VXLAN traffic.

From esxcomp-01a:

esxcomp-01a # esxcli network ip route ipv4 list -N vxlan
Network        Netmask        Gateway        Interface  Source
-------------  -------------  -------------  ---------  ------
default        0.0.0.0        192.168.250.2  vmk3       MANUAL
192.168.250.0  255.255.255.0  0.0.0.0        vmk3       MANUAL

From esxcomp-01b:

esxcomp-01b # esxcli network ip route ipv4 list -N vxlan
Network        Netmask        Gateway        Interface  Source
-------------  -------------  -------------  ---------  ------
default        0.0.0.0        192.168.250.2  vmk3       MANUAL
192.168.250.0  255.255.255.0  0.0.0.0        vmk3       MANUAL

My two ESXi hosts in VTEP IP address space for this LAB work on same L2 segment, both VTEP have same default gateway.

Ping from VTEP in esxcomp-01a to VTEP located in esxcomp-02a.

Source ping will be from VXLAN IP stack with packet size of 1570 and don’t fragment bit set to 1.

esxcomp-01a #  ping ++netstack=vxlan 192.168.250.53 -s 1570 -d
PING 192.168.250.53 (192.168.250.53): 1570 data bytes
1578 bytes from 192.168.250.53: icmp_seq=0 ttl=64 time=0.585 ms
1578 bytes from 192.168.250.53: icmp_seq=1 ttl=64 time=0.936 ms
1578 bytes from 192.168.250.53: icmp_seq=2 ttl=64 time=0.831 ms

--- 192.168.250.53 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.585/0.784/0.936 ms

Ping is successfully.

If ping with “-d” don’t work and without “-d” work its MTU problem. Check for MTU in the physical switch’s

Because VXLAN in this example in the same L2 we can view ARP entry for others VTEP’s:

From esxcomp-01a:

esxcomp-01a # esxcli network ip neighbor list -N vxlan
Neighbor        Mac Address        Vmknic    Expiry  State  Type
--------------  -----------------  ------  --------  -----  -----------
192.168.250.52  00:50:56:64:f4:25  vmk3    1173 sec         Unknown
192.168.250.53  00:50:56:67:d9:91  vmk3    1171 sec         Unknown
192.168.250.2   00:50:56:09:46:07  vmk3    1187 sec         Autorefresh

Look like our physical layer is not the issue.

 

Verify NSX control plane

During NSX host preparation NSX Manager install  VIB agents called User World Agent (UWA) inside ESXi hosts.

The process responsible to communicate with NSX controller called netcpad.

ESXi host using VMkernel Management interface to create this secure channel over TCP/1234, traffic is encrypted with SSL.

Part of the information netcpad send to NSX Controller is:

VM’s: MAC, IP.

VTEP: MAC, IP.

VXLAN: the VXLAN Id’s

Routing: Routes learn from the DLR Control VM. (explain in next post).

TSHOT4

Base on this information the Controller learn the network state and build directory services.

To learn how the Controller Cluster works and how fix problem in the cluster itself  NSX Controller Cluster Troubleshooting .

For two VM’s to be able to talk to each others we need working control plane. In this lab we have 3 NSX controller.

Verification command need to done from both ESXi  and Controllers side.

NSX controllers IP address: 192.168.110.201, 192.168.110.202, 192.168.110.203

Control Plane verification from ESXi point of view:

Verify esxcomp-01a have ESTABLISHED connection to NSX Controllers. (grep 1234  to show only TCP port 1234 ).

esxcomp-01a # esxcli network ip  connection list | grep 1234
tcp         0       0  192.168.210.51:54153  192.168.110.202:1234  ESTABLISHED     35185  newreno  netcpa-worker
tcp         0       0  192.168.210.51:34656  192.168.110.203:1234  ESTABLISHED     34519  newreno  netcpa-worker
tcp         0       0  192.168.210.51:41342  192.168.110.201:1234  ESTABLISHED     34519  newreno  netcpa-worker

Verify esxcomp-01b have ESTABLISHED connection to NSX Controllers:

esxcomp-01b # esxcli network ip  connection list | grep 1234
tcp         0       0  192.168.210.56:16580  192.168.110.202:1234  ESTABLISHED     34517  newreno  netcpa-worker
tcp         0       0  192.168.210.56:49434  192.168.110.203:1234  ESTABLISHED     34678  newreno  netcpa-worker
tcp         0       0  192.168.210.56:12358  192.168.110.201:1234  ESTABLISHED     34516  newreno  netcpa-worker

Example of problem with communication from ESXi host to NSX Controllers:

esxcli network ip  connection list | grep 1234
tcp         0       0  192.168.210.51:54153  192.168.110.202:1234  TIME_WAIT           0
tcp         0       0  192.168.210.51:34656  192.168.110.203:1234  FIN_WAIT_2      34519  newreno
tcp         0       0  192.168.210.51:41342  192.168.110.201:1234  TIME_WAIT           0

If we can’t see ESTABLISHED connection check:

1. IP connectivity from ESXi host to all NSX controllers.

2. If you have firewall between ESXi host to NSX controllers, TCP/1234 need to be open.

3. Is netcpad is running on ESXi host:

/etc/init.d/netcpad status
netCP agent service is not running

start netcpad:

esxcomp-01a # /etc/init.d/netcpad status
netCP agent service is running

If netcpad is not running start with command:

esxcomp-01a #/etc/init.d/netcpad start
Memory reservation set for netcpa
netCP agent service starts

Verify again:

esxcomp-01a # /etc/init.d/netcpad status
netCP agent service is running

 

Verify in esxcomp-01a Control Plane is Enable and connection is up state for VXLAN 5001:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS
VXLAN ID  Multicast IP               Control Plane                        Controller Connection  Port Count  MAC Entry Count  ARP Entry Count
--------  -------------------------  -----------------------------------  ---------------------  ----------  ---------------  ---------------
    5003  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            2                0                0
    5001  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.201 (up)            2                3                0
    5000  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            1                3                0
    5002  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.203 (up)            1                2                0

Verify in esxcomp-01b Control Plane is Enable and connection is up state for VXLAN 5001:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS
VXLAN ID  Multicast IP               Control Plane                        Controller Connection  Port Count  MAC Entry Count  ARP Entry Count
--------  -------------------------  -----------------------------------  ---------------------  ----------  ---------------  ---------------
    5001  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.201 (up)            2                3                0
    5000  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            1                0                0
    5002  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.203 (up)            1                2                0
    5003  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            1                0                0

Check esxcomp-01a learn ARP of remote VM’s VXLAN 5001:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan network arp list --vds-name Compute_VDS --vxlan-id=5001
IP            MAC                Flags
------------  -----------------  --------
172.16.10.12  00:50:56:a6:a1:e3  00001101

From this output we can understand that esxcomp-01a learn the ARP info of  web-sv-02a

Check esxcomp-01b learn ARP  for remote VM’s VXLAN 5001:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan network arp list --vds-name Compute_VDS --vxlan-id=5001
IP            MAC                Flags
------------  -----------------  --------
172.16.10.11  00:50:56:a6:7a:a2  00010001

From this output we can understand that esxcomp-01b learn the ARP info of  web-sv-01a

What we can tell at this point.

esxcomp-01a:

Know web-sv-01a is VM running in VXLAN 5001, his ip 172.16.10.11 and MAC address : 00:50:56:a6:7a:a2.

The communication to Controller’s cluster is UP for VXLAN 5001.

esxcomp-01b:

Know web-sv-01b is VM running in VXLAN 5001, his ip 172.16.10.12 and MAC address: 00:50:56:a6:a1:e3

The communication to Controller’s cluster is UP for VXLAN 5001.

So why web-sv-01a can’t take to web-sv-02a ?

the answer to this question is an another question: what the NSX  controller know ?

Control Plane verification from NSX Controller point of view:

We have 3 active controller, one of then is elected to manage VXLAN 5001. Remember slicing ?

Find out who is manage VXLAN 5001, SSH to one of the NSX controllers, for example 192.168.110.202:

nsx-controller # show control-cluster logical-switches vni 5001
VNI      Controller      BUM-Replication ARP-Proxy Connections VTEPs
5001     192.168.110.201 Enabled         Enabled   0           0

Line 3 say that 192.168.110.201 is manage VXLAN 5001, so the next command will run from 192.168.110.201:

nsx-controller # show control-cluster logical-switches vni 5001
VNI      Controller      BUM-Replication ARP-Proxy Connections VTEPs
5001     192.168.110.201 Enabled         Enabled   6           4

From this output we learn that VXLAN 5001 have 4 VTEP connected to him and total of 6 active connection.

At this point i would like to point you for excellent blogger with lots of information of what is happen under the hood in NSX.

His name is Dmitri Kalintsev. link to his blog: NSX for vSphere: Controller “Connections” and “VTEPs”

From Dimitri Post:

“ESXi host joins a VNI in two cases:

  1. When a VM running on that host connects to VNI’s dvPg and its vNIC transitions into “Link Up” state; and
  2. When DLR kernel module on that host needs to route traffic to a VM on that VNI that’s running on a different host.”

We are not route traffic between VM’s, DLR is not  part of the game here.

Find out VTEP IP address connected to VXLAN 5001:

nsx-controller # show control-cluster logical-switches vtep-table 5001
VNI      IP              Segment         MAC               Connection-ID
5001     192.168.250.53  192.168.250.0   00:50:56:67:d9:91 5
5001     192.168.250.52  192.168.250.0   00:50:56:64:f4:25 3
5001     192.168.250.51  192.168.250.0   00:50:56:66:e2:ef 4
5001     192.168.150.51  192.168.150.0   00:50:56:60:bc:e9 6

From this output we can learn that both VTEP’s esxcomp-01a line 5  and esxcomp-01b line 3 are seen by NSX Controller on VXLAN 5001.

The MAC address output in this comments are VTEP’s MAC.

Find out that MAC address of the VM’s has learn by NSX Controller:

nsx-controller # show control-cluster logical-switches mac-table 5001
VNI      MAC               VTEP-IP         Connection-ID
5001     00:50:56:a6:7a:a2 192.168.250.51  4
5001     00:50:56:a6:a1:e3 192.168.250.53  5
5001     00:50:56:8e:45:33 192.168.150.51  6

Line 3 show MAC of web-sv-01a, line 4 show MAC of web-sv-02a

Find out that ARP entry of the VM’s has learn by NSX Controller:

 

nsx-controller # show control-cluster logical-switches arp-table 5001
VNI      IP              MAC               Connection-ID
5001     172.16.10.11    00:50:56:a6:7a:a2 4
5001     172.16.10.12    00:50:56:a6:a1:e3 5
5001     172.16.10.10    00:50:56:8e:45:33 6

Line 3,4 show the exact IP/MAC of  web-sv-01a and  web-sv-02a

To understand how Controller have learn this info read my post NSX-V IP Discovery

Some time restart the netcpad process can fix problem between ESXi host and NSX Controllers.

esxcomp-01a # /etc/init.d/netcpad restart
watchdog-netcpa: Terminating watchdog process with PID 4273913
Memory reservation released for netcpa
netCP agent service is stopped
Memory reservation set for netcpa
netCP agent service starts

Summary of controller verification:

NSX Controller Controller know where VM’s is located, their  ip address and MAC address. it’s seem like Control plane work just fine.

 

Move VM to different ESXi host

In NSX-v each ESXi host has its one UWA service daemon part of the management and control  plane, sometimes when UWA not working as expected VMs on this ESXi host will have connectivity issue.

The fast way to check it, is to vMotion none working VMs  from one ESXi host to different, it VMs start to work we need to focus on the none working ESXi host control plane.

In this scenario even i vMotion my VM to different ESXi host the problem didn’t go away.

 

Capture in the right spots:

pktcap-uw command allow to capture traffic in so many places in NSX environments.

before start to capture all over the place, lets think where we think the problem is.

When VM connect to Logical switch there are few security services that pack a transverse, each service represent with different slot id.

TSHOT5

SLOT 0 : implement vDS Access List.

SLOT 1: Switch Security module (swsec) capture DHCP Ack and ARP message, this info then forward to NSX Controller.

SLOT2: NSX Distributed Firewall.

We need Check if VM traffic successfully pass  after NSX Distributed firewall, that mean in slot 2.

The capture command will need to SLOT 2 filter name for Web-sv-01a

From esxcomp-01a:

esxcomp-01a # summarize-dvfilter
~~~snip~~~~
world 35888 vmm0:web-sv-01a vcUuid:'50 26 c7 cd b6 f3 f4 bc-e5 33 3d 4b 25 5c 62 77'
 port 50331657 web-sv-01a.eth0
  vNic slot 2
   name: nic-35888-eth0-vmware-sfw.2
   agentName: vmware-sfw
   state: IOChain Attached
   vmState: Detached
   failurePolicy: failClosed
   slowPathID: none
   filter source: Dynamic Filter Creation
  vNic slot 1
   name: nic-35888-eth0-dvfilter-generic-vmware-swsec.1
   agentName: dvfilter-generic-vmware-swsec
   state: IOChain Attached
   vmState: Detached
   failurePolicy: failClosed
   slowPathID: none
   filter source: Alternate Opaque Channel

We can see in line4 that VM name is web-sv-01a, in line  5 that filter applied at slot 2 and in line 6 we have the filter name: nic-35888-eth0-vmware-sfw.2

pktcap-uw command help with -A output:

esxcomp-01a # pktcap-uw -A
Supported capture points:
        1: Dynamic -- The dynamic inserted runtime capture point.
        2: UplinkRcv -- The function that receives packets from uplink dev
        3: UplinkSnd -- Function to Tx packets on uplink
        4: Vmxnet3Tx -- Function in vnic backend to Tx packets from guest
        5: Vmxnet3Rx -- Function in vnic backend to Rx packets to guest
        6: PortInput -- Port_Input function of any given port
        7: IOChain -- The virtual switch port iochain capture point.
        8: EtherswitchDispath -- Function that receives packets for switch
        9: EtherswitchOutput -- Function that sends out packets, from switch
        10: PortOutput -- Port_Output function of any given port
        11: TcpipDispatch -- Tcpip Dispatch function
        12: PreDVFilter -- The DVFIlter capture point
        13: PostDVFilter -- The DVFilter capture point
        14: Drop -- Dropped Packets capture point
        15: VdrRxLeaf -- The Leaf Rx IOChain for VDR
        16: VdrTxLeaf -- The Leaf Tx IOChain for VDR
        17: VdrRxTerminal -- Terminal Rx IOChain for VDR
        18: VdrTxTerminal -- Terminal Tx IOChain for VDR
        19: PktFree -- Packets freeing point

capture command have support to sniff traffic in interesting points, with PreDVFilter and PostDVFilter line 14,15 can sniffing traffic before or after filtering action.

Capture after SLOT 2 filter:

pktcap-uw --capture PostDVFilter --dvfilter nic-35888-eth0-vmware-sfw.2 --proto=0x1 -o web-sv-01a_after.pcap
The session capture point is PostDVFilter
The name of the dvfilter is nic-35888-eth0-vmware-sfw.2
The session filter IP protocol is 0x1
The output file is web-sv-01a_after.pcap
No server port specifed, select 784 as the port
Local CID 2
Listen on port 784
Accept...Vsock connection from port 1049 cid 2
Destroying session 25

Dumped 0 packet to file web-sv-01a_after.pcap, dropped 0 packets.

PostDVFilter = capture after the filter name.

–proto=01x capture only icmp packet.

–dvfilter = filter name as it show from summarize-dvfilter command.

-o = where to capture the traffic.

From output of this command line 12 we can tell ICMP packet are not pass this filters because we have 0 Dumped packet.

We found our smoking gun 🙂

Now capture before SLOT 2 filter.

pktcap-uw –capture PreDVFilter –dvfilter nic-35888-eth0-vmware-sfw.2 –proto=0x1 -o web-sv-01a_before.pcap

pktcap-uw –capture PreDVFilter –dvfilter nic-35888-eth0-vmware-sfw.2 –proto=0x1 -o web-sv-01a_before.pcap
The session capture point is PreDVFilter
The name of the dvfilter is nic-35888-eth0-vmware-sfw.2
The session filter IP protocol is 0x1
The output file is web-sv-01a_before.pcap
No server port specifed, select 5782 as the port
Local CID 2
Listen on port 5782
Accept...Vsock connection from port 1050 cid 2
Dump: 6, broken : 0, drop: 0, file err: 0Destroying session 26

Dumped 6 packet to file web-sv-01a_before.pcap, dropped 0 packets.

Now we can see at line 6 that we have Dumped packet. we can open web-sv-01a_before.pcap  captured  file:

esxcomp-01a # tcpdump-uw -r web-sv-01a_before.pcap
reading from file web-sv-01a_before.pcap, link-type EN10MB (Ethernet)
20:15:31.389158 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18628, length 64
20:15:32.397225 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18629, length 64
20:15:33.405253 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18630, length 64
20:15:34.413356 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18631, length 64
20:15:35.421284 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18632, length 64
20:15:36.429219 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18633, length 64

Walla, NSX dFW block the traffic.

And now from NSX GUI:

TSHOT6

Looking back on this article can be skipped intentionally step 3 “Configuration issue”.

If we were checked configuration settings, we immediately notice this problem.

 

 

Summary of all CLI Commands for this post:

ESXI Commands:

esxtop
esxcfg-vmknic -l
esxcli network vswitch dvs vmware vxlan list
esxcli network vswitch dvs vmware vxlan network port list --vds-name Compute_VDS --vxlan-id=5001
esxcli network vswitch dvs vmware vxlan vmknic list --vds-name=Compute_VDS
esxcli network ip route ipv4 list -N vxlan
esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS
esxcli network vswitch dvs vmware vxlan network arp list --vds-name Compute_VDS --vxlan-id=5001
esxcli network ip connection list | grep 1234
ping ++netstack=vxlan 192.168.250.53 -s 1570 -d
/etc/init.d/netcpad (status|start|)
pktcap-uw --capture PostDVFilter --dvfilter nic-35888-eth0-vmware-sfw.2 --proto=0x1 -o web-sv-01a_after.pcap

 

NSX Controller Commands:

show control-cluster logical-switches vni 5001
show control-cluster logical-switches vtep-table 5001
show control-cluster logical-switches mac-table 5001
show control-cluster logical-switches arp-table 5001

 

Working with NSXv API

NSX RESTful API allow to interact with NSX manager, read the state of the NSX and change configuration. Some of NSX task can only be done via API and not GUI.

The protocol using to communication with the NSX Manager is HTTPS, over secure channel we send request and response data in XML format.

In this blog post we will use Mozilla web browser as REST client.

 

Install Mozilla REST Client:

REST Client can download from:

https://addons.mozilla.org/en-US/firefox/addon/restclient/

To install the REST Client Click on the “Open Menu” Button and Then click on the “Add-ons” button:

API3

Click on “Install add-on from file”:

API4

Import the REST Client file from the folder you download and click “Install”.

API5

Note: Mozilla browser will restart

To open the REST Client click on the Red button:

API6

One time configuration of REST Client:

Choose “Basic authentication”

API75

Type in NSX Credential for NSX Manager and Click Okay.

API65

Select from menu “Headers” and  “Custom Header”

Type the Username and Password for NSX Manager and Click Okay.

API7

In name type “Content-Type”.

In Value type: “application/xml”

Click Save to favorite and Okay

API8

Now we can start to work with REST Client.

There are few different Method to read and configure the NSX Manager.

API111

The frequent tasks we will work with REST Client are:

GET: to read information from NSX Manager

POST: to change configuration in NSX Manager

DELETE: to DELETE configuration in the NSX Manager

 

In the URL we need to type the NSX Manager URL:

https://nsxmgr-l-01a.corp.local/

Our first exercise will be to configure Logical Switch.

Create Logical Switch with API

To do this we will need to know which transport zone will use this logical switch.

In my lab we have two transport zone Global-Transport-Zone and DMZ-Transport-Zone.

API16

NSX as different scope ID to represent different transport zone.

Assuming we would like to configure the logical switch in DMZ-Transport-Zone we need to find out what is the Transport zone scope.

Finding this scope can be done with API call:

https://nsxmgr-l-01a.corp.local/api/2.0/vdn/scopes

Click on the red “SEND” button.

The “status code” 200 mean we successfully get the API information

API17

After click on the Response Body we got results for DMZ-Transport-Zone we get vdnsope-2

API18

For Global-Transport-Zone we get vdnsope-1

API19

Since we want to configure this logical switch on DMZ-Transport-Zone we will use vdnsope-2 in the POST API Call.

Create Logical switch in DMZ-transport-Zone:

The Method: POST

URL: https://nsxmgr-l-01a.corp.local/api/2.0/vdn/scopes/vdnscope-2/virtualwires

Body is the Payload of the actual configuration.

Body type:

<name>DMZ-Logical-Switch-01</name>
<description>Created via REST API</description>
<tenantId>virtual wire tenant</tenantId>
<controlPlaneMode>UNICAST_MODE</controlPlaneMode>
</virtualWireCreateSpec>

 

“415 Unsupported Media Type”

The reason for this is typo in “content type”

API20

My typo configuration is: Con: application/xml instead of writing Content-Type: application/xml

API13

Another issue can lead to same error:

Ensure that the SSL certificate used by NSX Manager is accepted by the client [browser or fat client] before making the call. By default, NSX Manager uses self-signed certificate that is accepted by the user. In such cases, ensure that the NSX Manager certificate is accepted before making the REST API call

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2094596

 

The REST APIs use a HTTP response message to inform clients of their request’s result (as defined in RFC 2616). 5 Categories Defined:

1xx: Informational Communicates transfer protocol-level information.

2xx: Success indicates that the client’s request was accepted successful.

3xx: Redirection indicates that the client must take some additional action in order to complete their request.

4xx: Client Error This category of error status codes points the finger at clients.

5xx: Server Error The server takes responsibility for these error status codes.

 

After fixing this typo and click on “SEND” we get status 201

API14

Each logical switch as unique virtual wire id:

API21

Form NSX GUI, Logical Switch tab we can see we successfully crested the Logical switch.

API15

Link to official API guide

http://pubs.vmware.com/NSX-61/topic/com.vmware.ICbase/PDF/nsx_61_api.pdf

 

 

NSX-V IP Discovery

Thanks to Dimitri Desmidt for feedbacks.

IP discovery Allow NSX to suppress the ARP message over the logical switch.

To understand why we need IP Discovery and how its work we need some background regard ARP.

IP Discovery

What is ARP (Address Resolution Protocol)?

When VM1 need to communicate with VM2 it need know VM2 MAC Address, the way to know MAC2 is to send a broadcast message (ARP request) to all VM in the same L2 segment (same VLAN or in the example above same VXLAN 5001).

ALL VM’s on this Logical Switch will see this message including VM3 since it’s a broadcast, but only VM’2 will respond. The response will come in Unicast Message from VM2 directly to VM1 mac@ with VM’2 MAC address (MAC2) in the response body.

 

topology

VM1 will cache the mac@ of VM2 IP@ in its ARP table. The entry it saved between few seconds to few minutes depending on the Operating System.

Windows 7 OS for example

http://support.microsoft.com/kb/949589

If VM1 and VM2 will not talk again in this Cache time window, VM1 will clear is ARP table entry for that MAC2, when next time VM1 will need to talk to VM2, VM1 OS will send again ARP message to relearn same MAC2 of VM2.

Note: In the unlikely event of the NSX Controller who dodoesn’tnow the mac@ for VM2-IP@, then the ARP request message is flooded, but only to the ESXi that have VMs in the logical switch 5001.

 

How IP Discovery works:

VMware NSX leverage NSX controller to achieve IP Discovery.

Inside ESXi host running with NSX software there is process called User World Agent (UWA), this process communicate with NSX  controller and update the controller directory MAC,IP,VTEP tables for VM’s reside inside this ESXi host.

NSX Arch

When VM connect to Logical switch there are few security services that pack a transverse, each service represent with different slot id.

Filter slots

SLOT 0 : implement vDS Access List

SLOT 1: Switch Security module (swsec) capture DHCP Ack and ARP message, this info then forward to NSX Controller.

SLOT2: NSX Distributed Firewall.

From the figure above we now understand that slot 1 is the service responsibly to implement the IP Discovery.

When VM1 power up even if the ip address is static the VM will send out ARP message do discover the MAC address of the default gateway, when swsec module see this ARP message he will forward to NSX Controller. That way NSX controller learn VM1 MAC1 address, same way Controller will learn VM2 MAC2 address.

Now when VM1 want to talk to VM2, MAC2 is not known to VM1, then ARP message will send out to VXLAN 5001.

The UWA will send out query to NSX controller and ask if he know MAC2, since controller already know this Controller will  send unicast message back to VM1 with MAC2, the ARP broadcast message will not send out to all VM’s  in VXLAN 5001.

Note: There is 3 min timer in NSX controller for ARP query, if host send same query in this time frame the controller ignore this request and broadcast message will be send out to all VM in the logical switch

 

IP Discovery Verification:

The easiest to know if IP discovery it actually works is to run Wireshark software in VM3, clear the ARP table in VM1 with the command: arp –d.

Now ping from VM1 to VM2, ARP broadcast message from VM1 should not see in VM3.

I would like to point out grate post explain in deep dive how IP discovery work by Dmitri Kalintsev

NSX-v under the hood: VXLAN ARP suppression

 

How to pass VMware VCIX-NV Certification

Few hours ago I received this email:

VCIX Email

I don’t need to tell you’re how man face can change so fast after open this kind of email J

I would like to share with you my impression about this exam.

Total time for this exam is 4 swatting hours built from 18 Task.

Its real Labs cover most of the feature you can find in NSX.

  1. The Lab run over nested environments.
  2. The Lab work so slow, it can drive you crazy during the exam.
  3. Some task related to other task.
  4. It’s including Troubleshooting steps. Some of them very difficult
  5. You need to wait few days to get your results (in my case it take 24 hours)
  6. During the exam you can write your comment and give feedback

 

The skills need to pass this exam cover different role in the IT: Network, Security and Virtualization.

I’m done few 8 hours labs in my lifeJ, I think VMware is on the right regard the Certification NV tracks.

My recommendation for people trying to learn for this exam:

Read my VCIX-NV Page, I will update this page from time to time.

Run over and over the HOL Labs.

Try to break and Fix HOL labs to be prepare to TSHOT in VCIX-NV exam.

You can find other blogs for VCIX-NV

http://lostdomain.org/tag/vcix-nv/

http://www.wynner.eu/computing/my-vcix-nv-exam-experience/

https://fojta.wordpress.com/2014/12/02/vcix-nv-exam-experience/

 

 

 

NSX-V Edge NAT

Thanks to Francis Guillier Max Ardica and  Tiran Efrat of the overview and feedback.

One of the most important NSX Edge features is NAT.
With NAT (Network Address Translation) we can change the Source or Destination IP addresses and TCP/UDP port. Combined NAT and Firewall rules can lead to confusion when we try to determine the correct IP address to which apply the firewall rule.
To create the correct rule we need to understand the packet flow inside the NSX Edge in details. In NSX Edge we have two different type of NAT: Source Nat (SNAT) and Destination NAT (DNAT).

 

SNAT

Allows translating an internal IP address (for example private IP described in RFC 1918) to a public External IP address.
In figure below, the IP address for any VM in VXLAN 5001 that needs outside connectivity to the WAN can be translated to an external IP address (this mapping is configured on the Edge). For example, VM1 with IP address 172.16.10.11 needs to communicate with WAN Internet, so the NSX Edge can translate it to a 192.168.100.50 IP address configured on the Edge external interface.
Users in the external network are not aware of the internal Private IP address.

 

SANT

DNAT

Allow to access internal private IP addresses from the outside world.
In the example in figure below, users from the WAN need to communicate with the Server 172.16.10.11.
NSX Edge DNAT mapping configuration is created so that the users from outside connect to 192.168.100.51 and NSX Edge translates this IP address to 172.16.10.11.

DNAT

Below is the outline of the Packet flow process inside the Edge. The important parts are where the SNAT/DNAT Action and firewall decision action are being taken.

packet flow

We can see from this process that the ingress packet will evaluate against FW rules before SNAT/DNAT translation.

Note: the actual packet flow details are more complicated with more action/decisions in Edge flow, but the emphasis here is on the NAT and FW functionalities only.

Note:  NAT function will work only if firewall service is enabled.

Enable Firewall Service

 

 

Firewall rules and SNAT

Because of this packet flow the firewall rule for SNAT need to be applied on the internal IP address object and not on the IP address translated by the SNAT function. For example, when a VM1 172.16.10.11 needs to communicate with the WAN, the firewall rule needs to be:

fw and SNAT

 Firewall rules and DNAT

Because of this packet flow the firewall rules for DNAT need to be applied on the public IP address object and not on the Private IP address after the DNAT translation. When a user from the WAN sends traffic to 192.168.100.51, this packet will be checked against this FW rule and then the NAT will change the destination IP address to 172.16.10.11.

fw and DNAT

DNAT Configuration

Users from outside need to access an internal web server connecting to its public IP address.
The server internal IP address is 172.16.100.11, the NAT IP address is 192.168.100.6.

 

DNAT

The first step is creating the External IP on the Edge, this IP is secondary because this edge already has a main IP address configured in the 192.168.100.0/24 IP subnet.

Note: the main IP address is marked with a black Ddot (192.168.100.3).

For this example the DNAT IP address is 192.168.100.6.

DNAT1

Create a DNAT Rule in the Edge:

DNAT2

Now pay attention to the firewall rules one the Edge: a user coming from the outside will try to access the internal server by connecting to the public IP address 192.168.100.6. This implies that the fw rule needs to allow this access.

.

DNAT3

DNAT Verification:

There are several ways to verify NAT is functioning as originally planned. In our example, users from any source address access the public IP address 192.168.100.6, and after the NAT translation the packet destination IP address is changed to 172.16.10.11.

The output of the command:

show nat

show nat

The output of the command:

show firewall flow

We can see that packet is received by the Edge and destined to the 192.168.100.6 address, the return traffic is instead originated from the different IP address 172.16.10.11 (the private IP address).
That means DNAT translation is happening here.

show flow

We can capture the traffic and see the actual packet:
Capture Edge traffic on its outside interface vNic_0, in this example user source IP address is 192.168.110.10 and destination is 192.168.100.6

The command for capture is:
debug packet display interface vNic_0 port_80_and_src_192.168.110.10

Debug packet display interface vNic_0 port_80_and_src_192.168.110.10

debug packet 1

Capture edge on internal interface vNic_1 we can see destination IP address has changed to 172.16.10.11 because of DNAT translation:

debug packet 2

SNAT configuration

All the servers part of VXLAN segment 5001 (associated to the IP subnet 172.16.10.0/24) need to leverage SNAT translation (in this example to IP address 192.168.100.3) on the outside interface of the Edge to be able to communicate with the external network.

 

SNAT config

SNAT Configuration:

snat config 2

Edge Firewall Rules:

Allow to 172.16.10.0/24 to go out

SNAT config fw rule

 

Verification:

The output of the command

Show nat

show nat verfication

DNAT with L4 Address Translation (PAT)

DNAT with L4 Address Translation allows changing Layer4 TCP/UDP port.
For example we would like to mask our internal SSH server port for all users from outside.
The new port will be TCP/222 instead of regular SSH TCP/22 port.

The user originates a connection to the Web Server on destination port TCP/222 but the NSX Edge will change it to TCP/22.

PAT

From the command line the show nat command:

PAT show nat

NAT Order

In this specific scenario, we want to create the two following SNAT rules.

  • SNAT Rule 1:
    The IP addresses for the devices part of VXLAN 5001 (associated to the IP subnet 172.16.10.0/24) need to be translated to the Edge outside interface address 192.168.100.3.
  • SNAT Rule 2:
    Web-SRV-01a on VXLAN 5001 needs its IP address 172.16.10.4 to be translated to the Edge outside address 192.168.100.4.

nat order

In the configuration example above, traffic will never hit rule number 4 because 172.16.10.4 is part of subnet 172.16.10.0/24, so its IP address will be translated to 192.168.100.3 (and not the desired 192.168.100.4).

Order for SNAT rules is important!
We need to re-order the SNAT rules and put the more specific one on top, so that rule 3 will be hit for traffic originated from the IP address 172.16.10.4, whereas rule 4 will apply to all the other devices part of IP subnet 172.16.10.0/24.

nat reorder

After re-order:

nat after reorer

 

another useful command

show configuration nat