NSX-v Troubleshooting L2 Connectivity

In this blog post we describe the methodology to troubleshoot L2 connectivity within the same Logical switch L2 segment.

Some of the steps here can and should be done via NSX GUI,vRealize Operations Manager 6.0 and vRealize Log Insight,  so see it like education post.

There are lots of CLI commands in this post :-). To view the output of CLI command you can scroll right.

 

High level approach to solve L2 problems:

1. Understand  the problem.

2. Know your network topology.

3. Figure out  if is its configuration issue.

4. Check  if the problem within the physical space or logical space.

5. Verify NSX control plane from ESXi hosts and NSX Controllers.

6. Move VM to different ESXi host.

7. Start to Capture traffic in right spots.

 

Understand the Problem

VM’s on same logical switch 5001 are  unable to communicate .

show the problem:

web-sv-01a:~ # ping 172.16.10.12
PING 172.16.10.12 (172.16.10.12) 56(84) bytes of data.
^C
--- 172.16.10.12 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3023ms

 

Know your network topology:

TSHOT1

VM’s: web-sv-01a and  web-sv-02a  reside in different compute resource  esxcomp-01a and esxcomp-02a respectively.

web-sv-01a: IP: 172.16.10.11,  MAC: 00:50:56:a6:7a:a2

web-sv-02a: IP:172.16.10.12, MAC: 00:50:56:a6:a1:e3

 

Validate network topology

I know its sounds stupid, let’s make sure that VM’s actually reside in the right esxi host and connected to right VXLAN.

Verify VM “web-sb-01a” is actually reside in “escomp-01a“:

From esxcomp-01a run the command esxtop then press “n” (Network):

esxcomp-01a # esxtop
   PORT-ID              USED-BY  TEAM-PNIC DNAME              PKTTX/s  MbTX/s    PKTRX/s  MbRX/s %DRPTX %DRPRX
  33554433           Management        n/a vSwitch0              0.00    0.00       0.00    0.00   0.00   0.00
  50331649           Management        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331650               vmnic0          - DvsPortset-0          8.41    0.02     437.81    3.17   0.00   0.00
  50331651     Shadow of vmnic0        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331652                 vmk0     vmnic0 DvsPortset-0          5.87    0.01       1.76    0.00   0.00   0.00
  50331653                 vmk1     vmnic0 DvsPortset-0          0.59    0.01       0.98    0.00   0.00   0.00
  50331654                 vmk2     vmnic0 DvsPortset-0          0.00    0.00       0.39    0.00   0.00   0.00
  50331655                 vmk3     vmnic0 DvsPortset-0          0.20    0.00       0.39    0.00   0.00   0.00
  50331656 35669:db-sv-01a.eth0     vmnic0 DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331657 35888:web-sv-01a.eth     vmnic0 DvsPortset-0          4.89    0.01       3.72    0.01   0.00   0.00
  50331658          vdr-vdrPort     vmnic0 DvsPortset-0          2.15    0.00       0.00    0.00   0.00   0.00

In line 12 we can see that “web-sv-01a.eth0” is shown, another imported information is has “Port-ID“.

The “Port-ID” is unique identifier for each virtual switch port , in our example web-sv-01a.eth0 as Port-ID “50331657″.

Find the vDS name:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan list
VDS ID                                           VDS Name      MTU  Segment ID     Gateway IP     Gateway MAC        Network Count  Vmknic Count
-----------------------------------------------  -----------  ----  -------------  -------------  -----------------  -------------  ------------
3b bf 0e 50 73 dc 49 d8-2e b0 df 20 91 e4 0b bd  Compute_VDS  1600  192.168.250.0  192.168.250.2  00:50:56:09:46:07              4             1

From Line 4 vDS name is “Compute_VDS

Verify “web-sv-01a.eth0″ Connect to VXLAN 5001:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan network port list --vds-name Compute_VDS --vxlan-id=5001
Switch Port ID  VDS Port ID  VMKNIC ID
--------------  -----------  ---------
      50331657  68                   0
      50331658  vdrPort              0

From Line 4 we have VM connect to VXLAN 5001 to port ID 50331657 this port ID is the Same port ID of VM web-sv-01a.eth0

Verification in esxcomp-01b:

esxcomp-01b esxtop
  PORT-ID              USED-BY  TEAM-PNIC DNAME              PKTTX/s  MbTX/s    PKTRX/s  MbRX/s %DRPTX %DRPRX
  33554433           Management        n/a vSwitch0              0.00    0.00       0.00    0.00   0.00   0.00
  50331649           Management        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331650               vmnic0          - DvsPortset-0          6.54    0.01     528.31    4.06   0.00   0.00
  50331651     Shadow of vmnic0        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331652                 vmk0     vmnic0 DvsPortset-0          2.77    0.00       1.19    0.00   0.00   0.00
  50331653                 vmk1     vmnic0 DvsPortset-0          0.59    0.00       0.40    0.00   0.00   0.00
  50331654                 vmk2     vmnic0 DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331655                 vmk3     vmnic0 DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331656 35663:web-sv-02a.eth     vmnic0 DvsPortset-0          3.96    0.01       3.57    0.01   0.00   0.00
  50331657          vdr-vdrPort     vmnic0 DvsPortset-0          2.18    0.00       0.00    0.00   0.00   0.00

From Line 11 we can see that “web-sv-02a.eth0” has Port-ID “50331656“.

Verify “web-sv-02a.eth0″ Connect to VXLAN 5001:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan network port list --vds-name Compute_VDS --vxlan-id=5001
Switch Port ID  VDS Port ID  VMKNIC ID
--------------  -----------  ---------
      50331656  69                   0
      50331657  vdrPort              0

From Line 4 we have VM connect to VXLAN 5001 to port ID 50331656

At this point we verify are VM’s located as draw in topology. now start with actual TSHOOT steps.

Is the problem in the physical network ?

Our first step will be to find out  if the problem is in the physical space or logical space.

TSHOT2

The easy way to find out is by ping from VTEP in esxcomp-01a to VTEP in esxcomp-01b, before ping let’s find out the VTEP IP address.

esxcomp-01a # esxcfg-vmknic -l
Interface  Port Group/DVPort   IP Family IP Address                              Netmask         Broadcast       MAC Address       MTU     TSO MSS   Enabled Type         
vmk0       16                  IPv4      192.168.210.51                          255.255.255.0   192.168.210.255 00:50:56:09:08:3e 1500    65535     true    STATIC       
vmk1       26                  IPv4      10.20.20.51                             255.255.255.0   10.20.20.255    00:50:56:69:80:0f 1500    65535     true    STATIC       
vmk2       35                  IPv4      10.20.30.51                             255.255.255.0   10.20.30.255    00:50:56:64:70:9f 1500    65535     true    STATIC       
vmk3       44                  IPv4      192.168.250.51                          255.255.255.0   192.168.250.255 00:50:56:66:e2:ef 1600    65535     true    STATIC

From Line 6 we can tell that VTEP IP address for VMK3(MTU is 1600) is 192.168.250.51.

Another command to find VTEP IP address is:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan vmknic list --vds-name=Compute_VDS
Vmknic Name  Switch Port ID  VDS Port ID  Endpoint ID  VLAN ID  IP              Netmask        IP Acquire Timeout  Multicast Group Count  Segment ID
-----------  --------------  -----------  -----------  -------  --------------  -------------  ------------------  ---------------------  -------------
vmk3               50331655  44                     0        0  192.168.250.51  255.255.255.0                   0                      0  192.168.250.0

Same commands in esxcomp-01b:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan vmknic list --vds-name=Compute_VDS
Vmknic Name  Switch Port ID  VDS Port ID  Endpoint ID  VLAN ID  IP              Netmask        IP Acquire Timeout  Multicast Group Count  Segment ID
-----------  --------------  -----------  -----------  -------  --------------  -------------  ------------------  ---------------------  -------------
vmk3               50331655  46                     0        0  192.168.250.53  255.255.255.0                   0                      0  192.168.250.0

VTEP IP for esxcomp-01b is 192.168.250.53. now let’s add this info to our  topology.

 

TSHOT3

Checks for VXLAN Routing:

NSX use use different IP stack for VXLAN  traffic,so we need to verify if default gateway is configured correctly for VXLAN traffic.

From esxcomp-01a:

esxcomp-01a # esxcli network ip route ipv4 list -N vxlan
Network        Netmask        Gateway        Interface  Source
-------------  -------------  -------------  ---------  ------
default        0.0.0.0        192.168.250.2  vmk3       MANUAL
192.168.250.0  255.255.255.0  0.0.0.0        vmk3       MANUAL

From esxcomp-01b:

esxcomp-01b # esxcli network ip route ipv4 list -N vxlan
Network        Netmask        Gateway        Interface  Source
-------------  -------------  -------------  ---------  ------
default        0.0.0.0        192.168.250.2  vmk3       MANUAL
192.168.250.0  255.255.255.0  0.0.0.0        vmk3       MANUAL

My two ESXi hosts in VTEP IP address space for this LAB work on same L2 segment, both VTEP have same default gateway.

Ping from VTEP in esxcomp-01a to VTEP located in esxcomp-02a.

Source ping will be from VXLAN IP stack with packet size of 1570 and don’t fragment bit set to 1.

esxcomp-01a #  ping ++netstack=vxlan 192.168.250.53 -s 1570 -d
PING 192.168.250.53 (192.168.250.53): 1570 data bytes
1578 bytes from 192.168.250.53: icmp_seq=0 ttl=64 time=0.585 ms
1578 bytes from 192.168.250.53: icmp_seq=1 ttl=64 time=0.936 ms
1578 bytes from 192.168.250.53: icmp_seq=2 ttl=64 time=0.831 ms

--- 192.168.250.53 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.585/0.784/0.936 ms

Ping is successfully.

If ping with “-d” don’t work and without “-d” work its MTU problem. Check for MTU in the physical switch’s

Because VXLAN in this example in the same L2 we can view ARP entry for others VTEP’s:

From esxcomp-01a:

esxcomp-01a # esxcli network ip neighbor list -N vxlan
Neighbor        Mac Address        Vmknic    Expiry  State  Type
--------------  -----------------  ------  --------  -----  -----------
192.168.250.52  00:50:56:64:f4:25  vmk3    1173 sec         Unknown
192.168.250.53  00:50:56:67:d9:91  vmk3    1171 sec         Unknown
192.168.250.2   00:50:56:09:46:07  vmk3    1187 sec         Autorefresh

Look like our physical layer is not the issue.

 

Verify NSX control plane

During NSX host preparation NSX Manager install  VIB agents called User World Agent (UWA) inside ESXi hosts.

The process responsible to communicate with NSX controller called netcpad.

ESXi host using VMkernel Management interface to create this secure channel over TCP/1234, traffic is encrypted with SSL.

Part of the information netcpad send to NSX Controller is:

VM’s: MAC, IP.

VTEP: MAC, IP.

VXLAN: the VXLAN Id’s

Routing: Routes learn from the DLR Control VM. (explain in next post).

TSHOT4

Base on this information the Controller learn the network state and build directory services.

To learn how the Controller Cluster works and how fix problem in the cluster itself  NSX Controller Cluster Troubleshooting .

For two VM’s to be able to talk to each others we need working control plane. In this lab we have 3 NSX controller.

Verification command need to done from both ESXi  and Controllers side.

NSX controllers IP address: 192.168.110.201, 192.168.110.202, 192.168.110.203

Control Plane verification from ESXi point of view:

Verify esxcomp-01a have ESTABLISHED connection to NSX Controllers. (grep 1234  to show only TCP port 1234 ).

esxcomp-01a # esxcli network ip  connection list | grep 1234
tcp         0       0  192.168.210.51:54153  192.168.110.202:1234  ESTABLISHED     35185  newreno  netcpa-worker
tcp         0       0  192.168.210.51:34656  192.168.110.203:1234  ESTABLISHED     34519  newreno  netcpa-worker
tcp         0       0  192.168.210.51:41342  192.168.110.201:1234  ESTABLISHED     34519  newreno  netcpa-worker

Verify esxcomp-01b have ESTABLISHED connection to NSX Controllers:

esxcomp-01b # esxcli network ip  connection list | grep 1234
tcp         0       0  192.168.210.56:16580  192.168.110.202:1234  ESTABLISHED     34517  newreno  netcpa-worker
tcp         0       0  192.168.210.56:49434  192.168.110.203:1234  ESTABLISHED     34678  newreno  netcpa-worker
tcp         0       0  192.168.210.56:12358  192.168.110.201:1234  ESTABLISHED     34516  newreno  netcpa-worker

Example of problem with communication from ESXi host to NSX Controllers:

esxcli network ip  connection list | grep 1234
tcp         0       0  192.168.210.51:54153  192.168.110.202:1234  TIME_WAIT           0
tcp         0       0  192.168.210.51:34656  192.168.110.203:1234  FIN_WAIT_2      34519  newreno
tcp         0       0  192.168.210.51:41342  192.168.110.201:1234  TIME_WAIT           0

If we can’t see ESTABLISHED connection check:

1. IP connectivity from ESXi host to all NSX controllers.

2. If you have firewall between ESXi host to NSX controllers, TCP/1234 need to be open.

3. Is netcpad is running on ESXi host:

/etc/init.d/netcpad status
netCP agent service is not running

start netcpad:

esxcomp-01a # /etc/init.d/netcpad status
netCP agent service is running

If netcpad is not running start with command:

esxcomp-01a #/etc/init.d/netcpad start
Memory reservation set for netcpa
netCP agent service starts

Verify again:

esxcomp-01a # /etc/init.d/netcpad status
netCP agent service is running

 

Verify in esxcomp-01a Control Plane is Enable and connection is up state for VXLAN 5001:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS
VXLAN ID  Multicast IP               Control Plane                        Controller Connection  Port Count  MAC Entry Count  ARP Entry Count
--------  -------------------------  -----------------------------------  ---------------------  ----------  ---------------  ---------------
    5003  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            2                0                0
    5001  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.201 (up)            2                3                0
    5000  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            1                3                0
    5002  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.203 (up)            1                2                0

Verify in esxcomp-01b Control Plane is Enable and connection is up state for VXLAN 5001:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS
VXLAN ID  Multicast IP               Control Plane                        Controller Connection  Port Count  MAC Entry Count  ARP Entry Count
--------  -------------------------  -----------------------------------  ---------------------  ----------  ---------------  ---------------
    5001  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.201 (up)            2                3                0
    5000  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            1                0                0
    5002  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.203 (up)            1                2                0
    5003  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            1                0                0

Check esxcomp-01a learn ARP of remote VM’s VXLAN 5001:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan network arp list --vds-name Compute_VDS --vxlan-id=5001
IP            MAC                Flags
------------  -----------------  --------
172.16.10.12  00:50:56:a6:a1:e3  00001101

From this output we can understand that esxcomp-01a learn the ARP info of  web-sv-02a

Check esxcomp-01b learn ARP  for remote VM’s VXLAN 5001:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan network arp list --vds-name Compute_VDS --vxlan-id=5001
IP            MAC                Flags
------------  -----------------  --------
172.16.10.11  00:50:56:a6:7a:a2  00010001

From this output we can understand that esxcomp-01b learn the ARP info of  web-sv-01a

What we can tell at this point.

esxcomp-01a:

Know web-sv-01a is VM running in VXLAN 5001, his ip 172.16.10.11 and MAC address : 00:50:56:a6:7a:a2.

The communication to Controller’s cluster is UP for VXLAN 5001.

esxcomp-01b:

Know web-sv-01b is VM running in VXLAN 5001, his ip 172.16.10.12 and MAC address: 00:50:56:a6:a1:e3

The communication to Controller’s cluster is UP for VXLAN 5001.

So why web-sv-01a can’t take to web-sv-02a ?

the answer to this question is an another question: what the NSX  controller know ?

Control Plane verification from NSX Controller point of view:

We have 3 active controller, one of then is elected to manage VXLAN 5001. Remember slicing ?

Find out who is manage VXLAN 5001, SSH to one of the NSX controllers, for example 192.168.110.202:

nsx-controller # show control-cluster logical-switches vni 5001
VNI      Controller      BUM-Replication ARP-Proxy Connections VTEPs
5001     192.168.110.201 Enabled         Enabled   0           0

Line 3 say that 192.168.110.201 is manage VXLAN 5001, so the next command will run from 192.168.110.201:

nsx-controller # show control-cluster logical-switches vni 5001
VNI      Controller      BUM-Replication ARP-Proxy Connections VTEPs
5001     192.168.110.201 Enabled         Enabled   6           4

From this output we learn that VXLAN 5001 have 4 VTEP connected to him and total of 6 active connection.

At this point i would like to point you for excellent blogger with lots of information of what is happen under the hood in NSX.

His name is Dmitri Kalintsev. link to his blog: NSX for vSphere: Controller “Connections” and “VTEPs”

From Dimitri Post:

“ESXi host joins a VNI in two cases:

  1. When a VM running on that host connects to VNI’s dvPg and its vNIC transitions into “Link Up” state; and
  2. When DLR kernel module on that host needs to route traffic to a VM on that VNI that’s running on a different host.”

We are not route traffic between VM’s, DLR is not  part of the game here.

Find out VTEP IP address connected to VXLAN 5001:

nsx-controller # show control-cluster logical-switches vtep-table 5001
VNI      IP              Segment         MAC               Connection-ID
5001     192.168.250.53  192.168.250.0   00:50:56:67:d9:91 5
5001     192.168.250.52  192.168.250.0   00:50:56:64:f4:25 3
5001     192.168.250.51  192.168.250.0   00:50:56:66:e2:ef 4
5001     192.168.150.51  192.168.150.0   00:50:56:60:bc:e9 6

From this output we can learn that both VTEP’s esxcomp-01a line 5  and esxcomp-01b line 3 are seen by NSX Controller on VXLAN 5001.

The MAC address output in this comments are VTEP’s MAC.

Find out that MAC address of the VM’s has learn by NSX Controller:

nsx-controller # show control-cluster logical-switches mac-table 5001
VNI      MAC               VTEP-IP         Connection-ID
5001     00:50:56:a6:7a:a2 192.168.250.51  4
5001     00:50:56:a6:a1:e3 192.168.250.53  5
5001     00:50:56:8e:45:33 192.168.150.51  6

Line 3 show MAC of web-sv-01a, line 4 show MAC of web-sv-02a

Find out that ARP entry of the VM’s has learn by NSX Controller:

 

nsx-controller # show control-cluster logical-switches arp-table 5001
VNI      IP              MAC               Connection-ID
5001     172.16.10.11    00:50:56:a6:7a:a2 4
5001     172.16.10.12    00:50:56:a6:a1:e3 5
5001     172.16.10.10    00:50:56:8e:45:33 6

Line 3,4 show the exact IP/MAC of  web-sv-01a and  web-sv-02a

To understand how Controller have learn this info read my post NSX-V IP Discovery

Some time restart the netcpad process can fix problem between ESXi host and NSX Controllers.

esxcomp-01a # /etc/init.d/netcpad restart
watchdog-netcpa: Terminating watchdog process with PID 4273913
Memory reservation released for netcpa
netCP agent service is stopped
Memory reservation set for netcpa
netCP agent service starts

Summary of controller verification:

NSX Controller Controller know where VM’s is located, their  ip address and MAC address. it’s seem like Control plane work just fine.

 

Move VM to different ESXi host

In NSX-v each ESXi host has its one UWA service daemon part of the management and control  plane, sometimes when UWA not working as expected VMs on this ESXi host will have connectivity issue.

The fast way to check it, is to vMotion none working VMs  from one ESXi host to different, it VMs start to work we need to focus on the none working ESXi host control plane.

In this scenario even i vMotion my VM to different ESXi host the problem didn’t go away.

 

Capture in the right spots:

pktcap-uw command allow to capture traffic in so many places in NSX environments.

before start to capture all over the place, lets think where we think the problem is.

When VM connect to Logical switch there are few security services that pack a transverse, each service represent with different slot id.

TSHOT5

SLOT 0 : implement vDS Access List.

SLOT 1: Switch Security module (swsec) capture DHCP Ack and ARP message, this info then forward to NSX Controller.

SLOT2: NSX Distributed Firewall.

We need Check if VM traffic successfully pass  after NSX Distributed firewall, that mean in slot 2.

The capture command will need to SLOT 2 filter name for Web-sv-01a

From esxcomp-01a:

esxcomp-01a # summarize-dvfilter
~~~snip~~~~
world 35888 vmm0:web-sv-01a vcUuid:'50 26 c7 cd b6 f3 f4 bc-e5 33 3d 4b 25 5c 62 77'
 port 50331657 web-sv-01a.eth0
  vNic slot 2
   name: nic-35888-eth0-vmware-sfw.2
   agentName: vmware-sfw
   state: IOChain Attached
   vmState: Detached
   failurePolicy: failClosed
   slowPathID: none
   filter source: Dynamic Filter Creation
  vNic slot 1
   name: nic-35888-eth0-dvfilter-generic-vmware-swsec.1
   agentName: dvfilter-generic-vmware-swsec
   state: IOChain Attached
   vmState: Detached
   failurePolicy: failClosed
   slowPathID: none
   filter source: Alternate Opaque Channel

We can see in line4 that VM name is web-sv-01a, in line  5 that filter applied at slot 2 and in line 6 we have the filter name: nic-35888-eth0-vmware-sfw.2

pktcap-uw command help with -A output:

esxcomp-01a # pktcap-uw -A
Supported capture points:
        1: Dynamic -- The dynamic inserted runtime capture point.
        2: UplinkRcv -- The function that receives packets from uplink dev
        3: UplinkSnd -- Function to Tx packets on uplink
        4: Vmxnet3Tx -- Function in vnic backend to Tx packets from guest
        5: Vmxnet3Rx -- Function in vnic backend to Rx packets to guest
        6: PortInput -- Port_Input function of any given port
        7: IOChain -- The virtual switch port iochain capture point.
        8: EtherswitchDispath -- Function that receives packets for switch
        9: EtherswitchOutput -- Function that sends out packets, from switch
        10: PortOutput -- Port_Output function of any given port
        11: TcpipDispatch -- Tcpip Dispatch function
        12: PreDVFilter -- The DVFIlter capture point
        13: PostDVFilter -- The DVFilter capture point
        14: Drop -- Dropped Packets capture point
        15: VdrRxLeaf -- The Leaf Rx IOChain for VDR
        16: VdrTxLeaf -- The Leaf Tx IOChain for VDR
        17: VdrRxTerminal -- Terminal Rx IOChain for VDR
        18: VdrTxTerminal -- Terminal Tx IOChain for VDR
        19: PktFree -- Packets freeing point

capture command have support to sniff traffic in interesting points, with PreDVFilter and PostDVFilter line 14,15 can sniffing traffic before or after filtering action.

Capture after SLOT 2 filter:

pktcap-uw --capture PostDVFilter --dvfilter nic-35888-eth0-vmware-sfw.2 --proto=0x1 -o web-sv-01a_after.pcap
The session capture point is PostDVFilter
The name of the dvfilter is nic-35888-eth0-vmware-sfw.2
The session filter IP protocol is 0x1
The output file is web-sv-01a_after.pcap
No server port specifed, select 784 as the port
Local CID 2
Listen on port 784
Accept...Vsock connection from port 1049 cid 2
Destroying session 25

Dumped 0 packet to file web-sv-01a_after.pcap, dropped 0 packets.

PostDVFilter = capture after the filter name.

–proto=01x capture only icmp packet.

–dvfilter = filter name as it show from summarize-dvfilter command.

-o = where to capture the traffic.

From output of this command line 12 we can tell ICMP packet are not pass this filters because we have 0 Dumped packet.

We found our smoking gun 🙂

Now capture before SLOT 2 filter.

pktcap-uw –capture PreDVFilter –dvfilter nic-35888-eth0-vmware-sfw.2 –proto=0x1 -o web-sv-01a_before.pcap

pktcap-uw –capture PreDVFilter –dvfilter nic-35888-eth0-vmware-sfw.2 –proto=0x1 -o web-sv-01a_before.pcap
The session capture point is PreDVFilter
The name of the dvfilter is nic-35888-eth0-vmware-sfw.2
The session filter IP protocol is 0x1
The output file is web-sv-01a_before.pcap
No server port specifed, select 5782 as the port
Local CID 2
Listen on port 5782
Accept...Vsock connection from port 1050 cid 2
Dump: 6, broken : 0, drop: 0, file err: 0Destroying session 26

Dumped 6 packet to file web-sv-01a_before.pcap, dropped 0 packets.

Now we can see at line 6 that we have Dumped packet. we can open web-sv-01a_before.pcap  captured  file:

esxcomp-01a # tcpdump-uw -r web-sv-01a_before.pcap
reading from file web-sv-01a_before.pcap, link-type EN10MB (Ethernet)
20:15:31.389158 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18628, length 64
20:15:32.397225 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18629, length 64
20:15:33.405253 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18630, length 64
20:15:34.413356 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18631, length 64
20:15:35.421284 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18632, length 64
20:15:36.429219 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18633, length 64

Walla, NSX dFW block the traffic.

And now from NSX GUI:

TSHOT6

Looking back on this article can be skipped intentionally step 3 “Configuration issue”.

If we were checked configuration settings, we immediately notice this problem.

 

 

Summary of all CLI Commands for this post:

ESXI Commands:

esxtop
esxcfg-vmknic -l
esxcli network vswitch dvs vmware vxlan list
esxcli network vswitch dvs vmware vxlan network port list --vds-name Compute_VDS --vxlan-id=5001
esxcli network vswitch dvs vmware vxlan vmknic list --vds-name=Compute_VDS
esxcli network ip route ipv4 list -N vxlan
esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS
esxcli network vswitch dvs vmware vxlan network arp list --vds-name Compute_VDS --vxlan-id=5001
esxcli network ip connection list | grep 1234
ping ++netstack=vxlan 192.168.250.53 -s 1570 -d
/etc/init.d/netcpad (status|start|)
pktcap-uw --capture PostDVFilter --dvfilter nic-35888-eth0-vmware-sfw.2 --proto=0x1 -o web-sv-01a_after.pcap

 

NSX Controller Commands:

show control-cluster logical-switches vni 5001
show control-cluster logical-switches vtep-table 5001
show control-cluster logical-switches mac-table 5001
show control-cluster logical-switches arp-table 5001

 

Posted in Controller, Install, Troubleshooting Tagged with: ,
13 comments on “NSX-v Troubleshooting L2 Connectivity
  1. Javel1n says:

    Hallo Roie

    I really love your blog, keep up the good work.

    This article is literraly solved my problem. I check everything, and basicly my controller are down, and i need to restart process again 😀

    oh, 1 request, do you know how to connecting a physical vrf network connecting to non vrf logical network?

    Thanks

    • roie9876@gmail.com says:

      Thanks for kind words.
      Not sure I understand the question, Connect physical vrf to non vrf logical network, do you mean connect physical to NSX Logical Switch?

  2. Javel1n says:

    I managed a public cloud. Right now we try to improve our old technology.

    So, every tenant are using VRF to seperate every routing table. But We only managed only to Distributed ISP Device. So we dont touch routing at ISP core.

    The topology right this ( South to North )
    VM > N5K ( Distribution / TOR Switch ) > N7K ( GW for VM + VRF “Tagging” ) > ASR 9K ( Our Distributed ISP ) > ISP Core > Internet

    What i still not understand is, how do i use or going to do to mix a internal NSX, who have not vrf “tag”, and to mix with our physical vrf?

    My plan is, i try to switch our gateway at N7K to NSX Distributed Router

    This is what i plan to do, i try to use Spine and Leaf topology :

    VM > NSX Distributed Router ( GW ) > N5K > N7K ( As dataplane, L2 only ) > NSX Edge Router > ASR 9K ( Dedicated for L3 Routing ) > ISP Core > Internet

    So, where do i “tag” for this VRF?

    Thanks

    • roie9876@gmail.com says:

      NSX Edge is not (yet) vrf tagging enabled device.
      You can see the Edge as CE device in ISP topology, no vrf tagging.
      Think of the Edge as normal CE router uplink with VLAN connectivity to PE device.
      Does it make sense ?

  3. Kohei says:

    HI.

    What I believe is that there’s no “tag” in VRF.
    VRF is configured on each interface/vlan so no add-on information inside packets.

    (sorry if i misunderstood your comments)

  4. csaif7 says:

    One of the best blog I read that explains things in detail with examples.

    Great Work!

  5. Anonymous says:

    Hi,

    In my setup, the VM’s on 2 different hosts is not able to ping when attached to a similar VXLAN wire. Your commands are helpful.

    For me the following command doesnt work.

    ping ++netstack=vxlan 192.168.250.53 -s 1570 -d

    when I change to -s 1450 or less than 1500, it works. What could be wrong ?
    Your help is appreciated.

    • roie9876@gmail.com says:

      Then you have MTU issue in your physical switch. find the way to increase your physical switch MTU to 1600.

  6. Simon Reynolds says:

    Hi Roie,

    Thanks for the great information you provide in your blog. It has really helped me understand NSX better.

    I’m trying to get to grips with some cli commands to help troubleshoot NSX — in particular regarding how esxtop and pktcap-uw can be used to see vxlan traffic. I have a couple of questions…

    By the way, I’m using a lab identical to the one used on VMware’s NSX 6.1 ICM course.

    1) If I ping from one vm to another where both vms are connected to the same logical switch (say, vni 5001) but each is running on a different esxi host, then esxtop does not show any traffic through the vmknics associated with the VTEPs of the logical switch. I was expecting to see the vxlan encapsulated frame traffic through the vmknics of the VTEPs on the source and destination esxi hosts.

    I initially thought that that might be because esxtop was not seeing traffic of the vxlan TCP/IP stack but I don’t think that can be true because a ping from one esxi to another (like the following, where the IP 192.168.250.52 is the VTEP IP on another esxi host):

    # ping ++netstack=vxlan 192.168.250.52

    does show the traffic through the VTEP’s vmk in esxtop.

    2) Using pktcap-uw

    If I ping between the two vms as above I see that the following pktcap-uw command captures the full vxlan frame at just before (stage 0) the physical vmnic. In the command below the IP 192.168.250.51 is the VTEP IP on the esxi host where the command is run, so pktcap-uw is capturing the vxlan frames carrying the ping reply:

    # pktcap-uw –uplink vmnic0 –stage 0 –dstip 192.168.250.51

    But I can’t get pktcap-uw to show any capture of the vxlan frames further up the TCP/IP stack, say at the vmk level. For example,

    # pktcap-uw –vmk vmk3

    shows no output (where vmk3 is the VTEP’s vmknic).

    Any advice would be gratefully received.

    Cheers

    Simon

    • roie9876@gmail.com says:

      Hello Simon
      I think you need to add “–” befire the uplink so insted
      “pktcap-uw –uplink vmnic0 –stage 0 –dstip 192.168.250.51”

      try to:
      “pktcap-uw –-uplink vmnic0 –stage 0 –dstip 192.168.250.51”

      BTW you can capture with the “capture” switch commands:
      pktcap-uw –-capture UplinkRcv –uplink vmnic0
      or
      pktcap-uw –-capture UplinkSnd –uplink vmnic0

      And to capture the vmk3 you need to:
      pktcap-uw –-vmk mk3

      Cheers

  7. Simon Reynolds says:

    Hi Roie,

    Thanks for your reply.

    I’m ok with the syntax of the command — I understand the need for “double -“, for example before the uplink flag.

    My question was regarding the fact that if vmk3 is the vmknic associated with the VTEP on an esxi host, then

    # pktcap-uw –-vmk vmk3

    works (it captures an occasional arp) but it doesn’t seem to capture any vxlan encapsulated frames through the VTEP’s vmknic (say, for example, the vxlan frames wrapping the pings between vms on the same logical switch)

    Also,

    # esxtop

    shows no traffic through vmk3

    I’m not clear why these commands don’t show vxlan traffic

    Cheers

    Simon

  8. Matin says:

    Hi

    When i run the command ” esxcli network vswitch dvs vmware vxlan network list –vds-name=Compute_VDS”

    The Controller connection is down

    Do i need to restart the controllers ? or any command to perform this task to make the connection (UP)

    Keep up the good work, appreciated.

    Thanks

Leave a Reply