NSX-v Troubleshooting L2 Connectivity

In this blog post we describe the methodology to troubleshoot L2 connectivity within the same Logical switch L2 segment.

Some of the steps here can and should be done via NSX GUI,vRealize Operations Manager 6.0 and vRealize Log Insight,  so see it like education post.

There are lots of CLI commands in this post :-). To view the output of CLI command you can scroll right.

 

High level approach to solve L2 problems:

1. Understand  the problem.

2. Know your network topology.

3. Figure out  if is its configuration issue.

4. Check  if the problem within the physical space or logical space.

5. Verify NSX control plane from ESXi hosts and NSX Controllers.

6. Move VM to different ESXi host.

7. Start to Capture traffic in right spots.

 

Understand the Problem

VM’s on same logical switch 5001 are  unable to communicate .

show the problem:

web-sv-01a:~ # ping 172.16.10.12
PING 172.16.10.12 (172.16.10.12) 56(84) bytes of data.
^C
--- 172.16.10.12 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3023ms

 

Know your network topology:

TSHOT1

VM’s: web-sv-01a and  web-sv-02a  reside in different compute resource  esxcomp-01a and esxcomp-02a respectively.

web-sv-01a: IP: 172.16.10.11,  MAC: 00:50:56:a6:7a:a2

web-sv-02a: IP:172.16.10.12, MAC: 00:50:56:a6:a1:e3

 

Validate network topology

I know its sounds stupid, let’s make sure that VM’s actually reside in the right esxi host and connected to right VXLAN.

Verify VM “web-sb-01a” is actually reside in “escomp-01a“:

From esxcomp-01a run the command esxtop then press “n” (Network):

esxcomp-01a # esxtop
   PORT-ID              USED-BY  TEAM-PNIC DNAME              PKTTX/s  MbTX/s    PKTRX/s  MbRX/s %DRPTX %DRPRX
  33554433           Management        n/a vSwitch0              0.00    0.00       0.00    0.00   0.00   0.00
  50331649           Management        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331650               vmnic0          - DvsPortset-0          8.41    0.02     437.81    3.17   0.00   0.00
  50331651     Shadow of vmnic0        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331652                 vmk0     vmnic0 DvsPortset-0          5.87    0.01       1.76    0.00   0.00   0.00
  50331653                 vmk1     vmnic0 DvsPortset-0          0.59    0.01       0.98    0.00   0.00   0.00
  50331654                 vmk2     vmnic0 DvsPortset-0          0.00    0.00       0.39    0.00   0.00   0.00
  50331655                 vmk3     vmnic0 DvsPortset-0          0.20    0.00       0.39    0.00   0.00   0.00
  50331656 35669:db-sv-01a.eth0     vmnic0 DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331657 35888:web-sv-01a.eth     vmnic0 DvsPortset-0          4.89    0.01       3.72    0.01   0.00   0.00
  50331658          vdr-vdrPort     vmnic0 DvsPortset-0          2.15    0.00       0.00    0.00   0.00   0.00

In line 12 we can see that “web-sv-01a.eth0” is shown, another imported information is has “Port-ID“.

The “Port-ID” is unique identifier for each virtual switch port , in our example web-sv-01a.eth0 as Port-ID “50331657″.

Find the vDS name:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan list
VDS ID                                           VDS Name      MTU  Segment ID     Gateway IP     Gateway MAC        Network Count  Vmknic Count
-----------------------------------------------  -----------  ----  -------------  -------------  -----------------  -------------  ------------
3b bf 0e 50 73 dc 49 d8-2e b0 df 20 91 e4 0b bd  Compute_VDS  1600  192.168.250.0  192.168.250.2  00:50:56:09:46:07              4             1

From Line 4 vDS name is “Compute_VDS

Verify “web-sv-01a.eth0″ Connect to VXLAN 5001:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan network port list --vds-name Compute_VDS --vxlan-id=5001
Switch Port ID  VDS Port ID  VMKNIC ID
--------------  -----------  ---------
      50331657  68                   0
      50331658  vdrPort              0

From Line 4 we have VM connect to VXLAN 5001 to port ID 50331657 this port ID is the Same port ID of VM web-sv-01a.eth0

Verification in esxcomp-01b:

esxcomp-01b esxtop
  PORT-ID              USED-BY  TEAM-PNIC DNAME              PKTTX/s  MbTX/s    PKTRX/s  MbRX/s %DRPTX %DRPRX
  33554433           Management        n/a vSwitch0              0.00    0.00       0.00    0.00   0.00   0.00
  50331649           Management        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331650               vmnic0          - DvsPortset-0          6.54    0.01     528.31    4.06   0.00   0.00
  50331651     Shadow of vmnic0        n/a DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331652                 vmk0     vmnic0 DvsPortset-0          2.77    0.00       1.19    0.00   0.00   0.00
  50331653                 vmk1     vmnic0 DvsPortset-0          0.59    0.00       0.40    0.00   0.00   0.00
  50331654                 vmk2     vmnic0 DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331655                 vmk3     vmnic0 DvsPortset-0          0.00    0.00       0.00    0.00   0.00   0.00
  50331656 35663:web-sv-02a.eth     vmnic0 DvsPortset-0          3.96    0.01       3.57    0.01   0.00   0.00
  50331657          vdr-vdrPort     vmnic0 DvsPortset-0          2.18    0.00       0.00    0.00   0.00   0.00

From Line 11 we can see that “web-sv-02a.eth0” has Port-ID “50331656“.

Verify “web-sv-02a.eth0″ Connect to VXLAN 5001:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan network port list --vds-name Compute_VDS --vxlan-id=5001
Switch Port ID  VDS Port ID  VMKNIC ID
--------------  -----------  ---------
      50331656  69                   0
      50331657  vdrPort              0

From Line 4 we have VM connect to VXLAN 5001 to port ID 50331656

At this point we verify are VM’s located as draw in topology. now start with actual TSHOOT steps.

Is the problem in the physical network ?

Our first step will be to find out  if the problem is in the physical space or logical space.

TSHOT2

The easy way to find out is by ping from VTEP in esxcomp-01a to VTEP in esxcomp-01b, before ping let’s find out the VTEP IP address.

esxcomp-01a # esxcfg-vmknic -l
Interface  Port Group/DVPort   IP Family IP Address                              Netmask         Broadcast       MAC Address       MTU     TSO MSS   Enabled Type         
vmk0       16                  IPv4      192.168.210.51                          255.255.255.0   192.168.210.255 00:50:56:09:08:3e 1500    65535     true    STATIC       
vmk1       26                  IPv4      10.20.20.51                             255.255.255.0   10.20.20.255    00:50:56:69:80:0f 1500    65535     true    STATIC       
vmk2       35                  IPv4      10.20.30.51                             255.255.255.0   10.20.30.255    00:50:56:64:70:9f 1500    65535     true    STATIC       
vmk3       44                  IPv4      192.168.250.51                          255.255.255.0   192.168.250.255 00:50:56:66:e2:ef 1600    65535     true    STATIC

From Line 6 we can tell that VTEP IP address for VMK3(MTU is 1600) is 192.168.250.51.

Another command to find VTEP IP address is:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan vmknic list --vds-name=Compute_VDS
Vmknic Name  Switch Port ID  VDS Port ID  Endpoint ID  VLAN ID  IP              Netmask        IP Acquire Timeout  Multicast Group Count  Segment ID
-----------  --------------  -----------  -----------  -------  --------------  -------------  ------------------  ---------------------  -------------
vmk3               50331655  44                     0        0  192.168.250.51  255.255.255.0                   0                      0  192.168.250.0

Same commands in esxcomp-01b:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan vmknic list --vds-name=Compute_VDS
Vmknic Name  Switch Port ID  VDS Port ID  Endpoint ID  VLAN ID  IP              Netmask        IP Acquire Timeout  Multicast Group Count  Segment ID
-----------  --------------  -----------  -----------  -------  --------------  -------------  ------------------  ---------------------  -------------
vmk3               50331655  46                     0        0  192.168.250.53  255.255.255.0                   0                      0  192.168.250.0

VTEP IP for esxcomp-01b is 192.168.250.53. now let’s add this info to our  topology.

 

TSHOT3

Checks for VXLAN Routing:

NSX use use different IP stack for VXLAN  traffic,so we need to verify if default gateway is configured correctly for VXLAN traffic.

From esxcomp-01a:

esxcomp-01a # esxcli network ip route ipv4 list -N vxlan
Network        Netmask        Gateway        Interface  Source
-------------  -------------  -------------  ---------  ------
default        0.0.0.0        192.168.250.2  vmk3       MANUAL
192.168.250.0  255.255.255.0  0.0.0.0        vmk3       MANUAL

From esxcomp-01b:

esxcomp-01b # esxcli network ip route ipv4 list -N vxlan
Network        Netmask        Gateway        Interface  Source
-------------  -------------  -------------  ---------  ------
default        0.0.0.0        192.168.250.2  vmk3       MANUAL
192.168.250.0  255.255.255.0  0.0.0.0        vmk3       MANUAL

My two ESXi hosts in VTEP IP address space for this LAB work on same L2 segment, both VTEP have same default gateway.

Ping from VTEP in esxcomp-01a to VTEP located in esxcomp-02a.

Source ping will be from VXLAN IP stack with packet size of 1570 and don’t fragment bit set to 1.

esxcomp-01a #  ping ++netstack=vxlan 192.168.250.53 -s 1570 -d
PING 192.168.250.53 (192.168.250.53): 1570 data bytes
1578 bytes from 192.168.250.53: icmp_seq=0 ttl=64 time=0.585 ms
1578 bytes from 192.168.250.53: icmp_seq=1 ttl=64 time=0.936 ms
1578 bytes from 192.168.250.53: icmp_seq=2 ttl=64 time=0.831 ms

--- 192.168.250.53 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.585/0.784/0.936 ms

Ping is successfully.

If ping with “-d” don’t work and without “-d” work its MTU problem. Check for MTU in the physical switch’s

Because VXLAN in this example in the same L2 we can view ARP entry for others VTEP’s:

From esxcomp-01a:

esxcomp-01a # esxcli network ip neighbor list -N vxlan
Neighbor        Mac Address        Vmknic    Expiry  State  Type
--------------  -----------------  ------  --------  -----  -----------
192.168.250.52  00:50:56:64:f4:25  vmk3    1173 sec         Unknown
192.168.250.53  00:50:56:67:d9:91  vmk3    1171 sec         Unknown
192.168.250.2   00:50:56:09:46:07  vmk3    1187 sec         Autorefresh

Look like our physical layer is not the issue.

 

Verify NSX control plane

During NSX host preparation NSX Manager install  VIB agents called User World Agent (UWA) inside ESXi hosts.

The process responsible to communicate with NSX controller called netcpad.

ESXi host using VMkernel Management interface to create this secure channel over TCP/1234, traffic is encrypted with SSL.

Part of the information netcpad send to NSX Controller is:

VM’s: MAC, IP.

VTEP: MAC, IP.

VXLAN: the VXLAN Id’s

Routing: Routes learn from the DLR Control VM. (explain in next post).

TSHOT4

Base on this information the Controller learn the network state and build directory services.

To learn how the Controller Cluster works and how fix problem in the cluster itself  NSX Controller Cluster Troubleshooting .

For two VM’s to be able to talk to each others we need working control plane. In this lab we have 3 NSX controller.

Verification command need to done from both ESXi  and Controllers side.

NSX controllers IP address: 192.168.110.201, 192.168.110.202, 192.168.110.203

Control Plane verification from ESXi point of view:

Verify esxcomp-01a have ESTABLISHED connection to NSX Controllers. (grep 1234  to show only TCP port 1234 ).

esxcomp-01a # esxcli network ip  connection list | grep 1234
tcp         0       0  192.168.210.51:54153  192.168.110.202:1234  ESTABLISHED     35185  newreno  netcpa-worker
tcp         0       0  192.168.210.51:34656  192.168.110.203:1234  ESTABLISHED     34519  newreno  netcpa-worker
tcp         0       0  192.168.210.51:41342  192.168.110.201:1234  ESTABLISHED     34519  newreno  netcpa-worker

Verify esxcomp-01b have ESTABLISHED connection to NSX Controllers:

esxcomp-01b # esxcli network ip  connection list | grep 1234
tcp         0       0  192.168.210.56:16580  192.168.110.202:1234  ESTABLISHED     34517  newreno  netcpa-worker
tcp         0       0  192.168.210.56:49434  192.168.110.203:1234  ESTABLISHED     34678  newreno  netcpa-worker
tcp         0       0  192.168.210.56:12358  192.168.110.201:1234  ESTABLISHED     34516  newreno  netcpa-worker

Example of problem with communication from ESXi host to NSX Controllers:

esxcli network ip  connection list | grep 1234
tcp         0       0  192.168.210.51:54153  192.168.110.202:1234  TIME_WAIT           0
tcp         0       0  192.168.210.51:34656  192.168.110.203:1234  FIN_WAIT_2      34519  newreno
tcp         0       0  192.168.210.51:41342  192.168.110.201:1234  TIME_WAIT           0

If we can’t see ESTABLISHED connection check:

1. IP connectivity from ESXi host to all NSX controllers.

2. If you have firewall between ESXi host to NSX controllers, TCP/1234 need to be open.

3. Is netcpad is running on ESXi host:

/etc/init.d/netcpad status
netCP agent service is not running

start netcpad:

esxcomp-01a # /etc/init.d/netcpad status
netCP agent service is running

If netcpad is not running start with command:

esxcomp-01a #/etc/init.d/netcpad start
Memory reservation set for netcpa
netCP agent service starts

Verify again:

esxcomp-01a # /etc/init.d/netcpad status
netCP agent service is running

 

Verify in esxcomp-01a Control Plane is Enable and connection is up state for VXLAN 5001:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS
VXLAN ID  Multicast IP               Control Plane                        Controller Connection  Port Count  MAC Entry Count  ARP Entry Count
--------  -------------------------  -----------------------------------  ---------------------  ----------  ---------------  ---------------
    5003  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            2                0                0
    5001  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.201 (up)            2                3                0
    5000  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            1                3                0
    5002  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.203 (up)            1                2                0

Verify in esxcomp-01b Control Plane is Enable and connection is up state for VXLAN 5001:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS
VXLAN ID  Multicast IP               Control Plane                        Controller Connection  Port Count  MAC Entry Count  ARP Entry Count
--------  -------------------------  -----------------------------------  ---------------------  ----------  ---------------  ---------------
    5001  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.201 (up)            2                3                0
    5000  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            1                0                0
    5002  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.203 (up)            1                2                0
    5003  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.202 (up)            1                0                0

Check esxcomp-01a learn ARP of remote VM’s VXLAN 5001:

esxcomp-01a # esxcli network vswitch dvs vmware vxlan network arp list --vds-name Compute_VDS --vxlan-id=5001
IP            MAC                Flags
------------  -----------------  --------
172.16.10.12  00:50:56:a6:a1:e3  00001101

From this output we can understand that esxcomp-01a learn the ARP info of  web-sv-02a

Check esxcomp-01b learn ARP  for remote VM’s VXLAN 5001:

esxcomp-01b # esxcli network vswitch dvs vmware vxlan network arp list --vds-name Compute_VDS --vxlan-id=5001
IP            MAC                Flags
------------  -----------------  --------
172.16.10.11  00:50:56:a6:7a:a2  00010001

From this output we can understand that esxcomp-01b learn the ARP info of  web-sv-01a

What we can tell at this point.

esxcomp-01a:

Know web-sv-01a is VM running in VXLAN 5001, his ip 172.16.10.11 and MAC address : 00:50:56:a6:7a:a2.

The communication to Controller’s cluster is UP for VXLAN 5001.

esxcomp-01b:

Know web-sv-01b is VM running in VXLAN 5001, his ip 172.16.10.12 and MAC address: 00:50:56:a6:a1:e3

The communication to Controller’s cluster is UP for VXLAN 5001.

So why web-sv-01a can’t take to web-sv-02a ?

the answer to this question is an another question: what the NSX  controller know ?

Control Plane verification from NSX Controller point of view:

We have 3 active controller, one of then is elected to manage VXLAN 5001. Remember slicing ?

Find out who is manage VXLAN 5001, SSH to one of the NSX controllers, for example 192.168.110.202:

nsx-controller # show control-cluster logical-switches vni 5001
VNI      Controller      BUM-Replication ARP-Proxy Connections VTEPs
5001     192.168.110.201 Enabled         Enabled   0           0

Line 3 say that 192.168.110.201 is manage VXLAN 5001, so the next command will run from 192.168.110.201:

nsx-controller # show control-cluster logical-switches vni 5001
VNI      Controller      BUM-Replication ARP-Proxy Connections VTEPs
5001     192.168.110.201 Enabled         Enabled   6           4

From this output we learn that VXLAN 5001 have 4 VTEP connected to him and total of 6 active connection.

At this point i would like to point you for excellent blogger with lots of information of what is happen under the hood in NSX.

His name is Dmitri Kalintsev. link to his blog: NSX for vSphere: Controller “Connections” and “VTEPs”

From Dimitri Post:

“ESXi host joins a VNI in two cases:

  1. When a VM running on that host connects to VNI’s dvPg and its vNIC transitions into “Link Up” state; and
  2. When DLR kernel module on that host needs to route traffic to a VM on that VNI that’s running on a different host.”

We are not route traffic between VM’s, DLR is not  part of the game here.

Find out VTEP IP address connected to VXLAN 5001:

nsx-controller # show control-cluster logical-switches vtep-table 5001
VNI      IP              Segment         MAC               Connection-ID
5001     192.168.250.53  192.168.250.0   00:50:56:67:d9:91 5
5001     192.168.250.52  192.168.250.0   00:50:56:64:f4:25 3
5001     192.168.250.51  192.168.250.0   00:50:56:66:e2:ef 4
5001     192.168.150.51  192.168.150.0   00:50:56:60:bc:e9 6

From this output we can learn that both VTEP’s esxcomp-01a line 5  and esxcomp-01b line 3 are seen by NSX Controller on VXLAN 5001.

The MAC address output in this comments are VTEP’s MAC.

Find out that MAC address of the VM’s has learn by NSX Controller:

nsx-controller # show control-cluster logical-switches mac-table 5001
VNI      MAC               VTEP-IP         Connection-ID
5001     00:50:56:a6:7a:a2 192.168.250.51  4
5001     00:50:56:a6:a1:e3 192.168.250.53  5
5001     00:50:56:8e:45:33 192.168.150.51  6

Line 3 show MAC of web-sv-01a, line 4 show MAC of web-sv-02a

Find out that ARP entry of the VM’s has learn by NSX Controller:

 

nsx-controller # show control-cluster logical-switches arp-table 5001
VNI      IP              MAC               Connection-ID
5001     172.16.10.11    00:50:56:a6:7a:a2 4
5001     172.16.10.12    00:50:56:a6:a1:e3 5
5001     172.16.10.10    00:50:56:8e:45:33 6

Line 3,4 show the exact IP/MAC of  web-sv-01a and  web-sv-02a

To understand how Controller have learn this info read my post NSX-V IP Discovery

Some time restart the netcpad process can fix problem between ESXi host and NSX Controllers.

esxcomp-01a # /etc/init.d/netcpad restart
watchdog-netcpa: Terminating watchdog process with PID 4273913
Memory reservation released for netcpa
netCP agent service is stopped
Memory reservation set for netcpa
netCP agent service starts

Summary of controller verification:

NSX Controller Controller know where VM’s is located, their  ip address and MAC address. it’s seem like Control plane work just fine.

 

Move VM to different ESXi host

In NSX-v each ESXi host has its one UWA service daemon part of the management and control  plane, sometimes when UWA not working as expected VMs on this ESXi host will have connectivity issue.

The fast way to check it, is to vMotion none working VMs  from one ESXi host to different, it VMs start to work we need to focus on the none working ESXi host control plane.

In this scenario even i vMotion my VM to different ESXi host the problem didn’t go away.

 

Capture in the right spots:

pktcap-uw command allow to capture traffic in so many places in NSX environments.

before start to capture all over the place, lets think where we think the problem is.

When VM connect to Logical switch there are few security services that pack a transverse, each service represent with different slot id.

TSHOT5

SLOT 0 : implement vDS Access List.

SLOT 1: Switch Security module (swsec) capture DHCP Ack and ARP message, this info then forward to NSX Controller.

SLOT2: NSX Distributed Firewall.

We need Check if VM traffic successfully pass  after NSX Distributed firewall, that mean in slot 2.

The capture command will need to SLOT 2 filter name for Web-sv-01a

From esxcomp-01a:

esxcomp-01a # summarize-dvfilter
~~~snip~~~~
world 35888 vmm0:web-sv-01a vcUuid:'50 26 c7 cd b6 f3 f4 bc-e5 33 3d 4b 25 5c 62 77'
 port 50331657 web-sv-01a.eth0
  vNic slot 2
   name: nic-35888-eth0-vmware-sfw.2
   agentName: vmware-sfw
   state: IOChain Attached
   vmState: Detached
   failurePolicy: failClosed
   slowPathID: none
   filter source: Dynamic Filter Creation
  vNic slot 1
   name: nic-35888-eth0-dvfilter-generic-vmware-swsec.1
   agentName: dvfilter-generic-vmware-swsec
   state: IOChain Attached
   vmState: Detached
   failurePolicy: failClosed
   slowPathID: none
   filter source: Alternate Opaque Channel

We can see in line4 that VM name is web-sv-01a, in line  5 that filter applied at slot 2 and in line 6 we have the filter name: nic-35888-eth0-vmware-sfw.2

pktcap-uw command help with -A output:

esxcomp-01a # pktcap-uw -A
Supported capture points:
        1: Dynamic -- The dynamic inserted runtime capture point.
        2: UplinkRcv -- The function that receives packets from uplink dev
        3: UplinkSnd -- Function to Tx packets on uplink
        4: Vmxnet3Tx -- Function in vnic backend to Tx packets from guest
        5: Vmxnet3Rx -- Function in vnic backend to Rx packets to guest
        6: PortInput -- Port_Input function of any given port
        7: IOChain -- The virtual switch port iochain capture point.
        8: EtherswitchDispath -- Function that receives packets for switch
        9: EtherswitchOutput -- Function that sends out packets, from switch
        10: PortOutput -- Port_Output function of any given port
        11: TcpipDispatch -- Tcpip Dispatch function
        12: PreDVFilter -- The DVFIlter capture point
        13: PostDVFilter -- The DVFilter capture point
        14: Drop -- Dropped Packets capture point
        15: VdrRxLeaf -- The Leaf Rx IOChain for VDR
        16: VdrTxLeaf -- The Leaf Tx IOChain for VDR
        17: VdrRxTerminal -- Terminal Rx IOChain for VDR
        18: VdrTxTerminal -- Terminal Tx IOChain for VDR
        19: PktFree -- Packets freeing point

capture command have support to sniff traffic in interesting points, with PreDVFilter and PostDVFilter line 14,15 can sniffing traffic before or after filtering action.

Capture after SLOT 2 filter:

pktcap-uw --capture PostDVFilter --dvfilter nic-35888-eth0-vmware-sfw.2 --proto=0x1 -o web-sv-01a_after.pcap
The session capture point is PostDVFilter
The name of the dvfilter is nic-35888-eth0-vmware-sfw.2
The session filter IP protocol is 0x1
The output file is web-sv-01a_after.pcap
No server port specifed, select 784 as the port
Local CID 2
Listen on port 784
Accept...Vsock connection from port 1049 cid 2
Destroying session 25

Dumped 0 packet to file web-sv-01a_after.pcap, dropped 0 packets.

PostDVFilter = capture after the filter name.

–proto=01x capture only icmp packet.

–dvfilter = filter name as it show from summarize-dvfilter command.

-o = where to capture the traffic.

From output of this command line 12 we can tell ICMP packet are not pass this filters because we have 0 Dumped packet.

We found our smoking gun 🙂

Now capture before SLOT 2 filter.

pktcap-uw –capture PreDVFilter –dvfilter nic-35888-eth0-vmware-sfw.2 –proto=0x1 -o web-sv-01a_before.pcap

pktcap-uw –capture PreDVFilter –dvfilter nic-35888-eth0-vmware-sfw.2 –proto=0x1 -o web-sv-01a_before.pcap
The session capture point is PreDVFilter
The name of the dvfilter is nic-35888-eth0-vmware-sfw.2
The session filter IP protocol is 0x1
The output file is web-sv-01a_before.pcap
No server port specifed, select 5782 as the port
Local CID 2
Listen on port 5782
Accept...Vsock connection from port 1050 cid 2
Dump: 6, broken : 0, drop: 0, file err: 0Destroying session 26

Dumped 6 packet to file web-sv-01a_before.pcap, dropped 0 packets.

Now we can see at line 6 that we have Dumped packet. we can open web-sv-01a_before.pcap  captured  file:

esxcomp-01a # tcpdump-uw -r web-sv-01a_before.pcap
reading from file web-sv-01a_before.pcap, link-type EN10MB (Ethernet)
20:15:31.389158 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18628, length 64
20:15:32.397225 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18629, length 64
20:15:33.405253 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18630, length 64
20:15:34.413356 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18631, length 64
20:15:35.421284 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18632, length 64
20:15:36.429219 IP 172.16.10.11 > 172.16.10.12: ICMP echo request, id 3144, seq 18633, length 64

Walla, NSX dFW block the traffic.

And now from NSX GUI:

TSHOT6

Looking back on this article can be skipped intentionally step 3 “Configuration issue”.

If we were checked configuration settings, we immediately notice this problem.

 

 

Summary of all CLI Commands for this post:

ESXI Commands:

esxtop
esxcfg-vmknic -l
esxcli network vswitch dvs vmware vxlan list
esxcli network vswitch dvs vmware vxlan network port list --vds-name Compute_VDS --vxlan-id=5001
esxcli network vswitch dvs vmware vxlan vmknic list --vds-name=Compute_VDS
esxcli network ip route ipv4 list -N vxlan
esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS
esxcli network vswitch dvs vmware vxlan network arp list --vds-name Compute_VDS --vxlan-id=5001
esxcli network ip connection list | grep 1234
ping ++netstack=vxlan 192.168.250.53 -s 1570 -d
/etc/init.d/netcpad (status|start|)
pktcap-uw --capture PostDVFilter --dvfilter nic-35888-eth0-vmware-sfw.2 --proto=0x1 -o web-sv-01a_after.pcap

 

NSX Controller Commands:

show control-cluster logical-switches vni 5001
show control-cluster logical-switches vtep-table 5001
show control-cluster logical-switches mac-table 5001
show control-cluster logical-switches arp-table 5001

 

Troubleshooting NSX-V Controller

Overview

The Controller cluster in the NSX platform is the control plane component that is responsible in managing the switching and routing modules in the hypervisors.

The use of controller cluster in managing VXLAN based logical switches eliminates the need for multicast.

1

Each Controller Node is assigned a set of roles that define the type of tasks the node can implement. By default, each Controller Node is assigned all roles.

NSX controller roles:

API provider: Handles HTTP web service requests from external clients (NSX Manager) and initiates processing by other Controller Node tasks.

Persistence Server: Stores data from the NVP API and vDS devices that must be persisted across all Controller Nodes in case of node failures or shutdowns.

Logical manager: Monitors when endhosts arrive or leave vDS devices and configures the vDS forwarding states to implement logical connectivity and policies..

Switch manager: Maintains management connections for one or more vDS devices.

Directory server: manage VXLAN and the distributed logical routing directory of information.

Any multi-node HA mechanism has the potential for a “split brain” scenario in which a cluster is partitioned into two or more groups, and those groups are not able to communicate. In this scenario, each group might assume control of all tasks under the assumption that the other nodes have failed. NSX uses leader election to solve this split-brain problem. One of the Controller Nodes is elected as a leader for each role, which requires a majority vote of all active and inactive nodes in the cluster.

2

The leader for each role is responsible for allocating tasks to individual Controller Nodes and determining when a node has failed. Since election requires a majority of all nodes,

it is not possible for two leaders to exist simultaneously within a cluster, preventing a split brain scenario. The leader election mechanism requires a majority of all cluster nodes to be functional at all times.

Note: Currently NSX-V 6.1 support maximum 3 controllers

Here is example of 3 NSX Controllers and role election per Node members.

3

Node 1 master for roles:  API Provider and Logical Manager

Node 2 master for roles: Persistence Server and Directory Server

Node 3 master for roles: Switch Manger.

The different majority number scenarios depending on the number of Controller Cluster nodes. It is evident how deploying 2 nodes (traditionally considered an example of a redundant system) would increase the scalability of the Controller Cluster (since at steady state two nodes would work in parallel)

without providing any additional resiliency. This is because with 2 nodes, the majority number is 2 and that means that if one of the two nodes were to fail, or they lost communication with each other (dual-active scenario), neither of them would be able to keep functioning (accepting API calls, etc.). The same considerations apply to a deployment with 4 nodes that cannot provide more resiliency than a cluster with 3 elements (even if providing better performance).

 

TSHOT NSX controllers

The next part of TSHOT NSX Controller base on VMware NSX MH 4.1 User Guide:

https://my.vmware.com/web/vmware/details?productId=418&downloadGroup=NSX-MH-412-DOC

NSX Controller nodes ip address for the next screenshots are:

Node1 192.168.110.201, Node1 192.168.110.202, Node1 192.168.110.202

Verify NSX Controller installation

Ensure that the Controllers are installed on systems that meet the minimum requirements.
On each Controller:

The CLI command “request system compatibility-report” provides informational details that determine whether a Controller system is compatible with the Controller requirements.

# request system compatibility-report

4

 

Check controller status in NSX Manager

The NSX Manager continually checks whether all Controller Clusters are accessible. If a Controller Cluster is currently in disconnected status, your diagnostic efforts and log review should be focused on the time immediately after the Controller Cluster was last seen as connected.

Here example of “Disconnected” controller from NSX Manager:

5

This NSX “Controller nodes status” screenshot show status between the NSX Manager to Controller and not the overall controller cluster status.

So even if we have all controllers in “Normal”state like the figure below , that doesn’t mean the overall controller status is ok.  

11

Checking the Controller Cluster Status from CLI

The current status of the Controller Cluster can be determined by running show control-cluster status:

 

# show control-cluster status

6

Join status: verify this node complete join to clusters process.

Majority status: check  if this cluster is part of the majority.

Cluster ID: all node members need to be in the same cluster id

The current status of the Controller Node’s intra-cluster communication connections can be determined by running

show control-cluster connections

7

If a Controller node is a Controller Cluster majority leader, it will be listening on port 2878 (as indicated by the Y in the “listening” column).

The other Controller nodes will have a dash (-) in the “listening” column.

The next step is to check whether the Controller Cluster majority leader has any open connections as indicated by the number in the “open conns” column. On a properly functioning Controller, the open connections should be the same as the number of other Controller nodes in the Controller Cluster (e.g. In a three-node Controller Cluster, the Controller Cluster majority leader should show two open connections).

The command show control-cluster history will allow you to see a history of Controller Cluster-related events on this node including restarts, upgrades, Controller Cluster errors and loss of majority.

controller # show control-cluster history

8

Joining a Controller Node to Controller Cluster

This section covers issues that may be encountered when attempting to join a new Controller Node to an existing Controller Cluster. An explanation of why the issue occurs and instructions on how to resolve the issue are also provided.

Symptom: Joining a new Controller node to a Controller Cluster may fail all of the existing Controllers are disconnected.

Example for this situation:

As we can see controller-1 and controller-2 are in disconnected from the NSX manager

5

When we try to add new controller cluster we get this error message:

10

Explanation:

If n nodes have joined the NSX Controller Cluster, then a majority (strictly greater than 50%) of those n nodes must be alive and connected to each other, before any new data to the system. This means that if you have a Controller Cluster of 3 nodes, 2 of them must be alive and connected in order for new data to be written in NSX.

In our case to add new controller node to cluster we need at least on member of the cluster to be in “Normal” state.

17

 

Resolution: Start the Disconnected Controller. If the Controller is disconnected due to a permanent failure, remove the Controller from the Controller Cluster.

Symptom: the join control-cluster CLI command hangs without ever completing the join operation.

Explanation:

The IP address passed into the join control-cluster command was incorrect, and/or does not refer to a currently live Controller node.

For example the user type the command:

join control-cluster 192.168.110.201

Make sure that 192.168.110.201 is part of existing controller cluster.

Resolution:

Use the IP address of a properly configured Controller that is reachable across the network.

Symptom:

The join control-cluster CLI command fails.

Explanation: If you have a Controller configured as part of a Controller Cluster, that Controller has been disconnected from the Controller Cluster for a long period of time (perhaps it was taken offline or shut down), and during that time, the other Controllers in that Controller Cluster were removed from the Controller Cluster and formed into a new Controller Cluster, then the long-disconnected Controller will not be allowed to rejoin the Controller Cluster that it left, because that original Controller Cluster is gone.

The following event log message in the new Controller Cluster indicates that something like this has happened:

Node b567a47f-9a61-43b3-8d53-36b3d1fd0675 tried to join with incorrect cluster ID

Resolution:

You must issue the join control-cluster command with the force option on the old Controller to force it to clear its state and join the new Controller Cluster with a fresh start.

Note: The forced join command deletes previously joined node with the same IP.

nvp-controller # join control-cluster 192.168.110.201 force

18

Recovering node disconnect from cluster

When controller cluster majority issue arises, it will very difficult to spot it from the NSX manager GUI.

For example the current state of the controllers from the NSX manager point of view is that all the member are in “Normal” state.

11

But in fact the current status in my cluster is:

12

Node1 + Node 2 are create cluster and share the roles between them, for some rezone Node 3 disconnected from the majority of the cluster:

Output example from controller Node 3:

13

 

Node 3 think his alone and own all of the roles.

From Node 1 perspective he is the leader (have the Y) and have one open connection from Node2 as show:

14

 

To recover from this scenario Node 3 need to join to majority of the cluster, the  ip address to join need to be to Node1 because his the leader of the majority.

join control-cluster 192.168.110.201 force

Recovering from lost all Controller Nodes

In this scenario all NSX Controller nodes failed or deleted,  Do we need start from scratch ? 🙁

The assumption is our environment already deployed NSX Edge, DLR and we have logical switch connected to VM’s and would like to preserve it.

The recovering process:

 Step 1:

Migrate existing logical switch to Multicast mode.

15

Step 2:

Deployed 3 new NSX controllers.

Step 3:

Sync the new deployed NSX controllers to unicast mode with the current state of our NSX.

16

other useful commands:

Checking Controller Processes

Even if the “join-cluster” command on a node appears to have been successful, the node might not have come up completely for a variety of reasons. The way this error tends to manifest itself most visibly is that the controller process isn’t listening on all the ports it’s supposed to be, and no API requests or switch connections are happening.

# show network connections of-type tcp

Active Internet connections (servers and established)

Proto Recv-Q Send-Q Local Address      Foreign Address     State       PID/Program

tcp        0      0 172.29.1.20:6633   0.0.0.0:*           LISTEN      14038/domain

tcp        0      0 172.29.1.20:7000   0.0.0.0:*           LISTEN      14072/java

tcp        0      0 0.0.0.0:443        0.0.0.0:*           LISTEN      14067/domain

tcp        0      0 172.29.1.20:7777   0.0.0.0:*           LISTEN      14038/domain

tcp        0      0 172.29.1.20:6632   0.0.0.0:*           LISTEN      14038/domain

tcp        0      0 172.29.1.20:9160   0.0.0.0:*           LISTEN      14072/java

tcp        0      0 172.29.1.20:2888   0.0.0.0:*           LISTEN      14072/java

tcp        0      0 172.29.1.20:2888   172.29.1.20:55622   ESTABLISHED 14072/java

tcp        0      0 172.29.1.20:9160   172.29.1.20:52567   ESTABLISHED 14072/java

tcp        0      0 172.29.1.20:52566  172.29.1.20:9160    ESTABLISHED 14038/domain

tcp        0      0 172.29.1.20:443    172.17.21.9:46438   ESTABLISHED 14067/domain

 

The show network connection output shown in the preceding block is an example from a healthy Controller. If you find some of these missing, it’s likely that NSX didn’t get past its install phase.  Here are some misconfigurations that can cause this:

Bad management address or listen IP

You’ve set an incorrect IP as the management-address, or as the listen-ip for one of the roles (like switch_manager or api_provider).

NSX attempts to bind to the specified address, and fails early if it cannot do so.  You’ll see log messages in cloudnet_cpp.log.ERROR like:

E0506 01:20:17.099596  7188 dso-deployer.cc:516] Controller component installation of rpc-broker failed: Unable to bind a RPC port $tags:tracing:3ef7d1f519ffb7fb^

E0506 01:20:17.100162  7188 main.cc:271] RPC deployment subsystem not installed; exiting. $tags:tracing:3ef7d1f519ffb7fb^

Or in cloudnet_cpp.log.WARNING:

W0506 01:22:27.721777  7694 ssl-socket.cc:530] SSLSocket failed to bind to 172.1.1.1:6632: Cannot assign requested address

Note that if you are using DHCP for the IP addresses of your controller nodes (not recommended or supported), the IP address could have changed since the last time you configured it.

Verify that the IP addresses for switch_manager and api_provider are what they are supposed to be by performing the CLI command:

<switch_manager|api_provider>  listen-ip

 

Bad first node address

You’ve provided the wrong IP address for the first node in the Controller Cluster.   Run show

control-cluster startup-nodes

to determine whether the IPs listed correspond to the IPs of the Controllers in the Controller Cluster.

 

Out of disk space

The Controller may be out of disk space. Use the

“show status”

see if any of the partitions have 0 bytes available.

The NSX CLI command show system statistics can be used to display resource utilization for disk space, disk I/O, memory, CPU and various other processes on the Controller Nodes. The command offers statistics with one-minute intervals for a window of one hour for various combinations. The show system statistics CLI command does auto-completion and can be used to view the list of metric data available.

show system statistics <datasource>       : for the tabular output
show system statistics graph <datasource> : for the graphical format output

 

As an example, the following output shows the RRD statistics for the datasource disk_ops:write associated with the disk sda1 on the Controller in a tabular form:

# show system statistics disk-sda1/disk_ops:write

Time  Write

12:29             0.74

12:28         0.731429

12:27         0.617143

12:26         0.665714  <snip>

 

more commands:

# show network interface
# show network default-gateway
# show network dns-servers
# show network ntp-servers
# show network ntp-status
# traceroute <ip_address or dns_name>
# ping <ip address>
# ping interface addr <alternate_src_ip> <ip_address>
# watch network interface breth0 traffic
 

Deploying NSX-V controller failed and disappear from vSphere client

One of the following issues hit during the deployment of the NSX-v Controller cluster may cause the deployment to fail and the deletion after few minutes of the instantiated Controller nodes.

  1. Firewall blocking Controller communication with NSX Manager.
  2. Network Connectivity between NSX Manager and Controllers.
  3. DNS/NTP misconfiguration between NSX Manager/vCenter/ESXi hosts.
  4. Lack of available resources, like disk space, in the Datastore utilized for the deployment of the Controllers.

The first area to investigate is the “Task Console” on vCenter. From an analysis of the entries displayed on the console, it is clear that first the Controller virtual machine is “powered on”, but then it gets powered off and deleted. But why?

 

View vCenter Tasks

View vCenter Tasks

 

Troubleshooting step:

  • Download the NSX manager logs.
  • Right click on the upper right corner of the NSX Manager GUI and choose “Download Tech Support Log”.
Download NSX Manager Logs

Download NSX Manager Logs

 

The Tech support file can be a very large text file, so finding an issue is as challenging as looking for a needle in a pile of hay.  What to look for?

My best advice is to start with something we know, the name of the Controller node that was first instantiated and then deleted. This name was assigned to the Controller node after the completion of the deployment wizard.

In my specific example it was “controller-2”.

Open the text file and search for this name:

Search in Tech Support File

Search in Tech Support File

 

When you find the name try to use the arrow down key and start to read:

NSX Tech Support file

NSX Tech Support file

 

From this error we can learn we have connectivity issues; it appears that if the Controller node can’t connect to NSX Manager during the deploying process, it will get automatically deleted.

The next question is: why do I have connectivity issues? In my case the NSX Controller and the NSX Manager run in the same IP subnet.

The answer is found in the manual Static IP pool object that was created for the Controller cluster.

In this lab I work with subnet class B 255.255.0.0 = prefix of 16, but in the object pool I mistakenly assigned a prefix length of 24.

 

Wrong IP Pool

Wrong IP Pool

 

This was just an example on how to troubleshoot an NSX-v Controller node deployment but there may be other reasons that can cause a similar problem.

  • Firewall block Controller to talk NSX Manager.
  • Network Connectivity between NSX Manager and Controllers.
  • Make sure NSX Manager/vCenter/ESXi hosts have DNS/NTP configured
  • Make sure you have available resource like disk space in the Datastore you deploying the controllers.

NSX Home LAB Part 2


 NSX Controller


This Post updated at 18.10.14

NSX Controller Overview

Controller

Controller

The NSX control plane runs in the NSX controller. In a vSphere-optimized environment with VDS the controller enables multicast free VXLAN and control plane programming of elements such as Distributed Logical Routing (DLR).

In all cases the controller is purely a part of the control plane and does not have any data plane traffic passing through it. The controller nodes are also deployed in a cluster of odd members in order to enable high-availability and scale.

The Controller role in NSX Architecture are:

  • Enables VXLAN control plane by distributing network information.
  • Controllers are clustered for scale out in odd number’s (1,3) and high availability.
  • TCP (SSL) server implements the control plane protocol
  • An extensible framework that supports multiple applications:
  • Currently VXLAN and Distributed Logical Router
  • Provides CLI interface for statistics and runtime states
  • Clustering, data persistence/replication, and REST API framework from NVP are leveraged by the controller

This next overview of NSX Controller was taken from great work of Max Ardica and Nimish Desai in the official NSX Design Guide:

The Controller cluster in the NSX platform is the control plane component that is responsible in managing the switching and routing modules in the hypervisors. The controller cluster consists of controller nodes that manage specific logical switches. The use of controller cluster in managing VXLAN based logical switches eliminates the need for multicast support from the physical network infrastructure. Customers now don’t have to provision multicast group IP addresses and also don’t need to enable PIM routing or IGMP snooping features on physical switches or routers.

Additionally, the NSX Controller supports an ARP suppression mechanism that reduces the need to flood ARP broadcast requests across the L2 network domain where virtual machines are connected. The different VXLAN replication mode and the ARP suppression mechanism will be discussed in more detail in the “Logical Switching” section.

For resiliency and performance, production deployments must deploy a Controller Cluster with multiple nodes. The NSX Controller Cluster represents a scale-out distributed system, where each Controller Node is assigned a set of roles that define the type of tasks the node can implement.

In order to increase the scalability characteristics of the NSX architecture, a “slicing” mechanism is utilized to ensure that all the controller nodes can be active at any given time.

 Slicing Controller

The above illustrates how the roles and responsibilities are fully distributed between the different cluster nodes. This means, for example, that different logical networks (or logical routers) may be managed by different Controller nodes: each node in the Controller Cluster is identified by a unique IP address and when an ESXi host establishes a control-plane connection with one member of the cluster, a full list of IP addresses for the other members is passed down to the host, so to be able to establish communication channels with all the members of the Controller Cluster. This allows the ESXi host to know at any given time what specific node is responsible for a given logical network

In the case of failure of a Controller Node, the slices for a given role that were owned by the failed node are reassigned to the remaining members of the cluster. In order for this mechanism to be resilient and deterministic, one of the Controller Nodes is elected as a “Master” for each role. The Master is responsible for allocating slices to individual Controller Nodes and determining when a node has failed, so to be able to reallocate the slices to the other nodes using a specific algorithm. The master also informs the ESXi hosts about the failure of the cluster node, so that they can update their internal information specifying what node owns the various logical network slices.

The election of the Master for each role requires a majority vote of all active and inactive nodes in the cluster. This is the main reason why a Controller Cluster must always be deployed leveraging an odd number of nodes.

Controlloer Nodes Majority

Figure above highlights the different majority number scenarios depending on the number of Controller Cluster nodes. It is evident how deploying 2 nodes (traditionally considered an example of a redundant system) would increase the scalability of the Controller Cluster (since at steady state two nodes would work in parallel) without providing any additional resiliency. This is because with 2 nodes, the majority number is 2 and that means that if one of the two nodes were to fail, or they lost communication with each other (dual-active scenario), neither of them would be able to keep functioning (accepting API calls, etc.). The same considerations apply to a deployment with 4 nodes that cannot provide more resiliency than a cluster with 3 elements (even if providing better performance).

Note: NSX currently (as of software release 6.1) supports only clusters with 3 nodes. The various examples above with different numbers of nodes were given just to illustrate how the majority vote mechanism works.

NSX controller nodes are deployed as virtual appliances from the NSX Manager UI. Each appliance is characterized by an IP address used for all control-plane interactions and by specific settings (4 vCPUs, 4GB of RAM) that cannot currently be modified. Downsizing NSX Controller

In order to ensure reliability to the Controller cluster, it is good practice to spread the deployment of the cluster nodes across separate ESXi hosts, to ensure that the failure of a single host would not cause the loss of majority number in the cluster. NSX does not currently provide any embedded capability to ensure this, so the recommendation is to leverage the native vSphere DRS anti-affinity rules to avoid deploying more than one controller node on the same ESXi server.

For more information on how to create a VM-to-VM anti-affinity rule,example of rule:

NSX Management Cluster and DRS Rules

Anti-Affinity

please refer to the following KB article:

http://pubs.vmware.com/vsphere-55/index.jsp#com.vmware.vsphere.resmgmt.doc/GUID-7297C302-378F-4AF2-9BD6-6EDB1E0A850A.html

Deploying NSX Controller

From the NSX Controller Menu we click green + button

Deploying Controller

Deploying Controller

Add Controller window pop-up, we will place the controller at the Management Cluster.

Capture2

The IP Pool box needed for automatically  allocated ip address for etch controller Node.

Capture3

After click ON the NSX Manager will Deploy Controller Node at the Management Cluster.

Capture4

we will need to wait until the node status change from Deploying to Normal

Capture5

At this point we have one Node in the NSX Cluster of Controller’s.

vCapture7

if you have problem to deploy the controller read my post:

Deploying NSX-V controller failed

We can now SSH to Controller and run some show commands:

show control-cluster status

nvp-controller # show control-cluster status
Type Status Since
——————————————————————————–
Join status: Join complete 04/16 02:55:19
Majority status: Connected to cluster majority 04/16 02:55:11
Restart status: This controller can be safely restarted 04/16 02:55:17
Cluster ID: 14d6067f-c1d2-4541-ae45-d20d1c47009f
Node UUID: 14d6067f-c1d2-4541-ae45-d20d1c47009f

Role Configured status Active status
——————————————————————————–
api_provider enabled activated
persistence_server enabled activated
switch_manager enabled activated
logical_manager enabled activated
directory_server enabled activated

Join Status: when this node join to the cluster and the status of the join, in this output we get “Join completed”

Check which Node are UP and running and ip addres’s

show control-cluster startup-nodes

nvp-controller # show control-cluster startup-nodes
192.168.78.135

Controller Cluster and High Availability

Controller Cluster

Controller Cluster

For testing We will install 3 Node to join as one Controller Cluster.

Install Controller Cluster

Install Controller Cluster

From the output of the startup node we can see we have 3 node’s.

show control-cluster startup-nodes

nvp-controller # show control-cluster startup-nodes
192.168.78.135, 192.168.78.136, 192.168.78.137

One of the node member will elect as Master.

In order to  see which node member elected as Master we can run the command:

show control-cluster roles

Master Election

Master Election

Node 1 chose as Master.

Now lest check what will happen if we restart Node 1.

Restart Node1

Restart Node1

After few sec the Node 2 elected as Master:

Node2 Elected as Master

Node2 Elected as Master

in order to save memory at my laptop i will leave node1 and delete node2 and node3.

want to know more about how to TSHO NSX controllers ?

read my post

Troubleshooting NSX-V Controller

Summery of Part 2

We Install NSX Controller and see the functionality of the high availability of the NSX cluster.

Lab topology:

Summary of home lab part 2

Summary of home lab part 2

Related Post:

Troubleshooting NSX-V Controller// // // <![CDATA[
var amznKeys = amznads.getKeys();
if (typeof amznKeys != “undefined” && amznKeys != “”) { for (var i =0; i // // // // // //

NSX Controller

Host Preparation

Logical Switch

Distributed Logical Router