Troubleshooting NSX-V Controller

Overview

The Controller cluster in the NSX platform is the control plane component that is responsible in managing the switching and routing modules in the hypervisors.

The use of controller cluster in managing VXLAN based logical switches eliminates the need for multicast.

1

Each Controller Node is assigned a set of roles that define the type of tasks the node can implement. By default, each Controller Node is assigned all roles.

NSX controller roles:

API provider: Handles HTTP web service requests from external clients (NSX Manager) and initiates processing by other Controller Node tasks.

Persistence Server: Stores data from the NVP API and vDS devices that must be persisted across all Controller Nodes in case of node failures or shutdowns.

Logical manager: Monitors when endhosts arrive or leave vDS devices and configures the vDS forwarding states to implement logical connectivity and policies..

Switch manager: Maintains management connections for one or more vDS devices.

Directory server: manage VXLAN and the distributed logical routing directory of information.

Any multi-node HA mechanism has the potential for a “split brain” scenario in which a cluster is partitioned into two or more groups, and those groups are not able to communicate. In this scenario, each group might assume control of all tasks under the assumption that the other nodes have failed. NSX uses leader election to solve this split-brain problem. One of the Controller Nodes is elected as a leader for each role, which requires a majority vote of all active and inactive nodes in the cluster.

2

The leader for each role is responsible for allocating tasks to individual Controller Nodes and determining when a node has failed. Since election requires a majority of all nodes,

it is not possible for two leaders to exist simultaneously within a cluster, preventing a split brain scenario. The leader election mechanism requires a majority of all cluster nodes to be functional at all times.

Note: Currently NSX-V 6.1 support maximum 3 controllers

Here is example of 3 NSX Controllers and role election per Node members.

3

Node 1 master for roles:  API Provider and Logical Manager

Node 2 master for roles: Persistence Server and Directory Server

Node 3 master for roles: Switch Manger.

The different majority number scenarios depending on the number of Controller Cluster nodes. It is evident how deploying 2 nodes (traditionally considered an example of a redundant system) would increase the scalability of the Controller Cluster (since at steady state two nodes would work in parallel)

without providing any additional resiliency. This is because with 2 nodes, the majority number is 2 and that means that if one of the two nodes were to fail, or they lost communication with each other (dual-active scenario), neither of them would be able to keep functioning (accepting API calls, etc.). The same considerations apply to a deployment with 4 nodes that cannot provide more resiliency than a cluster with 3 elements (even if providing better performance).

 

TSHOT NSX controllers

The next part of TSHOT NSX Controller base on VMware NSX MH 4.1 User Guide:

https://my.vmware.com/web/vmware/details?productId=418&downloadGroup=NSX-MH-412-DOC

NSX Controller nodes ip address for the next screenshots are:

Node1 192.168.110.201, Node1 192.168.110.202, Node1 192.168.110.202

Verify NSX Controller installation

Ensure that the Controllers are installed on systems that meet the minimum requirements.
On each Controller:

The CLI command “request system compatibility-report” provides informational details that determine whether a Controller system is compatible with the Controller requirements.

# request system compatibility-report

4

 

Check controller status in NSX Manager

The NSX Manager continually checks whether all Controller Clusters are accessible. If a Controller Cluster is currently in disconnected status, your diagnostic efforts and log review should be focused on the time immediately after the Controller Cluster was last seen as connected.

Here example of “Disconnected” controller from NSX Manager:

5

This NSX “Controller nodes status” screenshot show status between the NSX Manager to Controller and not the overall controller cluster status.

So even if we have all controllers in “Normal”state like the figure below , that doesn’t mean the overall controller status is ok.  

11

Checking the Controller Cluster Status from CLI

The current status of the Controller Cluster can be determined by running show control-cluster status:

 

# show control-cluster status

6

Join status: verify this node complete join to clusters process.

Majority status: check  if this cluster is part of the majority.

Cluster ID: all node members need to be in the same cluster id

The current status of the Controller Node’s intra-cluster communication connections can be determined by running

show control-cluster connections

7

If a Controller node is a Controller Cluster majority leader, it will be listening on port 2878 (as indicated by the Y in the “listening” column).

The other Controller nodes will have a dash (-) in the “listening” column.

The next step is to check whether the Controller Cluster majority leader has any open connections as indicated by the number in the “open conns” column. On a properly functioning Controller, the open connections should be the same as the number of other Controller nodes in the Controller Cluster (e.g. In a three-node Controller Cluster, the Controller Cluster majority leader should show two open connections).

The command show control-cluster history will allow you to see a history of Controller Cluster-related events on this node including restarts, upgrades, Controller Cluster errors and loss of majority.

controller # show control-cluster history

8

Joining a Controller Node to Controller Cluster

This section covers issues that may be encountered when attempting to join a new Controller Node to an existing Controller Cluster. An explanation of why the issue occurs and instructions on how to resolve the issue are also provided.

Symptom: Joining a new Controller node to a Controller Cluster may fail all of the existing Controllers are disconnected.

Example for this situation:

As we can see controller-1 and controller-2 are in disconnected from the NSX manager

5

When we try to add new controller cluster we get this error message:

10

Explanation:

If n nodes have joined the NSX Controller Cluster, then a majority (strictly greater than 50%) of those n nodes must be alive and connected to each other, before any new data to the system. This means that if you have a Controller Cluster of 3 nodes, 2 of them must be alive and connected in order for new data to be written in NSX.

In our case to add new controller node to cluster we need at least on member of the cluster to be in “Normal” state.

17

 

Resolution: Start the Disconnected Controller. If the Controller is disconnected due to a permanent failure, remove the Controller from the Controller Cluster.

Symptom: the join control-cluster CLI command hangs without ever completing the join operation.

Explanation:

The IP address passed into the join control-cluster command was incorrect, and/or does not refer to a currently live Controller node.

For example the user type the command:

join control-cluster 192.168.110.201

Make sure that 192.168.110.201 is part of existing controller cluster.

Resolution:

Use the IP address of a properly configured Controller that is reachable across the network.

Symptom:

The join control-cluster CLI command fails.

Explanation: If you have a Controller configured as part of a Controller Cluster, that Controller has been disconnected from the Controller Cluster for a long period of time (perhaps it was taken offline or shut down), and during that time, the other Controllers in that Controller Cluster were removed from the Controller Cluster and formed into a new Controller Cluster, then the long-disconnected Controller will not be allowed to rejoin the Controller Cluster that it left, because that original Controller Cluster is gone.

The following event log message in the new Controller Cluster indicates that something like this has happened:

Node b567a47f-9a61-43b3-8d53-36b3d1fd0675 tried to join with incorrect cluster ID

Resolution:

You must issue the join control-cluster command with the force option on the old Controller to force it to clear its state and join the new Controller Cluster with a fresh start.

Note: The forced join command deletes previously joined node with the same IP.

nvp-controller # join control-cluster 192.168.110.201 force

18

Recovering node disconnect from cluster

When controller cluster majority issue arises, it will very difficult to spot it from the NSX manager GUI.

For example the current state of the controllers from the NSX manager point of view is that all the member are in “Normal” state.

11

But in fact the current status in my cluster is:

12

Node1 + Node 2 are create cluster and share the roles between them, for some rezone Node 3 disconnected from the majority of the cluster:

Output example from controller Node 3:

13

 

Node 3 think his alone and own all of the roles.

From Node 1 perspective he is the leader (have the Y) and have one open connection from Node2 as show:

14

 

To recover from this scenario Node 3 need to join to majority of the cluster, the  ip address to join need to be to Node1 because his the leader of the majority.

join control-cluster 192.168.110.201 force

Recovering from lost all Controller Nodes

In this scenario all NSX Controller nodes failed or deleted,  Do we need start from scratch ? 🙁

The assumption is our environment already deployed NSX Edge, DLR and we have logical switch connected to VM’s and would like to preserve it.

The recovering process:

 Step 1:

Migrate existing logical switch to Multicast mode.

15

Step 2:

Deployed 3 new NSX controllers.

Step 3:

Sync the new deployed NSX controllers to unicast mode with the current state of our NSX.

16

other useful commands:

Checking Controller Processes

Even if the “join-cluster” command on a node appears to have been successful, the node might not have come up completely for a variety of reasons. The way this error tends to manifest itself most visibly is that the controller process isn’t listening on all the ports it’s supposed to be, and no API requests or switch connections are happening.

# show network connections of-type tcp

Active Internet connections (servers and established)

Proto Recv-Q Send-Q Local Address      Foreign Address     State       PID/Program

tcp        0      0 172.29.1.20:6633   0.0.0.0:*           LISTEN      14038/domain

tcp        0      0 172.29.1.20:7000   0.0.0.0:*           LISTEN      14072/java

tcp        0      0 0.0.0.0:443        0.0.0.0:*           LISTEN      14067/domain

tcp        0      0 172.29.1.20:7777   0.0.0.0:*           LISTEN      14038/domain

tcp        0      0 172.29.1.20:6632   0.0.0.0:*           LISTEN      14038/domain

tcp        0      0 172.29.1.20:9160   0.0.0.0:*           LISTEN      14072/java

tcp        0      0 172.29.1.20:2888   0.0.0.0:*           LISTEN      14072/java

tcp        0      0 172.29.1.20:2888   172.29.1.20:55622   ESTABLISHED 14072/java

tcp        0      0 172.29.1.20:9160   172.29.1.20:52567   ESTABLISHED 14072/java

tcp        0      0 172.29.1.20:52566  172.29.1.20:9160    ESTABLISHED 14038/domain

tcp        0      0 172.29.1.20:443    172.17.21.9:46438   ESTABLISHED 14067/domain

 

The show network connection output shown in the preceding block is an example from a healthy Controller. If you find some of these missing, it’s likely that NSX didn’t get past its install phase.  Here are some misconfigurations that can cause this:

Bad management address or listen IP

You’ve set an incorrect IP as the management-address, or as the listen-ip for one of the roles (like switch_manager or api_provider).

NSX attempts to bind to the specified address, and fails early if it cannot do so.  You’ll see log messages in cloudnet_cpp.log.ERROR like:

E0506 01:20:17.099596  7188 dso-deployer.cc:516] Controller component installation of rpc-broker failed: Unable to bind a RPC port $tags:tracing:3ef7d1f519ffb7fb^

E0506 01:20:17.100162  7188 main.cc:271] RPC deployment subsystem not installed; exiting. $tags:tracing:3ef7d1f519ffb7fb^

Or in cloudnet_cpp.log.WARNING:

W0506 01:22:27.721777  7694 ssl-socket.cc:530] SSLSocket failed to bind to 172.1.1.1:6632: Cannot assign requested address

Note that if you are using DHCP for the IP addresses of your controller nodes (not recommended or supported), the IP address could have changed since the last time you configured it.

Verify that the IP addresses for switch_manager and api_provider are what they are supposed to be by performing the CLI command:

<switch_manager|api_provider>  listen-ip

 

Bad first node address

You’ve provided the wrong IP address for the first node in the Controller Cluster.   Run show

control-cluster startup-nodes

to determine whether the IPs listed correspond to the IPs of the Controllers in the Controller Cluster.

 

Out of disk space

The Controller may be out of disk space. Use the

“show status”

see if any of the partitions have 0 bytes available.

The NSX CLI command show system statistics can be used to display resource utilization for disk space, disk I/O, memory, CPU and various other processes on the Controller Nodes. The command offers statistics with one-minute intervals for a window of one hour for various combinations. The show system statistics CLI command does auto-completion and can be used to view the list of metric data available.

show system statistics <datasource>       : for the tabular output
show system statistics graph <datasource> : for the graphical format output

 

As an example, the following output shows the RRD statistics for the datasource disk_ops:write associated with the disk sda1 on the Controller in a tabular form:

# show system statistics disk-sda1/disk_ops:write

Time  Write

12:29             0.74

12:28         0.731429

12:27         0.617143

12:26         0.665714  <snip>

 

more commands:

# show network interface
# show network default-gateway
# show network dns-servers
# show network ntp-servers
# show network ntp-status
# traceroute <ip_address or dns_name>
# ping <ip address>
# ping interface addr <alternate_src_ip> <ip_address>
# watch network interface breth0 traffic
 

Posted in Controller, Troubleshooting Tagged with: , ,
8 comments on “Troubleshooting NSX-V Controller
  1. aj203355 says:

    Looks like my controllers have run out of disk space. Have you seen this problem before? If so, how do you go about fixing it? I’ve tried all of the clear commands (except the ones that will clear the configurations on the clusters).

    • roie9876@gmail.com says:

      Sorry but never see this issue, but re-deploy the NSX controllers may fix this issue.

      • aj203355 says:

        There were a few things that I was able to do to fix the “ouf of disk space” errors on two different occasions. First, I performed a restart on the controller to ensure that it came up correctly. One of the times, I noticed that it was prompting me to either “fix” some file system issues [f] or to ignore [i] them while trying to boot up the controller. I’m not sure how I missed this prompt and am assuming that after a certain amount of time, the controller continued the boot up process instead of hanging on the prompt. The second way I was able to fix another “out of disk space” error was to run the commands to clear all disk space. Not “all” disk space but the cached data on the disks except “essential files and configuration data”.

        Here are the commands:

        ***USE AT YOUR OWN RISK AND USE “?” WITH THE COMMANDS IF YOU ARE UNSURE***

        #show status

        Shows current disk space utilization

        # restart system

        # clear disk-space fully

        Free up as much disk space as possible by eliminating non-essential files

        Free up disk space by eliminating non-essential files such as old log
        files, core files, etc. This information can be preserved by creating
        a status report using the ‘save status-report’ command and copying it
        to a remote location using the ‘copy file’ command.

  2. Rahul says:

    Can I have 1-Node Control Cluster ? Initially, I had 3-Node controller, but 2 control Node crashed(server issues). Seems like, now NSX manager doesn’t detect the cluster. To get it working, can I have just 1-Node Controller till I resolve the other 2 Nodes. ????

    • Rahul says:

      Other way to ask the question …How I disable existing cluster config so that I can get the setup working with single Node Controller till I get additional servers to make it 3-Node.

      • roie9876@gmail.com says:

        If you create cluster with 3 controllers and then loss two, you will lose your cluster majority meaning you will not have functional cluster.
        To resolve this you will need to add two working controllers to the remain existing controller.

        I did not try it but if you want to change your NSX controller cluster from 3 to 1 after losing 2 controller you will need to delete the none working controller from NSX manager GUI, then SSH to the remain controller and do
        join control-cluster x.x.x.x force (where x.x.x.x is the ip address of the controller itself).

1 Pings/Trackbacks for "Troubleshooting NSX-V Controller"
  1. […] Troubleshooting NSX-V Controller// // // <![CDATA[ var amznKeys = amznads.getKeys(); if (typeof amznKeys != "undefined" && amznKeys != "") { for (var i =0; i // // // // // // […]

Leave a Reply