NSX-V Troubleshooting registration to vCenter

In the current NSX software release, the NSX Manager is tightly connected to the vCenter server in a 1:1 relationship.

During the process of coupling the NSX Manager to vCenter we have two different initial steps: the configuration of “Lookup Service” and “vCenter Server”.

1

Lookup Service:

Lookup Service allows to bind NSX role to SSO user or group. In other word this enable the “Role Based Access Control” authentication functionality in NSX and its optional configuration. Notice that without Lookup service configuration the functionality of NSX is not affected at all.

 

 VCenter Server:

This is a mandatory configuration. Registering the NSX Manager with vCenter injects a plugin into the vSphere Web Client for consumption of NSX functionalities within the Web management platform.

While trying to Register to vCenter or configuring the Lookup Service you might see this error:

“nested exception is java.net.UnknownHostException: vc-l-01a.corp.local( vc-l-01a.corp.local )”

2

Or when trying to setup the Lookup Service:

“nested exception is java.net.UnknownHostException: vc-l-01a.corp.local( vc-l-01a.corp.local )”

3

Or similar to this Error:

“NSX Management Service operation failed.( Initialization of Admin Registration Service Provider failed. Root Cause: Error occurred while registration of lookup service, com.vmware.vim.sso.admin.exception.InternalError: General failure. )”

 

Most of the problems to register NSX Manager to vCenter or configure the SSO Lookup service are:

  1. Connectivity problem between the NSX Managers and vCenter.
  2. Firewall blocking this connection.
  3. DNS not configured properly on NSX Manager or vCenter.
  4. Time is not synced between NSX Manager and vCenter.
  5. The user authenticated via SSO needs to have administrative rights.

 

TSHOT steps

Connectivity issue:

Verify connectivity from NSX Manager to vCenter. Ping from NSX Manager to vCenter using both the IP address and the Fully Qualified Domain Name (FQDN). Check for routing or static information or for the presence of a default route in NSX Manager:

nsxmgr-l-01a# show ip route

Codes: K – kernel route, C – connected, S – static,

> – selected route, * – FIB route

S>* 0.0.0.0/0 [1/0] via 192.168.110.2, mgmt

C>* 192.168.110.0/24 is directly connected, mgmt

 

DNS Issue:

Verify NSX Manager can successfully resolve the vCenter DNS name. Ping from NSX Manager to vCenter with FQDN:

nsxmgr-l-01a# ping vc-l-01a.corp.local

PING vc-l-01a.corp.local (192.168.110.22): 56 data bytes

64 bytes from 192.168.110.22: icmp_seq=0 ttl=64 time=0.576 ms

If this does not work verify the DNS configuration on the NSX Manager.

Go to Manage -> Network -> DNS Servers:

4

Firewall Issue:

If you have a firewall between NSX Manager and vCenter, verify it allows SSL communication on TCP/443 (also allow ping for connective checks).

A complete list of the communication ports and protocols used for VMware NSX for vSphere is available at the links below:

kb.vmware.com/kb/2079386

or

https://communities.vmware.com/docs/DOC-28142

 

NTP issue:

Verify that actual time is synced between vCenter and NSX Manager.

6

From NSX Manager CLI:

nsxmgr-l-01a# show clock
Tue Nov 18 06:51:34 UTC 2014

 

From vCenter CLI:

vc-l-01a:~ # date
Tue Nov 18 06:51:31 UTC 2014

Note: After configuration of Time settings, Appliance needs to be restarted.

 

User permission issue:

Registered user to vCenter or Lookup service must have administrative rights.
Try to work with default administrator user: administrator@vsphere.local

Now the official KB publish at 21/1/15:

KB-2102041

Upgrade NSX-V, The right Way

During November I was the opportunity to take NSX advance bootcamp with one of brilliant PSO Architect in the NSX filed, Kevin Barrass
This blog was based on Kevin lecture, I add screenshots and my experience.

Upgrade NSX can be very easy if planned right, or very frustrating if we try to do shortcuts in the process. In this blog I will try to documents all the steps need for complete nsx-v upgrade.

High level upgrade flow:

Upgrade NSX Process

Upgrade NSX Process

Before start the upgrade procedure, pre upgrade steps must take under consideration:

  1. Read the NSX release notes.
  2. Check upgrade MD5 file.
  3. Verify the state of the NSX and vSphere infrastructure.
  4. Preserver the NSX infrastructure.

 

Read the NSX release notes:

How many times you face issue during upgrade process, waste hours of troubleshooting, sure you work exactly as guided, open support ticket and get answer: you hitting known upgrade issue and the workaround is writing in the release notes.  RTFM, Filling dummy…? J

This line writing with blood, do not skip this step!!!! Read the release notes:

 

Compare the MD5

Download any of your favorite MD5 tools, I’m using free winMd5Sum

2

Compare MD5 sum you get from Calculate against VMware official MD5 web site.

The link to software:

http://www.nullriver.com/

Verify NSX working state

Again this line came from filed, the scenario is you complete the upgrade process and now facing issue. How do we you know if the issue wasn’t there before we start the upgrade?

Do not assume everything is working before you start to touch the infrastructure, Check it!!! 

  1. Note current versions of NSX Manager, vCenter, ESXi and Edges Verify you can log into:
  • NSX Manager Web UI
  • vCenter and see NSX Manager in Plugin
  • ESG, DLR control VM’s
  1. Validate VXLAN is functional:
  • Ping between two VM’s on same logical switch (different hosts):
  • Ping -l 1472 –f <dest VM>
  • Ping between two VTEP’s (different hosts)
  • Ping ++netstack=vxlan -d -s 1572 <dest VTEP IP>
  1. Validate North south by pinging out from a VM
  2. Visual inspection of Host Prep, Logical Network Prep, Edges (check for all Green)

Verify vSphere working state

Check DRS is enabled on clusters

Validate vMotion functions correctly

Check host connection state with vCenter

 

Check you have minimum 3 esxi host in etch NSX Cluster.

During NSX upgrade in some situation, NSX cluster with 2 hosts or less can causes issues with DRS/Admission control/Anti-Affinity rules.  My recommendation to get success with upgrade process, try to work with 3 host in etch NSX cluster you plan to upgrade.

Preserve the NSX infrastructure

Do the upgrade during a maintenance window

Backups NSX-Managers:

Create a current backup of the NSX Manager, Check you know the backup password 🙂

3

 

Backup Firewall Policy:

Export the Distributed Firewall Rules  and Service Composer :

8

Export Service Composer

Upgrade NSX manager

Verify the NSX manager OVA file name ended with tar.gz

Some browser may remove the gz extension, if the file look like:

VMware-NSX-Manager-upgrade-bundle-6.1.0-X.X.gz

Change it to:

VMware-NSX-Manager-upgrade-bundle-6.1.0-2107742.tar.gz

Otherwise you will get error after complete uploading the OVA file to NSX manager:

“Invalid upgrade bundle file VMware-NSX-Manager-upgrade-bundle-6.0.x-xxxxx,gz, upgrade file name has extension tar.gz”

9

NSX manager Upgrade, Open NSX manager web interface and click on the Upgrade:

10

Click on the upgrade baton:

11

Click “Browse” and open the upgrade file, click Continue:

12

 

Note: NSX Manager will reboot during upgrade process, the forwarding path of VM workloads will not affected during this step unless:

We are using user identity with distributed firewall and new user login during NSX Manager is down.

 

The upgrade process built on two steps: validate the tar.gz image and start the actual upgrade process:

13

When NSX manager finish the validated process, the upgrade process start:

14

After complete upgrading Manager, Confirm the version from the Summary Tab of the NSX Manager Web UI:

15

Upgrade the NSX controllers

During upgrade controller nodes, the upgrade file is download to etch node, the process will start to upgrade node1, then node2 and end node3.

To start the upgrade process click on the “Upgrade Available”

16

During upgrade NSX controller we will face this state:

Node1: complete upgraded to 6.1

Node2: Is rebooting

Node3: In Normal state but in version 6.00

Results: we have one node active in 6.1 as conscience controller loss of Majority due to version mismatch

17

What does it mean? -> Impact on Control plane

Working with enable DRS live virtual enviroment, vMotion of VM can happen, VM may change is currint esxi host location, as results may face forwading issue because of other VTEP will not reflect this update.

Other issue may ocure if dynamic routing get update of topology state, for example new route add or remove. To avoid this issue we need keep routing unchange.

To limited the expose time window for forwading issue with worload VM’s my recommendation is the change the DRS setting to maual, this will limit the VM vMotion in NSX clusters durring controller update!!

18

Note: After compelte controller upgrade, change it back to privios configuration.

If we sure: VMs must not move, Dynamic routes must not change, then No impact on data plane

When controller node-2 complete is rebooting process, we get two controllers upgraded and on same version. At that point we gain back cluster majority, controller node-1 still need to finish his upgrade and rebooting process.

19

When all tree controller nodes completed the rebooting the cluster is upgrade.

20

Upgrade Clusters

During upgrade NSX clusters, esxi host required reboot, there will no impact on data plane for VM’s because thy will move automatically with DRS.

If DRS is disable, vSphere admin will need to move VM’s manually and reboot this esxi host.

This is rezone admission control with 2 hosts may prevent automatic host upgrade. My recommendation is to avoid 2 host clusters, or manually evacuate a host and put into maintenance mode.

If you have created anti-affinity rules for Controllers, 3 hosts will prevent the upgrade.

21

Disable anti-affinity rules by uncheck “Enable rule” for automatic hosts upgrade and enable it after upgrade complete.

22

With default anti-affinity rules for Edges/DLR, 2 hosts will prevent the upgrade. Uncheck the “Enable rule” anti-affinity rules for Edges to allow automatic hosts upgrade. Enable it after upgrade compete.

23

Click Cluster Host “Update”

If an upgrade is available to the Cluster an “Update” link is available in the NSX. When upgrade is initiated NSX Manager updates the NSX VIB on each host

Click on “update” to upgrade Cluster:

24

VIBs are updated on hosts

25

host reboot during upgrade:

26

Task view will reveal what happen during upgrade process run:

27

Once all hosts are rebooted, the host update is completed.

28

Upgrade DLR and ESG’s

During the upgrade process new ESG VM is deployed alongside the existing one, when the new ESG is ready, old ESG vnic are disconnected and new ESG vnics connected. The New ESG send GARP.

This process can affect forwarding plan, we can minims it with Edge working in ECMP mode.

Go to NSX Edges and Upgrade each one

29

Each ESG/DLR will then be upgraded

Check status is deployed and at correct version

31

Upgrade Guest Introspection / Data Security if required

NSX Guest Introspection / Data Security One Upgrade

If an upgrade is available to the Guest Introspection / Data Security an upgrade link is available in the NSX UI.

32

Click on upgrade if available

Follow NSX installation guide for specific details on upgrading Guest Introspection / Data Security.

Once upgrade is successful create new NSX Manager backup

The previous NSX Manager backup is only valid for the previous release

33

Don’t forget to Verify NSX working state

Troubleshooting NSX-V Controller

Overview

The Controller cluster in the NSX platform is the control plane component that is responsible in managing the switching and routing modules in the hypervisors.

The use of controller cluster in managing VXLAN based logical switches eliminates the need for multicast.

 

Each Controller Node is assigned a set of roles that define the type of tasks the node can implement. By default, each Controller Node is assigned all roles.

NSX controller roles:

API provider: Handles HTTP web service requests from external clients (NSX Manager) and initiates processing by other Controller Node tasks.

Persistence Server: Stores data from the NVP API and vDS devices that must be persisted across all Controller Nodes in case of node failures or shutdowns.

Logical manager: Monitors when endhosts arrive or leave vDS devices and configures the vDS forwarding states to implement logical connectivity and policies..

Switch manager: Maintains management connections for one or more vDS devices.

Directory server: manage VXLAN and the distributed logical routing directory of information.

Any multi-node HA mechanism has the potential for a “split brain” scenario in which a cluster is partitioned into two or more groups, and those groups are not able to communicate. In this scenario, each group might assume control of all tasks under the assumption that the other nodes have failed. NSX uses leader election to solve this split-brain problem. One of the Controller Nodes is elected as a leader for each role, which requires a majority vote of all active and inactive nodes in the cluster.

 

The leader for each role is responsible for allocating tasks to individual Controller Nodes and determining when a node has failed. Since election requires a majority of all nodes,

it is not possible for two leaders to exist simultaneously within a cluster, preventing a split brain scenario. The leader election mechanism requires a majority of all cluster nodes to be functional at all times.

Note: Currently NSX-V 6.1 support maximum 3 controllers

Here is example of 3 NSX Controllers and role election per Node members.

3

Node 1 master for roles:  API Provider and Logical Manager

Node 2 master for roles: Persistence Server and Directory Server

Node 3 master for roles: Switch Manger.

The different majority number scenarios depending on the number of Controller Cluster nodes. It is evident how deploying 2 nodes (traditionally considered an example of a redundant system) would increase the scalability of the Controller Cluster (since at steady state two nodes would work in parallel)

without providing any additional resiliency. This is because with 2 nodes, the majority number is 2 and that means that if one of the two nodes were to fail, or they lost communication with each other (dual-active scenario), neither of them would be able to keep functioning (accepting API calls, etc.). The same considerations apply to a deployment with 4 nodes that cannot provide more resiliency than a cluster with 3 elements (even if providing better performance).

 

TSHOT NSX controllers

The next part of TSHOT NSX Controller base on VMware NSX MH 4.1 User Guide:

https://my.vmware.com/web/vmware/details?productId=418&downloadGroup=NSX-MH-412-DOC

NSX Controller nodes ip address for the next screenshots are:

Node1 192.168.110.201, Node1 192.168.110.202, Node1 192.168.110.202

Verify NSX Controller installation

Ensure that the Controllers are installed on systems that meet the minimum requirements.
On each Controller:

The CLI command “request system compatibility-report” provides informational details that determine whether a Controller system is compatible with the Controller requirements.

# request system compatibility-report

 

Check controller status in NSX Manager

The NSX Manager continually checks whether all Controller Clusters are accessible. If a Controller Cluster is currently in disconnected status, your diagnostic efforts and log review should be focused on the time immediately after the Controller Cluster was last seen as connected.

Here example of “Disconnected” controller from NSX Manager:

 

This NSX “Controller nodes status” screenshot show status between the NSX Manager to Controller and not the overall controller cluster status.

So even if we have all controllers in “Normal”state like the figure below , that doesn’t mean the overall controller status is ok.  

Checking the Controller Cluster Status from CLI

The current status of the Controller Cluster can be determined by running show control-cluster status:

 

# show control-cluster status

 

Join status: verify this node complete join to clusters process.

Majority status: check  if this cluster is part of the majority.

Cluster ID: all node members need to be in the same cluster id

The current status of the Controller Node’s intra-cluster communication connections can be determined by running

show control-cluster connections

 

If a Controller node is a Controller Cluster majority leader, it will be listening on port 2878 (as indicated by the Y in the “listening” column).

The other Controller nodes will have a dash (-) in the “listening” column.

The next step is to check whether the Controller Cluster majority leader has any open connections as indicated by the number in the “open conns” column. On a properly functioning Controller, the open connections should be the same as the number of other Controller nodes in the Controller Cluster (e.g. In a three-node Controller Cluster, the Controller Cluster majority leader should show two open connections).

The command show control-cluster history will allow you to see a history of Controller Cluster-related events on this node including restarts, upgrades, Controller Cluster errors and loss of majority.

controller # show control-cluster history

8

Joining a Controller Node to Controller Cluster

This section covers issues that may be encountered when attempting to join a new Controller Node to an existing Controller Cluster. An explanation of why the issue occurs and instructions on how to resolve the issue are also provided.

Symptom: Joining a new Controller node to a Controller Cluster may fail all of the existing Controllers are disconnected.

Example for this situation:

As we can see controller-1 and controller-2 are in disconnected from the NSX manager

 

When we try to add new controller cluster we get this error message:

 

 

Explanation:

If n nodes have joined the NSX Controller Cluster, then a majority (strictly greater than 50%) of those n nodes must be alive and connected to each other, before any new data to the system. This means that if you have a Controller Cluster of 3 nodes, 2 of them must be alive and connected in order for new data to be written in NSX.

In our case to add new controller node to cluster we need at least on member of the cluster to be in “Normal” state.

 

Resolution: Start the Disconnected Controller. If the Controller is disconnected due to a permanent failure, remove the Controller from the Controller Cluster.

Symptom: the join control-cluster CLI command hangs without ever completing the join operation.

Explanation:

The IP address passed into the join control-cluster command was incorrect, and/or does not refer to a currently live Controller node.

For example the user type the command:

join control-cluster 192.168.110.201

Make sure that 192.168.110.201 is part of existing controller cluster.

Resolution:

Use the IP address of a properly configured Controller that is reachable across the network.

Symptom:

The join control-cluster CLI command fails.

Explanation: If you have a Controller configured as part of a Controller Cluster, that Controller has been disconnected from the Controller Cluster for a long period of time (perhaps it was taken offline or shut down), and during that time, the other Controllers in that Controller Cluster were removed from the Controller Cluster and formed into a new Controller Cluster, then the long-disconnected Controller will not be allowed to rejoin the Controller Cluster that it left, because that original Controller Cluster is gone.

The following event log message in the new Controller Cluster indicates that something like this has happened:

Node b567a47f-9a61-43b3-8d53-36b3d1fd0675 tried to join with incorrect cluster ID

Resolution:

You must issue the join control-cluster command with the force option on the old Controller to force it to clear its state and join the new Controller Cluster with a fresh start.

Note: The forced join command deletes previously joined node with the same IP.

nvp-controller # join control-cluster 192.168.110.201 force

18

Recovering node disconnect from cluster

When controller cluster majority issue arises, it will very difficult to spot it from the NSX manager GUI.

For example the current state of the controllers from the NSX manager point of view is that all the member are in “Normal” state.

 

But in fact the current status in my cluster is:

12

Node1 + Node 2 are create cluster and share the roles between them, for some rezone Node 3 disconnected from the majority of the cluster:

Output example from controller Node 3:

 

 

Node 3 think his alone and own all of the roles.

From Node 1 perspective he is the leader (have the Y) and have one open connection from Node2 as show:

 

 

To recover from this scenario Node 3 need to join to majority of the cluster, the  ip address to join need to be to Node1 because his the leader of the majority.

join control-cluster 192.168.110.201 force

Recovering from lost all Controller Nodes

In this scenario all NSX Controller nodes failed or deleted,  Do we need start from scratch ? 🙁

The assumption is our environment already deployed NSX Edge, DLR and we have logical switch connected to VM’s and would like to preserve it.

The recovering process:

 Step 1:

Migrate existing logical switch to Multicast mode.

15

Step 2:

Deployed 3 new NSX controllers.

Step 3:

Sync the new deployed NSX controllers to unicast mode with the current state of our NSX.

16

other useful commands:

Checking Controller Processes

Even if the “join-cluster” command on a node appears to have been successful, the node might not have come up completely for a variety of reasons. The way this error tends to manifest itself most visibly is that the controller process isn’t listening on all the ports it’s supposed to be, and no API requests or switch connections are happening.

# show network connections of-type tcp

Active Internet connections (servers and established)

Proto Recv-Q Send-Q Local Address      Foreign Address     State       PID/Program

tcp        0      0 172.29.1.20:6633   0.0.0.0:*           LISTEN      14038/domain

tcp        0      0 172.29.1.20:7000   0.0.0.0:*           LISTEN      14072/java

tcp        0      0 0.0.0.0:443        0.0.0.0:*           LISTEN      14067/domain

tcp        0      0 172.29.1.20:7777   0.0.0.0:*           LISTEN      14038/domain

tcp        0      0 172.29.1.20:6632   0.0.0.0:*           LISTEN      14038/domain

tcp        0      0 172.29.1.20:9160   0.0.0.0:*           LISTEN      14072/java

tcp        0      0 172.29.1.20:2888   0.0.0.0:*           LISTEN      14072/java

tcp        0      0 172.29.1.20:2888   172.29.1.20:55622   ESTABLISHED 14072/java

tcp        0      0 172.29.1.20:9160   172.29.1.20:52567   ESTABLISHED 14072/java

tcp        0      0 172.29.1.20:52566  172.29.1.20:9160    ESTABLISHED 14038/domain

tcp        0      0 172.29.1.20:443    172.17.21.9:46438   ESTABLISHED 14067/domain

 

The show network connection output shown in the preceding block is an example from a healthy Controller. If you find some of these missing, it’s likely that NSX didn’t get past its install phase.  Here are some misconfigurations that can cause this:

Bad management address or listen IP

You’ve set an incorrect IP as the management-address, or as the listen-ip for one of the roles (like switch_manager or api_provider).

NSX attempts to bind to the specified address, and fails early if it cannot do so.  You’ll see log messages in cloudnet_cpp.log.ERROR like:

E0506 01:20:17.099596  7188 dso-deployer.cc:516] Controller component installation of rpc-broker failed: Unable to bind a RPC port $tags:tracing:3ef7d1f519ffb7fb^

E0506 01:20:17.100162  7188 main.cc:271] RPC deployment subsystem not installed; exiting. $tags:tracing:3ef7d1f519ffb7fb^

Or in cloudnet_cpp.log.WARNING:

W0506 01:22:27.721777  7694 ssl-socket.cc:530] SSLSocket failed to bind to 172.1.1.1:6632: Cannot assign requested address

Note that if you are using DHCP for the IP addresses of your controller nodes (not recommended or supported), the IP address could have changed since the last time you configured it.

Verify that the IP addresses for switch_manager and api_provider are what they are supposed to be by performing the CLI command:

<switch_manager|api_provider>  listen-ip

 

Bad first node address

You’ve provided the wrong IP address for the first node in the Controller Cluster.   Run show

control-cluster startup-nodes

to determine whether the IPs listed correspond to the IPs of the Controllers in the Controller Cluster.

 

Out of disk space

The Controller may be out of disk space. Use the

“show status”

see if any of the partitions have 0 bytes available.

The NSX CLI command show system statistics can be used to display resource utilization for disk space, disk I/O, memory, CPU and various other processes on the Controller Nodes. The command offers statistics with one-minute intervals for a window of one hour for various combinations. The show system statistics CLI command does auto-completion and can be used to view the list of metric data available.

show system statistics <datasource>       : for the tabular output
show system statistics graph <datasource> : for the graphical format output

 

As an example, the following output shows the RRD statistics for the datasource disk_ops:write associated with the disk sda1 on the Controller in a tabular form:

# show system statistics disk-sda1/disk_ops:write

Time  Write

12:29             0.74

12:28         0.731429

12:27         0.617143

12:26         0.665714  <snip>

 

more commands:

# show network interface
# show network default-gateway
# show network dns-servers
# show network ntp-servers
# show network ntp-status
# traceroute <ip_address or dns_name>
# ping <ip address>
# ping interface addr <alternate_src_ip> <ip_address>
# watch network interface breth0 traffic