NSX-v Host Preparation

The information in this post is based on my NSX Professional experience in the field and from a lecture by Kevin Barrass, a NSX solution architect.

Thanks toTiran Efrat for reviewing this post.

Host preparation overview

Host preparation is the process in which the NSX manager installs the NSX Kernel module inside vSphere cluster and builds the NSX Control plan fabric.

Before the host preparation process we need to complete:

  1. Register the NSX Manager in the vCenter. This process was covered in NSX-V Troubleshooting registration to vCenter.
  2. Deploy the NSX Controllers, covered in deploying-nsx-v-controller-disappear-from-vsphere-client

Three components are involved during the NSX host preparation:
vCenter, NSX Manager, EAM(ESX Agent Manager).

Host Preperation1

vCenter Server:
Management of vSphere compute infrastructure.

NSX Manager:
Provides the single point of configuration and REST API entry-points in a vSphere environment for NSX.

EAM (ESX Agent Management):
The middleware component between the NSX manager and the vCenter. The EAM is part of the vCenter and is responsible to install the VIB (vSphere Installation Bundles), which are software packages prepared to be installed inside a ESXi host.

Host Preparation process

The host preparation begins when we click the “Install” process in vCenter GUI.

host preparation

host preparation

This process is done in the vSphere Cluster level and not per ESXi host. The EAM will create an agent to track the VIB’s installation process for each host. The VIB’s are being copied from the NSX manager and cache in EAM.
If the VIBs are not present in the ESXi host, the EAM will install the VIBs (ESXi host reboot is not needed).
The EAM will remove installed old version VIBs but an ESXi host reboot is needed.

VIBs installed during host preparation:
esx-dvfilter-switch-security
esx-vsip
esx-vxlan

The ESXi host has a fully working Control Plane after the host preparation was successfully completed

Two control plan channels will be created:

  • RabbitMQMessage bus: provides communication between the vsfwd process on the ESXi hypervisor to NSX Manager over TCP/5671.
  • User World Agent (UWA) process (netcpa on the ESXi hypervisor): establishes TCP/1234 over SSL communication channels to the Controller Cluster nodes.

Host Preperation2

Troubleshooting Host Preparation

DNS:

EAM fails to deploy VIBs due to misconfigured DNS or no DNS configuration on host.
We may get a status of “Not Ready”:

Not Ready

This indicates “Agent VIB module not installed” on one or more hosts.

We can check the vSphere ESX Agent Manager for errors:

“vCenter home > vCenter Solutions Manager > vSphere ESX Agent Manager”

On “vSphere ESX Agent Manager”, check the status of “Agencies” prefixed with “_VCNS_153” If any of the agencies has a bad status, select the agency and view its issues:

EAM

We need to check the associated log  /var/log/esxupdate.log (on the ESXi host) for more details on host preparation issues.
Log into host in which you have the issue, run “tail /var/log/esxupdate.log” to view the log

esxupdate error1

Solution:
Configure the DNS settings in the ESXi host for the NSX host preparation to success.

 

TCP/80 from ESXi to vCenter is blocked:

The ESXi host unable to connect to vCenter EAM on TCP/80:

Could be caused by a firewall block on this port. From the ESXi host /var/log/esxupdate.log file:

esxupdate: esxupdate: ERROR: MetadataDownloadError: (‘http://VC_IP_Address:80/eam/vib?id=xxx-xxx-xxx-xxx), None, “( http://VC_IP_Address:80/eam/vib?id=xxx-xxx-xxx-xxx), ‘/tmp/tmp_TKl58’, ‘[Errno 4] IOError: <urlopen error [Errno 111] Connection refused>’)”)

Solution:
The NSX-v has a list of ports that need to be open in order for the host preparation to succeed.
The complete list can be found in:
https://communities.vmware.com/docs/DOC-28142

 

Older VIB’s version:

If an old VIBs version exists on the ESXi host, EAM will remove the old VIB’s
But host preparation will not automatically continue.

Solution:
We will need to reboot the ESXi host to complete the process.

 

ESXi Bootbank Space issue:

If you try Upgrade ESXi 5.1u1 to ESXi 5.5 and then start NSX host preparation you may face issue and from /var/log/esxupdate log file you will see message like:
“Installationerror: the pending transaction required 240MB free space, however the maximum size is 239 MB”
I faced this issue in customer ISO of IBM blade but may appear in other vendors.

Solution:
Install fresh ESXi 5.5 Customer ISO. (this is the version i upgrade too)

 

vCenter on Windows, EAM TCP/80 taken by other application:

If the vCenter runs on a Windows machine, other applications can be installed and use port 80,  causing a conflict with EAM port tcp/80.

For example: By default IIS server use TCP/80

Solution:
Use a different port for EAM:

Changed the port to 80 in eam.properties in \ProgramFiles\VMware\Infrastructure\tomcat\webapps\eam\WEB-INF\

 

UWA Agent Issues:

In rare cases the installation of the VIBs succeeded but for some reason one or both of the userworld agents does not functioning correctly. This could manifest itself as:
The firewall showing a bad status OR The control plane between hypervisor(s) and the controllers being down
UWA error

If Message bus service is active on NSX Manager:

Check the messaging bus userworld agent status on hosts by running the command /etc/init.d/vShield-Stateful-Firewall status on the ESXi hosts

vShield-Stateful-Firewall

Check Message bus userworld logs on hosts at /var/log/vsfwd.log

esxcfg-advcfg -l | grep Rmq

Run this command on the ESXi hosts to show all Rmq variables –there should be 16 variable in total

esxcfg-advcfg -g /UserVars/RmqIpAddress

Run this command on the ESXi hosts, it should display the NSX Manager IP address

RmqIpAddress

Run this command on the ESXi hosts to check for active messaging bus connection

esxcli network ip connection list | grep 5671 (Message bus TCP connection)

network connection

 

 

The NSX manager has a direct link to download the VIB’s as zip file:

https://$nsxmgr/bin/vdn/vibs/5.5/vxlan.zip

 

Reverting a NSX prepared ESXi host:

Remove the host from the vSphere cluster:

Put ESXi host in maintenance mode and remove the ESXi host from the cluster. This will automatically uninstall NSX VIBs.

Note: ESXi host must be rebooted to complete the operation.

 

Manually Uninstall VIB’s:

esxcli software vib remove -n esx-vxlan

esxcli software vib remove -n esx-vsip

esxcli software vib remove -n dvfilter-switch-security

Note: ESXi host must be rebooted to complete the operation

Asymmetric routing with ECMP and Edge Firewall Enabled

What is Asymmetric Routing?

In Asymmetric routing, a packet traverses from a source to a destination in one path and takes a different path when it returns to the source.

Start from version 6.1 NSX Edge can work with ECMP – Equal Cost Multipath, ECMP traffic involved Asymmetric routing between Edges and DLR or between Edge and physical routers.

ECMP Consideration with Asymmetric Routing

ECMP with  Asymmetric routing is not a problem by itself, but will cause problems when more than one NSX Edge in place  and stateful services inserted in the path of the traffic.

Stateful services like firewall, Load Balanced  Network Address Translation (NAT) can’t work with asymmetric routing.

Explain the problem:

User from outside try to access Web VM inside the Data Center. the traffic will pass through E1 Edge.

From E1 the traffic will go to DLR transverse NSX distributed firewall and get to Web VM.

When Web VM respond back the traffic will hit the DLR default gateway. DLR have two option to route the traffic E1 or E2.

If DLR choose E2 the traffic will get the E2 and will Dropped !!!

The reason for this is E2 does not aware the state of session started at E1, replay packet from Red VM arrived to E2 are not match any existing session at E2.
From E2 perspective this is new session need to validate, any new TCP session should start with SYN, since this is not the begin of the session E2 will drop it!!!

Asymmetric Routing with Edge Firewall Enabled

Asymmetric Routing with Edge Firewall Enabled

Note: NSX Distributed firewall is not part of this problem, NSX Distributed firewall implement at the vNic level, all traffic get in/out same vNic.

there is no Asymmetric route in the vNic level, btw this is the reason when we vMotion VM, the Firewall Rule, Connection state is move with the VM itself.

ECMP and Edge Firewall NSX

Starting from version 6.1 when we enable ECMP  on NSX Edge get message:

Enable ECMP in 6.1 version

The firewall service disabled by default:

Enable ECMP in 6.1 version Firewall turnoff

Even if you try to enable it you will get warning message:

Firewall Service in 6.1 with ECMP

In version 6.1.2 when we enable ECMP we get same message:

Enable ECMP in 6.1 version

But the BIG difference is Firewall Service  is Not disable by default. (you need to turn it off)

Even if you have “Any, Any” rule with “Accept” action we still be subject for DROP packet subject of the Asymmetric routing problem!!!

Firewall Service Enable in 6.1.2

Even in Syslog or LogInSight you will not see this DROP packet !!!

The end users expirese for will be some of the session’s are working just fine (this sessions are not asymmetric) other session will drop (asymmetric sessions)

The place i found we can learn packet are drops because state of the session is with the command: show tech-support:

show tech-support
vShield Edge Firewall Packet Counters:
~~~~~~~~~~~~~~~ snip ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
rid    pkts bytes target     prot opt in     out     source               destination         
0        20  2388 ACCEPT     all  --  *      lo      0.0.0.0/0            0.0.0.0/0           
0        12   720 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            state INVALID
0        51  7108 block_out  all  --  *      *       0.0.0.0/0            0.0.0.0/0           
0         0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            PHYSDEV match --physdev-in tap0 --physdev-out vNic_+
0         0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            PHYSDEV match --physdev-in vNic_+ --physdev-out tap0
0         0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            PHYSDEV match --physdev-in na+ --physdev-out vNic_+
0         0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            PHYSDEV match --physdev-in vNic_+ --physdev-out na+
0         0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED
0        51  7108 usr_rules  all  --  *      *       0.0.0.0/0            0.0.0.0/0           
0         0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0

From line 7 we can see DROP packet because of INVALID state.

Conclusion:

When you enable ECMP and you have more then one NSX Edge in you topology, go to Firewall service and disable it by yourself otherwise you will spend lots of troubleshooting hours 🙁