NSX – Distributed Logical Router Deep Dive

Overview

In today’s modern Datacenter, the physical router is essential for building a workable network design. As in the physical infrastructure, we need to provide similar functionality in virtual networking. Routing between IP subnets can be performed in a logical space without traffic going out to the physical router. This routing is performed in the hypervisor kernel with a minimal CPU and memory overhead. This functionality provides an optimal data-path for routing traffic within the virtual infrastructure. Distributed routing capability in the NSX-v platform provides an optimized and scalable way of handling East – West traffic within a data center. East – West traffic is the communication between virtual machines within the datacenter. The amount of East – West traffic in the datacenter is growing. The new collaborative, distributed, and service oriented application architecture demands a higher bandwidth for server-to-server communication.

If these servers are virtual machines running on a hypervisor, and they are connected to different subnets, the communication between these servers has to go through a router. Also, if a physical router is used to provide routing services the virtual machine communication has to go out to the physical router and get back in to the server after the routing decisions have been made. This is obviously not an optimal traffic flow and is sometimes referred to as “hair pinning”.

The distributed routing on the NSX-v platform prevents the “hair-pinning” by providing hypervisor level routing functionality. Each hypervisor has a routing kernel module that performs routing between the Logical Interfaces (LIFs) defined on that distributed router instance.

The distributed logical router possesses and manages the logical interface (LIF). The LIF idea is similar to interfaces VLAN on a physical router. But on the distributed logical router, the interfaces are called LIFs. The LIF connects to the logical switches or distributed port groups. A single distributed logical router can have a maximum of 1,000 LIFs.

DLR Overview

DLR Overview

DLR Interfaces type

With the DLR we have three types of interfaces. These are called Uplink, LIFs and Management.

Uplink: This is used by the DLR Control VM to connect the upstream router. In most of the documentation you will see, it is also referred to as “transit”, and this interface is the transit interface between the logical space to the physical space. The DLR supports both OSPF and BGP on its Uplink Interface, but cannot run both at the same time. OSPF can be enabled only on single Uplink Interface.

LIFs: LIFs exist on the ESXi host at the kernel level; LIFs are the Layer 3 interface that act as the default gateway for all VM’s connected to logical switches.

Management: DLR management interface can be used for different purposes. The first one is to manage the DLR control VM remote access like SSH. Another use case is for High Availability. The last one is to send out syslog information to a syslog server. The management interface is part of the routing table of the control VM; there is no separate routing table. When we configure an IP address for the management interface only devices on the same subnet as the Management subnet will be able to reach the DLR Control VM management IP, and the remote device will not be able to contact this IP.

DLR Interface Type

DLR Interface Type

Note: If we just need the IP address to manage the DLR remotely we can SSH to the DLR “Protocol Address” explain later in this chapter, there is no need to configure new IP address for management interface.

Logical Interfaces and virtual MAC’s and Physical MAC:

Logical Interfaces (LIFs) including IP address of the DLR Kernel module inside the ESXi host. For each LIF we will have an associated MAC address called virtual MAC (vMAC).  This vMAC is not visible to the physical network. The virtual MAC (vMAC) is the MAC address of the LIF and is the same across all the ESXi hosts and is never seen by the physical network, only by virtual machines. The virtual machines use the vMAC as their default gateway MAC address. The physical MAC (pMAC) is the MAC address of the uplink through which traffic flows to the physical network, and in this case when the DLR needs to route traffic outside of the ESXi host it is the Physical MAC (pMAC) address that will be used.

In the following figure, inside esxcomp-01a that is an ESXi host, we have the DLR kernel module, this DLR instance will have two LIFs. Each LIF is associated with a logical switch VXLAN 5001 and 5002. From the perspective of VM1, the default gateway is LIF1 with IP address 172.16.10.1, VM2 has a default gateway that is LIF2 172.16.20.1 and vMAC is the same mac address for both LIFs.

The LIFs IP address and vMAC will be the same across all NSX-v hosts for the same DLR instance.

DLR and vMotion

DLR and vMotion

When VM2 is vMotioned from esxcomp-01a to esxcomp-01b, VM2 will have the same default gateway (LIF2), which is associated with vMAC, and from the perspective of VM2 nothing has been changed.

 

DLR Kernel module and ARP table

The DLR does not communicate with the NSX-v Controller to figure out the MAC address of VMs. Instead it sends an ARP request to the entire ESXi host VTEP’s members on that logical switch The VTEP’s that receive this ARP request forward it to all VMs on that logical switch.

In the following figure, if VM1 needs to communicate with VM2, this traffic will route inside the DLR kernel module at escomp-01a, this DLR needs to know the MAC address of VM1 and VM2. The DLR will then send an ARP request to all VTEP members on VXLAN 5002 to learn the MAC address of VM2. In addition to this, the DLR will also keep the ARP table entry for 600 seconds, which is called its aging time.

DLR Kernel module and ARP table

DLR Kernel module and ARP table

Note: The DLR instance may have different ARP entries between different ESXi hosts. Each DLR Kernel module maintains its own ARP table.

DLR and local routing

Since the DLR instance is distributed, each ESXi host has a route instance that can route traffic. When VM1 need to send traffic to VM2, theoretically both DLR in esxcomp-01a and esxcomp-01b can route the traffic as in the following figure. In NSX-v the DLR will always perform local routing for VMs traffic!

When VM1 sends a packet to VM2, the DLR in esxcomp-01a will route the traffic from VXLAN 5001 to VXLAN 5002 because VM1 has initiated the traffic.

DLR Local Routing

DLR Local Routing

The following illustration shows that when VM2 replies back to VM1, the DLR at esxcomp-01b will route the traffic because VM2 is near to the DLR at esxomp-01b.

Note: the actual traffic between the ESXi hosts will flow via VTEP’s.

DLR Local Routing

DLR Local Routing

Note: the actual traffic between the ESXi hosts will flow via VTEP’s.

Multiple Route Instances

The Distributed Logical Router (DLR) has two components, the first one is the DLR Control VM that is a virtual machine and the second one is the DLR Kernel module that runs in all ESXi hypervisor.  This DLR Kernel module, which is called, route-instance has the same copy of information in each ESXi host. The Route-instance works at the kernel level. We will have at least one unique route-instance of the DLR kernel module inside the ESXi host but not limited to just on ESXi host.

The following figure shows two DLR control VMs, with the DLR Control VM1 on the right and DLR Control VM2 on the left. Each Control VM has its own route-instance in the ESXi hosts. In esxcomp-01a we have the route-instance1, which is managed by the DLR control VM1, and route-instance 2, which is managed by the Control VM2, and the same also applies to escomp-01b. The DLR instance has its own range of LIFs that it manages. The DLR control VM1 manages the LIF in VXLAN 5001 and 5002. The DLR control VM2 manages the LIF in VXLAN 5003 and 5004.

Multiple Route Instances

Multiple Route Instances

Logical Router Port

Regardless of the amount of route-instances we have inside the ESXi hosts we will have one special port called the “Logical Router Port” or “vdr Port”.

This port works like a “route in stick” concept. That means all routed traffic will pass through this port. We can think of route-instance like vrf lite because each route-instance will have its own LIFs and routing table, even the LIFs IP address can overlap with others.

In the following figure we have an example of an ESXi host with two route-instances where in route-instance-1 we have the same IP address as route-instace-2, but with a different VXLAN.

Note: Different DLRs cannot share the same VXLAN

DLR vdr port

DLR vdr port

Routing information Control Plan Update Flow

We need to understand how a route is configured and pushed from the DLR control VM to the ESXi hosts. Let’s look at the following figure to understand the flow.

Step 1: An end user configures a new DLR Control VM. This DLR will have LIFs (Logical interfaces) and a static or dynamic routing protocol peer with the NSX-v Edge Services gateway device.

Step 2: The DLR LIFs configuration information is pushed to all ESXi hosts in the cluster that have been prepared by the NSX-v platform. If more than one route instance exists, the DLR LIFs information will be sent to that instance only.

At this point VM’s in a different VXLAN (East – West traffic) can communicate with each other.

Step 3: The NSX-v Edge Services gateway (ESG) will update the DLR control VM about new routes.

Step 4: The DLR control VM will update the NSX-v controller (via UWA) with Routing Information Tables (RIBs).

Step 5: Then NSX-v controller will push RIBs to all ESXi hosts that have prepared by the NSX-v platform. If more than one route instance exists, RIBs information will send to that instance only.

Step 6: Route Instance on the ESXi host creates Forwarding Information Base (FIB) and handles the data path traffic.

Routing information Control Plan Update Flow

Routing information Control Plan Update Flow

DLR Control VM communications

The DLR Control VM is a virtual machine that is typically deployed in the Management or Edge Cluster. When the ESXi host has been prepared by the NSX-v platform, one of the VIB’s creates the control plane channel between the ESXi hosts to the NSX-v controllers. The service demon inside the ESXi host which is responsible for this channel, is called netcpad, and which is also more commonly referred to as the User World Agent (UWA).

The netcpad is responsible for communication between the NSX-v controller and ESXi host learns MAC/IP/VTEP address information, and for VXLAN communications. The communication is secured and uses SSL to communicate with NSX-v controller on the control plane. The UWA can also connect to multiple NSX-v controller instances and maintains its logs at /var/log/ netcpa.log

 

Another Service demon called the vShield-Statefull-Firewall is responsible for interacting with the NSX-v Manager. This service daemon receives configuration information from the NSX-v Manager to create (or delete) the DLR Control VM, create (or delete) the ESG. Beside that, this demon also performs NSX-v firewall tasks: Retrieve the DFW policy rules, gather the DFW statistics information and send them to the NSX-v Manager, send audit logs and information to the NSX-v Manager. Part of host preparation processes SSL related tasks from the NSX-v Manager.

The DLR control VM runs two VMCI sockets to the user world agents (UWA) on the ESXi host it is residing on. The first VMCI socket is to the vShield-Statefull-Firewall service daemon on the host for receiving update configuration information from the NSX-v Manager to the DLR control VM itself, and the second to netcpad for control plane access to the controllers.

The VMCI socket provides the local communication whereby the guest virtual machines can communicate to the hypervisor where they reside but cannot communicate to the other ESXi hosts.

On this basis the routing update happens in the following manner:

  • Step (1) DLR Control VM learn new route information (from the dynamic routing as an example) to update the NSX-v controller,
  • Step (2) the DLR will use the internal channel inside the ESXi01 host called the “Virtual Machine Communication Interface” (VMCI). VMCI will open a socket to transfer learned routes as Routing Information Base (RIB) information to the netcpa service daemon.
  • Step (3) The netcpa service demon will send the RIB information to the NSX-v controller. The flow of routing information passes through the Management VMkernel interface of the ESXi host, which means that the NSX-v controllers do not need a new interface to communicate to the DLR control VM. The protocol and port used for this communication is TCP/1234.
  • Step (4) NSX Controller will forward the DLR RIB to all netcpa service daemons on the ESXi host.
  • Step (5) netcpa will forward the FIB’s to the DLR route instance.
DLR Control VM communications

DLR Control VM communications

DLR High Availability

The High Availability (HA) DLR Control VM allows redundancy at the VM level. The HA mode is Active/Passive where the active DLR Control VM holds the IP address, and if the active DLR Control VM fails the passive DLR Control VM will take ownership of the IP address (flip event). The DLR route-instance and the interface of the LIFs and IP address exists on the ESXi host as a kernel module and are not part of this Active/passive mode flip event.

The Active DLR Control VM sync-forwarding table to secondary DLR Control VM, if the active fails, the forwarding table will continue to run on the secondary unit until the secondary DLR will renew the adjacency with the upper router.

The HA heartbeat message is sent out through the DLR management interface. We must have L2 connectivity between the Active DLR Control VM and the Secondary DLR Control VM. IP address of Active/Passive assign automatic as /30 when we deploy HA. The default failover detection mechanism is 15 seconds but can be lowered down to 6 seconds. The heartbeat uses UDP Port 694 for its communication.

DLR High Availability

DLR High Availability

You can also verify the HA status by running the following command:

DLR HA verification command:

$ show service highavailability

$ show service highavailability connection-sync

$ show service highavailability link

Protocol Address and Forwarding Address

The Protocol address is the IP address of the DLR Control VM. This Control Plane actually establishes the OSPF or BGP peering with the ESG’s. The following figure shows OSPF as example:

Protocol Address and Forwarding Address

Protocol Address and Forwarding Address

The following figure shows that the DLR Forwarding Address is the IP address that uses as the  next-hop for ESG’s.

Protocol Address and Forwarding Address

Protocol Address and Forwarding Address

DLR Control VM Firewall

The DLR Control VM can protect its Management or Uplink interfaces with the built in firewall. For any device that needs to communicate with the DLR Control VM itself we will need a firewall rule to approve it.

For example SSH to the DLR control VM or even OSPF adjacencies with the upper router will need to have a firewall rule. We can Disable/Enable the DLR Control VM firewall globally.

 Note: do not confuse DLR Control VM firewall rule with NSX-v distributed firewall rule. The following image shows the firewall rule for DLR Control VM.

DLR Control VM Firewall

DLR Control VM Firewall

Creating DLR

First step will be to create the DLR Control VM.

We need to go to Network and Security -> NSX Edges -> and click on the green + button.

Here we need to specify Logical (distributed) Router

 

Creating DLR

Creating DLR

Specify the User and Password, we can Enable SSH Access:

DLR CLI Credentials

DLR CLI Credentials

We need to specify where we want to place the DLR Control VM:

place the DLR Control VM

place the DLR Control VM

We need to specify the Management interfaces and Logical Interface (LIF)

Management Interface is for access with SSH to Control VM.

Lif interface needed to be configure Second Table below “Configure Interfaces of this NSX Edge”

Configure Interfaces of this DLR

Configure Interfaces of this DLR

Configure the Lif Interface’s done by connected interface to “Logical Switch” interfaces

Connected Lif  to DLR

Connected Lif to DLR

Configure the Up-Link Transit Lif:

Configure Up-Link Lif

Configure Up-Link Lif

Configure the Web Lif:

Configure the web Lif

Configure the web Lif

Configure the App Lif:

Configure the App Lif:

Configure the App Lif:

Configure the DB Lif:

Configure the DB Lif

Configure the DB Lif

Summary of all DLR Lif’s:

Summary of all DLR Lif’s

Summary of all DLR Lif’s

DLR Control VM can work in High Availability mode, in our lab we will not enable H.A:

DLR High Availability

DLR High Availability

Summary of DLR configuration:

Summary of DLR configuration:

Summary of DLR configuration:

 

DLR Intermediate step

After completed deploying DLR, we created 4 different Lif’s.

Tranit-Network-01, Web-Tier-01, App-Tier-01, DB-Tier01

All these Lif’s are spanned over all our ESX Cluster’s.

So for example virtual machine connected to Logical Switch called “App-Tier-01” will have a default gateway of 172.16.20.1 regardless where this VM located in the DC.

DLR Intermediate step

DLR Intermediate step

 

DLR Routing verification

We can verify NSX controller receiving the DRL Lif’s IP address for each VXLAN Logical switch.

From NSX controller run this command: show control-cluster logical-routers instance all

DLR Routing verification

DLR Routing verification

The LR-Id “1460487505” is the internal id of the DLR control VM.

To verify all DLR Lif’s interfaces run this command: show control-cluster logical-routers interface-summary LR-Id.

In our lab:

show control-cluster logical-routers interface-summary LR-Id14604875

DLR Routing verification

DLR Routing verification

 

Configure OSPF on DLR

On the ESX Edges click on the DLR Type Logical Router

Configure OSPF on DLR

Configure OSPF on DLR

Go to Manage – > Routing ->  OSPF and Click “Edit”

Configure OSPF on DLR

Configure OSPF on DLR

Type in the Protocol Address and Forwarding Address.

Do not Mark the “Enable OSPF” Check box !!!

Protocol Address and Forwarding Address

Protocol Address and Forwarding Address

The Protocol address is the IP address of the DLR Logical Router Control VM, this Control Plane actually establishing the OSPF peering with the NSX Edge.

The Forwarding Address is the IP address that use next-hop for NSX Edge to forward the packet to DRL:

DLR Forwarding Address

DLR Forwarding Address

Click on “Publish Changes”:

Publish Changes

Publish Changes

The results will look like this:

DLR

Go to “Global Configuration”:

Global Configuration

Global Configuration

Type the Default Gateway for DLR (Next hop NSX Edge):

Default Gateway

Default Gateway

Enable the OSPF:

Enable the OSPF

Enable the OSPF

Then click on “Publish the Change’s”

Go Back to “OSPF” to “Are to Interface Mapping” and add the Transit-Uplink to Area 51:

Are to Interface Mapping

Are to Interface Mapping

Click on “Publish Change”

Go to Route Redistribution and make sure OSPF is enabled:

Route Redistribution

Route Redistribution

Deploy NSX Edge

In our LAB we will use NSX Edge as next-hop for LDR but it can be physical router.

NSX Edge is virtual appliance offers L2, L3, perimeter firewall, load-balancing and other services such as SSL VPN, DHCP, etc.

We will use this Edge for Dynamic Routing.

 

Go to “NSX Edge” -> and Click on the green plus button

Select “Edge Services Gateway” fill in the Name and Hostname for this Edge.

If we would like the use redundant Edge we need to checked the “Enable High Availability”

NSX Edge

NSX Edge

Put your username and password:

username and password

username and password

Select the Size of the NSX Edge:

NSX Edge size

NSX Edge size

Select where to install the Edge:

Configure the Network Interfaces:

Configure the Network Interfaces

Configure the Network Interfaces

Configure the Mgmt interface:

 

 

 

 

Configure the Mgmt interface

Configure the Mgmt interface

Configure the Transit interface:

Configure the Transit interface

Configure the Transit interface (toward  DLR):

Configure Default Gateway:

Edge Default Gateway

Edge Default Gateway

 

Set Firewall Default policy to permit all traffic:

Firewall Default policy to permit all traffic

Firewall Default policy to permit all traffic:

Summary of Edge Configuration:

Summary of Edge Configuration

Summary of Edge Configuration

Configure OSPF at NSX Edge:

Configure OSPF at NSX Edge

Configure OSPF at NSX Edge

Enable OSPF at “Global Configuration”:

Enable OSPF at "Global Configuration"

Enable OSPF at “Global Configuration”

In the “Dynamic Routing Configuration” Click “Edit”

For the “Router ID” select the interface that you have configured as the OSPF Router-ID.

Check “Enable OSPF”:

 

Enable OSPF

Enable OSPF

Publish and Go to “OSPF” Add Transit Network to Area 51 in the interface mapping section:

Map Interface to OSPF Area

Map Interface to OSPF Area

 

Click “Publish”

Make sure OSPF Status is in “Enabled” state and the Red button on the right is in “Disable”.

Getting the full picture

 

Getting the full picture

Getting the full picture

 

Dynamic OSPF Routing Verification

Open the Edge CLI

The Edge has OSPF neighbor adjacency with 192.168.10.3 This is the Control VM IP address.

Edge OSPF verfication

Edge OSPF verfication

The NSX Edge Received OSPF Routes from the DLR.

From the Edge Perspective the next-hope to DLR is the Forwarding Address 192.168.10.2

Edge OSPF Routing Verification

Edge OSPF Routing Verification

 

Related Post:

NSX Manager

NSX Controller

Host Preparation

Logical Switch

Distributed Logical Router

 

Thanks to:

Shachar Bobrovskye, Michael Haines,  Prasenjit Sarkar for contribute to this post.

Offer Nissim for reviewing this post

 

To find out more info what is Distributed Dynamic routing I recommend on reading two blogs of

Colleague of mine:

Brad Hedlund

http://bradhedlund.com/2013/11/20/distributed-virtual-and-physical-routing-in-vmware-nsx-for-vsphere/

Antony Burke

http://networkinferno.net/nsx-compendium

Posted in DLR, Install
21 comments on “NSX – Distributed Logical Router Deep Dive
  1. Jon Jones says:

    Not bad. It is worth noting that you can share routes directly from the DLR to a North bound physical L3 device, you don’t have to only peer with an Edge device.

  2. Johnd889 says:

    Im thankful for the post. Great. dbgffdfafcdc

  3. beck says:

    Because the DLR doesn’t have DHCP feature, and the VMs are in the logical switch, so i am confused how the VMs in the DLR network, like web vm,app vm, get an ip address?

    • roie9876@gmail.com says:

      In my lab the VMs have static IPs. if you need to assign IP address dynamically you can use external DHCP with DLR DHCP relay feature.

  4. beck says:

    Thank you! i set up a dhcp server in the DLR network, but the DLR vm can not ping the vm inside DLR network, i will try to use static IPs. Thanks!

  5. beck says:

    Here i have a question again, can we connect ESG and DLR with distributed portgroup instead of logical switch, i saw you use a logical switch as transit network, Could someone explain why it is necessary to connect ESG and DLR using logical switch? thank you very much!

  6. Preetam says:

    Hiya, Thanks a ton for this post. I understood many aspects about DLR. I have little doubt. When we connect VMs to LIF of 172.16.10.x,20.x,30.x, don’t we have to configure static route so that they can talk to each other? or we just connect the VMs to LIFs and they are accessible to each other by default?

  7. Taher Shaker says:

    This is an amazing post it really helps a lot to understand the topic, i do need a favour if you can provide a link or a blog that deeply identifies the difference between the Uplink and the Internal LIFs i would be appreciated.
    thanks again for this amazing info

    • roie9876@gmail.com says:

      Hello Taher and thanks you for your feedback.
      The difference between DLR ulink and Lif explained in the beginning of this post.

  8. Saad says:

    Thanks a bunch for your labs. I followed you starting form home lab part1 till this important part were I was stocked to connect DLR with Edge gateway but you solved it. My best.

  9. Tiger says:

    Is it Mandatory to select NextHop Interface?

  10. Anonymous says:

    hi ben , i am confused about communication between nsx manager and controller . what exactly do we use REST API or Rabbit MQ. can you please help me on this.

  11. macy says:

    hi ben i am confused about nsx manager and controller communication. what do we use REST API or Rabbit MQ.. please help me on this

  12. wafaa says:

    Thank You very much
    your blog is excellent i’ll recommend it

    I want test isolation between tenants .
    In your opinion what is the better scenario for this

    Thanks

  13. Manojit says:

    Great Post Roie Ben Haim. Could you please address more on the Edge to Physical Router route communication steps. I presume 192.168.100.2 is the physical router Interface IP. What need to set at the Physical router so that 172.16.x.x can communicate to any other VM at 192.168.100.X and vice versa.

  14. thomas gyger says:

    i was confused of how the way back to the dlr works from the perspective of the physical network… each esxi can forward the vm traffic directly to its physical connected device. but what about the way back? so as i understand, there is a dedignated host for each dlr which is responding to arp requests resceived from the physical router. this host will be the entrypoint into the overlay for the routed outside world which means hairpinning in input direction. the only way to have multiple entrypoints is to deploy ecmp-edge appliances. please correcrt me if i’m wrong

    have a good day

    • roie9876@gmail.com says:

      Hello Thomas,
      If you are using DLR with VLAN backed DVportGroup you are absolutely right.
      There are some drawbacks when connect DLR via VLAN dvPortgroup, for example:
      Requirement for consistent DVS across all hosts/clusters within a transport zone and the limitations this brings.
      Limitations such as restriction on physical network design/flexibility where you will need to span VLANs across all hosts within a transport zone.
      Rely on designated instance to handle ingress and all ARP for VLAN Lif there will be a limit of a single ESXi host vs if you route from DLR to physical via ECMP Edges where this would allow throughput of multiple ESG’s potentially on multiple hosts.

2 Pings/Trackbacks for "NSX – Distributed Logical Router Deep Dive"
  1. […] NSX Distributed Logical Router breakdown by Roie Ben Haim […]

  2. […] NSX Distributed Logical Router breakdown by Roie Ben Haim […]

Leave a Reply