Introduction to NSX and Kubernetes
The Evolution of Applications
Application Architectures are ever changing!!! Over the years, we moved from monolithic architectures to multi-tier Applications. ´Today the next generation Application Architecture is centered around the concepts of micro services.
This change is rooted in companies who are under pressure to be fast and agile in their development process. In the days of monolithic architectures, the application components like UI, APP, DB and Storage all resided in the same server.
From the development cycle point of view, all components needed to be packaged and tested as single unit. If e.g. the UI team needed to do a small change in the code, it would force the entire stack to be re-deloyed.
With the rise of SOA and comparable architectures, the industry moved to multi-tier applications, where each layer of the application is broken down in separate tiers. For example: Web, App and DB tiers. This simplified the process of application development and helped with scaling. But it still kept a lot of dependencies between the tiers, forcing a coordinated redeploment of multiple tiers in ‘upgrade windows’.
In Micro-services, we separate the application functionality into even smaller parts. Each part of the code is running independently from each other and may be developed by different teams with different language. Micro-services communicate with each other using language-agnostic APIs (e.g. REST). The host for those Micro-services could be a VM, but containers evolved as the ideal packaging unit to deploy Micro-services because of their small footprint, quick startup times and easy creation process. Combining the concepts of micro services with the benefits of containers enables companies to easily deploy, upgrade and decommission parts of their application quickly and independently.
But it is important to note that a Micro-service doesn’t equal to a Container.
So, what is a Container?
Introduction to Containers
Containers are isolated processes running in user-space that share the same OS kernel. Containers help us to abstract the application layer by packaging the application and file dependency into a single image. Each container isn’t aware of the existence of other containers.
The Container Runtime is at the heart of the container technology., It provides the ability to run processes in isolated environments. The Container Runtime is sometime referred to as the “Container Engine”, allowing us to start and stop the containers, manage resource allocations like CPU shares and RAM to containers, and provide restful API for external interactions.
Each container sees a complete file system that contains binaries and set of libraries that allows the application to run. These files are completely isolated from other containers file system.
In the industry, we have different container runtimes, the most popular being Docker:
But there is also rkt, pronounced “rocket” and developed by CoreOS:
Another popular choice is LXD founded and led by Canonical Ltd
A Container can run on top of a Linux OS or a Windows OS. The Guest OS itself can be virtualized with Hypervisor technology like VMware vSphere / KVM, or run directly on a physical server (bare metal).
In the following figure, we can see a Hypervisor running on top of a physical server.
On top of the hypervisor layer, we can deploy any GuestOS as a VM, and we can run any type of software, including a container runtime that creates, starts and stops isolated processes – So Containers.
Each Container Host VM is usually referred to as a Node in a lot of Container as a Service (CaaS) systems.
This figure shows just one physical server running separate OSs with Containers in them, but what if the compute resources such as CPU and Memory are under constrains? In a scenario where we require to deploy more container nodes, we will have to get another physical server with the same structure of layers: Hypervisor, GuestOS, Container Runtime etc…
This bring us to the next topic which is known as “container orchestration and clustering”.
Running containers on a single Container host is sufficient if you only want to explore the technology. But moving from the exploration phase to a production environment will force you to consider clustering mechanisms in order to have better availability, performance and manageability of your container environment.
In the figure bellow we show 4 Container nodes, each node is running as a VM. To scale up we can deploy more nodes by adding more node VMs.
If the physical server is experiencing high resource utilization we can add additional physical servers, and then deploy more node VMs on top of the Hypervisors.
Regardless if it’s a single server or multiple physical servers running Nodes, the first question that we should ask is: “Do we manage each container node separately, or do we have some central component that sees the ‘big picture’ of the container environment?
We can take it one step further, and ask what happens if one of the nodes fails? Do we have a mechanism to start the failed containers on a different node?
To solve these challenges the industry started to develop container orchestration and clustering technologies.
The idea behind a Container cluster is to have a central place to manage our container environment. With this central management, we can see the “big picture” of our environment. This helps us to improve our availability, e.g. in the scenario of a failed node, we can start the failed container on a different node. When we deploy new containers, we can have a smart decision where to place them based on free compute resources like CPU and Memory.
We can also create a deployment rule that mandates that a group of containers should never be deployed on the same node, or just the opposite, that certain containers must be placed onto the same node.
The best analogy for a Container Cluster if you come from a VMware background is vCenter Server with DRS rules and HA.
With cluster Deployment and orchestration, we can scale up and add more nodes or scale down by removing a node from the cluster.
In the industry, there are a few clustering technologies, the leader in the market is Kubernetes.
Kubernetes is an Open source solution originally developed by Google:
The second is SWARM, developed by Docker company:
The last is MESOS, an open source software originally developed at the University of California at Berkeley and is being marketed by Mesosphere:
In this introduction guide, we will focus on Kubernetes cluster technology.
Introduction to Kubernetes
Kubernetes is an open-source platform for deployment, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure.
- Deploy your applications quickly and predictably
- Scale your applications on the fly
- Seamlessly roll out new features
- Optimize use of your hardware by using only the resources you need
Role: Kubernetes is considered as Container as a Service (CaaS) or Container orchestration layer
Kubernetes: or “K8s'” in short, is the ancient Greek word for Helmsmen
K8s roots: Kubernetes was championed by Google and is now backed by major enterprise IT vendors and users (including VMware)
The short name of Kubernetes is K8s.
this is because we have 8 letters between the K to the S KuberneteS = K8S
Kubernetes Building block
A K8s cluster has one or more masters and one or more nodes. The master is responsible for the management and control plane tasks of the cluster. The Nodes job is to run the containerized applications. Users can’t interact directly with the nodes, commands are allowed only via the master. The user commands can be executed via CLI or an API call.
The deployment of containers in K8s is done via the set of CLI or API commands executed by the user. Usually resources that we want to create in K8s are defined in a yaml structured file that “explains” K8s what the desired state of our application is. For example, the yaml file could contain instructions to run a containerized application with 4 replicas. We call this the desired state of the application. The K8s master will take this yaml file as an Inpute and deploy 4 instances of the containerized application on different Nodes.
The K8s master and nodes usually run on top of some flavor of a Linux OS, they don’t care if the Linux OS run on top of physical server, virtual machine or in the cloud.
The Kubernetes master is built from a set of components, these components can run on a single master node, or can be replicated in order to support high-availability clusters.
The API server is the front-end to the control plane of the K8s cluster, the API server exposes a http REST service that allows users to read, create, update and delete (CRUD) K8s resources. The API server is the target for all operations to the K8s data model. External API clients include the K8s CLI client, the dashboard Web-Service, as well as various external and internal components that interact with the API server by ’watching’ and ‘setting’ resources.
The Scheduler monitors Container (POD) resources on the API Server, and assigns nodes to run the PODs. We will talk about Pods later in more details. The assignment of POD can be based on different aspect such as free resource on the node, affinity and anti-affinity rules constraints etc. Again, we can compare the scheduler to vSphere DRS technology.
Within the master, we have few controllers running different tasks. The list of controllers managed by the controller manager are:
- Node Controller: Responsible for noticing and responding when nodes go down.
- Replication Controller: Responsible for maintaining the correct number of PODs for every replication controller object in the system.
- Endpoints Controller: Populates the Endpoints object (that is, joins Services & Pods).
Service Account & Token Controllers: Create default accounts and API access tokens for new namespaces.
Etcd – Distributed Key-Value Store
Etcd is an Open Source distributed key-value servers used as the Database of K8s, Etcd is the source of truth for the cluster. We keep all the cluster data inside Etcd. We can think of this as a special cluster DB. The etcd allow us to watch for any resource inside the DB via the API server, when the value of the watch resource change, the Etcd will report back.
The K8s nodes run the application containers, and are managed from the master. Each Node part of K8s cluster will have the following components:
The Kubelet agent runs on the Linux OS. This is the main K8s agent on the Nodes. Kubelet first registers itself to the master, and then starts watching for ‘PodSpecs’ to determine what it is supposed to run. It is a common misconception to think that the scheduler running inside the master instructs the kubelet to run the PODs. Instead the scheduler assigns PODs to a Node as key/value in the Pod pec on the API server. Kubelet will then notice this change because it is also watching also for changes of Pod specs. It now notices that there is a new POD that needs to be created and that it is the node that was selected to run this POC. Finally, when kubelet needs to run a POD, it instructs the container runtimes to run containers through the container runtime API interface.
In a scenario when a POD fails to start or if it crashes, kubelet will report this event back to the master, and then the scheduler takes a decision which node will be assigned to the new POD. Another way to see it is that the kubelet is not responsible to restart the POD inside the node. The scheduler sees the big picture of the cluster, and based on the affinity rules and resource constrains he decides where to place the new POD.
Docker Is the most used container runtime in K8s. However, K8s is ‘runtime agnostic’, and the goal is to support any runtime through a standard interface (CRI-O).
Rkt:Besides Docker, Rkt by CoreOS is the most visible alternative, and CoreOS drives a lot of standards like CNI and CRI-O.
A pod is a group of one or more containers. K8s always runs containers inside a POD. If we need to scale up application, we add more PODs. All the containers running inside the PODs will share the same kernel Namespaces for networking and storage volumes.
Its common best practices to group containers into single POD when they need to run as tightly coupled functional units. A good example of this is a web server and logging activity related to this web server that need to share a storage volume to write, read, process and export logs.
From Networking point of view, containers within a POD share the same IP address and port space, and can find each other via localhost. The containers can also communicate with each other using standard inter-process communications (IPC) like SystemV semaphores or POSIX shared memory
The connectivity of the POD to the external network is done via special Container called the Pause Container. Its sole purpose is to own the network stack (linux network namespace) and build the ‘low level network plumbing’
inside the POD. Only the pause container is started with an IP interface, other Containers will not get their own IPs.
Another example of grouping containers to single POD is shown below:
The POD can have different state based on his Lifecycle status. The POD phase status can be: Container Creating , Pending, Running, Succeeded or Failed. In ‘Container Creating’ state, the Pod has been accepted by the Kubernetes system, but one or more of the Containers has not been started successfully yet.
Kube-Proxy and Kubernetes Services
We already talked about micro services. With micro services, the application will be broken down into distinct parts, e.g. Web frontend, DB layer, various gateways to external data sources, application layers calculating results, etc. Each micro service will offer its service through a REST API. The question now is, how do we group and discover those functional units (micro services) in the container orchestration cluster. The K8s answer to this is the K8s service. In the figure shown below, we have the Web Front-End POD that needs to talk to a set of Redis database PODs. As you already understand by now, PODs don’t live forever, they can come and go, they can be scaled up or down at any time. A good example for this kind of behavior is scaling up Redis PODs for short period of time, and then reduce the number of PODs by destroying them. In this kind of scenario, the question is how the Front-End POD learns about the IP address of the new Redis slave PODs, and when the K8s destroys them, how the Front-End will be told to stop to talk to a POD that was just destroyed?
To simplify this task, K8s created the concept of kube-proxy. Instead of dealing with individual IP of PODs, we are creating a service to abstract logical set of PODs. The service represents a group of PODs with an assigned DNS entry, and Virtual IP address (VIP) reachable within the cluster.
In figure bellow when a Front-End POD needs to talk to a Redis POD, it sends the traffic to the service layer (virtual IP returned by DNS), and then kube proxy will load-balance the traffic between Redis POD members.
When new POD member join to the Redis cluster, the Front-End POD doesn’t need to know what is the IP address of the new POD. The same applies when the POD is destroyed.
PODs participate in the service based on the labels value they get in their initial deployments.
Keep in mind the K8s service IP address and DNS record live inside all the nodes in the cluster. We can think of this as East-West Distributed load balancer. The Virtual IP (VIP) of the Service is not exposed to the outside world.
The K8s service is implemented using Linux IPTables, The DNS entry name, e,g. ’redis-slave.cluster.local’ is created within Kubernetes using a dynamic DNS service (CoreDNS) or through environment variable injection.
If we need to provide External Access: A K8s Service can also be made externally reachable through all Nodes IP interface using ‘NodePort’ exposing the Service through a specific UDP/TCP Port
Replication Controller (RC) and Replica Set (RS)
In a situation when a K8s node goes down, all the running PODs inside that node will also go down.
By default, the failed PODs won’t be rescheduled to a different available Node. To solve this problem K8s invented the Replication Controller or RC in short. With RC we can enforce the ‘desired’ state of a collection of PODs, e.g. make sure that 4 PODs are always running in the cluster. If there are too many PODs, it will kill some. If there are too few, the Replication Controller will start more. Unlike manually created POD, the pods maintained by a Replication Controller are automatically replaced if they fail, get deleted, or are terminated.
Replica Set is the next-generation Replication Controller. The only difference between a Replica Set and a Replication Controller right now is the selector support vs. Replication Controllers that only supports equality-based selector requirements.
K8s Label and Selectors
With K8s we have the fundamental concept of Labels to “mark” objects. Labels built from key and value, working in pairs. The label can be attached to various K8s objects such as PODs, Nodes, etc. Labels can be used to organize, and to select subsets of objects. Labels can be attached to objects at creation time and subsequently added and modified at any time. Each object can have a set of key/value labels defined. Each Key must be unique for a given object.
In example shown below the labels key is “key1” and the value of this key is “value1”:
“key1” : “value1“,
“key2” : “value2“
We can use labels and selectors to manage object as groups, one good example is the service deployment as shown below:
- apiVersion: v1
- kind: Service
- name: redis-cluster
- app: redis-slave
- # the port that this service should serve on
- port: 80
- app: redis-slave
The label key in line 6 is “app” and the value is “redis-slave”. The selector defined in line 11 with the same key “app” and value “redis-slave”. Every time we need to scale new POD, we can attach the label value “redis-slave”.
The service selector value is redis-slave, meaning the service is “actively” looking for a new PODs based on this selector value. When the service finds a new POD with this value, the POD will be marked as endpoint member of the service.
K8s Daemon Set
A DaemonSet ensures that all (or some) nodes run a copy of a POD. As nodes are added to the cluster, PODs are added to them. As nodes are removed from the cluster, those PODs are garbage collected. Deleting a DaemonSet will clean up the PODs it created.
This is essential for use cases such as building a running POD on every node. The DaemonSet can be used for user-specified system services, cluster-level applications with strong node ties, and Kubernetes node services.
One of the main feature of DaemonSet is to scale up node member to the cluster and will automatically add the installation of the DaemonSet from to the new node member. Same concept applies to scale down.
A Namespace is a mechanism to partition resources created by users into a logically named group. Within one physical K8s cluster we can work with different user groups where each group is isolated from the other group.
The isolation level divided into:
- resources (pods, services, replication controllers, etc.)
- policies (who can or cannot perform actions in their community)
- constraints (this community is allowed this much quota, etc.)
In the example bellow we have two different namespaces: foo and bar.
Each namespace has its own objects describe in unique URI path.
Network policy allows network segregation through Firewall Policies attached to Namespaces.
Authorization: per-Namespace access-control to Objects in specific namespaces
Core DNS (aka SkyDNS)
SkyDNS, now called CoreDNS is a distributed service for announcement and discovery of services built on top of etcd. It utilizes DNS queries to discover available services. This is done by leveraging SRV records in DNS, with special meaning given to subdomains, priorities and weights
SkyDNS runs as K8s Pod (Containers) on the K8s Cluster
VMware NSX-T and K8s Integration
Now we can start to describe how NSX and K8s work together.
This integration is based on NSX-T, to learn more about NSX-T, please watch:
NSX-T simplifies the implementation of network and security tasks around K8s. With this integration, the developer can focus on writing code and deploy the applications. NSX-T hides the complexity of container connectivity, dynamic routing and security implementation behind the scenes.
NSX-T is dynamically building a separate network topology per K8s namespace, every K8s namespace gets one or more logical switches and one Tier-1 router.
Compared to native K8s, the K8s Nodes are not doing IP routing. Every Pod has its own logical port on a NSX logical switch. Every Node can have Pods from different Namespaces with different IP Subnets / Topologies.
NSX-T uses Overlay theologies and an encapsulation header called GENEVE. The encapsulation / de-capsulation occurs at the hypervisor and includes hardware offload to achieve better performance.
NSX-T provides high performance East/West and North/South traffic including dynamic routing to physical network. The platform also has built in IPAM that provides IP Address Management by supplying Subnets from IP Block to Namespaces, and Individual IPs and MAC to Pods
With NSX-T we provide a central management tool for K8s networking. NSX-T can enforce security policy between PODs leveraging NSX’s distributed firewall, and we can view the POD logs event if traffic is passed or dropped by NSX policy. The logs can be view by VMware Log Insight.
Managing and operating containers environment can be a nightmare for IT team, With the power of NSX-T and the integration of K8s we can simplify the operations. NSX-T bring the enterprise grade platform offer monitoring and troubleshooting tools.
NSX-T provides very strong built in operation tools that work in K8s POD level:
- TX/RX Counters per Container
- IPFIX – traffic flow records
- SPAN – redirect copy to traffic to a monitoring device
- Traceflow – trace network & host failures
NSX Container Plugin
The core component that provides the integration between the K8s and the NSX manager is called NSX Container Plugin – NCP. The NCP itself runs as container inside a K8s POD. The NCP monitors and watches for changes of relevant objects on K8s API server like Namespaces, Pods, etc. Developers run tasks on the K8s side, and NCP will see those changes and react by creating the related NSX Objects like logical switches, logical routers and Firewall objects using a collection of API calls towards NSX manager.
Creating K8s Namespace
The environment separation in the K8s cluster is achieved using K8s namespace. With NSX-T we leverage this construct to build network and security objects related to the K8s namespace. For each K8s namespace, the NCP will create different NSX-T entities such as:
- Logical switch.
- Unique segment of IP address per Logical Switch.
- Logical Tier-1 Router.
- Create L3 IP interface on the Tier-1 router.
- Connect the Tier-1 to pre-crated Tier-0 logical router.
In the illustration bellow we can see two K8s namespaces: foo and bar. for each namespace we have a dedicated: logical switch, Tier-1 router, IP segment.
After creating K8s namespace we can deploy PODs. As soon as new PODs get deployed, the K8s API server will notify the NCP. As a result, the NCP performs a collection of API calls towards to NSX Manager.
The full process of POD deployment steps:
- NCP creates a “watch” on K8s API for any Pod events
- A user creates a new K8s Pod
- The K8s API Server notifies NCP of the change (addition) of Pods
- NCP creates a logical port:
- Request a MAC from the container MAC pool in NSX
- Assigns a VLAN for the Pod
- Creates a logical port (Sub-VIF / CIF) on the Namespace LS and assigns the IP, MAC and VLAN to the logical port
- Adds all K8s Pod Labels as Tags to the logical port
K8s NSX Service
K8s kube proxy (K8s service) provides application abstraction by presenting group of PODs with DNS entry and Virtual IP address (VIP). The kube proxy deploys East – West load balancing in distributed fashion. Kube proxy is using Linux IP-tables to provide this functionality.
NSX Kube Proxy has the same goal as the K8s kube proxy, but instead of using Linux IP-tables, the NSX Services Cluster virtual IPs are implemented using OpenVSwitch. The NSX Kube Proxy daemon is watching the K8s API for new services, and is programming flow entries and server groups into OVS.
East/West Load-Balancing is implemented in the OVS fast path in a fully distributed way.
K8s NSX Ingress
K8s Ingress controller provides an external Load balancing functionality, so that users can access services from the ‘outside world’. This is in contrast to the K8s service that is usually a ‘intra-cluster’ construct.
With NSX-T 2.0 one can use the popular NGINX Ingress controller running as a Pod inside of the cluster. NCP is able to map an external IP (aka. Floating-IP) to the Ingress controller to make it reachable from the outside world if the Namespace is running in NAT mode.
Also, in a joint effort with F5, the interoperability of F5’s K8s Ingress solution was tested with NSX-T 2.0.
Implementing NSX firewall in K8s environment can be done with two different approaches. The first method is with pre-defined security policies in NSX-T. The second method leverages the K8s Network Policy. Regardless of the methodology you decided to use, the enforcement of the policy will be done with the NSX distributed firewall, the implementation of the firewall occurs at the Virtual Interface (VIF) of the POD in the kernel of the hypervisor.
With NSX firewall we can block or permit POD to POD traffic within the Namespace or different Namespace. The traffic can be blocked at the traffic Ingress but also at the Egress.
One of the benefit using NSX firewall is the ability to view the permitted and blocked traffic using VMware Log insight. The logs hit count will based on the POD IP address.
Each K8s Node is a Virtual Machine running a Linux OS. Inside the Node we run Open vSwitch (OVS) that connects each POD with separate VLAN tags. The PODs network interface with the associated VLAN tag is mapped as a Container Interface – CIF in the NSX logical switch. NSX-T DFW runs in front of the CIF of each POD.
In the figure illustrated bellow we can see two PODs, each POD connects via unique VLAN
(IDs 10, and 11) into the NSX-T logical switch. All PODs are represented with a dedicated CIF on the logical switch and get its individual dFW policy. The configuration of the OVS done with the NSX CNI Plugin.
You can learn more about the integration from VMworld 2017 , The session presented by Yves Fause and Yasen Simeonov:
Source and Credits:
The materials used to create this blog post are based on Yves Fauser NSBU TPM.
I would like to thank him for reviewing this blog post.
Another special thanks to my friend Gilles Chekroun , Senior NSX Specialist SE within VMware NSBU, Gilles helped me a lot to review this blog post.