TKG Cluster Deployment Gotchas with Node Health Check in CSE 4.2

25-02-202425-02-2024 Manish Jha

Recently, I upgraded Container Service Extension to 4.2.0 in my lab and was trying to deploy a TKG 2.4.0 cluster with node health check enabled. The deployment got stuck after deploying one control plane and worker node, and the cluster went into an error state.

Clicking on the Events tab showed the following error:

I checked the CSE log file and the capvcd logs on the ephemeral vm (before it got deleted) and found no error that would make sense to me.

I contacted CSE Engineering to discuss this issue and opened a bug for further analysis of the logs.

Root Cause

CSE Engineering debugged the logs and found that it was a bug in the product version. Here is the summary of the analysis done by Engineering.

This bug is caused by UI plugin 4.2.0 using the wrong API version in the MachineHealthCheck capi yaml object. The UI plugin was using API version v1beta2 is used instead of v1beta1, which is the correct value.

How to Force Delete a Stale Virtual Service in NSX ALB

24-02-202424-02-2024 Manish Jha

Recently, I ran into an interesting problem in my lab where I couldn’t get rid of an unused Virtual Service in NSX ALB. The attempt to delete was failing with an error: “VS cannot be deleted! ‘It is being referred to by SystemConfiguration object”

I tried deleting the VS via API and it returned the same error

DELETE https://siteb-alb.sddc.lab/api/virtualservice/virtualservice-68679e6f-f10b-45ee-ab7d-d714ff8766aa

Response:

{
    "error": "siteb-local-dns-vs cannot be deleted! 'siteb-local-dns-vs' is being referred to by SystemConfiguration object"
}

DELETE https://siteb-alb.sddc.lab/api/virtualservice/virtualservice-68679e6f-f10b-45ee-ab7d-d714ff8766aa

Response:

{

"error": "siteb-local-dns-vs cannot be deleted! 'siteb-local-dns-vs' is being referred to by SystemConfiguration object"

}

To figure out where this VS is being referenced, I looked through the pool members and other settings in NSX ALB, but I couldn’t discover anything particular. Internet searches were also not very helpful.

I then checked this issue in internal tools and got a hint that I needed to remove the VS reference from the system configuration through API first. … Read More

How to Force Delete a Stale Logical Segment in NSX 3.x

24-02-202424-02-2024 Manish Jha

I ran into a problem recently when disabling NSX in my lab where I couldn’t remove a logical segment. This logical segment was previously attached to NSX Edge virtual machines. The logical segments still had a stale port, even after the Edge virtual machines were deleted.

Any attempt to delete the segment through UI/API resulted in the error “Segment has 1 VMs or VIFs attached. Disconnect all VMs and VIFs before deleting a segment.”

GUI Error

API Error

So how do I delete the stale segments in this case? The answer is through API, but before deleting the segments, you must delete the stale VIFs.

Follow the procedure given below to delete stale VIFs.

1: List Logical Ports associated with the Stale Segment

Method: GET

URL: https://{{nsx_mgr}}/api/v1/logical-ports

Method: GET

URL: https://{{nsx_mgr}}/api/v1/logical-ports

This API call will list all segments. You can pass the segment uuid in the above command to list a specific segment (stale one).… Read More

How to Integrate TMC Self-Managed 1.0 with VCD

23-02-202403-04-2024 Manish Jha

Introduction

VMware Tanzu Mission Control is a centralized hub for simplified, multi-cloud, multi-cluster Kubernetes management. It helps platform teams take control of their Kubernetes clusters with visibility across environments by allowing users to group clusters and perform operations, such as applying policies, on these groupings.

VMware launched Tanzu Mission Control Self-Managed last year for customers running their Kubernetes (Tanzu) platform in an air-gapped environment. TMC Self-Managed is designed to support organizations that prefer to maintain complete control over their multi-cluster management hub for Kubernetes to take full advantage of advanced capabilities for cluster configuration, policy management, data protection, etc.

The first couple of releases of TMC Self-Managed only supported TKG clusters that were running on vSphere. Last month, VMware announced the release of the VMware Cloud Director Extension for Tanzu Mission Control, which allows installing TMC Self-Managed in a VCD environment to manage TKG clusters deployed through the VCD Container Service Extension (CSE).… Read More

Install Container Service Extension 4.2 in an Airgap Environment

16-02-2024 Manish Jha

Introduction

VMware Container Service (CSE) is an extension of VMware Cloud Director that enables cloud providers to offer Kubernetes as a service to their tenants. CSE helps tenants quickly deploy the Tanzu Kubernetes Grid clusters in their virtual data centers with just a few clicks directly from the tenant portal. Tenants can manage their clusters utilizing Tanzu products and services, such as Tanzu Mission Control, in conjunction with the VMware Cloud Director Container Service Extension.

Until CSE 4.0, the deployment of TKG clusters depended on internet connectivity to get the necessary installation binaries from the VMware public image registry. There was no support for the airgap environment.

With CSE 4.1, VMware introduced support for deploying CSE in an Airgap environment. Before diving into the nitty-gritty of configuring CSE, let’s look at the CSE airgap architecture.

CSE Airgap Architecture

The image below is from the CSE product documentation and depicts the high-level architecture and service provider workflow of CSE in an airgap setup.… Read More

VCD (10.5) Service Crashing Continuously in CSE Environment

14-02-202414-02-2024 Manish Jha

After updating my lab’s Container Service Extension to version 4.2.0, I observed that the VMware VCD service was frequently crashing. Restarting the cell service did not help much, as the VCD user interface (UI) died again after five minutes. The cell.log was throwing below exception

java.lang.OutOfMemoryError: Java heap space
Dumping heap to /opt/vmware/vcloud-director/logs/java_pid11530.hprof ...

1 2	java.lang.OutOfMemoryError: Java heap space Dumping heap to /opt/vmware/vcloud-director/logs/java_pid11530.hprof ...

You will find similar log entries in the cell-runtime.log file.

2024-02-12 15:25:07,873 | FATAL    | pc-activity-pool-13       | UncaughtExceptionHandlerStartupAction | Uncaught Exception. Originating thread: Thread[pc-activity-pool-13,5,main]. 

Message: Java heap space | activity=(com.vmware.vcloud.vimproxy.internal.impl.PCEventProcessingActivity,urn:uuid:5d072602-9517-4ce9-b00c-9e12a46e03bd)
java.lang.OutOfMemoryError: Java heap space

2024-02-12 15:25:07,873 | FATAL | pc-activity-pool-13 | UncaughtExceptionHandlerStartupAction | Uncaught Exception.

Quick Tip: Configure NSX Manager GUI Idle Timeout

12-02-202414-02-2024 Manish Jha

The NSX Manager Web-UI has a default timeout of 1800 seconds. i.e., the NSX Manager UI will time out after 30 minutes of inactivity. This timeout looks reasonable for a production deployment, but since security is not an issue in lab settings, you might want to change it to a little bit higher. On top of that, it is annoying to get kicked out of the UI after 30 minutes of idle session.

In this post, I will demonstrate how you can change the timeout value.

Run the command get service http to see the currently configured value:

compute-nsxt01> get service http

Service name:                     http
Service state:                    running
Logging level:                    info
Session timeout:                  1800
Connection timeout:               30
Client API rate limit:            100 requests/sec
Client API concurrency limit:     40 connections
Global API concurrency limit:     199 connections
Redirect host:                    (not configured)
Basic authentication:             enabled
Cookie-based authentication:      enabled

compute-nsxt01> get service http

Service name: http

Service state: running

Logging level: info

Session timeout: 1800

Connection timeout: 30

Client API rate limit: 100 requests/sec

Client API concurrency limit: 40 connections

Global API concurrency limit: 199 connections

Redirect host: (not configured)

Basic authentication: enabled

Cookie-based authentication: enabled

Run the command set service http session-timeout 0 to fully remove the session timeout.… Read More

Securing TKG Workloads with Antrea and NSX-Part 2: Enable Antrea Integration with NSX

13-09-2023 Manish Jha

In the first part of this series of blog posts, I talked about how VMware Container Networking with Antrea addresses current business problems associated with a Kubernetes CNI deployment. I also discussed the benefits that VMware NSX offers when Antrea is integrated with NSX.

In this post, I will discuss how to enable the integration between Antrea and NSX.

Antrea-NSX Integration Architecture

The below diagram is taken from VMware blogs and shows the high-level architecture of Antrea and NSX integration.

The following excerpt from vmware blogs summarizes the above architecture pretty well.

Antrea NSX Adapter is a new component introduced to the standard Antrea cluster to make the integration possible. This component communicates with K8s API and Antrea Controller and connects to the NSX-T APIs. When a NSX-T admin defines a new policy via NSX APIs or UI, the policies are replicated to all the clusters as applicable. These policies will be received by the adapter which in turn will create appropriate CRDs using K8s APIs.

Securing TKG Workloads with Antrea and NSX-Part 1: Introduction

12-09-202313-09-2023 Manish Jha

What is a Container Network Interface

Container Network Interface (CNI) is a framework for dynamically configuring networking resources in a Kubernetes cluster. CNI can integrate smoothly with the kubelet to enable the use of an overlay or underlay network to automatically configure the network between pods. Kubernetes uses CNI as an interface between network providers and Kubernetes pod networking.

There exists a wide variety of CNIs (Antrea, Calico, etc.) that can be used in a Kubernetes cluster. For more information on the supported CNIs, please read this article.

Business Challenges with Current K8s Networking Solutions

The top business challenges associated with current CNI solutions can be categorized as below:

Community support lacks predefined SLAs: Enterprises benefit from collaborative engineering and receive the latest innovations from open-source projects. However, it is a challenge for any enterprise to rely solely on community support to run its operations because community support is a best effort and cannot provide a predefined service-level agreement (SLA).

Deploying TKG 2.0 Clusters in TKGS on vSphere 8 to Trust Private CA Certificates

01-09-2023 Manish Jha

In a vSphere with Tanzu environment, when you enable Workload Management, the Supervisor cluster that gets deployed operates as the management cluster. After supervisor cluster is deployed, you can provision two types of workload clusters

Tanzu Kubernetes clusters.
Clusters based on a ClusterClass (aka Classy Cluster).

TKG on vSphere 8 provides different sets of APIs to deploy a TKC or a Classy cluster:

v1alpha3 API
v1beta1 API

When you deploy a cluster using v1beta1 API, you get a Classy Cluster or TKG 2.0 cluster which is based on a default ClusterClass definition.

By default workload cluster don’t trust any self-signed certificates. Prior to TKG 2.0 clusters, the easiest way to make Tanzu Kubernetes Clusters (TKCs) to trust any self-signed CA certificate was to edit tkgserviceconfigurations and define your Trusted CAs there. This setting was then enforced on any newly deployed TKCs.

For TKG 2.0 clusters, the process has changed a bit and in this post I will walk through configuring the same.… Read More

TKG Cluster Deployment Gotchas with Node Health Check in CSE 4.2

Like this:

How to Force Delete a Stale Virtual Service in NSX ALB

Like this:

How to Force Delete a Stale Logical Segment in NSX 3.x

Like this:

How to Integrate TMC Self-Managed 1.0 with VCD

Introduction

Like this:

Install Container Service Extension 4.2 in an Airgap Environment

Introduction

CSE Airgap Architecture

Like this:

VCD (10.5) Service Crashing Continuously in CSE Environment

Like this:

Quick Tip: Configure NSX Manager GUI Idle Timeout

Like this:

Securing TKG Workloads with Antrea and NSX-Part 2: Enable Antrea Integration with NSX

Antrea-NSX Integration Architecture

Like this:

Securing TKG Workloads with Antrea and NSX-Part 1: Introduction

What is a Container Network Interface

Business Challenges with Current K8s Networking Solutions

Like this:

Deploying TKG 2.0 Clusters in TKGS on vSphere 8 to Trust Private CA Certificates

Like this:

Spread the Love

Like this:

Spread the Love

Like this:

Spread the Love

Like this:

Introduction

Spread the Love

Like this:

Introduction

CSE Airgap Architecture

Spread the Love

Like this:

Spread the Love

Like this:

Spread the Love

Like this:

Antrea-NSX Integration Architecture

Spread the Love

Like this:

What is a Container Network Interface

Business Challenges with Current K8s Networking Solutions

Spread the Love

Like this:

Spread the Love

Like this: