Configuring TKGI Cluster Discovery

Page last updated:

This topic describes how to configure Tanzu Kubernetes Grid Integrated Edition (TKGI) cluster discovery in Healthwatch.

Overview of TKGI Cluster Discovery

In the TKGI Cluster Discovery Configuration pane of the Healthwatch tile, you can enable the Prometheus instance in the Healthwatch tile to detect on-demand Kubernetes clusters created through the TKGI API and create scrape jobs for them. You only need to configure this pane if you have Ops Manager foundations with TKGI installed.

The Prometheus instance detects and scrapes TKGI clusters by connecting to the Kubernetes API through the TKGI API using a UAA client. To enable this process, you must configure the Healthwatch tile, the Prometheus instance in the Healthwatch tile, the UAA client that the Prometheus instance uses to connect to the TKGI API, and the TKGI tile.

To configure TKGI cluster discovery:

  1. Configure the TKGI Cluster Discovery Configuration pane in the Healthwatch tile. For more information, see Configure TKGI Cluster Discovery in Healthwatch below.

  2. Configure TKGI to allow the Prometheus instance to scrape metrics from TKGI clusters. For more information, see Configure TKGI below.

If TKGI cluster discovery fails after you have completed both parts of the procedure in this topic, see Troubleshooting TKGI Cluster Discovery Failure below.

Note: To collect additional BOSH system metrics related to TKGI and view them in the Grafana UI, you must install and configure the Healthwatch Exporter for TKGI on your foundations with TKGI installed. To install the Healthwatch Exporter for TKGI tile, see Installing a Tile Manually. To configure the Healthwatch Exporter for TKGI tile, see Configuring Healthwatch Exporter for TKGI.

Configure TKGI Cluster Discovery in Healthwatch

In the TKGI Cluster Discovery Configuration pane of the Healthwatch tile, you enable and configure TKGI cluster discovery, including the UAA client that the Prometheus instance uses to connect to the Kubernetes API through the TKGI API.

To configure the TKGI Cluster Discovery Configuration pane:

  1. Navigate to the Ops Manager Installation Dashboard.

  2. Click the Healthwatch tile.

  3. Select TKGI Cluster Discovery Configuration.

  4. Under Enable TKGI Cluster Discovery, select one of the following options:

    • Disabled: This option disables TKGI cluster discovery. TKGI cluster discovery is disabled by default.
    • Enabled: This option enables TKGI cluster discovery and reveals the configuration fields described in the steps below.
  5. For Scrape Port, enter a port where the Healthwatch tile exposes an endpoint from which the Prometheus instance scrapes metrics detailing the health of the TKGI cluster discovery process. These metrics appear in the Healthwatch - Exporter Troubleshooting dashboard in the Grafana UI.

  6. For TKGI API Address, enter the TKGI API domain you configured in the API Hostname (FQDN) field in the TKGI API pane of the TKGI tile. For example, api.tkgi.example.com. For more information, see Routing to the TKGI API VM in TKGI API Authentication in the TKGI documentation.

  7. For TKGI UAA Client, enter one of the following options for the TKGI UAA client username:

  8. For TKGI UAA Client Secret, enter one of the following options for the TKGI UAA client secret:

    • Enter the TKGI management admin client secret:
      1. Return to the Ops Manager Installation Dashboard.
      2. Click the Tanzu Kubernetes Grid Integrated Edition tile.
      3. Select the Credentials tab.
      4. In the Pks Uaa Management Admin Client row, click Link to Credential.
      5. Record the value of secret.
      6. Return to the Ops Manager Installation Dashboard.
      7. Click the Healthwatch tile.
      8. Select TKGI Cluster Discovery Configuration.
      9. For TKGI UAA Client Secret, enter the secret you recorded from the Pks Uaa Management Admin Client row in the Credentials tab in the Tanzu Kubernetes Grid Integrated Edition tile in a previous step.
    • Create a separate UAA client with access to the TKGI API and enter the secret you specify for that UAA client. For more information, see Grant Tanzu Kubernetes Grid Integrated Edition Access to a Client in Managing Tanzu Kubernetes Grid Integrated Edition Users with UAA in the TKGI documentation.
  9. If you configured UAA as the OIDC provider for TKGI in the UAA pane of the TKGI tile, enter the TKGI UAA admin password in TKGI UAA Admin Password. Otherwise, do not configure this field. For more information, see Grant Tanzu Kubernetes Grid Integrated Edition Access to a Client in Managing Tanzu Kubernetes Grid Integrated Edition Users with UAA in the TKGI documentation.

  10. For Test Frequency in Seconds, enter in seconds how frequently you want the TKGI service level indicator (SLI) test to run. The TKGI SLI test monitors the health of the TKGI API by logging into the TKGI API server, listing all TKGI clusters, and logging out of the TKGI API server.

  11. (Optional) To enable TLS communication between the Prometheus instance and the TKGI API, configure one of the following options:

    • To configure the Prometheus instance to use a self-signed CA or a certificate that is signed by a self-signed CA when communicating with the TKGI API over TLS, provide the CA in TKGI API Certificate Authority. If you provide a self-signed CA, it must be the same CA that signs the certificate in the TKGI API.
    • If you do not provide a self-signed CA or a certificate that is signed by a self-signed CA in the TKGI API Certificate Authority field, you can enable the TKGI API Skip SSL Validation checkbox to enable the Prometheus instance to skip SSL validation when connecting to the TKGI API. VMware does not recommend skipping SSL validation in a production environment.
  12. Click Save.

Configure TKGI

After you enable and configure TKGI cluster discovery in the Healthwatch tile, you must configure TKGI to allow the Prometheus instance to scrape metrics from TKGI clusters.

To configure TKGI:

  1. Return to the Ops Manager Installation Dashboard.

  2. Click the Tanzu Kubernetes Grid Integrated Edition tile.

  3. Select Host Monitoring.

  4. Under Enable Telegraf Outputs?, select Yes.

  5. Enable the Include etcd metrics checkbox to send etcd server and debugging metrics to Healthwatch.

  6. For Setup Telegraf Outputs, provide the following TOML configuration file:

    [[outputs.prometheus_client]]
          listen = ":10200"
    

    You must use 10200 as the listening port to enable the Prometheus instance to scrape Telegraf metrics. For more information about creating a configuration file in TKGI, see Create a Configuration File in Configuring Telegraf in TKGI in the TKGI documentation.

  7. Click Save.

  8. For each plan you want to monitor:

    1. Select the plan you want to monitor. For example, Plan 2.
    2. For (Optional) Add-ons - Use with caution, enter the following YAML snippet to create the roles required to enable the Prometheus instance to scrape metrics from your TKGI clusters:

    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: healthwatch
    rules:
    - resources:
        - pods/proxy
        - pods
        - nodes
        - nodes/proxy
        - namespace/pods
        - endpoints
        - services
      verbs:
        - get
        - watch
        - list
      apiGroups:
        - ""
    - nonResourceURLs: ["/metrics"]
      verbs: ["get"]
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: healthwatch
    roleRef:
      apiGroup: ""
      kind: ClusterRole
      name: healthwatch
    subjects:
    - apiGroup: ""
      kind: User
      name: healthwatch
    

    If (Optional) Add-ons - Use with caution already contains other API resource definitions, append the above YAML snippet to the end of the existing resource definitions, followed by a newline character.

  9. Click Save.

  10. Select Errands.

  11. Ensure that the Upgrade all clusters errand is enabled. Running this errand configures your TKGI clusters with the roles you created in the (Optional) Add-ons - Use with caution field of the plans you monitor in a previous step.

  12. Click Save.

Troubleshooting TKGI Cluster Discovery Failure

TKGI cluster discovery can fail if the Prometheus instance fails to scrape metrics from your TKGI clusters. To troubleshoot TKGI cluster discovery failure, see Troubleshooting Missing TKGI Cluster Metrics in Troubleshooting Healthwatch.