You’ve just deployed your Tanzu Kubernetes Grid clusters and setup enterprise services such as Cert-Manager, Fluent Bit, ExternalDNS and Harbor from the Tanzu Standard Repository.
How do you know if everything is working?
- Are certificates being generated? When are their expiration dates? Any failures?
- How about automated management of DNS entries? Are the desired records from within the cluster, reconciled with the enterprise DNS server?
- Are all your logs getting to the enterprise log management system? When are bursts occurring, and why?
- Are all the components of your enterprise container registry up and running? How much headroom do you have on your storage quota?
Wouldn’t it be great to have a dashboard to visualize the health of these enterprise services?
Kubernetes level monitoring is very important for on-going platform operations, but this level of monitoring alone can not answer these service-specific questions. It’s a good thing that many mature OSS services provide service-specific metrics. It is up to the deployer, to configure the services to expose these metrics and then integrate them into an enterprise observability platform like Tanzu Observability.
Consider the following use cases. As a platform operator, i need to…
- add observability to your Kubernetes clusters
- seek service-specific metrics options, and ensure they are configured
- create meaningful dashboards to visualize the health of your services
- brainstorm failure scenarios and create alerts
- leverage observability during troubleshooting
- use observability data during root cause analysis, then refactor dashboards and alerts accordingly
Number 1 (add observability to your Kubernetes clusters) is as easy as can be with Tanzu Mission Control’s integration with Tanzu Observability. Simply add your Tanzu Observability credentials in the Administration section. Then add the Tanzu Observability integration to your cluster group or cluster.
I went about working on number 2 (seek service-specific metrics options, and ensure they are configured), by reviewing the documentation associated with the open source projects. In all cases, the project documentation described how to enable metrics with explanation of the metrics. Next step was to review the configuration options within the Tanzu Standard Repository packaging for those projects.
$ tanzu package available get fluent-bit.tanzu.vmware.com/1.7.5+vmware.2-tkg.1 -n tanzu-package-repo-global --values-schema
Number 3 (create meaningful dashboards to visualize the health of your services) was a lot of fun going through the wealth of resources available to users of Tanzu Observability. This included videos, guides, and examples.
I’ve captured my work for numbers 2 and 3 at the tanzu-standard-addon-monitoring git repository. It provides detailed instructions for configuring cert-manager, external-dns, fluent-bit and harbor for monitoring along with the json to create your Tanzu Observability dashboard.
This is a great start, but it doesn’t end here. The remaining uses cases number 4–6 involve the iterative ongoing process that your platform/SRE team should follow as you introduce application workload onto your Kubernetes clusters and optimize ongoing day 2 operations.
In summary, in order to answer service-specific health questions, it is best to have service-specific metrics and context. Whenever you deploy services to your Kubernetes clusters, whether they are packaged open source software or custom developed software, seek out service-specific metrics and create and maintain dashboards and alerts.
Post-release Note. At VMware Explore 2022, it was announced that Tanzu Observability is joining the VMware Aria platform with a new name, VMware Area Operations for Applications.
Special thanks to community contributors to Grafana dashboards for these open source projects which were used for inspiration in the Tanzu Observability custom dashboard.