Prometheus

0. Required Metrics

Container and pod metrics are used to determine recommendations for individual workloads. Node and cluster metrics are used to determine cost and overall cluster health.

If you have custom metric names, please contact us for further assistance.

Expand to see all queries that Flightcrew runs (filtered by cluster)

Metric Type	Query
CPU Allocatable	`sum (kube_node_status_allocatable{resource="cpu"})`
CPU Allocatable	`sum by (node) (kube_node_status_allocatable{resource="cpu"})`
CPU Capacity	`sum (kube_node_status_capacity{resource="cpu"})`
CPU Capacity	`sum by (node) (kube_node_status_capacity{resource="cpu"})`
CPU Limit	`sum (kube_pod_container_resource_limits{resource="cpu"})`
CPU Limit	`sum by (container, namespace, pod) (kube_pod_container_resource_limits{resource="cpu"})`
CPU Limit	`sum by (node) (kube_pod_container_resource_limits{resource="cpu"})`
CPU Request	`sum (kube_pod_container_resource_requests{resource="cpu"})`
CPU Request	`sum by (container, namespace, pod) (kube_pod_container_resource_requests{resource="cpu"})`
CPU Request	`sum by (node) (kube_pod_container_resource_requests{resource="cpu"})`
CPU Usage	`sum (rate(container_cpu_usage_seconds_total{container!=""}[1m]) * on(pod) group_left(node) kube_pod_info)`
CPU Usage	`sum by (container, namespace, pod) (rate(container_cpu_usage_seconds_total{container!=""}[1m]))`
CPU Usage	`sum by (node) (rate(container_cpu_usage_seconds_total{container!=""}[1m]) * on(pod) group_left(node) kube_pod_info)`
Container Restart Count	`sum by (container, namespace, pod) (increase(kube_pod_container_status_restarts_total[1m]))`
Disk Capacity	`sum by (namespace, persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes)`
Disk Capacity	`sum by (node) ((sum by (persistentvolumeclaim, namespace) (kubelet_volume_stats_capacity_bytes)) * on (persistentvolumeclaim, namespace) group_left(node) max by (persistentvolumeclaim, namespace, node) (kube_persistentvolumeclaim_info))`
Disk Usage	`sum by (namespace, persistentvolumeclaim) (kubelet_volume_stats_used_bytes)`
Disk Usage	`sum by (node) ((sum by (persistentvolumeclaim, namespace) (kubelet_volume_stats_used_bytes)) * on (persistentvolumeclaim, namespace) group_left(node) max by (persistentvolumeclaim, namespace, node) (kube_persistentvolumeclaim_info))`
Memory Allocatable	`sum (kube_node_status_allocatable{resource="memory"})`
Memory Allocatable	`sum by (node) (kube_node_status_allocatable{resource="memory"})`
Memory Capacity	`sum (kube_node_status_capacity{resource="memory"})`
Memory Capacity	`sum by (node) (kube_node_status_capacity{resource="memory"})`
Memory Limit	`sum (kube_pod_container_resource_limits{resource="memory"})`
Memory Limit	`sum by (container, namespace, pod) (kube_pod_container_resource_limits{resource="memory"})`
Memory Limit	`sum by (node) (kube_pod_container_resource_limits{resource="memory"})`
Memory Request	`sum (kube_pod_container_resource_requests{resource="memory"})`
Memory Request	`sum by (container, namespace, pod) (kube_pod_container_resource_requests{resource="memory"})`
Memory Request	`sum by (node) (kube_pod_container_resource_requests{resource="memory"})`
Memory Usage	`sum (last_over_time(container_memory_working_set_bytes{container!=""}[1m]) * on(pod) group_left(node) kube_pod_info)`
Memory Usage	`sum by (container, namespace, pod) (last_over_time(container_memory_working_set_bytes{container!=""}[1m]))`
Memory Usage	`sum by (node) (last_over_time(container_memory_working_set_bytes{container!=""}[1m]) * on(pod) group_left(node) kube_pod_info)`
Pod Readiness	`(min by (pod, namespace) (kube_pod_info{created_by_kind!="Job"} _ on (pod, namespace) ((min_over_time(kube_pod_status_ready{pod!="", condition="true"}[1m]) > 0) / (min_over_time(kube_pod_status_ready{pod!="", condition="true"}[1m]) > 0) or (min_over_time(kube_pod_status_ready{pod!="", condition!="true"}[1m]) > 0) _ 0)))*100`
Pod Readiness	`(min by(pod, namespace) ((min_over_time(kube_pod_status_phase{pod!="", phase=~"Running\|Succeeded"}[1m]) > 0) / (min_over_time(kube_pod_status_phase{pod!="", phase=~"Running\|Succeeded"}[1m]) > 0) or ((min_over_time(kube_pod_status_phase{pod!="", phase!~"Running\|Succeeded"}[1m]) > 0) _ 0)) _ on (pod, namespace) kube_pod_info{created_by_kind="Job"})*100`
Status Phase	`sum by (namespace, pod, phase) (kube_pod_status_phase) > 0`

1. Set up kube-state-metrics

Skip this step if: kube-state-metrics is already installed on your cluster.

Actions: Install kube-state-metrics.

If Prometheus has been installed via Helm, the add the following lines to values.yaml here:

kube-state-metrics:
  enabled: true

If Prometheus has been installed another way, kube-state-metrics can be installed as a standalone Helm chart with the following commands:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-state-metrics prometheus-community/kube-state-metrics -n kube-system

2. Set up scrape configs

Skip this step if: Prometheus was installed via helm chart. It will have relevant scrape configs and you can skip to the next step.

Actions: Add the scrape configs.

Copy the Prometheus Helm chart's scrape configs into your Prometheus server
Restart Prometheus (kubectl delete pods ...) to enact the new configuration.

Note: if you've configured any scrape configs in values.yaml, these defaults may be overwritten, so your overrides should be moved to extraScrapeConfigs instead. Enact the new configuration using helm uninstall and helm install.

Verify: Examine the scrape configs with the following commands:

kubectl get configmaps -n <namespace-of-prometheus>
kubectl get configmap <configmap-name> -n <namespace-of-prometheus> -o yaml

The scrape_configs should look something like this if correctly configured:

- job_name: 'kubernetes-pods'

  kubernetes_sd_configs:
    - role: pod

  relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
  ...

3. Verify API Access

Actions: Curl Prometheus directly and make sure the response has the data Flightcrew is looking for.

First, port-forward your Prometheus instance:

# Replace "prometheus-service" and "monitoring" namespace
# if they are named differently in your cluster.
export PROMETHEUS_CONTAINER_PORT=$(kubectl get service prometheus-service --namespace=monitoring -ojsonpath="{.spec.ports[].port}")
kubectl port-forward "service/prometheus-service" --namespace monitoring "9090:${PROMETHEUS_CONTAINER_PORT}"

Then, in another terminal window, paste the following curl commands to query Prometheus:

# Some example metrics that Flightcrew reads from:
curl --silent http://localhost:9090/api/v1/query?query=kube_pod_info | head --bytes=150
curl --silent http://localhost:9090/api/v1/query?query=kube_node_status_allocatable | head --bytes=150
curl --silent http://localhost:9090/api/v1/query?query=container_cpu_usage_seconds_total | head --bytes=150

See a complete list of metrics and their usage in Required Metrics.

Verify: The output should contain data in the "result" section. For example:

{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"kube_pod_info","app_kubernetes_io_component":"metrics", ... }}]}}

If instead the list is empty and just looks like "result":[], it could be that one of the above steps was not followed correctly, or check further Troubleshooting steps below.

Troubleshooting

Common problems we've seen while configuring Prometheus:

Ensure Prometheus has been restarted for the config changes to take effect.

The service is pointing at the wrong port - Ensure spec.ports.targetPort on the Service should match spec.template.spec.containers.ports.containerPort.See the snippet below as an example for when the ports are correctly aligned:

apiVersion: v1
kind: Service
metadata:
  name: ...
  namespace: ...
  annotations: ...
spec:
  selector:
    app: ...
  type: ClusterIP      # ClusterIP gives internal DNS
  ports:
    - port: 9090       # The service listens on this port
      targetPort: 9090 # The service communicates with pods on this port
      protocol: TCP

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-deployment
  namespace: monitoring
  labels:
    app: prometheus-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-server
  template:
    metadata:
      labels:
        app: prometheus-server
    spec:
      containers:
        - ...
          ports:
            - containerPort: 9090 # should match targetPort in the service

0. Required Metrics​

1. Set up kube-state-metrics​

2. Set up scrape configs​

3. Verify API Access​

Troubleshooting​

0. Required Metrics

1. Set up kube-state-metrics

2. Set up scrape configs

3. Verify API Access

Troubleshooting