Skip to main content

Prometheus

0. Required Metrics

Container and pod metrics are used to determine recommendations for individual workloads. Node and cluster metrics are used to determine cost and overall cluster health.

If you have custom metric names, please contact us for further assistance.

Expand to see all queries that Flightcrew runs (filtered by cluster)
Metric TypeQuery
CPU Allocatablesum (kube_node_status_allocatable{resource="cpu"})
CPU Allocatablesum by (node) (kube_node_status_allocatable{resource="cpu"})
CPU Capacitysum (kube_node_status_capacity{resource="cpu"})
CPU Capacitysum by (node) (kube_node_status_capacity{resource="cpu"})
CPU Limitsum (kube_pod_container_resource_limits{resource="cpu"})
CPU Limitsum by (container, namespace, pod) (kube_pod_container_resource_limits{resource="cpu"})
CPU Limitsum by (node) (kube_pod_container_resource_limits{resource="cpu"})
CPU Requestsum (kube_pod_container_resource_requests{resource="cpu"})
CPU Requestsum by (container, namespace, pod) (kube_pod_container_resource_requests{resource="cpu"})
CPU Requestsum by (node) (kube_pod_container_resource_requests{resource="cpu"})
CPU Usagesum (rate(container_cpu_usage_seconds_total{container!=""}[1m]) * on(pod) group_left(node) kube_pod_info)
CPU Usagesum by (container, namespace, pod) (rate(container_cpu_usage_seconds_total{container!=""}[1m]))
CPU Usagesum by (node) (rate(container_cpu_usage_seconds_total{container!=""}[1m]) * on(pod) group_left(node) kube_pod_info)
Container Restart Countsum by (container, namespace, pod) (increase(kube_pod_container_status_restarts_total[1m]))
Disk Capacitysum by (namespace, persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes)
Disk Capacitysum by (node) ((sum by (persistentvolumeclaim, namespace) (kubelet_volume_stats_capacity_bytes)) * on (persistentvolumeclaim, namespace) group_left(node) max by (persistentvolumeclaim, namespace, node) (kube_persistentvolumeclaim_info))
Disk Usagesum by (namespace, persistentvolumeclaim) (kubelet_volume_stats_used_bytes)
Disk Usagesum by (node) ((sum by (persistentvolumeclaim, namespace) (kubelet_volume_stats_used_bytes)) * on (persistentvolumeclaim, namespace) group_left(node) max by (persistentvolumeclaim, namespace, node) (kube_persistentvolumeclaim_info))
Memory Allocatablesum (kube_node_status_allocatable{resource="memory"})
Memory Allocatablesum by (node) (kube_node_status_allocatable{resource="memory"})
Memory Capacitysum (kube_node_status_capacity{resource="memory"})
Memory Capacitysum by (node) (kube_node_status_capacity{resource="memory"})
Memory Limitsum (kube_pod_container_resource_limits{resource="memory"})
Memory Limitsum by (container, namespace, pod) (kube_pod_container_resource_limits{resource="memory"})
Memory Limitsum by (node) (kube_pod_container_resource_limits{resource="memory"})
Memory Requestsum (kube_pod_container_resource_requests{resource="memory"})
Memory Requestsum by (container, namespace, pod) (kube_pod_container_resource_requests{resource="memory"})
Memory Requestsum by (node) (kube_pod_container_resource_requests{resource="memory"})
Memory Usagesum (last_over_time(container_memory_working_set_bytes{container!=""}[1m]) * on(pod) group_left(node) kube_pod_info)
Memory Usagesum by (container, namespace, pod) (last_over_time(container_memory_working_set_bytes{container!=""}[1m]))
Memory Usagesum by (node) (last_over_time(container_memory_working_set_bytes{container!=""}[1m]) * on(pod) group_left(node) kube_pod_info)
Pod Readiness(min by (pod, namespace) (kube_pod_info{created_by_kind!="Job"} _ on (pod, namespace) ((min_over_time(kube_pod_status_ready{pod!="", condition="true"}[1m]) > 0) / (min_over_time(kube_pod_status_ready{pod!="", condition="true"}[1m]) > 0) or (min_over_time(kube_pod_status_ready{pod!="", condition!="true"}[1m]) > 0) _ 0)))*100
Pod Readiness(min by(pod, namespace) ((min_over_time(kube_pod_status_phase{pod!="", phase=~"Running|Succeeded"}[1m]) > 0) / (min_over_time(kube_pod_status_phase{pod!="", phase=~"Running|Succeeded"}[1m]) > 0) or ((min_over_time(kube_pod_status_phase{pod!="", phase!~"Running|Succeeded"}[1m]) > 0) _ 0)) _ on (pod, namespace) kube_pod_info{created_by_kind="Job"})*100
Status Phasesum by (namespace, pod, phase) (kube_pod_status_phase) > 0

1. Set up kube-state-metrics

Skip this step if: kube-state-metrics is already installed on your cluster.

Actions: Install kube-state-metrics.

If Prometheus has been installed via Helm, the add the following lines to values.yaml here:

kube-state-metrics:
enabled: true

If Prometheus has been installed another way, kube-state-metrics can be installed as a standalone Helm chart with the following commands:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-state-metrics prometheus-community/kube-state-metrics -n kube-system

2. Set up scrape configs

Skip this step if: Prometheus was installed via helm chart. It will have relevant scrape configs and you can skip to the next step.

Actions: Add the scrape configs.

  1. Copy the Prometheus Helm chart's scrape configs into your Prometheus server
  2. Restart Prometheus (kubectl delete pods ...) to enact the new configuration.

Note: if you've configured any scrape configs in values.yaml, these defaults may be overwritten, so your overrides should be moved to extraScrapeConfigs instead. Enact the new configuration using helm uninstall and helm install.

Verify: Examine the scrape configs with the following commands:

kubectl get configmaps -n <namespace-of-prometheus>
kubectl get configmap <configmap-name> -n <namespace-of-prometheus> -o yaml

The scrape_configs should look something like this if correctly configured:

- job_name: 'kubernetes-pods'

kubernetes_sd_configs:
- role: pod

relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
...

3. Verify API Access

Actions: Curl Prometheus directly and make sure the response has the data Flightcrew is looking for.

First, port-forward your Prometheus instance:

# Replace "prometheus-service" and "monitoring" namespace
# if they are named differently in your cluster.
export PROMETHEUS_CONTAINER_PORT=$(kubectl get service prometheus-service --namespace=monitoring -ojsonpath="{.spec.ports[].port}")
kubectl port-forward "service/prometheus-service" --namespace monitoring "9090:${PROMETHEUS_CONTAINER_PORT}"

Then, in another terminal window, paste the following curl commands to query Prometheus:

# Some example metrics that Flightcrew reads from:
curl --silent http://localhost:9090/api/v1/query?query=kube_pod_info | head --bytes=150
curl --silent http://localhost:9090/api/v1/query?query=kube_node_status_allocatable | head --bytes=150
curl --silent http://localhost:9090/api/v1/query?query=container_cpu_usage_seconds_total | head --bytes=150

See a complete list of metrics and their usage in Required Metrics.

Verify: The output should contain data in the "result" section. For example:

{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"kube_pod_info","app_kubernetes_io_component":"metrics", ... }}]}}

If instead the list is empty and just looks like "result":[], it could be that one of the above steps was not followed correctly, or check further Troubleshooting steps below.

Troubleshooting

Common problems we've seen while configuring Prometheus:

  1. Ensure Prometheus has been restarted for the config changes to take effect.

  2. The service is pointing at the wrong port - Ensure spec.ports.targetPort on the Service should match spec.template.spec.containers.ports.containerPort.See the snippet below as an example for when the ports are correctly aligned:

    apiVersion: v1
    kind: Service
    metadata:
    name: ...
    namespace: ...
    annotations: ...
    spec:
    selector:
    app: ...
    type: ClusterIP # ClusterIP gives internal DNS
    ports:
    - port: 9090 # The service listens on this port
    targetPort: 9090 # The service communicates with pods on this port
    protocol: TCP

    ---

    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: prometheus-deployment
    namespace: monitoring
    labels:
    app: prometheus-server
    spec:
    replicas: 1
    selector:
    matchLabels:
    app: prometheus-server
    template:
    metadata:
    labels:
    app: prometheus-server
    spec:
    containers:
    - ...
    ports:
    - containerPort: 9090 # should match targetPort in the service