Target Allocator

如果您在 OpenTelemetry Operator 上启用了 目标分配器 服务发现,并且目标分配器未能发现抓取目标,您可以采取以下一些故障排除步骤来帮助您了解发生了什么并恢复正常运行。

故障排除步骤

您是否已将所有资源部署到 Kubernetes?

首先,请确保您已将所有相关资源部署到您的 Kubernetes 集群。

您是否知道指标是否真的被抓取?

在将所有资源部署到 Kubernetes 后,请确保目标分配器正在从您的 ServiceMonitorPodMonitor 中发现抓取目标。

假设您有这个 ServiceMonitor 定义

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: sm-example
  namespace: opentelemetry
  labels:
    app.kubernetes.io/name: py-prometheus-app
    release: prometheus
spec:
  selector:
    matchLabels:
      app: my-app
  namespaceSelector:
    matchNames:
      - opentelemetry
  endpoints:
    - port: prom
      path: /metrics
    - port: py-client-port
      interval: 15s
    - port: py-server-port

这个 Service 定义

apiVersion: v1
kind: Service
metadata:
  name: py-prometheus-app
  namespace: opentelemetry
  labels:
    app: my-app
    app.kubernetes.io/name: py-prometheus-app
spec:
  selector:
    app: my-app
    app.kubernetes.io/name: py-prometheus-app
  ports:
    - name: prom
      port: 8080

以及这个 OpenTelemetryCollector 定义

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otelcol
  namespace: opentelemetry
spec:
  mode: statefulset
  targetAllocator:
    enabled: true
    serviceAccount: opentelemetry-targetallocator-sa
    prometheusCR:
      enabled: true
      podMonitorSelector: {}
      serviceMonitorSelector: {}
  config:
    receivers:
      otlp:
        protocols:
          grpc: {}
          http: {}
      prometheus:
        config:
          scrape_configs:
            - job_name: 'otel-collector'
              scrape_interval: 10s
              static_configs:
                - targets: ['0.0.0.0:8888']
    exporters:
      debug:
        verbosity: detailed

    service:
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [debug]
        metrics:
          receivers: [otlp, prometheus]
          exporters: [debug]
        logs:
          receivers: [otlp]
          exporters: [debug]

首先,在 Kubernetes 中设置一个 port-forward,以便您可以公开目标分配器服务

kubectl port-forward svc/otelcol-targetallocator -n opentelemetry 8080:80

其中 otelcol-targetallocator 是您的 OpenTelemetryCollector CR 的 metadata.name 值,后接 -targetallocator 后缀,而 opentelemetryOpenTelemetryCollector CR 部署到的命名空间。

接下来,获取已注册到目标分配器的作业列表

curl localhost:8080/jobs | jq

您的示例输出应如下所示

{
  "serviceMonitor/opentelemetry/sm-example/1": {
    "_link": "/jobs/serviceMonitor%2Fopentelemetry%2Fsm-example%2F1/targets"
  },
  "serviceMonitor/opentelemetry/sm-example/2": {
    "_link": "/jobs/serviceMonitor%2Fopentelemetry%2Fsm-example%2F2/targets"
  },
  "otel-collector": {
    "_link": "/jobs/otel-collector/targets"
  },
  "serviceMonitor/opentelemetry/sm-example/0": {
    "_link": "/jobs/serviceMonitor%2Fopentelemetry%2Fsm-example%2F0/targets"
  },
  "podMonitor/opentelemetry/pm-example/0": {
    "_link": "/jobs/podMonitor%2Fopentelemetry%2Fpm-example%2F0/targets"
  }
}

其中 serviceMonitor/opentelemetry/sm-example/0 代表 ServiceMonitor 拾取的 Service 端口之一

  • opentelemetryServiceMonitor 资源所在的命名空间。
  • sm-exampleServiceMonitor 的名称。
  • 0ServiceMonitorService 之间匹配的端口端点之一。

同样,PodMonitorcurl 输出中显示为 podMonitor/opentelemetry/pm-example/0

这是个好消息,因为它告诉我们抓取配置发现正在工作!

您可能还对 otel-collector 条目感到好奇。这是因为 OpenTelemetryCollector 资源(名为 otel-collector)中的 spec.config.receivers.prometheusReceiver 启用了自抓取

prometheus:
  config:
    scrape_configs:
      - job_name: 'otel-collector'
        scrape_interval: 10s
        static_configs:
          - targets: ['0.0.0.0:8888']

我们可以更深入地查看 serviceMonitor/opentelemetry/sm-example/0,通过运行 curl 查询上面 _link 输出的值来查看正在抓取的哪些目标

curl localhost:8080/jobs/serviceMonitor%2Fopentelemetry%2Fsm-example%2F0/targets | jq

示例输出

{
  "otelcol-collector-0": {
    "_link": "/jobs/serviceMonitor%2Fopentelemetry%2Fsm-example%2F0/targets?collector_id=otelcol-collector-0",
    "targets": [
      {
        "targets": ["10.244.0.11:8080"],
        "labels": {
          "__meta_kubernetes_endpointslice_port_name": "prom",
          "__meta_kubernetes_pod_labelpresent_app_kubernetes_io_name": "true",
          "__meta_kubernetes_endpointslice_port_protocol": "TCP",
          "__meta_kubernetes_endpointslice_address_target_name": "py-prometheus-app-575cfdd46-nfttj",
          "__meta_kubernetes_endpointslice_annotation_endpoints_kubernetes_io_last_change_trigger_time": "2024-06-21T20:01:37Z",
          "__meta_kubernetes_endpointslice_labelpresent_app_kubernetes_io_name": "true",
          "__meta_kubernetes_pod_name": "py-prometheus-app-575cfdd46-nfttj",
          "__meta_kubernetes_pod_controller_name": "py-prometheus-app-575cfdd46",
          "__meta_kubernetes_pod_label_app_kubernetes_io_name": "py-prometheus-app",
          "__meta_kubernetes_endpointslice_address_target_kind": "Pod",
          "__meta_kubernetes_pod_node_name": "otel-target-allocator-talk-control-plane",
          "__meta_kubernetes_pod_labelpresent_pod_template_hash": "true",
          "__meta_kubernetes_endpointslice_label_kubernetes_io_service_name": "py-prometheus-app",
          "__meta_kubernetes_endpointslice_annotationpresent_endpoints_kubernetes_io_last_change_trigger_time": "true",
          "__meta_kubernetes_service_name": "py-prometheus-app",
          "__meta_kubernetes_pod_ready": "true",
          "__meta_kubernetes_pod_labelpresent_app": "true",
          "__meta_kubernetes_pod_controller_kind": "ReplicaSet",
          "__meta_kubernetes_endpointslice_labelpresent_app": "true",
          "__meta_kubernetes_pod_container_image": "otel-target-allocator-talk:0.1.0-py-prometheus-app",
          "__address__": "10.244.0.11:8080",
          "__meta_kubernetes_service_label_app_kubernetes_io_name": "py-prometheus-app",
          "__meta_kubernetes_pod_uid": "495d47ee-9a0e-49df-9b41-fe9e6f70090b",
          "__meta_kubernetes_endpointslice_port": "8080",
          "__meta_kubernetes_endpointslice_label_endpointslice_kubernetes_io_managed_by": "endpointslice-controller.k8s.io",
          "__meta_kubernetes_endpointslice_label_app": "my-app",
          "__meta_kubernetes_service_labelpresent_app_kubernetes_io_name": "true",
          "__meta_kubernetes_pod_host_ip": "172.24.0.2",
          "__meta_kubernetes_namespace": "opentelemetry",
          "__meta_kubernetes_endpointslice_endpoint_conditions_serving": "true",
          "__meta_kubernetes_endpointslice_labelpresent_kubernetes_io_service_name": "true",
          "__meta_kubernetes_endpointslice_endpoint_conditions_ready": "true",
          "__meta_kubernetes_service_annotation_kubectl_kubernetes_io_last_applied_configuration": "{\"apiVersion\":\"v1\",\"kind\":\"Service\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"my-app\",\"app.kubernetes.io/name\":\"py-prometheus-app\"},\"name\":\"py-prometheus-app\",\"namespace\":\"opentelemetry\"},\"spec\":{\"ports\":[{\"name\":\"prom\",\"port\":8080}],\"selector\":{\"app\":\"my-app\",\"app.kubernetes.io/name\":\"py-prometheus-app\"}}}\n",
          "__meta_kubernetes_endpointslice_endpoint_conditions_terminating": "false",
          "__meta_kubernetes_pod_container_port_protocol": "TCP",
          "__meta_kubernetes_pod_phase": "Running",
          "__meta_kubernetes_pod_container_name": "my-app",
          "__meta_kubernetes_pod_container_port_name": "prom",
          "__meta_kubernetes_pod_ip": "10.244.0.11",
          "__meta_kubernetes_service_annotationpresent_kubectl_kubernetes_io_last_applied_configuration": "true",
          "__meta_kubernetes_service_labelpresent_app": "true",
          "__meta_kubernetes_endpointslice_address_type": "IPv4",
          "__meta_kubernetes_service_label_app": "my-app",
          "__meta_kubernetes_pod_label_app": "my-app",
          "__meta_kubernetes_pod_container_port_number": "8080",
          "__meta_kubernetes_endpointslice_name": "py-prometheus-app-bwbvn",
          "__meta_kubernetes_pod_label_pod_template_hash": "575cfdd46",
          "__meta_kubernetes_endpointslice_endpoint_node_name": "otel-target-allocator-talk-control-plane",
          "__meta_kubernetes_endpointslice_labelpresent_endpointslice_kubernetes_io_managed_by": "true",
          "__meta_kubernetes_endpointslice_label_app_kubernetes_io_name": "py-prometheus-app"
        }
      }
    ]
  }
}

上面输出的 _link 字段中的查询参数 collector_id 表明这些目标属于 otelcol-collector-0(为 OpenTelemetryCollector 资源创建的 StatefulSet 的名称)。

目标分配器是否已启用?Prometheus 服务发现是否已启用?

如果上面的 curl 命令没有显示预期的 ServiceMonitorPodMonitor 列表,您需要检查填充这些值的特性是否已打开。

需要记住的一点是,仅仅因为您在 OpenTelemetryCollector CR 中包含 targetAllocator 部分,并不意味着它已被启用。您需要明确启用它。此外,如果您想使用 Prometheus 服务发现,您必须明确启用它。

  • spec.targetAllocator.enabled 设置为 true
  • spec.targetAllocator.prometheusCR.enabled 设置为 true

这样您的 OpenTelemetryCollector 资源将如下所示

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otelcol
  namespace: opentelemetry
spec:
  mode: statefulset
  targetAllocator:
    enabled: true
    serviceAccount: opentelemetry-targetallocator-sa
    prometheusCR:
      enabled: true

请参阅“您是否知道指标是否真的被抓取?”中的完整 OpenTelemetryCollector 资源定义

您是否配置了 ServiceMonitor(或 PodMonitor)选择器?

如果您配置了 ServiceMonitor 选择器,则意味着目标分配器仅查找具有与 serviceMonitorSelector 中的值匹配的 metadata.labelServiceMonitor

假设您为目标分配器配置了 serviceMonitorSelector,如下例所示

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otelcol
  namespace: opentelemetry
spec:
  mode: statefulset
  targetAllocator:
    enabled: true
    serviceAccount: opentelemetry-targetallocator-sa
    prometheusCR:
      enabled: true
      serviceMonitorSelector:
        matchLabels:
          app: my-app

通过将 spec.targetAllocator.prometheusCR.serviceMonitorSelector.matchLabels 的值设置为 app: my-app,这意味着您的 ServiceMonitor 资源在 metadata.labels 中也必须具有相同的值

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: sm-example
  labels:
    app: my-app
    release: prometheus
spec:

请参阅“您是否知道指标是否真的被抓取?”中的完整 ServiceMonitor 资源定义

在这种情况下,OpenTelemetryCollector 资源的 prometheusCR.serviceMonitorSelector.matchLabels 只查找具有标签 app: my-appServiceMonitor,正如我们在前面的示例中看到的。

如果您的 ServiceMonitor 资源缺少该标签,则目标分配器将无法从该 ServiceMonitor 中发现抓取目标。

您是否完全省略了 serviceMonitorSelector 和/或 podMonitorSelector 配置?

“您是否配置了 ServiceMonitor 或 PodMonitor 选择器” 中所述,为 serviceMonitorSelectorpodMonitorSelector 设置不匹配的值会导致目标分配器分别无法从您的 ServiceMonitorPodMonitor 中发现抓取目标。

同样,在 OpenTelemetryCollector CR 的 v1beta1 版本中,完全省略此配置也会导致目标分配器无法从您的 ServiceMonitorPodMonitor 中发现抓取目标。

OpenTelemetryOperatorv1beta1 版本开始,必须包含 serviceMonitorSelectorpodMonitorSelector,即使您不打算使用它们,就像这样

prometheusCR:
  enabled: true
  podMonitorSelector: {}
  serviceMonitorSelector: {}

此配置意味着它将匹配所有 PodMonitorServiceMonitor 资源。请参阅“您是否知道指标是否真的被抓取?”中的完整 OpenTelemetryCollector 定义。

您的 ServiceMonitor 和 Service(或 PodMonitor 和 Pod)的标签、命名空间和端口是否匹配?

ServiceMonitor 配置为抓取匹配以下条件的 Kubernetes Services

  • 标签
  • 命名空间(可选)
  • 端口(端点)

假设您有这个 ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: sm-example
  labels:
    app: my-app
    release: prometheus
spec:
  selector:
    matchLabels:
      app: my-app
  namespaceSelector:
    matchNames:
      - opentelemetry
  endpoints:
    - port: prom
      path: /metrics
    - port: py-client-port
      interval: 15s
    - port: py-server-port

前面的 ServiceMonitor 正在查找任何具有以下条件的的服务:

  • 标签 app: my-app
  • 位于名为 opentelemetry 的命名空间中
  • 名为 prompy-client-portpy-server-port 的端口

例如,以下 Service 资源将被 ServiceMonitor 拾取,因为它匹配之前的条件

apiVersion: v1
kind: Service
metadata:
  name: py-prometheus-app
  namespace: opentelemetry
  labels:
    app: my-app
    app.kubernetes.io/name: py-prometheus-app
spec:
  selector:
    app: my-app
    app.kubernetes.io/name: py-prometheus-app
  ports:
    - name: prom
      port: 8080

以下 Service 资源不会被拾取,因为 ServiceMonitor 正在查找名为 prompy-client-portpy-server-port 的端口,而此服务的端口名为 bleh

apiVersion: v1
kind: Service
metadata:
  name: py-prometheus-app
  namespace: opentelemetry
  labels:
    app: my-app
    app.kubernetes.io/name: py-prometheus-app
spec:
  selector:
    app: my-app
    app.kubernetes.io/name: py-prometheus-app
  ports:
    - name: bleh
      port: 8080