管理

如何大规模管理您的 OpenTelemetry Collector 部署

本文档描述了如何大规模管理您的 OpenTelemetry Collector 部署。

要充分利用此页面,您应该了解如何安装和配置 Collector。这些主题在其他地方有介绍。

  • 快速入门,以了解如何安装 OpenTelemetry Collector。
  • 配置,了解如何配置 OpenTelemetry Collector,设置遥测管道。

基础知识

大规模遥测收集需要结构化的方法来管理代理。典型的代理管理任务包括:

  1. 查询代理信息和配置。代理信息可以包括其版本、操作系统相关信息或功能。代理配置指的是其遥测收集设置,例如,OpenTelemetry Collector 的配置
  2. 升级/降级代理以及管理特定于代理的包,包括基本代理功能和插件。
  3. 将新配置应用于代理。这可能是由于环境变化或策略变更所必需的。
  4. 对代理进行健康和性能监控,通常包括 CPU 和内存使用情况,以及特定于代理的指标,例如处理速率或反压相关信息。
  5. 控制平面和代理之间的连接管理,例如处理 TLS 证书(吊销和轮换)。

并非所有用例都需要支持以上所有代理管理任务。在 OpenTelemetry 的上下文中,任务4. 健康和性能监控最好使用 OpenTelemetry 来完成。

OpAMP

可观测性供应商和云提供商提供专有的代理管理解决方案。在开源可观测性领域,有一个新兴标准可用于代理管理:Open Agent Management Protocol (OpAMP)。

OpAMP 规范定义了如何管理一组遥测数据代理。这些代理可以是OpenTelemetry Collectors、Fluent Bit 或任何任意组合的其他代理。

注意 此处“代理”一词用作响应 OpAMP 的 OpenTelemetry 组件的统称,这可能是 Collector,也可能是 SDK 组件。

OpAMP 是一种客户端/服务器协议,支持通过 HTTP 和 WebSockets 进行通信。

  • OpAMP 服务器是控制平面的一部分,充当协调器,管理一组遥测代理。
  • OpAMP 客户端是数据平面的一部分。OpAMP 的客户端可以进程内实现,例如,OpenTelemetry Collector 中的 OpAMP 支持。OpAMP 的客户端也可以进程外实现。对于后一种选择,您可以使用 supervisor 来处理与 OpAMP 服务器的 OpAMP 特定通信,同时控制遥测代理,例如应用配置或进行升级。请注意,supervisor/遥测通信不属于 OpAMP。

让我们来看一个具体的设置。

OpAMP example setup
  1. OpenTelemetry Collector,配置了管道以
    • (A) 从下游源接收信号
    • (B) 将信号导出到上游目的地,可能包括关于 Collector 本身的遥测(由 OpAMP own_xxx 连接设置表示)。
  2. 控制平面(实现 OpAMP 服务器端)和 Collector(或控制 Collector 的 supervisor)之间(实现 OpAMP 客户端端)的双向 OpAMP 控制流。

试用

您可以使用 Go 中的 OpAMP 协议实现来试用简单的 OpAMP 设置。对于以下演练,您需要 Go 1.22+。

我们将设置一个简单的 OpAMP 控制平面,包括一个示例 OpAMP 服务器,并让 OpenTelemetry Collector 使用 OpAMP Supervisor 连接到它。

步骤 1 - 启动 OpAMP 服务器

克隆 open-telemetry/opamp-go 存储库。

git clone https://github.com/open-telemetry/opamp-go.git

./opamp-go/internal/examples/server 目录中,启动 OpAMP 服务器。

$ go run .
2025/04/20 15:10:35.307207 [MAIN] OpAMP Server starting...
2025/04/20 15:10:35.308201 [MAIN] OpAMP Server running...

步骤 2 - 安装 OpenTelemetry Collector

我们需要一个 OpAMP Supervisor 可以管理的 OpenTelemetry Collector 二进制文件。为此,请安装 OpenTelemetry Collector Contrib 发行版。您安装 Collector 二进制文件的路径在以下配置中称为 $OTEL_COLLECTOR_BINARY

步骤 3 - 安装 OpAMP Supervisor

opampsupervisor 二进制文件可作为 OpenTelemetry Collector releases with cmd/opampsupervisor tags 的可下载资产提供。您会找到根据操作系统和芯片组命名的资产列表,请下载适合您配置的资产。

curl --proto '=https' --tlsv1.2 -fL -o opampsupervisor \
"https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fopampsupervisor%2Fv0.142.0/opampsupervisor_0.142.0_linux_amd64"
chmod +x opampsupervisor
curl --proto '=https' --tlsv1.2 -fL -o opampsupervisor \
"https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fopampsupervisor%2Fv0.142.0/opampsupervisor_0.142.0_linux_arm64"
chmod +x opampsupervisor
curl --proto '=https' --tlsv1.2 -fL -o opampsupervisor \
"https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fopampsupervisor%2Fv0.142.0/opampsupervisor_0.142.0_linux_ppc64le"
chmod +x opampsupervisor
curl --proto '=https' --tlsv1.2 -fL -o opampsupervisor \
"https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fopampsupervisor%2Fv0.142.0/opampsupervisor_0.142.0_darwin_amd64"
chmod +x opampsupervisor
curl --proto '=https' --tlsv1.2 -fL -o opampsupervisor \
"https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fopampsupervisor%2Fv0.142.0/opampsupervisor_0.142.0_darwin_arm64"
chmod +x opampsupervisor
Invoke-WebRequest -Uri "https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fopampsupervisor%2Fv0.142.0/opampsupervisor_0.142.0_windows_amd64.exe" -OutFile "opampsupervisor.exe"
Unblock-File -Path "opampsupervisor.exe"

步骤 4 - 创建 OpAMP Supervisor 配置文件

创建一个名为 supervisor.yaml 的文件,内容如下:

server:
  endpoint: wss://127.0.0.1:4320/v1/opamp
  tls:
    insecure_skip_verify: true

capabilities:
  accepts_remote_config: true
  reports_effective_config: true
  reports_own_metrics: false
  reports_own_logs: true
  reports_own_traces: false
  reports_health: true
  reports_remote_config: true

agent:
  executable: $OTEL_COLLECTOR_BINARY

storage:
  directory: ./storage

步骤 5 - 运行 OpAMP Supervisor

现在是时候启动 supervisor,它将启动您的 OpenTelemetry Collector 了。

$ ./opampsupervisor --config=./supervisor.yaml
{"level":"info","ts":1745154644.746028,"logger":"supervisor","caller":"supervisor/supervisor.go:340","msg":"Supervisor starting","id":"01965352-9958-72da-905c-e40329c32c64"}
{"level":"info","ts":1745154644.74608,"logger":"supervisor","caller":"supervisor/supervisor.go:1086","msg":"No last received remote config found"}

如果一切顺利,您现在应该能够访问 https://:4321/ 并访问 OpAMP 服务器 UI。您应该会在 Supervisor 管理的代理列表中看到您的 Collector。

OpAMP example setup

步骤 6 - 远程配置 OpenTelemetry Collector

在服务器 UI 中点击 Collector,并将以下内容粘贴到 Additional Configuration 框中:

receivers:
  hostmetrics:
    collection_interval: 10s
    scrapers:
      cpu:

exporters:
  # NOTE: Prior to v0.86.0 use `logging` instead of `debug`.
  debug:
    verbosity: detailed

service:
  pipelines:
    metrics:
      receivers: [hostmetrics]
      exporters: [debug]

点击 Save and Send to Agent

OpAMP additional configuration

刷新页面,并验证 Agent 状态显示 Up: true

OpAMP agent

您可以查询 Collector 以获取导出的指标(注意标签值)。

$ curl localhost:8888/metrics
# HELP otelcol_exporter_send_failed_metric_points Number of metric points in failed attempts to send to destination. [alpha]
# TYPE otelcol_exporter_send_failed_metric_points counter
otelcol_exporter_send_failed_metric_points{exporter="debug",service_instance_id="01965352-9958-72da-905c-e40329c32c64",service_name="otelcol-contrib",service_version="0.124.1"} 0
# HELP otelcol_exporter_sent_metric_points Number of metric points successfully sent to destination. [alpha]
# TYPE otelcol_exporter_sent_metric_points counter
otelcol_exporter_sent_metric_points{exporter="debug",service_instance_id="01965352-9958-72da-905c-e40329c32c64",service_name="otelcol-contrib",service_version="0.124.1"} 132
# HELP otelcol_process_cpu_seconds Total CPU user and system time in seconds [alpha]
# TYPE otelcol_process_cpu_seconds counter
otelcol_process_cpu_seconds{service_instance_id="01965352-9958-72da-905c-e40329c32c64",service_name="otelcol-contrib",service_version="0.124.1"} 0.127965
...

您还可以检查 Collector 的日志。

$ cat ./storage/agent.log
{"level":"info","ts":"2025-04-20T15:11:12.996+0200","caller":"service@v0.124.0/service.go:199","msg":"Setting up own telemetry..."}
{"level":"info","ts":"2025-04-20T15:11:12.996+0200","caller":"builders/builders.go:26","msg":"Development component. May change in the future."}
{"level":"info","ts":"2025-04-20T15:11:12.997+0200","caller":"service@v0.124.0/service.go:266","msg":"Starting otelcol-contrib...","Version":"0.124.1","NumCPU":11}
{"level":"info","ts":"2025-04-20T15:11:12.997+0200","caller":"extensions/extensions.go:41","msg":"Starting extensions..."}
{"level":"info","ts":"2025-04-20T15:11:12.997+0200","caller":"extensions/extensions.go:45","msg":"Extension is starting..."}
{"level":"info","ts":"2025-04-20T15:11:13.022+0200","caller":"extensions/extensions.go:62","msg":"Extension started."}
{"level":"info","ts":"2025-04-20T15:11:13.022+0200","caller":"extensions/extensions.go:45","msg":"Extension is starting..."}
{"level":"info","ts":"2025-04-20T15:11:13.022+0200","caller":"healthcheckextension@v0.124.1/healthcheckextension.go:32","msg":"Starting health_check extension","config":{"Endpoint":"localhost:58760","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
{"level":"info","ts":"2025-04-20T15:11:13.022+0200","caller":"extensions/extensions.go:62","msg":"Extension started."}
{"level":"info","ts":"2025-04-20T15:11:13.024+0200","caller":"healthcheck/handler.go:132","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":"2025-04-20T15:11:13.024+0200","caller":"service@v0.124.0/service.go:289","msg":"Everything is ready. Begin running and processing data."}
{"level":"info","ts":"2025-04-20T15:11:14.025+0200","msg":"Metrics","resource metrics":1,"metrics":1,"data points":44}

其他信息