管理
本文档描述了如何大规模管理您的 OpenTelemetry Collector 部署。
要充分利用此页面,您应该了解如何安装和配置 Collector。这些主题在其他地方有介绍。
基础知识
大规模遥测收集需要结构化的方法来管理代理。典型的代理管理任务包括:
- 查询代理信息和配置。代理信息可以包括其版本、操作系统相关信息或功能。代理配置指的是其遥测收集设置,例如,OpenTelemetry Collector 的配置。
- 升级/降级代理以及管理特定于代理的包,包括基本代理功能和插件。
- 将新配置应用于代理。这可能是由于环境变化或策略变更所必需的。
- 对代理进行健康和性能监控,通常包括 CPU 和内存使用情况,以及特定于代理的指标,例如处理速率或反压相关信息。
- 控制平面和代理之间的连接管理,例如处理 TLS 证书(吊销和轮换)。
并非所有用例都需要支持以上所有代理管理任务。在 OpenTelemetry 的上下文中,任务4. 健康和性能监控最好使用 OpenTelemetry 来完成。
OpAMP
可观测性供应商和云提供商提供专有的代理管理解决方案。在开源可观测性领域,有一个新兴标准可用于代理管理:Open Agent Management Protocol (OpAMP)。
该OpAMP 规范定义了如何管理一组遥测数据代理。这些代理可以是OpenTelemetry Collectors、Fluent Bit 或任何任意组合的其他代理。
注意 此处“代理”一词用作响应 OpAMP 的 OpenTelemetry 组件的统称,这可能是 Collector,也可能是 SDK 组件。
OpAMP 是一种客户端/服务器协议,支持通过 HTTP 和 WebSockets 进行通信。
- OpAMP 服务器是控制平面的一部分,充当协调器,管理一组遥测代理。
- OpAMP 客户端是数据平面的一部分。OpAMP 的客户端可以进程内实现,例如,OpenTelemetry Collector 中的 OpAMP 支持。OpAMP 的客户端也可以进程外实现。对于后一种选择,您可以使用 supervisor 来处理与 OpAMP 服务器的 OpAMP 特定通信,同时控制遥测代理,例如应用配置或进行升级。请注意,supervisor/遥测通信不属于 OpAMP。
让我们来看一个具体的设置。
- OpenTelemetry Collector,配置了管道以
- (A) 从下游源接收信号
- (B) 将信号导出到上游目的地,可能包括关于 Collector 本身的遥测(由 OpAMP
own_xxx连接设置表示)。
- 控制平面(实现 OpAMP 服务器端)和 Collector(或控制 Collector 的 supervisor)之间(实现 OpAMP 客户端端)的双向 OpAMP 控制流。
试用
您可以使用 Go 中的 OpAMP 协议实现来试用简单的 OpAMP 设置。对于以下演练,您需要 Go 1.22+。
我们将设置一个简单的 OpAMP 控制平面,包括一个示例 OpAMP 服务器,并让 OpenTelemetry Collector 使用 OpAMP Supervisor 连接到它。
步骤 1 - 启动 OpAMP 服务器
克隆 open-telemetry/opamp-go 存储库。
git clone https://github.com/open-telemetry/opamp-go.git
在 ./opamp-go/internal/examples/server 目录中,启动 OpAMP 服务器。
$ go run .
2025/04/20 15:10:35.307207 [MAIN] OpAMP Server starting...
2025/04/20 15:10:35.308201 [MAIN] OpAMP Server running...
步骤 2 - 安装 OpenTelemetry Collector
我们需要一个 OpAMP Supervisor 可以管理的 OpenTelemetry Collector 二进制文件。为此,请安装 OpenTelemetry Collector Contrib 发行版。您安装 Collector 二进制文件的路径在以下配置中称为 $OTEL_COLLECTOR_BINARY。
步骤 3 - 安装 OpAMP Supervisor
opampsupervisor 二进制文件可作为 OpenTelemetry Collector releases with cmd/opampsupervisor tags 的可下载资产提供。您会找到根据操作系统和芯片组命名的资产列表,请下载适合您配置的资产。
curl --proto '=https' --tlsv1.2 -fL -o opampsupervisor \
"https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fopampsupervisor%2Fv0.142.0/opampsupervisor_0.142.0_linux_amd64"
chmod +x opampsupervisor
curl --proto '=https' --tlsv1.2 -fL -o opampsupervisor \
"https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fopampsupervisor%2Fv0.142.0/opampsupervisor_0.142.0_linux_arm64"
chmod +x opampsupervisor
curl --proto '=https' --tlsv1.2 -fL -o opampsupervisor \
"https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fopampsupervisor%2Fv0.142.0/opampsupervisor_0.142.0_linux_ppc64le"
chmod +x opampsupervisor
curl --proto '=https' --tlsv1.2 -fL -o opampsupervisor \
"https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fopampsupervisor%2Fv0.142.0/opampsupervisor_0.142.0_darwin_amd64"
chmod +x opampsupervisor
curl --proto '=https' --tlsv1.2 -fL -o opampsupervisor \
"https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fopampsupervisor%2Fv0.142.0/opampsupervisor_0.142.0_darwin_arm64"
chmod +x opampsupervisor
Invoke-WebRequest -Uri "https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fopampsupervisor%2Fv0.142.0/opampsupervisor_0.142.0_windows_amd64.exe" -OutFile "opampsupervisor.exe"
Unblock-File -Path "opampsupervisor.exe"
步骤 4 - 创建 OpAMP Supervisor 配置文件
创建一个名为 supervisor.yaml 的文件,内容如下:
server:
endpoint: wss://127.0.0.1:4320/v1/opamp
tls:
insecure_skip_verify: true
capabilities:
accepts_remote_config: true
reports_effective_config: true
reports_own_metrics: false
reports_own_logs: true
reports_own_traces: false
reports_health: true
reports_remote_config: true
agent:
executable: $OTEL_COLLECTOR_BINARY
storage:
directory: ./storage
确保将 $OTEL_COLLECTOR_BINARY 替换为实际的文件路径。例如,在 Linux 或 macOS 上,如果您将 Collector 安装在 /usr/local/bin/,则您应将 $OTEL_COLLECTOR_BINARY 替换为 /usr/local/bin/otelcol。
步骤 5 - 运行 OpAMP Supervisor
现在是时候启动 supervisor,它将启动您的 OpenTelemetry Collector 了。
$ ./opampsupervisor --config=./supervisor.yaml
{"level":"info","ts":1745154644.746028,"logger":"supervisor","caller":"supervisor/supervisor.go:340","msg":"Supervisor starting","id":"01965352-9958-72da-905c-e40329c32c64"}
{"level":"info","ts":1745154644.74608,"logger":"supervisor","caller":"supervisor/supervisor.go:1086","msg":"No last received remote config found"}
如果一切顺利,您现在应该能够访问 https://:4321/ 并访问 OpAMP 服务器 UI。您应该会在 Supervisor 管理的代理列表中看到您的 Collector。

步骤 6 - 远程配置 OpenTelemetry Collector
在服务器 UI 中点击 Collector,并将以下内容粘贴到 Additional Configuration 框中:
receivers:
hostmetrics:
collection_interval: 10s
scrapers:
cpu:
exporters:
# NOTE: Prior to v0.86.0 use `logging` instead of `debug`.
debug:
verbosity: detailed
service:
pipelines:
metrics:
receivers: [hostmetrics]
exporters: [debug]
点击 Save and Send to Agent。

刷新页面,并验证 Agent 状态显示 Up: true。

您可以查询 Collector 以获取导出的指标(注意标签值)。
$ curl localhost:8888/metrics
# HELP otelcol_exporter_send_failed_metric_points Number of metric points in failed attempts to send to destination. [alpha]
# TYPE otelcol_exporter_send_failed_metric_points counter
otelcol_exporter_send_failed_metric_points{exporter="debug",service_instance_id="01965352-9958-72da-905c-e40329c32c64",service_name="otelcol-contrib",service_version="0.124.1"} 0
# HELP otelcol_exporter_sent_metric_points Number of metric points successfully sent to destination. [alpha]
# TYPE otelcol_exporter_sent_metric_points counter
otelcol_exporter_sent_metric_points{exporter="debug",service_instance_id="01965352-9958-72da-905c-e40329c32c64",service_name="otelcol-contrib",service_version="0.124.1"} 132
# HELP otelcol_process_cpu_seconds Total CPU user and system time in seconds [alpha]
# TYPE otelcol_process_cpu_seconds counter
otelcol_process_cpu_seconds{service_instance_id="01965352-9958-72da-905c-e40329c32c64",service_name="otelcol-contrib",service_version="0.124.1"} 0.127965
...
您还可以检查 Collector 的日志。
$ cat ./storage/agent.log
{"level":"info","ts":"2025-04-20T15:11:12.996+0200","caller":"service@v0.124.0/service.go:199","msg":"Setting up own telemetry..."}
{"level":"info","ts":"2025-04-20T15:11:12.996+0200","caller":"builders/builders.go:26","msg":"Development component. May change in the future."}
{"level":"info","ts":"2025-04-20T15:11:12.997+0200","caller":"service@v0.124.0/service.go:266","msg":"Starting otelcol-contrib...","Version":"0.124.1","NumCPU":11}
{"level":"info","ts":"2025-04-20T15:11:12.997+0200","caller":"extensions/extensions.go:41","msg":"Starting extensions..."}
{"level":"info","ts":"2025-04-20T15:11:12.997+0200","caller":"extensions/extensions.go:45","msg":"Extension is starting..."}
{"level":"info","ts":"2025-04-20T15:11:13.022+0200","caller":"extensions/extensions.go:62","msg":"Extension started."}
{"level":"info","ts":"2025-04-20T15:11:13.022+0200","caller":"extensions/extensions.go:45","msg":"Extension is starting..."}
{"level":"info","ts":"2025-04-20T15:11:13.022+0200","caller":"healthcheckextension@v0.124.1/healthcheckextension.go:32","msg":"Starting health_check extension","config":{"Endpoint":"localhost:58760","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
{"level":"info","ts":"2025-04-20T15:11:13.022+0200","caller":"extensions/extensions.go:62","msg":"Extension started."}
{"level":"info","ts":"2025-04-20T15:11:13.024+0200","caller":"healthcheck/handler.go:132","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":"2025-04-20T15:11:13.024+0200","caller":"service@v0.124.0/service.go:289","msg":"Everything is ready. Begin running and processing data."}
{"level":"info","ts":"2025-04-20T15:11:14.025+0200","msg":"Metrics","resource metrics":1,"metrics":1,"data points":44}
其他信息
- 博客文章
- YouTube 视频