本文介绍如何通过修改Prometheus监控配置来采集虚拟节点的Metrics。
背景信息
在天翼云Serverless集群虚拟节点的架构设计下,同一Serverless集群内的多个虚拟节点会共享同一个Node IP。由于Prometheus常通过Kubelet Service采集所有节点的Metrics,采集单个虚拟节点的数据会返回所有虚拟节点的全量数据,因此会出现Metrics重复的现象。为了解决这个问题,Serverless集群提供了采集指定虚拟节点的Metrics数据的能力,不但保留了原有的采集端点<nodeIP>:10250/metrics/cadvisor,并且会过滤指定nodeName的数据,避免重复采集数据。
前提条件
确保您已经创建Serverless集群,具体操作请参阅创建Serverless集群。
修改Prometheus监控配置
您可以通过修改监控配置来采集指定虚拟节点的Metrics。Serverless集群支持ccse-monitor插件以及开源Prometheus场景下的配置方式。
Serverless集群安装ccse-monitor插件后配置默认支持采集虚拟节点的Metrics,无需额外操作。开源Promethues配置参考如下:
scrape_configs:
...
- job_name: cadvisor
honor_timestamps: true
scrape_interval: 40s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- role: node
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
relabel_configs:
- source_labels: [ __meta_kubernetes_node_label_kubernetes_poseidon_ctyun_cn_collector_scrape ]
regex: true
action: keep
# 以__开头的标签会在relabel后被删除,通过labelmap动作可改名。
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $$1
action: labelmap
- separator: ;
regex: (.*)
target_label: __address__
replacement: kubernetes.default.svc:443
action: replace
- source_labels: [ __meta_kubernetes_node_name ]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
action: replace
- replacement: {{.Values.otelCollector.clusterName}}
target_label: cluster_name
action: replace
- replacement: {{.Values.otelCollector.regionCode}}
target_label: region_code
action: replace
- replacement: {{.Values.otelCollector.tenantCode}}
target_label: tenant_code
action: replace
- replacement: {{.Values.otelCollector.tenantId}}
target_label: tenant_id
action: replace
- replacement: {{.Values.otelCollector.tenantName}}
target_label: tenant_name
action: replace
- replacement: {{.Values.otelCollector.instanceId}}
target_label: instance_id
action: replace
- replacement: CCSE
target_label: carms_obj_type
action: replace
- replacement: {{.Values.otelCollector.instanceId}}
target_label: carms_obj_id
action: replace
- replacement: {{.Values.otelCollector.clusterName}}
target_label: carms_obj_name
action: replace
- replacement: {{.Values.otelCollector.regionName}}
target_label: region_name
action: replace
metric_relabel_configs:
- source_labels: [ __name__ ]
regex: (container_memory_failures_total|container_memory_rss|container_spec_memory_limit_bytes|container_memory_failcnt|container_memory_cache|container_memory_swap|container_memory_usage_bytes|container_memory_max_usage_bytes|container_cpu_load_average_10s|container_fs_reads_total|container_fs_writes_total|container_network_transmit_errors_total|container_network_transmit_packets_total|container_network_receive_errors_total|container_network_receive_bytes_total|container_network_receive_errors_total|container_network_transmit_errors_total|container_memory_working_set_bytes|container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_writes_bytes_total|container_spec_cpu_quota|container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total|container_cpu_cfs_throttled_seconds_total|container_fs_inodes_free|container_fs_io_time_seconds_total|container_fs_io_time_weighted_seconds_total|container_fs_limit_bytes|container_tasks_state|container_fs_read_seconds_total|container_fs_write_seconds_total|container_fs_usage_bytes|container_fs_inodes_total|container_fs_io_current|machine_cpu_cores|machine_memory_bytes|container_network_transmit_bytes_total).*
action: keep