一、简介
在上一节中,我们介绍了如何利用kubemark工具来模拟一个有许多个节点(比如5000个)的大规模k8s集群的场景。有了这个大规模集群之后,
我们要如何开展测试工作。这里我们需要用到clusteload2.
注意: 旧版本的k8s 性能测试脚本 即e2e测试脚本 在k8s源代码中,执行方法相对简单。大概方法如下:
make WHAT="test/e2e/e2e.test"
./e2e.test --kube-master=192.168.0.16 --host=https://192.168.0.16:6443 --ginkgo.focus="\[Performance\]" --provider=local --kubeconfig=kubemark.kubeconfig --num-nodes=10 --v=3 --ginkgo.failFast --e2e-output-dir=. --report-dir=.
编译clusterload2
从github上拉取perf-test项目,其中包含clusterloader2。将perf-tests源码克隆到$GOPATH/src/k8s.io/perf-tests目录下。
选择与被测试k8s集群匹配的版本,进入clusterloader2目录,进行编译。
cd $GOPATH/src/k8s.io/perf-tests/clusterloader2
go build -o clusterloader './cmd/'
二、执行测试用例
Clusterloader2主要提供了两个测试用例:
(1)密度测试:该测试用例主要用来测试节点规模和容器规模的性能指标。它的大致思路是:在一个有N个节点的集群中,连续创建30*N个Pod,然后再删除这些Pod,然后跟踪这个过程中,上面的三个SLO是否满足。
(2)负载测试:该测试用例的主要思路是,向K8S进行大量的各种类型的资源创建、删除、LIST以及其他操作,然后跟踪这个过程中,上面的三个SLO是否满足。
以density 测试策略为例,运行命令前,需要根据测试场景修改测试配置文件中的变量参数,配置文件包括有config.yaml,deployment.yaml。这二个文件在perf-test/clusterloader2/testing/density目录下有。
config 文件可参考以下修改:
# ASSUMPTIONS:
# - Underlying cluster should have 100+ nodes.
# - Number of nodes should be divisible by NODES_PER_NAMESPACE (default 100).
#Constants
{{$DENSITY_RESOURCE_CONSTRAINTS_FILE := DefaultParam .DENSITY_RESOURCE_CONSTRAINTS_FILE ""}}
# Cater for the case where the number of nodes is less than nodes per namespace. See https://github.com/kubernetes/perf-tests/issues/887
# 每个命名空间100个节点,每个节点30个Pod,这样每个命名空间为3000个Pod
{{$NODES_PER_NAMESPACE := MinInt .Nodes (DefaultParam .NODES_PER_NAMESPACE 100)}}
{{$PODS_PER_NODE := DefaultParam .PODS_PER_NODE 30}}
{{$DENSITY_TEST_THROUGHPUT := DefaultParam .DENSITY_TEST_THROUGHPUT 20}}
{{$SCHEDULER_THROUGHPUT_THRESHOLD := DefaultParam .CL2_SCHEDULER_THROUGHPUT_THRESHOLD 0}}
# LATENCY_POD_MEMORY and LATENCY_POD_CPU are calculated for 1-core 4GB node.
# Increasing allocation of both memory and cpu by 10%
# decreases the value of priority function in scheduler by one point.
# This results in decreased probability of choosing the same node again.
{{$LATENCY_POD_CPU := DefaultParam .LATENCY_POD_CPU 100}}
{{$LATENCY_POD_MEMORY := DefaultParam .LATENCY_POD_MEMORY 350}}
{{$MIN_LATENCY_PODS := DefaultParam .MIN_LATENCY_PODS 500}}
{{$MIN_SATURATION_PODS_TIMEOUT := 180}}
{{$ENABLE_CHAOSMONKEY := DefaultParam .ENABLE_CHAOSMONKEY false}}
{{$ENABLE_SYSTEM_POD_METRICS:= DefaultParam .ENABLE_SYSTEM_POD_METRICS true}}
{{$ENABLE_CLUSTER_OOMS_TRACKER := DefaultParam .CL2_ENABLE_CLUSTER_OOMS_TRACKER true}}
{{$CLUSTER_OOMS_IGNORED_PROCESSES := DefaultParam .CL2_CLUSTER_OOMS_IGNORED_PROCESSES ""}}
{{$USE_SIMPLE_LATENCY_QUERY := DefaultParam .USE_SIMPLE_LATENCY_QUERY false}}
{{$ENABLE_RESTART_COUNT_CHECK := DefaultParam .ENABLE_RESTART_COUNT_CHECK true}}
{{$RESTART_COUNT_THRESHOLD_OVERRIDES:= DefaultParam .RESTART_COUNT_THRESHOLD_OVERRIDES ""}}
{{$ALLOWED_SLOW_API_CALLS := DefaultParam .CL2_ALLOWED_SLOW_API_CALLS 0}}
{{$ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT := DefaultParam .CL2_ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT true}}
#Variables
{{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}}
{{$podsPerNamespace := MultiplyInt $PODS_PER_NODE $NODES_PER_NAMESPACE}}
{{$totalPods := MultiplyInt $podsPerNamespace $namespaces}}
{{$latencyReplicas := DivideInt (MaxInt $MIN_LATENCY_PODS .Nodes) $namespaces}}
{{$totalLatencyPods := MultiplyInt $namespaces $latencyReplicas}}
{{$saturationDeploymentTimeout := DivideFloat $totalPods $DENSITY_TEST_THROUGHPUT | AddInt $MIN_SATURATION_PODS_TIMEOUT}}
# saturationDeploymentHardTimeout must be at least 20m to make sure that ~10m node
# failure won't fail the test. See https://github.com/kubernetes/kubernetes/issues/73461#issuecomment-467338711
# 根据经验每秒大概能调度20个Pod,5000节点时有150000个Pod,需要7500秒才能调度完,所以这里要改成7500+
{{$saturationDeploymentHardTimeout := MaxInt $saturationDeploymentTimeout 12000}}
{{$saturationDeploymentSpec := DefaultParam .SATURATION_DEPLOYMENT_SPEC "deployment.yaml"}}
{{$latencyDeploymentSpec := DefaultParam .LATENCY_DEPLOYMENT_SPEC "deployment.yaml"}}
# Probe measurements shared parameter
{{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT := DefaultParam .CL2_PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT "15m"}}
name: density
namespace:
number: {{$namespaces}}
tuningSets:
- name: Uniform5qps
qpsLoad:
# 每秒钟创建5个object,本文中object为deployment
qps: 5
# 该参数在上面为false,即不模拟节点故障
{{if $ENABLE_CHAOSMONKEY}}
chaosMonkey:
nodeFailure:
failureRate: 0.01
interval: 1m
jitterFactor: 10.0
simulatedDowntime: 10m
{{end}}
steps:
- name: Starting measurements
# 开始监控API调用
measurements:
- Identifier: APIResponsivenessPrometheus
Method: APIResponsivenessPrometheus
Params:
action: start
- Identifier: APIResponsivenessPrometheusSimple
Method: APIResponsivenessPrometheus
Params:
action: start
# TODO(oxddr): figure out how many probers to run in function of cluster
# 根据源码kubemark集群不支持InClusterNetworkLatency和DnsLookupLatency,故把它们注释掉
# - Identifier: InClusterNetworkLatency
# Method: InClusterNetworkLatency
# Params:
# action: start
# checkProbesReadyTimeout: {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT}}
# replicasPerProbe: {{AddInt 2 (DivideInt .Nodes 100)}}
# - Identifier: DnsLookupLatency
# Method: DnsLookupLatency
# Params:
# action: start
# checkProbesReadyTimeout: {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT}}
# replicasPerProbe: {{AddInt 2 (DivideInt .Nodes 100)}}
# 暂不清楚TestMetrics用来做什么,先注释
# - Identifier: TestMetrics
# Method: TestMetrics
# Params:
# action: start
# resourceConstraints: {{$DENSITY_RESOURCE_CONSTRAINTS_FILE}}
# systemPodMetricsEnabled: {{$ENABLE_SYSTEM_POD_METRICS}}
# clusterOOMsTrackerEnabled: {{$ENABLE_CLUSTER_OOMS_TRACKER}}
# clusterOOMsIgnoredProcesses: {{$CLUSTER_OOMS_IGNORED_PROCESSES}}
# restartCountThresholdOverrides: {{YamlQuote $RESTART_COUNT_THRESHOLD_OVERRIDES 4}}
# enableRestartCountCheck: {{$ENABLE_RESTART_COUNT_CHECK}}
- name: Starting saturation pod measurements
# 开始监控Pod启动延时
measurements:
- Identifier: SaturationPodStartupLatency
Method: PodStartupLatency
Params:
action: start
labelSelector: group = saturation
threshold: {{$saturationDeploymentTimeout}}s
- Identifier: WaitForRunningSaturationDeployments
Method: WaitForControlledPodsRunning
Params:
action: start
apiVersion: apps/v1
kind: Deployment
labelSelector: group = saturation
operationTimeout: {{$saturationDeploymentHardTimeout}}s
- Identifier: SchedulingThroughput
Method: SchedulingThroughput
Params:
action: start
labelSelector: group = saturation
# 开始创建saturation pod, 即30*N个Pod,N为节点数
- name: Creating saturation pods
phases:
- namespaceRange:
min: 1
max: {{$namespaces}}
# 一个命名空间中创建几个object,即几个deployment
replicasPerNamespace: 1
tuningSet: Uniform5qps
objectBundle:
- basename: saturation-deployment
objectTemplatePath: {{$saturationDeploymentSpec}}
# 下面的参数用于填充deployment.yaml中的变量,根据前面的variables,podsPerNamespace的值为3000,即一个命名空间中有一个deployment,有3000个Pod
templateFillMap:
Replicas: {{$podsPerNamespace}}
Group: saturation
CpuRequest: 1m
MemoryRequest: 10M
# 等待saturation pod都为Running状态
- name: Waiting for saturation pods to be running
measurements:
- Identifier: WaitForRunningSaturationDeployments
Method: WaitForControlledPodsRunning
Params:
action: gather
- name: Collecting saturation pod measurements
measurements:
# 统计saturation pod的启动延时
- Identifier: SaturationPodStartupLatency
Method: PodStartupLatency
Params:
action: gather
# 统计saturation pod的调度吞吐量,即每秒调度多少个Pod,如果小于threshold,则该项measurement为failed。threshhold上面的默认为0,所以不会失败
- Identifier: SchedulingThroughput
Method: SchedulingThroughput
Params:
action: gather
enableViolations: {{$ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT}}
threshold: {{$SCHEDULER_THROUGHPUT_THRESHOLD}}
# 在创建了30*N个Pod后,再创建500个latency pod(个数由前面的参数决定),观察当集群的Pod已经"饱和(saturation)"后,是否还能正常调度Pod
# 开始监控latency pod的启动延时
- name: Starting latency pod measurements
measurements:
- Identifier: PodStartupLatency
Method: PodStartupLatency
Params:
action: start
labelSelector: group = latency
- Identifier: WaitForRunningLatencyDeployments
Method: WaitForControlledPodsRunning
Params:
action: start
apiVersion: apps/v1
kind: Deployment
labelSelector: group = latency
operationTimeout: 15m
# 创建latency pod
- name: Creating latency pods
phases:
- namespaceRange:
min: 1
max: {{$namespaces}}
replicasPerNamespace: {{$latencyReplicas}}
tuningSet: Uniform5qps
objectBundle:
- basename: latency-deployment
objectTemplatePath: {{$latencyDeploymentSpec}}
templateFillMap:
Replicas: 1
Group: latency
CpuRequest: {{$LATENCY_POD_CPU}}m
MemoryRequest: {{$LATENCY_POD_MEMORY}}M
# 等待latency pod处于Running状态
- name: Waiting for latency pods to be running
measurements:
- Identifier: WaitForRunningLatencyDeployments
Method: WaitForControlledPodsRunning
Params:
action: gather
# 删除latency pod
- name: Deleting latency pods
phases:
- namespaceRange:
min: 1
max: {{$namespaces}}
replicasPerNamespace: 0
tuningSet: Uniform5qps
objectBundle:
- basename: latency-deployment
objectTemplatePath: {{$latencyDeploymentSpec}}
# 等待latency pod删除完成
- name: Waiting for latency pods to be deleted
measurements:
- Identifier: WaitForRunningLatencyDeployments
Method: WaitForControlledPodsRunning
Params:
action: gather
# 收集latency pod的启动延时
- name: Collecting pod startup latency
measurements:
- Identifier: PodStartupLatency
Method: PodStartupLatency
Params:
action: gather
# 删除saturation pod
- name: Deleting saturation pods
phases:
- namespaceRange:
min: 1
max: {{$namespaces}}
replicasPerNamespace: 0
tuningSet: Uniform5qps
objectBundle:
- basename: saturation-deployment
objectTemplatePath: {{$saturationDeploymentSpec}}
# 等待saturation pod删除完成
- name: Waiting for saturation pods to be deleted
measurements:
- Identifier: WaitForRunningSaturationDeployments
Method: WaitForControlledPodsRunning
Params:
action: gather
- name: Collecting measurements
measurements:
# APIResponsivenessPrometheusSimple统计API调用延时,使用的是Histgram类型的指标
- Identifier: APIResponsivenessPrometheusSimple
Method: APIResponsivenessPrometheus
Params:
action: gather
enableViolations: true
useSimpleLatencyQuery: true
summaryName: APIResponsivenessPrometheus_simple
allowedSlowCalls: {{$ALLOWED_SLOW_API_CALLS}}
# APIResponsivenessPrometheus统计API调用延时,使用的是Summary类型的指标,该指标更为准确,一般以它为准
{{if not $USE_SIMPLE_LATENCY_QUERY}}
- Identifier: APIResponsivenessPrometheus
Method: APIResponsivenessPrometheus
Params:
action: gather
allowedSlowCalls: {{$ALLOWED_SLOW_API_CALLS}}
{{end}}
# 注释掉这三个
# - Identifier: InClusterNetworkLatency
# Method: InClusterNetworkLatency
# Params:
# action: gather
# - Identifier: DnsLookupLatency
# Method: DnsLookupLatency
# Params:
# action: gather
# - Identifier: TestMetrics
# Method: TestMetrics
# Params:
# action: gather
# systemPodMetricsEnabled: {{$ENABLE_SYSTEM_POD_METRICS}}
# clusterOOMsTrackerEnabled: {{$ENABLE_CLUSTER_OOMS_TRACKER}}
# restartCountThresholdOverrides: {{YamlQuote $RESTART_COUNT_THRESHOLD_OVERRIDES 4}}
# enableRestartCountCheck: {{$ENABLE_RESTART_COUNT_CHECK}}
deployment.yaml的修改相对比较交单, 只需要修改测试使用的镜像即可。修改完成后即可执行测试。
将kubemark集群(被测试环境)的config文件(/root/.kube/config)拷贝到当前工作目录下。注意执行机到被测试环境主节点需要配置ssh免密。
执行用例的命令如下:
进入clusterloader可执行文件目录,配置文件(config.yaml)也需转移到了此位置
cd $GOPATH/src/k8s.io/perf-test/clusterloader2
# ./clusterloader2 --testconfig=config.yaml --provider=kubemark --provider-configs=ROOT_KUBECONFIG=./kubemark-kubeconfig \
--kubeconfig=./kubemark-kubeconfig --v=2 --enable-exec-service=false \ --enable-prometheus-server=true --tear-down-prometheus-server=false \
--prometheus-manifest-path /home/docker/clusterloader2/prometheus/manifest --nodes=500 2>&1 | tee output.txt
三、结果分析
上面的命令,会输出如下的日志,在下面的日志中,有如下几点要注意:
当执行完后,会输出各种详细的指标数据。通过搜索关键字SchedulingThroughput
,我们可以看到如下的调度吞量(对应config.yaml中的Identifier为SchedulingThroughput
这个measurement)
I0816 20:26:20.733126 14229 simple_test_executor.go:83] SchedulingThroughput: {
"perc50": 20,
"perc90": 20,
"perc99": 20.2,
"max": 24
}
通过搜索关键字pod_startup
,可以找到如下的PodStartupLatency_SaturationPodStartupLatency
的启动延时;以及StatelessPodStartupLatency_SaturationPodStartupLatency
的启动延时。
由于这里创建的Pod都是Stateless的,所以这两个指标的数据是一致的。另外还可以找到StatefulPodStartupLatency_SaturationPodStartupLatency
,不过由于没有stateful pod,所以数据为0。
这些数据对应着config.yaml中Identifier为SaturationPodStartupLatency
这个measurement。
I0816 20:26:20.733129 14229 simple_test_executor.go:83] PodStartupLatency_SaturationPodStartupLatency: {
"version": "1.0",
"dataItems": [
...
{
"data": {
"Perc50": 1323.250478,
"Perc90": 1878.221624,
"Perc99": 2184.178124
},
"unit": "ms",
"labels": {
"Metric": "pod_startup"
}
},
...
I0816 20:26:20.733137 14229 simple_test_executor.go:83] StatelessPodStartupLatency_SaturationPodStartupLatency: {
"version": "1.0",
"dataItems": [
...
{
"data": {
"Perc50": 1323.250478,
"Perc90": 1878.221624,
"Perc99": 2184.178124
},
"unit": "ms",
"labels": {
"Metric": "pod_startup"
}
},
...
I0816 20:26:20.733142 14229 simple_test_executor.go:83] StatefulPodStartupLatency_SaturationPodStartupLatency: {
"version": "1.0",
"dataItems": [
...
{
"data": {
"Perc50": 0,
"Perc90": 0,
"Perc99": 0
},
"unit": "ms",
"labels": {
"Metric": "pod_startup"
}
},
...
latency pod的启动延时,对应着config.yaml中Identifier为PodStartupLatency
这个measurement。
I0816 20:26:20.733148 14229 simple_test_executor.go:83] PodStartupLatency_PodStartupLatency: {
"version": "1.0",
"dataItems": [
...
{
"data": {
"Perc50": 1350.344608,
"Perc90": 1943.066452,
"Perc99": 2169.727106
},
"unit": "ms",
"labels": {
"Metric": "pod_startup"
}
}
...
I0816 20:26:20.733152 14229 simple_test_executor.go:83] StatelessPodStartupLatency_PodStartupLatency: {
"version": "1.0",
"dataItems": [
...
{
"data": {
"Perc50": 1350.344608,
"Perc90": 1943.066452,
"Perc99": 2169.727106
},
"unit": "ms",
"labels": {
"Metric": "pod_startup"
}
},
...
I0816 20:26:20.733156 14229 simple_test_executor.go:83] StatefulPodStartupLatency_PodStartupLatency: {
"version": "1.0",
"dataItems": [
...
{
"data": {
"Perc50": 0,
"Perc90": 0,
"Perc99": 0
},
"unit": "ms",
"labels": {
"Metric": "pod_startup"
}
},
...
API调用延时的结果,会列出所有(resource,verb)对的结果。对应config.yaml文件中Identifier为APIResponsivenessPrometheus
这个measurement。
I0816 20:26:20.733160 14229 simple_test_executor.go:83] APIResponsivenessPrometheus: {
"version": "v1",
"dataItems": [
{
"data": {
"Perc50": 500,
"Perc90": 580,
"Perc99": 598
},
"unit": "ms",
"labels": {
"Count": "3",
"Resource": "events",
"Scope": "cluster",
"SlowCount": "0",
"Subresource": "",
"Verb": "LIST"
}
},
{
"data": {
"Perc50": 26.017594,
"Perc90": 46.83167,
"Perc99": 167.899999
},
"unit": "ms",
"labels": {
"Count": "1008",
"Resource": "pods",
"Scope": "namespace",
"SlowCount": "0",
"Subresource": "",
"Verb": "LIST"
}
},
...