kubemark性能测试(二)--通过clusterload2 执行测试用例-天翼云开发者社区

一、简介

在上一节中，我们介绍了如何利用kubemark工具来模拟一个有许多个节点(比如5000个)的大规模k8s集群的场景。有了这个大规模集群之后，

我们要如何开展测试工作。这里我们需要用到clusteload2.

注意：旧版本的k8s 性能测试脚本即e2e测试脚本在k8s源代码中，执行方法相对简单。大概方法如下：

make WHAT="test/e2e/e2e.test"

./e2e.test --kube-master=192.168.0.16 --host=https://192.168.0.16:6443 --ginkgo.focus="\[Performance\]" --provider=local --kubeconfig=kubemark.kubeconfig --num-nodes=10 --v=3 --ginkgo.failFast --e2e-output-dir=. --report-dir=.

目前e2e的性能用例已经被移出主库了 https://github.com/kubernetes/kubernetes/pull/83322，所以在2019.10.1之后出的版本用上面的命令是无法运行性能测试的。

编译clusterload2

从github上拉取perf-test项目，其中包含clusterloader2。将perf-tests源码克隆到$GOPATH/src/k8s.io/perf-tests目录下。

选择与被测试k8s集群匹配的版本，进入clusterloader2目录，进行编译。

cd $GOPATH/src/k8s.io/perf-tests/clusterloader2
go build -o clusterloader './cmd/'

二、执行测试用例

Clusterloader2主要提供了两个测试用例：

（1）密度测试：该测试用例主要用来测试节点规模和容器规模的性能指标。它的大致思路是：在一个有N个节点的集群中，连续创建30*N个Pod，然后再删除这些Pod，然后跟踪这个过程中，上面的三个SLO是否满足。

（2）负载测试：该测试用例的主要思路是，向K8S进行大量的各种类型的资源创建、删除、LIST以及其他操作，然后跟踪这个过程中，上面的三个SLO是否满足。

以density 测试策略为例，运行命令前，需要根据测试场景修改测试配置文件中的变量参数，配置文件包括有config.yaml，deployment.yaml。这二个文件在perf-test/clusterloader2/testing/density目录下有。
config 文件可参考以下修改：

# ASSUMPTIONS:
# - Underlying cluster should have 100+ nodes.
# - Number of nodes should be divisible by NODES_PER_NAMESPACE (default 100).

#Constants
{{$DENSITY_RESOURCE_CONSTRAINTS_FILE := DefaultParam .DENSITY_RESOURCE_CONSTRAINTS_FILE ""}}
# Cater for the case where the number of nodes is less than nodes per namespace. See https://github.com/kubernetes/perf-tests/issues/887
# 每个命名空间100个节点，每个节点30个Pod，这样每个命名空间为3000个Pod
{{$NODES_PER_NAMESPACE := MinInt .Nodes (DefaultParam .NODES_PER_NAMESPACE 100)}}
{{$PODS_PER_NODE := DefaultParam .PODS_PER_NODE 30}}
{{$DENSITY_TEST_THROUGHPUT := DefaultParam .DENSITY_TEST_THROUGHPUT 20}}
{{$SCHEDULER_THROUGHPUT_THRESHOLD := DefaultParam .CL2_SCHEDULER_THROUGHPUT_THRESHOLD 0}}
# LATENCY_POD_MEMORY and LATENCY_POD_CPU are calculated for 1-core 4GB node.
# Increasing allocation of both memory and cpu by 10%
# decreases the value of priority function in scheduler by one point.
# This results in decreased probability of choosing the same node again.
{{$LATENCY_POD_CPU := DefaultParam .LATENCY_POD_CPU 100}}
{{$LATENCY_POD_MEMORY := DefaultParam .LATENCY_POD_MEMORY 350}}
{{$MIN_LATENCY_PODS := DefaultParam .MIN_LATENCY_PODS 500}}
{{$MIN_SATURATION_PODS_TIMEOUT := 180}}
{{$ENABLE_CHAOSMONKEY := DefaultParam .ENABLE_CHAOSMONKEY false}}
{{$ENABLE_SYSTEM_POD_METRICS:= DefaultParam .ENABLE_SYSTEM_POD_METRICS true}}
{{$ENABLE_CLUSTER_OOMS_TRACKER := DefaultParam .CL2_ENABLE_CLUSTER_OOMS_TRACKER true}}
{{$CLUSTER_OOMS_IGNORED_PROCESSES := DefaultParam .CL2_CLUSTER_OOMS_IGNORED_PROCESSES ""}}
{{$USE_SIMPLE_LATENCY_QUERY := DefaultParam .USE_SIMPLE_LATENCY_QUERY false}}
{{$ENABLE_RESTART_COUNT_CHECK := DefaultParam .ENABLE_RESTART_COUNT_CHECK true}}
{{$RESTART_COUNT_THRESHOLD_OVERRIDES:= DefaultParam .RESTART_COUNT_THRESHOLD_OVERRIDES ""}}
{{$ALLOWED_SLOW_API_CALLS := DefaultParam .CL2_ALLOWED_SLOW_API_CALLS 0}}
{{$ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT := DefaultParam .CL2_ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT true}}
#Variables
{{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}}
{{$podsPerNamespace := MultiplyInt $PODS_PER_NODE $NODES_PER_NAMESPACE}}
{{$totalPods := MultiplyInt $podsPerNamespace $namespaces}}
{{$latencyReplicas := DivideInt (MaxInt $MIN_LATENCY_PODS .Nodes) $namespaces}}
{{$totalLatencyPods := MultiplyInt $namespaces $latencyReplicas}}
{{$saturationDeploymentTimeout := DivideFloat $totalPods $DENSITY_TEST_THROUGHPUT | AddInt $MIN_SATURATION_PODS_TIMEOUT}}
# saturationDeploymentHardTimeout must be at least 20m to make sure that ~10m node
# failure won't fail the test. See https://github.com/kubernetes/kubernetes/issues/73461#issuecomment-467338711
# 根据经验每秒大概能调度20个Pod，5000节点时有150000个Pod，需要7500秒才能调度完，所以这里要改成7500+
{{$saturationDeploymentHardTimeout := MaxInt $saturationDeploymentTimeout 12000}}

{{$saturationDeploymentSpec := DefaultParam .SATURATION_DEPLOYMENT_SPEC "deployment.yaml"}}
{{$latencyDeploymentSpec := DefaultParam .LATENCY_DEPLOYMENT_SPEC "deployment.yaml"}}

# Probe measurements shared parameter
{{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT := DefaultParam .CL2_PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT "15m"}}

name: density
namespace:
  number: {{$namespaces}}
tuningSets:
- name: Uniform5qps
  qpsLoad:
    # 每秒钟创建5个object，本文中object为deployment
    qps: 5
# 该参数在上面为false，即不模拟节点故障
{{if $ENABLE_CHAOSMONKEY}}
chaosMonkey:
  nodeFailure:
    failureRate: 0.01
    interval: 1m
    jitterFactor: 10.0
    simulatedDowntime: 10m
{{end}}
steps:
- name: Starting measurements
  # 开始监控API调用
  measurements:
  - Identifier: APIResponsivenessPrometheus
    Method: APIResponsivenessPrometheus
    Params:
      action: start
  - Identifier: APIResponsivenessPrometheusSimple
    Method: APIResponsivenessPrometheus
    Params:
      action: start
  # TODO(oxddr): figure out how many probers to run in function of cluster
  # 根据源码kubemark集群不支持InClusterNetworkLatency和DnsLookupLatency，故把它们注释掉
  # - Identifier: InClusterNetworkLatency
  #   Method: InClusterNetworkLatency
  #   Params:
  #     action: start
  #     checkProbesReadyTimeout: {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT}}
  #     replicasPerProbe: {{AddInt 2 (DivideInt .Nodes 100)}}
  # - Identifier: DnsLookupLatency
  #   Method: DnsLookupLatency
  #   Params:
  #     action: start
  #     checkProbesReadyTimeout: {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT}}
  #     replicasPerProbe: {{AddInt 2 (DivideInt .Nodes 100)}}
       
  # 暂不清楚TestMetrics用来做什么，先注释
  # - Identifier: TestMetrics
  #   Method: TestMetrics
  #   Params:
  #     action: start
  #     resourceConstraints: {{$DENSITY_RESOURCE_CONSTRAINTS_FILE}}
  #     systemPodMetricsEnabled: {{$ENABLE_SYSTEM_POD_METRICS}}
  #     clusterOOMsTrackerEnabled: {{$ENABLE_CLUSTER_OOMS_TRACKER}}
  #     clusterOOMsIgnoredProcesses: {{$CLUSTER_OOMS_IGNORED_PROCESSES}}
  #     restartCountThresholdOverrides: {{YamlQuote $RESTART_COUNT_THRESHOLD_OVERRIDES 4}}
  #     enableRestartCountCheck: {{$ENABLE_RESTART_COUNT_CHECK}}

- name: Starting saturation pod measurements
  # 开始监控Pod启动延时
  measurements:
  - Identifier: SaturationPodStartupLatency
    Method: PodStartupLatency
    Params:
      action: start
      labelSelector: group = saturation
      threshold: {{$saturationDeploymentTimeout}}s
  - Identifier: WaitForRunningSaturationDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: start
      apiVersion: apps/v1
      kind: Deployment
      labelSelector: group = saturation
      operationTimeout: {{$saturationDeploymentHardTimeout}}s
  - Identifier: SchedulingThroughput
    Method: SchedulingThroughput
    Params:
      action: start
      labelSelector: group = saturation

# 开始创建saturation pod, 即30*N个Pod，N为节点数
- name: Creating saturation pods
  phases:
  - namespaceRange:
      min: 1
      max: {{$namespaces}}
    # 一个命名空间中创建几个object，即几个deployment
    replicasPerNamespace: 1
    tuningSet: Uniform5qps
    objectBundle:
    - basename: saturation-deployment
      objectTemplatePath: {{$saturationDeploymentSpec}}
      # 下面的参数用于填充deployment.yaml中的变量，根据前面的variables，podsPerNamespace的值为3000，即一个命名空间中有一个deployment，有3000个Pod
      templateFillMap:
        Replicas: {{$podsPerNamespace}}
        Group: saturation
        CpuRequest: 1m
        MemoryRequest: 10M

# 等待saturation pod都为Running状态
- name: Waiting for saturation pods to be running
  measurements:
  - Identifier: WaitForRunningSaturationDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: gather

- name: Collecting saturation pod measurements
  measurements:
  # 统计saturation pod的启动延时
  - Identifier: SaturationPodStartupLatency
    Method: PodStartupLatency
    Params:
      action: gather
  # 统计saturation pod的调度吞吐量，即每秒调度多少个Pod，如果小于threshold，则该项measurement为failed。threshhold上面的默认为0，所以不会失败
  - Identifier: SchedulingThroughput
    Method: SchedulingThroughput
    Params:
      action: gather
      enableViolations: {{$ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT}}
      threshold: {{$SCHEDULER_THROUGHPUT_THRESHOLD}}

# 在创建了30*N个Pod后，再创建500个latency pod（个数由前面的参数决定），观察当集群的Pod已经"饱和(saturation)"后，是否还能正常调度Pod
# 开始监控latency pod的启动延时
- name: Starting latency pod measurements
  measurements:
  - Identifier: PodStartupLatency
    Method: PodStartupLatency
    Params:
      action: start
      labelSelector: group = latency
  - Identifier: WaitForRunningLatencyDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: start
      apiVersion: apps/v1
      kind: Deployment
      labelSelector: group = latency
      operationTimeout: 15m

# 创建latency pod
- name: Creating latency pods
  phases:
  - namespaceRange:
      min: 1
      max: {{$namespaces}}
    replicasPerNamespace: {{$latencyReplicas}}
    tuningSet: Uniform5qps
    objectBundle:
    - basename: latency-deployment
      objectTemplatePath: {{$latencyDeploymentSpec}}
      templateFillMap:
        Replicas: 1
        Group: latency
        CpuRequest: {{$LATENCY_POD_CPU}}m
        MemoryRequest: {{$LATENCY_POD_MEMORY}}M

# 等待latency pod处于Running状态
- name: Waiting for latency pods to be running
  measurements:
  - Identifier: WaitForRunningLatencyDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: gather

# 删除latency pod
- name: Deleting latency pods
  phases:
  - namespaceRange:
      min: 1
      max: {{$namespaces}}
    replicasPerNamespace: 0
    tuningSet: Uniform5qps
    objectBundle:
    - basename: latency-deployment
      objectTemplatePath: {{$latencyDeploymentSpec}}

# 等待latency pod删除完成
- name: Waiting for latency pods to be deleted
  measurements:
  - Identifier: WaitForRunningLatencyDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: gather

# 收集latency pod的启动延时
- name: Collecting pod startup latency
  measurements:
  - Identifier: PodStartupLatency
    Method: PodStartupLatency
    Params:
      action: gather

# 删除saturation pod
- name: Deleting saturation pods
  phases:
  - namespaceRange:
      min: 1
      max: {{$namespaces}}
    replicasPerNamespace: 0
    tuningSet: Uniform5qps
    objectBundle:
    - basename: saturation-deployment
      objectTemplatePath: {{$saturationDeploymentSpec}}

# 等待saturation pod删除完成
- name: Waiting for saturation pods to be deleted
  measurements:
  - Identifier: WaitForRunningSaturationDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: gather

- name: Collecting measurements
  measurements:
  # APIResponsivenessPrometheusSimple统计API调用延时，使用的是Histgram类型的指标
  - Identifier: APIResponsivenessPrometheusSimple
    Method: APIResponsivenessPrometheus
    Params:
      action: gather
      enableViolations: true
      useSimpleLatencyQuery: true
      summaryName: APIResponsivenessPrometheus_simple
      allowedSlowCalls: {{$ALLOWED_SLOW_API_CALLS}}
  # APIResponsivenessPrometheus统计API调用延时，使用的是Summary类型的指标，该指标更为准确，一般以它为准
  {{if not $USE_SIMPLE_LATENCY_QUERY}}
  - Identifier: APIResponsivenessPrometheus
    Method: APIResponsivenessPrometheus
    Params:
      action: gather
      allowedSlowCalls: {{$ALLOWED_SLOW_API_CALLS}}
  {{end}}
  # 注释掉这三个
  # - Identifier: InClusterNetworkLatency
  #   Method: InClusterNetworkLatency
  #   Params:
  #     action: gather
  # - Identifier: DnsLookupLatency
  #   Method: DnsLookupLatency
  #   Params:
  #     action: gather
  # - Identifier: TestMetrics
  #   Method: TestMetrics
  #   Params:
  #     action: gather
  #     systemPodMetricsEnabled: {{$ENABLE_SYSTEM_POD_METRICS}}
  #     clusterOOMsTrackerEnabled: {{$ENABLE_CLUSTER_OOMS_TRACKER}}
  #     restartCountThresholdOverrides: {{YamlQuote $RESTART_COUNT_THRESHOLD_OVERRIDES 4}}
  #     enableRestartCountCheck: {{$ENABLE_RESTART_COUNT_CHECK}}

deployment.yaml的修改相对比较交单，只需要修改测试使用的镜像即可。修改完成后即可执行测试。

将kubemark集群(被测试环境)的config文件(/root/.kube/config)拷贝到当前工作目录下。注意执行机到被测试环境主节点需要配置ssh免密。

执行用例的命令如下：

进入clusterloader可执行文件目录，配置文件(config.yaml)也需转移到了此位置
cd $GOPATH/src/k8s.io/perf-test/clusterloader2
# ./clusterloader2 --testconfig=config.yaml --provider=kubemark --provider-configs=ROOT_KUBECONFIG=./kubemark-kubeconfig \
--kubeconfig=./kubemark-kubeconfig --v=2 --enable-exec-service=false \ --enable-prometheus-server=true --tear-down-prometheus-server=false \
--prometheus-manifest-path /home/docker/clusterloader2/prometheus/manifest --nodes=500 2>&1 | tee output.txt

这里开启了prometheus监控，如果不需要则将 --enable-prometheus-server=false，这样一些依赖于prometheus的用例不会被执行。

三、结果分析

上面的命令，会输出如下的日志，在下面的日志中，有如下几点要注意：

当执行完后，会输出各种详细的指标数据。通过搜索关键字SchedulingThroughput，我们可以看到如下的调度吞量（对应config.yaml中的Identifier为SchedulingThroughput这个measurement）

I0816 20:26:20.733126   14229 simple_test_executor.go:83] SchedulingThroughput: {
  "perc50": 20,
  "perc90": 20,
  "perc99": 20.2,
  "max": 24
}



通过搜索关键字pod_startup，可以找到如下的PodStartupLatency_SaturationPodStartupLatency的启动延时；以及StatelessPodStartupLatency_SaturationPodStartupLatency的启动延时。
由于这里创建的Pod都是Stateless的，所以这两个指标的数据是一致的。另外还可以找到StatefulPodStartupLatency_SaturationPodStartupLatency，不过由于没有stateful pod，所以数据为0。
这些数据对应着config.yaml中Identifier为SaturationPodStartupLatency这个measurement。

I0816 20:26:20.733129   14229 simple_test_executor.go:83] PodStartupLatency_SaturationPodStartupLatency: {
  "version": "1.0",
  "dataItems": [
  ...
    {
      "data": {
        "Perc50": 1323.250478,
        "Perc90": 1878.221624,
        "Perc99": 2184.178124
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...
I0816 20:26:20.733137   14229 simple_test_executor.go:83] StatelessPodStartupLatency_SaturationPodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
   {
      "data": {
        "Perc50": 1323.250478,
        "Perc90": 1878.221624,
        "Perc99": 2184.178124
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...
I0816 20:26:20.733142   14229 simple_test_executor.go:83] StatefulPodStartupLatency_SaturationPodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
    {
      "data": {
        "Perc50": 0,
        "Perc90": 0,
        "Perc99": 0
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...

latency pod的启动延时，对应着config.yaml中Identifier为PodStartupLatency这个measurement。

I0816 20:26:20.733148   14229 simple_test_executor.go:83] PodStartupLatency_PodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
{ 
      "data": {
        "Perc50": 1350.344608,
        "Perc90": 1943.066452,
        "Perc99": 2169.727106
      },
      "unit": "ms",
      "labels": { 
        "Metric": "pod_startup"
      }
    }
...
I0816 20:26:20.733152   14229 simple_test_executor.go:83] StatelessPodStartupLatency_PodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
    {
      "data": {
        "Perc50": 1350.344608,
        "Perc90": 1943.066452,
        "Perc99": 2169.727106
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...
I0816 20:26:20.733156   14229 simple_test_executor.go:83] StatefulPodStartupLatency_PodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
{
      "data": {
        "Perc50": 0,
        "Perc90": 0,
        "Perc99": 0
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...

API调用延时的结果，会列出所有（resource，verb）对的结果。对应config.yaml文件中Identifier为APIResponsivenessPrometheus这个measurement。

I0816 20:26:20.733160   14229 simple_test_executor.go:83] APIResponsivenessPrometheus: {
  "version": "v1",
  "dataItems": [
    {
      "data": {
        "Perc50": 500,
        "Perc90": 580,
        "Perc99": 598
      },
      "unit": "ms",
      "labels": {
        "Count": "3",
        "Resource": "events",
        "Scope": "cluster",
        "SlowCount": "0",
        "Subresource": "",
        "Verb": "LIST"
      }
    },
    {
      "data": {
        "Perc50": 26.017594,
        "Perc90": 46.83167,
        "Perc99": 167.899999
      },
      "unit": "ms",
      "labels": {
        "Count": "1008",
        "Resource": "pods",
        "Scope": "namespace",
        "SlowCount": "0",
        "Subresource": "",
        "Verb": "LIST"
      }
    },
...

make WHAT="test/e2e/e2e.test" ./e2e.test --kube-master=192.168.0.16 --host=https://192.168.0.16:6443 --ginkgo.focus="\[Performance\]" --provider=local --kubeconfig=kubemark.kubeconfig --num-nodes=10 --v=3 --ginkgo.failFast --e2e-output-dir=. --report-dir=.

# ASSUMPTIONS: # - Underlying cluster should have 100+ nodes. # - Number of nodes should be divisible by NODES_PER_NAMESPACE (default 100). #Constants {{$DENSITY_RESOURCE_CONSTRAINTS_FILE := DefaultParam .DENSITY_RESOURCE_CONSTRAINTS_FILE ""}} # Cater for the case where the number of nodes is less than nodes per namespace. See https://github.com/kubernetes/perf-tests/issues/887 # 每个命名空间100个节点，每个节点30个Pod，这样每个命名空间为3000个Pod {{$NODES_PER_NAMESPACE := MinInt .Nodes (DefaultParam .NODES_PER_NAMESPACE 100)}} {{$PODS_PER_NODE := DefaultParam .PODS_PER_NODE 30}} {{$DENSITY_TEST_THROUGHPUT := DefaultParam .DENSITY_TEST_THROUGHPUT 20}} {{$SCHEDULER_THROUGHPUT_THRESHOLD := DefaultParam .CL2_SCHEDULER_THROUGHPUT_THRESHOLD 0}} # LATENCY_POD_MEMORY and LATENCY_POD_CPU are calculated for 1-core 4GB node. # Increasing allocation of both memory and cpu by 10% # decreases the value of priority function in scheduler by one point. # This results in decreased probability of choosing the same node again. {{$LATENCY_POD_CPU := DefaultParam .LATENCY_POD_CPU 100}} {{$LATENCY_POD_MEMORY := DefaultParam .LATENCY_POD_MEMORY 350}} {{$MIN_LATENCY_PODS := DefaultParam .MIN_LATENCY_PODS 500}} {{$MIN_SATURATION_PODS_TIMEOUT := 180}} {{$ENABLE_CHAOSMONKEY := DefaultParam .ENABLE_CHAOSMONKEY false}} {{$ENABLE_SYSTEM_POD_METRICS:= DefaultParam .ENABLE_SYSTEM_POD_METRICS true}} {{$ENABLE_CLUSTER_OOMS_TRACKER := DefaultParam .CL2_ENABLE_CLUSTER_OOMS_TRACKER true}} {{$CLUSTER_OOMS_IGNORED_PROCESSES := DefaultParam .CL2_CLUSTER_OOMS_IGNORED_PROCESSES ""}} {{$USE_SIMPLE_LATENCY_QUERY := DefaultParam .USE_SIMPLE_LATENCY_QUERY false}} {{$ENABLE_RESTART_COUNT_CHECK := DefaultParam .ENABLE_RESTART_COUNT_CHECK true}} {{$RESTART_COUNT_THRESHOLD_OVERRIDES:= DefaultParam .RESTART_COUNT_THRESHOLD_OVERRIDES ""}} {{$ALLOWED_SLOW_API_CALLS := DefaultParam .CL2_ALLOWED_SLOW_API_CALLS 0}} {{$ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT := DefaultParam .CL2_ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT true}} #Variables {{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}} {{$podsPerNamespace := MultiplyInt $PODS_PER_NODE $NODES_PER_NAMESPACE}} {{$totalPods := MultiplyInt $podsPerNamespace $namespaces}} {{$latencyReplicas := DivideInt (MaxInt $MIN_LATENCY_PODS .Nodes) $namespaces}} {{$totalLatencyPods := MultiplyInt $namespaces $latencyReplicas}} {{$saturationDeploymentTimeout := DivideFloat $totalPods $DENSITY_TEST_THROUGHPUT | AddInt $MIN_SATURATION_PODS_TIMEOUT}} # saturationDeploymentHardTimeout must be at least 20m to make sure that ~10m node # failure won't fail the test. See https://github.com/kubernetes/kubernetes/issues/73461#issuecomment-467338711 # 根据经验每秒大概能调度20个Pod，5000节点时有150000个Pod，需要7500秒才能调度完，所以这里要改成7500+ {{$saturationDeploymentHardTimeout := MaxInt $saturationDeploymentTimeout 12000}} {{$saturationDeploymentSpec := DefaultParam .SATURATION_DEPLOYMENT_SPEC "deployment.yaml"}} {{$latencyDeploymentSpec := DefaultParam .LATENCY_DEPLOYMENT_SPEC "deployment.yaml"}} # Probe measurements shared parameter {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT := DefaultParam .CL2_PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT "15m"}} name: density namespace: number: {{$namespaces}} tuningSets: - name: Uniform5qps qpsLoad: # 每秒钟创建5个object，本文中object为deployment qps: 5 # 该参数在上面为false，即不模拟节点故障 {{if $ENABLE_CHAOSMONKEY}} chaosMonkey: nodeFailure: failureRate: 0.01 interval: 1m jitterFactor: 10.0 simulatedDowntime: 10m {{end}} steps: - name: Starting measurements # 开始监控API调用 measurements: - Identifier: APIResponsivenessPrometheus Method: APIResponsivenessPrometheus Params: action: start - Identifier: APIResponsivenessPrometheusSimple Method: APIResponsivenessPrometheus Params: action: start # TODO(oxddr): figure out how many probers to run in function of cluster # 根据源码kubemark集群不支持InClusterNetworkLatency和DnsLookupLatency，故把它们注释掉 # - Identifier: InClusterNetworkLatency # Method: InClusterNetworkLatency # Params: # action: start # checkProbesReadyTimeout: {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT}} # replicasPerProbe: {{AddInt 2 (DivideInt .Nodes 100)}} # - Identifier: DnsLookupLatency # Method: DnsLookupLatency # Params: # action: start # checkProbesReadyTimeout: {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT}} # replicasPerProbe: {{AddInt 2 (DivideInt .Nodes 100)}} # 暂不清楚TestMetrics用来做什么，先注释 # - Identifier: TestMetrics # Method: TestMetrics # Params: # action: start # resourceConstraints: {{$DENSITY_RESOURCE_CONSTRAINTS_FILE}} # systemPodMetricsEnabled: {{$ENABLE_SYSTEM_POD_METRICS}} # clusterOOMsTrackerEnabled: {{$ENABLE_CLUSTER_OOMS_TRACKER}} # clusterOOMsIgnoredProcesses: {{$CLUSTER_OOMS_IGNORED_PROCESSES}} # restartCountThresholdOverrides: {{YamlQuote $RESTART_COUNT_THRESHOLD_OVERRIDES 4}} # enableRestartCountCheck: {{$ENABLE_RESTART_COUNT_CHECK}} - name: Starting saturation pod measurements # 开始监控Pod启动延时 measurements: - Identifier: SaturationPodStartupLatency Method: PodStartupLatency Params: action: start labelSelector: group = saturation threshold: {{$saturationDeploymentTimeout}}s - Identifier: WaitForRunningSaturationDeployments Method: WaitForControlledPodsRunning Params: action: start apiVersion: apps/v1 kind: Deployment labelSelector: group = saturation operationTimeout: {{$saturationDeploymentHardTimeout}}s - Identifier: SchedulingThroughput Method: SchedulingThroughput Params: action: start labelSelector: group = saturation # 开始创建saturation pod, 即30*N个Pod，N为节点数 - name: Creating saturation pods phases: - namespaceRange: min: 1 max: {{$namespaces}} # 一个命名空间中创建几个object，即几个deployment replicasPerNamespace: 1 tuningSet: Uniform5qps objectBundle: - basename: saturation-deployment objectTemplatePath: {{$saturationDeploymentSpec}} # 下面的参数用于填充deployment.yaml中的变量，根据前面的variables，podsPerNamespace的值为3000，即一个命名空间中有一个deployment，有3000个Pod templateFillMap: Replicas: {{$podsPerNamespace}} Group: saturation CpuRequest: 1m MemoryRequest: 10M # 等待saturation pod都为Running状态 - name: Waiting for saturation pods to be running measurements: - Identifier: WaitForRunningSaturationDeployments Method: WaitForControlledPodsRunning Params: action: gather - name: Collecting saturation pod measurements measurements: # 统计saturation pod的启动延时 - Identifier: SaturationPodStartupLatency Method: PodStartupLatency Params: action: gather # 统计saturation pod的调度吞吐量，即每秒调度多少个Pod，如果小于threshold，则该项measurement为failed。threshhold上面的默认为0，所以不会失败 - Identifier: SchedulingThroughput Method: SchedulingThroughput Params: action: gather enableViolations: {{$ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT}} threshold: {{$SCHEDULER_THROUGHPUT_THRESHOLD}} # 在创建了30*N个Pod后，再创建500个latency pod（个数由前面的参数决定），观察当集群的Pod已经"饱和(saturation)"后，是否还能正常调度Pod # 开始监控latency pod的启动延时 - name: Starting latency pod measurements measurements: - Identifier: PodStartupLatency Method: PodStartupLatency Params: action: start labelSelector: group = latency - Identifier: WaitForRunningLatencyDeployments Method: WaitForControlledPodsRunning Params: action: start apiVersion: apps/v1 kind: Deployment labelSelector: group = latency operationTimeout: 15m # 创建latency pod - name: Creating latency pods phases: - namespaceRange: min: 1 max: {{$namespaces}} replicasPerNamespace: {{$latencyReplicas}} tuningSet: Uniform5qps objectBundle: - basename: latency-deployment objectTemplatePath: {{$latencyDeploymentSpec}} templateFillMap: Replicas: 1 Group: latency CpuRequest: {{$LATENCY_POD_CPU}}m MemoryRequest: {{$LATENCY_POD_MEMORY}}M # 等待latency pod处于Running状态 - name: Waiting for latency pods to be running measurements: - Identifier: WaitForRunningLatencyDeployments Method: WaitForControlledPodsRunning Params: action: gather # 删除latency pod - name: Deleting latency pods phases: - namespaceRange: min: 1 max: {{$namespaces}} replicasPerNamespace: 0 tuningSet: Uniform5qps objectBundle: - basename: latency-deployment objectTemplatePath: {{$latencyDeploymentSpec}} # 等待latency pod删除完成 - name: Waiting for latency pods to be deleted measurements: - Identifier: WaitForRunningLatencyDeployments Method: WaitForControlledPodsRunning Params: action: gather # 收集latency pod的启动延时 - name: Collecting pod startup latency measurements: - Identifier: PodStartupLatency Method: PodStartupLatency Params: action: gather # 删除saturation pod - name: Deleting saturation pods phases: - namespaceRange: min: 1 max: {{$namespaces}} replicasPerNamespace: 0 tuningSet: Uniform5qps objectBundle: - basename: saturation-deployment objectTemplatePath: {{$saturationDeploymentSpec}} # 等待saturation pod删除完成 - name: Waiting for saturation pods to be deleted measurements: - Identifier: WaitForRunningSaturationDeployments Method: WaitForControlledPodsRunning Params: action: gather - name: Collecting measurements measurements: # APIResponsivenessPrometheusSimple统计API调用延时，使用的是Histgram类型的指标 - Identifier: APIResponsivenessPrometheusSimple Method: APIResponsivenessPrometheus Params: action: gather enableViolations: true useSimpleLatencyQuery: true summaryName: APIResponsivenessPrometheus_simple allowedSlowCalls: {{$ALLOWED_SLOW_API_CALLS}} # APIResponsivenessPrometheus统计API调用延时，使用的是Summary类型的指标，该指标更为准确，一般以它为准 {{if not $USE_SIMPLE_LATENCY_QUERY}} - Identifier: APIResponsivenessPrometheus Method: APIResponsivenessPrometheus Params: action: gather allowedSlowCalls: {{$ALLOWED_SLOW_API_CALLS}} {{end}} # 注释掉这三个 # - Identifier: InClusterNetworkLatency # Method: InClusterNetworkLatency # Params: # action: gather # - Identifier: DnsLookupLatency # Method: DnsLookupLatency # Params: # action: gather # - Identifier: TestMetrics # Method: TestMetrics # Params: # action: gather # systemPodMetricsEnabled: {{$ENABLE_SYSTEM_POD_METRICS}} # clusterOOMsTrackerEnabled: {{$ENABLE_CLUSTER_OOMS_TRACKER}} # restartCountThresholdOverrides: {{YamlQuote $RESTART_COUNT_THRESHOLD_OVERRIDES 4}} # enableRestartCountCheck: {{$ENABLE_RESTART_COUNT_CHECK}}

进入clusterloader可执行文件目录，配置文件(config.yaml)也需转移到了此位置 cd $GOPATH/src/k8s.io/perf-test/clusterloader2 # ./clusterloader2 --testconfig=config.yaml --provider=kubemark --provider-configs=ROOT_KUBECONFIG=./kubemark-kubeconfig \ --kubeconfig=./kubemark-kubeconfig --v=2 --enable-exec-service=false \ --enable-prometheus-server=true --tear-down-prometheus-server=false \ --prometheus-manifest-path /home/docker/clusterloader2/prometheus/manifest --nodes=500 2>&1 | tee output.txt

通过搜索关键字pod_startup，可以找到如下的PodStartupLatency_SaturationPodStartupLatency的启动延时；以及StatelessPodStartupLatency_SaturationPodStartupLatency的启动延时。由于这里创建的Pod都是Stateless的，所以这两个指标的数据是一致的。另外还可以找到StatefulPodStartupLatency_SaturationPodStartupLatency，不过由于没有stateful pod，所以数据为0。这些数据对应着config.yaml中Identifier为SaturationPodStartupLatency这个measurement。

I0816 20:26:20.733129 14229 simple_test_executor.go:83] PodStartupLatency_SaturationPodStartupLatency: { "version": "1.0", "dataItems": [ ... { "data": { "Perc50": 1323.250478, "Perc90": 1878.221624, "Perc99": 2184.178124 }, "unit": "ms", "labels": { "Metric": "pod_startup" } }, ... I0816 20:26:20.733137 14229 simple_test_executor.go:83] StatelessPodStartupLatency_SaturationPodStartupLatency: { "version": "1.0", "dataItems": [ ... { "data": { "Perc50": 1323.250478, "Perc90": 1878.221624, "Perc99": 2184.178124 }, "unit": "ms", "labels": { "Metric": "pod_startup" } }, ... I0816 20:26:20.733142 14229 simple_test_executor.go:83] StatefulPodStartupLatency_SaturationPodStartupLatency: { "version": "1.0", "dataItems": [ ... { "data": { "Perc50": 0, "Perc90": 0, "Perc99": 0 }, "unit": "ms", "labels": { "Metric": "pod_startup" } }, ...

I0816 20:26:20.733148 14229 simple_test_executor.go:83] PodStartupLatency_PodStartupLatency: { "version": "1.0", "dataItems": [ ... { "data": { "Perc50": 1350.344608, "Perc90": 1943.066452, "Perc99": 2169.727106 }, "unit": "ms", "labels": { "Metric": "pod_startup" } } ... I0816 20:26:20.733152 14229 simple_test_executor.go:83] StatelessPodStartupLatency_PodStartupLatency: { "version": "1.0", "dataItems": [ ... { "data": { "Perc50": 1350.344608, "Perc90": 1943.066452, "Perc99": 2169.727106 }, "unit": "ms", "labels": { "Metric": "pod_startup" } }, ... I0816 20:26:20.733156 14229 simple_test_executor.go:83] StatefulPodStartupLatency_PodStartupLatency: { "version": "1.0", "dataItems": [ ... { "data": { "Perc50": 0, "Perc90": 0, "Perc99": 0 }, "unit": "ms", "labels": { "Metric": "pod_startup" } }, ...

I0816 20:26:20.733160 14229 simple_test_executor.go:83] APIResponsivenessPrometheus: { "version": "v1", "dataItems": [ { "data": { "Perc50": 500, "Perc90": 580, "Perc99": 598 }, "unit": "ms", "labels": { "Count": "3", "Resource": "events", "Scope": "cluster", "SlowCount": "0", "Subresource": "", "Verb": "LIST" } }, { "data": { "Perc50": 26.017594, "Perc90": 46.83167, "Perc99": 167.899999 }, "unit": "ms", "labels": { "Count": "1008", "Resource": "pods", "Scope": "namespace", "SlowCount": "0", "Subresource": "", "Verb": "LIST" } }, ...

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

kubemark性能测试(二)--通过clusterload2 执行测试用例

kubemark性能测试(二)--通过clusterload2 执行测试用例

活动

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

kubemark性能测试(二)--通过clusterload2 执行测试用例

kubemark性能测试(二)--通过clusterload2 执行测试用例