searchusermenu
  • 发布文章
  • 消息中心
点赞
收藏
评论
分享
原创

kubemark性能测试(二)--通过clusterload2 执行测试用例

2023-06-28 00:50:45
656
0

一、简介

在上一节中,我们介绍了如何利用kubemark工具来模拟一个有许多个节点(比如5000个)的大规模k8s集群的场景。有了这个大规模集群之后,

我们要如何开展测试工作。这里我们需要用到clusteload2.

 

注意: 旧版本的k8s 性能测试脚本 即e2e测试脚本 在k8s源代码中,执行方法相对简单。大概方法如下:

make WHAT="test/e2e/e2e.test"

./e2e.test --kube-master=192.168.0.16 --host=https://192.168.0.16:6443 --ginkgo.focus="\[Performance\]" --provider=local --kubeconfig=kubemark.kubeconfig --num-nodes=10 --v=3 --ginkgo.failFast --e2e-output-dir=. --report-dir=.

目前e2e的性能用例已经被移出主库了 https://github.com/kubernetes/kubernetes/pull/83322,所以在2019.10.1之后出的版本用上面的命令是无法运行性能测试的。

编译clusterload2

从github上拉取perf-test项目,其中包含clusterloader2。将perf-tests源码克隆到$GOPATH/src/k8s.io/perf-tests目录下。

选择与被测试k8s集群匹配的版本,进入clusterloader2目录,进行编译。

cd $GOPATH/src/k8s.io/perf-tests/clusterloader2
go build -o clusterloader './cmd/'
 

二、执行测试用例

Clusterloader2主要提供了两个测试用例:

(1)密度测试:该测试用例主要用来测试节点规模和容器规模的性能指标。它的大致思路是:在一个有N个节点的集群中,连续创建30*N个Pod,然后再删除这些Pod,然后跟踪这个过程中,上面的三个SLO是否满足。

(2)负载测试:该测试用例的主要思路是,向K8S进行大量的各种类型的资源创建、删除、LIST以及其他操作,然后跟踪这个过程中,上面的三个SLO是否满足。

以density 测试策略为例,运行命令前,需要根据测试场景修改测试配置文件中的变量参数,配置文件包括有config.yaml,deployment.yaml。这二个文件在perf-test/clusterloader2/testing/density目录下有。
config 文件可参考以下修改:

# ASSUMPTIONS:
# - Underlying cluster should have 100+ nodes.
# - Number of nodes should be divisible by NODES_PER_NAMESPACE (default 100).

#Constants
{{$DENSITY_RESOURCE_CONSTRAINTS_FILE := DefaultParam .DENSITY_RESOURCE_CONSTRAINTS_FILE ""}}
# Cater for the case where the number of nodes is less than nodes per namespace. See https://github.com/kubernetes/perf-tests/issues/887
# 每个命名空间100个节点,每个节点30个Pod,这样每个命名空间为3000个Pod
{{$NODES_PER_NAMESPACE := MinInt .Nodes (DefaultParam .NODES_PER_NAMESPACE 100)}}
{{$PODS_PER_NODE := DefaultParam .PODS_PER_NODE 30}}
{{$DENSITY_TEST_THROUGHPUT := DefaultParam .DENSITY_TEST_THROUGHPUT 20}}
{{$SCHEDULER_THROUGHPUT_THRESHOLD := DefaultParam .CL2_SCHEDULER_THROUGHPUT_THRESHOLD 0}}
# LATENCY_POD_MEMORY and LATENCY_POD_CPU are calculated for 1-core 4GB node.
# Increasing allocation of both memory and cpu by 10%
# decreases the value of priority function in scheduler by one point.
# This results in decreased probability of choosing the same node again.
{{$LATENCY_POD_CPU := DefaultParam .LATENCY_POD_CPU 100}}
{{$LATENCY_POD_MEMORY := DefaultParam .LATENCY_POD_MEMORY 350}}
{{$MIN_LATENCY_PODS := DefaultParam .MIN_LATENCY_PODS 500}}
{{$MIN_SATURATION_PODS_TIMEOUT := 180}}
{{$ENABLE_CHAOSMONKEY := DefaultParam .ENABLE_CHAOSMONKEY false}}
{{$ENABLE_SYSTEM_POD_METRICS:= DefaultParam .ENABLE_SYSTEM_POD_METRICS true}}
{{$ENABLE_CLUSTER_OOMS_TRACKER := DefaultParam .CL2_ENABLE_CLUSTER_OOMS_TRACKER true}}
{{$CLUSTER_OOMS_IGNORED_PROCESSES := DefaultParam .CL2_CLUSTER_OOMS_IGNORED_PROCESSES ""}}
{{$USE_SIMPLE_LATENCY_QUERY := DefaultParam .USE_SIMPLE_LATENCY_QUERY false}}
{{$ENABLE_RESTART_COUNT_CHECK := DefaultParam .ENABLE_RESTART_COUNT_CHECK true}}
{{$RESTART_COUNT_THRESHOLD_OVERRIDES:= DefaultParam .RESTART_COUNT_THRESHOLD_OVERRIDES ""}}
{{$ALLOWED_SLOW_API_CALLS := DefaultParam .CL2_ALLOWED_SLOW_API_CALLS 0}}
{{$ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT := DefaultParam .CL2_ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT true}}
#Variables
{{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}}
{{$podsPerNamespace := MultiplyInt $PODS_PER_NODE $NODES_PER_NAMESPACE}}
{{$totalPods := MultiplyInt $podsPerNamespace $namespaces}}
{{$latencyReplicas := DivideInt (MaxInt $MIN_LATENCY_PODS .Nodes) $namespaces}}
{{$totalLatencyPods := MultiplyInt $namespaces $latencyReplicas}}
{{$saturationDeploymentTimeout := DivideFloat $totalPods $DENSITY_TEST_THROUGHPUT | AddInt $MIN_SATURATION_PODS_TIMEOUT}}
# saturationDeploymentHardTimeout must be at least 20m to make sure that ~10m node
# failure won't fail the test. See https://github.com/kubernetes/kubernetes/issues/73461#issuecomment-467338711
# 根据经验每秒大概能调度20个Pod,5000节点时有150000个Pod,需要7500秒才能调度完,所以这里要改成7500+
{{$saturationDeploymentHardTimeout := MaxInt $saturationDeploymentTimeout 12000}}

{{$saturationDeploymentSpec := DefaultParam .SATURATION_DEPLOYMENT_SPEC "deployment.yaml"}}
{{$latencyDeploymentSpec := DefaultParam .LATENCY_DEPLOYMENT_SPEC "deployment.yaml"}}

# Probe measurements shared parameter
{{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT := DefaultParam .CL2_PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT "15m"}}

name: density
namespace:
  number: {{$namespaces}}
tuningSets:
- name: Uniform5qps
  qpsLoad:
    # 每秒钟创建5个object,本文中object为deployment
    qps: 5
# 该参数在上面为false,即不模拟节点故障
{{if $ENABLE_CHAOSMONKEY}}
chaosMonkey:
  nodeFailure:
    failureRate: 0.01
    interval: 1m
    jitterFactor: 10.0
    simulatedDowntime: 10m
{{end}}
steps:
- name: Starting measurements
  # 开始监控API调用
  measurements:
  - Identifier: APIResponsivenessPrometheus
    Method: APIResponsivenessPrometheus
    Params:
      action: start
  - Identifier: APIResponsivenessPrometheusSimple
    Method: APIResponsivenessPrometheus
    Params:
      action: start
  # TODO(oxddr): figure out how many probers to run in function of cluster
  # 根据源码kubemark集群不支持InClusterNetworkLatency和DnsLookupLatency,故把它们注释掉
  # - Identifier: InClusterNetworkLatency
  #   Method: InClusterNetworkLatency
  #   Params:
  #     action: start
  #     checkProbesReadyTimeout: {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT}}
  #     replicasPerProbe: {{AddInt 2 (DivideInt .Nodes 100)}}
  # - Identifier: DnsLookupLatency
  #   Method: DnsLookupLatency
  #   Params:
  #     action: start
  #     checkProbesReadyTimeout: {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT}}
  #     replicasPerProbe: {{AddInt 2 (DivideInt .Nodes 100)}}
       
  # 暂不清楚TestMetrics用来做什么,先注释
  # - Identifier: TestMetrics
  #   Method: TestMetrics
  #   Params:
  #     action: start
  #     resourceConstraints: {{$DENSITY_RESOURCE_CONSTRAINTS_FILE}}
  #     systemPodMetricsEnabled: {{$ENABLE_SYSTEM_POD_METRICS}}
  #     clusterOOMsTrackerEnabled: {{$ENABLE_CLUSTER_OOMS_TRACKER}}
  #     clusterOOMsIgnoredProcesses: {{$CLUSTER_OOMS_IGNORED_PROCESSES}}
  #     restartCountThresholdOverrides: {{YamlQuote $RESTART_COUNT_THRESHOLD_OVERRIDES 4}}
  #     enableRestartCountCheck: {{$ENABLE_RESTART_COUNT_CHECK}}

- name: Starting saturation pod measurements
  # 开始监控Pod启动延时
  measurements:
  - Identifier: SaturationPodStartupLatency
    Method: PodStartupLatency
    Params:
      action: start
      labelSelector: group = saturation
      threshold: {{$saturationDeploymentTimeout}}s
  - Identifier: WaitForRunningSaturationDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: start
      apiVersion: apps/v1
      kind: Deployment
      labelSelector: group = saturation
      operationTimeout: {{$saturationDeploymentHardTimeout}}s
  - Identifier: SchedulingThroughput
    Method: SchedulingThroughput
    Params:
      action: start
      labelSelector: group = saturation

# 开始创建saturation pod, 即30*N个Pod,N为节点数
- name: Creating saturation pods
  phases:
  - namespaceRange:
      min: 1
      max: {{$namespaces}}
    # 一个命名空间中创建几个object,即几个deployment
    replicasPerNamespace: 1
    tuningSet: Uniform5qps
    objectBundle:
    - basename: saturation-deployment
      objectTemplatePath: {{$saturationDeploymentSpec}}
      # 下面的参数用于填充deployment.yaml中的变量,根据前面的variables,podsPerNamespace的值为3000,即一个命名空间中有一个deployment,有3000个Pod
      templateFillMap:
        Replicas: {{$podsPerNamespace}}
        Group: saturation
        CpuRequest: 1m
        MemoryRequest: 10M

# 等待saturation pod都为Running状态
- name: Waiting for saturation pods to be running
  measurements:
  - Identifier: WaitForRunningSaturationDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: gather

- name: Collecting saturation pod measurements
  measurements:
  # 统计saturation pod的启动延时
  - Identifier: SaturationPodStartupLatency
    Method: PodStartupLatency
    Params:
      action: gather
  # 统计saturation pod的调度吞吐量,即每秒调度多少个Pod,如果小于threshold,则该项measurement为failed。threshhold上面的默认为0,所以不会失败
  - Identifier: SchedulingThroughput
    Method: SchedulingThroughput
    Params:
      action: gather
      enableViolations: {{$ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT}}
      threshold: {{$SCHEDULER_THROUGHPUT_THRESHOLD}}

# 在创建了30*N个Pod后,再创建500个latency pod(个数由前面的参数决定),观察当集群的Pod已经"饱和(saturation)"后,是否还能正常调度Pod
# 开始监控latency pod的启动延时
- name: Starting latency pod measurements
  measurements:
  - Identifier: PodStartupLatency
    Method: PodStartupLatency
    Params:
      action: start
      labelSelector: group = latency
  - Identifier: WaitForRunningLatencyDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: start
      apiVersion: apps/v1
      kind: Deployment
      labelSelector: group = latency
      operationTimeout: 15m

# 创建latency pod
- name: Creating latency pods
  phases:
  - namespaceRange:
      min: 1
      max: {{$namespaces}}
    replicasPerNamespace: {{$latencyReplicas}}
    tuningSet: Uniform5qps
    objectBundle:
    - basename: latency-deployment
      objectTemplatePath: {{$latencyDeploymentSpec}}
      templateFillMap:
        Replicas: 1
        Group: latency
        CpuRequest: {{$LATENCY_POD_CPU}}m
        MemoryRequest: {{$LATENCY_POD_MEMORY}}M

# 等待latency pod处于Running状态
- name: Waiting for latency pods to be running
  measurements:
  - Identifier: WaitForRunningLatencyDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: gather

# 删除latency pod
- name: Deleting latency pods
  phases:
  - namespaceRange:
      min: 1
      max: {{$namespaces}}
    replicasPerNamespace: 0
    tuningSet: Uniform5qps
    objectBundle:
    - basename: latency-deployment
      objectTemplatePath: {{$latencyDeploymentSpec}}

# 等待latency pod删除完成
- name: Waiting for latency pods to be deleted
  measurements:
  - Identifier: WaitForRunningLatencyDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: gather

# 收集latency pod的启动延时
- name: Collecting pod startup latency
  measurements:
  - Identifier: PodStartupLatency
    Method: PodStartupLatency
    Params:
      action: gather

# 删除saturation pod
- name: Deleting saturation pods
  phases:
  - namespaceRange:
      min: 1
      max: {{$namespaces}}
    replicasPerNamespace: 0
    tuningSet: Uniform5qps
    objectBundle:
    - basename: saturation-deployment
      objectTemplatePath: {{$saturationDeploymentSpec}}

# 等待saturation pod删除完成
- name: Waiting for saturation pods to be deleted
  measurements:
  - Identifier: WaitForRunningSaturationDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: gather

- name: Collecting measurements
  measurements:
  # APIResponsivenessPrometheusSimple统计API调用延时,使用的是Histgram类型的指标
  - Identifier: APIResponsivenessPrometheusSimple
    Method: APIResponsivenessPrometheus
    Params:
      action: gather
      enableViolations: true
      useSimpleLatencyQuery: true
      summaryName: APIResponsivenessPrometheus_simple
      allowedSlowCalls: {{$ALLOWED_SLOW_API_CALLS}}
  # APIResponsivenessPrometheus统计API调用延时,使用的是Summary类型的指标,该指标更为准确,一般以它为准
  {{if not $USE_SIMPLE_LATENCY_QUERY}}
  - Identifier: APIResponsivenessPrometheus
    Method: APIResponsivenessPrometheus
    Params:
      action: gather
      allowedSlowCalls: {{$ALLOWED_SLOW_API_CALLS}}
  {{end}}
  # 注释掉这三个
  # - Identifier: InClusterNetworkLatency
  #   Method: InClusterNetworkLatency
  #   Params:
  #     action: gather
  # - Identifier: DnsLookupLatency
  #   Method: DnsLookupLatency
  #   Params:
  #     action: gather
  # - Identifier: TestMetrics
  #   Method: TestMetrics
  #   Params:
  #     action: gather
  #     systemPodMetricsEnabled: {{$ENABLE_SYSTEM_POD_METRICS}}
  #     clusterOOMsTrackerEnabled: {{$ENABLE_CLUSTER_OOMS_TRACKER}}
  #     restartCountThresholdOverrides: {{YamlQuote $RESTART_COUNT_THRESHOLD_OVERRIDES 4}}
  #     enableRestartCountCheck: {{$ENABLE_RESTART_COUNT_CHECK}}

deployment.yaml的修改相对比较交单, 只需要修改测试使用的镜像即可。修改完成后即可执行测试。

将kubemark集群(被测试环境)的config文件(/root/.kube/config)拷贝到当前工作目录下。注意执行机到被测试环境主节点需要配置ssh免密。

执行用例的命令如下:

进入clusterloader可执行文件目录,配置文件(config.yaml)也需转移到了此位置
cd $GOPATH/src/k8s.io/perf-test/clusterloader2
# ./clusterloader2 --testconfig=config.yaml --provider=kubemark --provider-configs=ROOT_KUBECONFIG=./kubemark-kubeconfig \
--kubeconfig=./kubemark-kubeconfig --v=2 --enable-exec-service=false \ --enable-prometheus-server=true --tear-down-prometheus-server=false \
--prometheus-manifest-path /home/docker/clusterloader2/prometheus/manifest --nodes=500 2>&1 | tee output.txt
这里开启了prometheus监控,如果不需要则将 --enable-prometheus-server=false,这样一些依赖于prometheus的用例不会被执行。
 

三、结果分析

上面的命令,会输出如下的日志,在下面的日志中,有如下几点要注意:

当执行完后,会输出各种详细的指标数据。通过搜索关键字SchedulingThroughput,我们可以看到如下的调度吞量(对应config.yaml中的Identifier为SchedulingThroughput这个measurement)

I0816 20:26:20.733126   14229 simple_test_executor.go:83] SchedulingThroughput: {
  "perc50": 20,
  "perc90": 20,
  "perc99": 20.2,
  "max": 24
}


通过搜索关键字pod_startup,可以找到如下的PodStartupLatency_SaturationPodStartupLatency的启动延时;以及StatelessPodStartupLatency_SaturationPodStartupLatency的启动延时。
由于这里创建的Pod都是Stateless的,所以这两个指标的数据是一致的。另外还可以找到StatefulPodStartupLatency_SaturationPodStartupLatency,不过由于没有stateful pod,所以数据为0。
这些数据对应着config.yaml中Identifier为SaturationPodStartupLatency这个measurement。

I0816 20:26:20.733129   14229 simple_test_executor.go:83] PodStartupLatency_SaturationPodStartupLatency: {
  "version": "1.0",
  "dataItems": [
  ...
    {
      "data": {
        "Perc50": 1323.250478,
        "Perc90": 1878.221624,
        "Perc99": 2184.178124
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...
I0816 20:26:20.733137   14229 simple_test_executor.go:83] StatelessPodStartupLatency_SaturationPodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
   {
      "data": {
        "Perc50": 1323.250478,
        "Perc90": 1878.221624,
        "Perc99": 2184.178124
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...
I0816 20:26:20.733142   14229 simple_test_executor.go:83] StatefulPodStartupLatency_SaturationPodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
    {
      "data": {
        "Perc50": 0,
        "Perc90": 0,
        "Perc99": 0
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...

 

 

latency pod的启动延时,对应着config.yaml中Identifier为PodStartupLatency这个measurement。

I0816 20:26:20.733148   14229 simple_test_executor.go:83] PodStartupLatency_PodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
{ 
      "data": {
        "Perc50": 1350.344608,
        "Perc90": 1943.066452,
        "Perc99": 2169.727106
      },
      "unit": "ms",
      "labels": { 
        "Metric": "pod_startup"
      }
    }
...
I0816 20:26:20.733152   14229 simple_test_executor.go:83] StatelessPodStartupLatency_PodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
    {
      "data": {
        "Perc50": 1350.344608,
        "Perc90": 1943.066452,
        "Perc99": 2169.727106
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...
I0816 20:26:20.733156   14229 simple_test_executor.go:83] StatefulPodStartupLatency_PodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
{
      "data": {
        "Perc50": 0,
        "Perc90": 0,
        "Perc99": 0
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...

API调用延时的结果,会列出所有(resource,verb)对的结果。对应config.yaml文件中Identifier为APIResponsivenessPrometheus这个measurement。

I0816 20:26:20.733160   14229 simple_test_executor.go:83] APIResponsivenessPrometheus: {
  "version": "v1",
  "dataItems": [
    {
      "data": {
        "Perc50": 500,
        "Perc90": 580,
        "Perc99": 598
      },
      "unit": "ms",
      "labels": {
        "Count": "3",
        "Resource": "events",
        "Scope": "cluster",
        "SlowCount": "0",
        "Subresource": "",
        "Verb": "LIST"
      }
    },
    {
      "data": {
        "Perc50": 26.017594,
        "Perc90": 46.83167,
        "Perc99": 167.899999
      },
      "unit": "ms",
      "labels": {
        "Count": "1008",
        "Resource": "pods",
        "Scope": "namespace",
        "SlowCount": "0",
        "Subresource": "",
        "Verb": "LIST"
      }
    },
...
0条评论
0 / 1000
陆****林
2文章数
0粉丝数
陆****林
2 文章 | 0 粉丝
陆****林
2文章数
0粉丝数
陆****林
2 文章 | 0 粉丝
原创

kubemark性能测试(二)--通过clusterload2 执行测试用例

2023-06-28 00:50:45
656
0

一、简介

在上一节中,我们介绍了如何利用kubemark工具来模拟一个有许多个节点(比如5000个)的大规模k8s集群的场景。有了这个大规模集群之后,

我们要如何开展测试工作。这里我们需要用到clusteload2.

 

注意: 旧版本的k8s 性能测试脚本 即e2e测试脚本 在k8s源代码中,执行方法相对简单。大概方法如下:

make WHAT="test/e2e/e2e.test"

./e2e.test --kube-master=192.168.0.16 --host=https://192.168.0.16:6443 --ginkgo.focus="\[Performance\]" --provider=local --kubeconfig=kubemark.kubeconfig --num-nodes=10 --v=3 --ginkgo.failFast --e2e-output-dir=. --report-dir=.

目前e2e的性能用例已经被移出主库了 https://github.com/kubernetes/kubernetes/pull/83322,所以在2019.10.1之后出的版本用上面的命令是无法运行性能测试的。

编译clusterload2

从github上拉取perf-test项目,其中包含clusterloader2。将perf-tests源码克隆到$GOPATH/src/k8s.io/perf-tests目录下。

选择与被测试k8s集群匹配的版本,进入clusterloader2目录,进行编译。

cd $GOPATH/src/k8s.io/perf-tests/clusterloader2
go build -o clusterloader './cmd/'
 

二、执行测试用例

Clusterloader2主要提供了两个测试用例:

(1)密度测试:该测试用例主要用来测试节点规模和容器规模的性能指标。它的大致思路是:在一个有N个节点的集群中,连续创建30*N个Pod,然后再删除这些Pod,然后跟踪这个过程中,上面的三个SLO是否满足。

(2)负载测试:该测试用例的主要思路是,向K8S进行大量的各种类型的资源创建、删除、LIST以及其他操作,然后跟踪这个过程中,上面的三个SLO是否满足。

以density 测试策略为例,运行命令前,需要根据测试场景修改测试配置文件中的变量参数,配置文件包括有config.yaml,deployment.yaml。这二个文件在perf-test/clusterloader2/testing/density目录下有。
config 文件可参考以下修改:

# ASSUMPTIONS:
# - Underlying cluster should have 100+ nodes.
# - Number of nodes should be divisible by NODES_PER_NAMESPACE (default 100).

#Constants
{{$DENSITY_RESOURCE_CONSTRAINTS_FILE := DefaultParam .DENSITY_RESOURCE_CONSTRAINTS_FILE ""}}
# Cater for the case where the number of nodes is less than nodes per namespace. See https://github.com/kubernetes/perf-tests/issues/887
# 每个命名空间100个节点,每个节点30个Pod,这样每个命名空间为3000个Pod
{{$NODES_PER_NAMESPACE := MinInt .Nodes (DefaultParam .NODES_PER_NAMESPACE 100)}}
{{$PODS_PER_NODE := DefaultParam .PODS_PER_NODE 30}}
{{$DENSITY_TEST_THROUGHPUT := DefaultParam .DENSITY_TEST_THROUGHPUT 20}}
{{$SCHEDULER_THROUGHPUT_THRESHOLD := DefaultParam .CL2_SCHEDULER_THROUGHPUT_THRESHOLD 0}}
# LATENCY_POD_MEMORY and LATENCY_POD_CPU are calculated for 1-core 4GB node.
# Increasing allocation of both memory and cpu by 10%
# decreases the value of priority function in scheduler by one point.
# This results in decreased probability of choosing the same node again.
{{$LATENCY_POD_CPU := DefaultParam .LATENCY_POD_CPU 100}}
{{$LATENCY_POD_MEMORY := DefaultParam .LATENCY_POD_MEMORY 350}}
{{$MIN_LATENCY_PODS := DefaultParam .MIN_LATENCY_PODS 500}}
{{$MIN_SATURATION_PODS_TIMEOUT := 180}}
{{$ENABLE_CHAOSMONKEY := DefaultParam .ENABLE_CHAOSMONKEY false}}
{{$ENABLE_SYSTEM_POD_METRICS:= DefaultParam .ENABLE_SYSTEM_POD_METRICS true}}
{{$ENABLE_CLUSTER_OOMS_TRACKER := DefaultParam .CL2_ENABLE_CLUSTER_OOMS_TRACKER true}}
{{$CLUSTER_OOMS_IGNORED_PROCESSES := DefaultParam .CL2_CLUSTER_OOMS_IGNORED_PROCESSES ""}}
{{$USE_SIMPLE_LATENCY_QUERY := DefaultParam .USE_SIMPLE_LATENCY_QUERY false}}
{{$ENABLE_RESTART_COUNT_CHECK := DefaultParam .ENABLE_RESTART_COUNT_CHECK true}}
{{$RESTART_COUNT_THRESHOLD_OVERRIDES:= DefaultParam .RESTART_COUNT_THRESHOLD_OVERRIDES ""}}
{{$ALLOWED_SLOW_API_CALLS := DefaultParam .CL2_ALLOWED_SLOW_API_CALLS 0}}
{{$ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT := DefaultParam .CL2_ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT true}}
#Variables
{{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}}
{{$podsPerNamespace := MultiplyInt $PODS_PER_NODE $NODES_PER_NAMESPACE}}
{{$totalPods := MultiplyInt $podsPerNamespace $namespaces}}
{{$latencyReplicas := DivideInt (MaxInt $MIN_LATENCY_PODS .Nodes) $namespaces}}
{{$totalLatencyPods := MultiplyInt $namespaces $latencyReplicas}}
{{$saturationDeploymentTimeout := DivideFloat $totalPods $DENSITY_TEST_THROUGHPUT | AddInt $MIN_SATURATION_PODS_TIMEOUT}}
# saturationDeploymentHardTimeout must be at least 20m to make sure that ~10m node
# failure won't fail the test. See https://github.com/kubernetes/kubernetes/issues/73461#issuecomment-467338711
# 根据经验每秒大概能调度20个Pod,5000节点时有150000个Pod,需要7500秒才能调度完,所以这里要改成7500+
{{$saturationDeploymentHardTimeout := MaxInt $saturationDeploymentTimeout 12000}}

{{$saturationDeploymentSpec := DefaultParam .SATURATION_DEPLOYMENT_SPEC "deployment.yaml"}}
{{$latencyDeploymentSpec := DefaultParam .LATENCY_DEPLOYMENT_SPEC "deployment.yaml"}}

# Probe measurements shared parameter
{{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT := DefaultParam .CL2_PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT "15m"}}

name: density
namespace:
  number: {{$namespaces}}
tuningSets:
- name: Uniform5qps
  qpsLoad:
    # 每秒钟创建5个object,本文中object为deployment
    qps: 5
# 该参数在上面为false,即不模拟节点故障
{{if $ENABLE_CHAOSMONKEY}}
chaosMonkey:
  nodeFailure:
    failureRate: 0.01
    interval: 1m
    jitterFactor: 10.0
    simulatedDowntime: 10m
{{end}}
steps:
- name: Starting measurements
  # 开始监控API调用
  measurements:
  - Identifier: APIResponsivenessPrometheus
    Method: APIResponsivenessPrometheus
    Params:
      action: start
  - Identifier: APIResponsivenessPrometheusSimple
    Method: APIResponsivenessPrometheus
    Params:
      action: start
  # TODO(oxddr): figure out how many probers to run in function of cluster
  # 根据源码kubemark集群不支持InClusterNetworkLatency和DnsLookupLatency,故把它们注释掉
  # - Identifier: InClusterNetworkLatency
  #   Method: InClusterNetworkLatency
  #   Params:
  #     action: start
  #     checkProbesReadyTimeout: {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT}}
  #     replicasPerProbe: {{AddInt 2 (DivideInt .Nodes 100)}}
  # - Identifier: DnsLookupLatency
  #   Method: DnsLookupLatency
  #   Params:
  #     action: start
  #     checkProbesReadyTimeout: {{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT}}
  #     replicasPerProbe: {{AddInt 2 (DivideInt .Nodes 100)}}
       
  # 暂不清楚TestMetrics用来做什么,先注释
  # - Identifier: TestMetrics
  #   Method: TestMetrics
  #   Params:
  #     action: start
  #     resourceConstraints: {{$DENSITY_RESOURCE_CONSTRAINTS_FILE}}
  #     systemPodMetricsEnabled: {{$ENABLE_SYSTEM_POD_METRICS}}
  #     clusterOOMsTrackerEnabled: {{$ENABLE_CLUSTER_OOMS_TRACKER}}
  #     clusterOOMsIgnoredProcesses: {{$CLUSTER_OOMS_IGNORED_PROCESSES}}
  #     restartCountThresholdOverrides: {{YamlQuote $RESTART_COUNT_THRESHOLD_OVERRIDES 4}}
  #     enableRestartCountCheck: {{$ENABLE_RESTART_COUNT_CHECK}}

- name: Starting saturation pod measurements
  # 开始监控Pod启动延时
  measurements:
  - Identifier: SaturationPodStartupLatency
    Method: PodStartupLatency
    Params:
      action: start
      labelSelector: group = saturation
      threshold: {{$saturationDeploymentTimeout}}s
  - Identifier: WaitForRunningSaturationDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: start
      apiVersion: apps/v1
      kind: Deployment
      labelSelector: group = saturation
      operationTimeout: {{$saturationDeploymentHardTimeout}}s
  - Identifier: SchedulingThroughput
    Method: SchedulingThroughput
    Params:
      action: start
      labelSelector: group = saturation

# 开始创建saturation pod, 即30*N个Pod,N为节点数
- name: Creating saturation pods
  phases:
  - namespaceRange:
      min: 1
      max: {{$namespaces}}
    # 一个命名空间中创建几个object,即几个deployment
    replicasPerNamespace: 1
    tuningSet: Uniform5qps
    objectBundle:
    - basename: saturation-deployment
      objectTemplatePath: {{$saturationDeploymentSpec}}
      # 下面的参数用于填充deployment.yaml中的变量,根据前面的variables,podsPerNamespace的值为3000,即一个命名空间中有一个deployment,有3000个Pod
      templateFillMap:
        Replicas: {{$podsPerNamespace}}
        Group: saturation
        CpuRequest: 1m
        MemoryRequest: 10M

# 等待saturation pod都为Running状态
- name: Waiting for saturation pods to be running
  measurements:
  - Identifier: WaitForRunningSaturationDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: gather

- name: Collecting saturation pod measurements
  measurements:
  # 统计saturation pod的启动延时
  - Identifier: SaturationPodStartupLatency
    Method: PodStartupLatency
    Params:
      action: gather
  # 统计saturation pod的调度吞吐量,即每秒调度多少个Pod,如果小于threshold,则该项measurement为failed。threshhold上面的默认为0,所以不会失败
  - Identifier: SchedulingThroughput
    Method: SchedulingThroughput
    Params:
      action: gather
      enableViolations: {{$ENABLE_VIOLATIONS_FOR_SCHEDULING_THROUGHPUT}}
      threshold: {{$SCHEDULER_THROUGHPUT_THRESHOLD}}

# 在创建了30*N个Pod后,再创建500个latency pod(个数由前面的参数决定),观察当集群的Pod已经"饱和(saturation)"后,是否还能正常调度Pod
# 开始监控latency pod的启动延时
- name: Starting latency pod measurements
  measurements:
  - Identifier: PodStartupLatency
    Method: PodStartupLatency
    Params:
      action: start
      labelSelector: group = latency
  - Identifier: WaitForRunningLatencyDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: start
      apiVersion: apps/v1
      kind: Deployment
      labelSelector: group = latency
      operationTimeout: 15m

# 创建latency pod
- name: Creating latency pods
  phases:
  - namespaceRange:
      min: 1
      max: {{$namespaces}}
    replicasPerNamespace: {{$latencyReplicas}}
    tuningSet: Uniform5qps
    objectBundle:
    - basename: latency-deployment
      objectTemplatePath: {{$latencyDeploymentSpec}}
      templateFillMap:
        Replicas: 1
        Group: latency
        CpuRequest: {{$LATENCY_POD_CPU}}m
        MemoryRequest: {{$LATENCY_POD_MEMORY}}M

# 等待latency pod处于Running状态
- name: Waiting for latency pods to be running
  measurements:
  - Identifier: WaitForRunningLatencyDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: gather

# 删除latency pod
- name: Deleting latency pods
  phases:
  - namespaceRange:
      min: 1
      max: {{$namespaces}}
    replicasPerNamespace: 0
    tuningSet: Uniform5qps
    objectBundle:
    - basename: latency-deployment
      objectTemplatePath: {{$latencyDeploymentSpec}}

# 等待latency pod删除完成
- name: Waiting for latency pods to be deleted
  measurements:
  - Identifier: WaitForRunningLatencyDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: gather

# 收集latency pod的启动延时
- name: Collecting pod startup latency
  measurements:
  - Identifier: PodStartupLatency
    Method: PodStartupLatency
    Params:
      action: gather

# 删除saturation pod
- name: Deleting saturation pods
  phases:
  - namespaceRange:
      min: 1
      max: {{$namespaces}}
    replicasPerNamespace: 0
    tuningSet: Uniform5qps
    objectBundle:
    - basename: saturation-deployment
      objectTemplatePath: {{$saturationDeploymentSpec}}

# 等待saturation pod删除完成
- name: Waiting for saturation pods to be deleted
  measurements:
  - Identifier: WaitForRunningSaturationDeployments
    Method: WaitForControlledPodsRunning
    Params:
      action: gather

- name: Collecting measurements
  measurements:
  # APIResponsivenessPrometheusSimple统计API调用延时,使用的是Histgram类型的指标
  - Identifier: APIResponsivenessPrometheusSimple
    Method: APIResponsivenessPrometheus
    Params:
      action: gather
      enableViolations: true
      useSimpleLatencyQuery: true
      summaryName: APIResponsivenessPrometheus_simple
      allowedSlowCalls: {{$ALLOWED_SLOW_API_CALLS}}
  # APIResponsivenessPrometheus统计API调用延时,使用的是Summary类型的指标,该指标更为准确,一般以它为准
  {{if not $USE_SIMPLE_LATENCY_QUERY}}
  - Identifier: APIResponsivenessPrometheus
    Method: APIResponsivenessPrometheus
    Params:
      action: gather
      allowedSlowCalls: {{$ALLOWED_SLOW_API_CALLS}}
  {{end}}
  # 注释掉这三个
  # - Identifier: InClusterNetworkLatency
  #   Method: InClusterNetworkLatency
  #   Params:
  #     action: gather
  # - Identifier: DnsLookupLatency
  #   Method: DnsLookupLatency
  #   Params:
  #     action: gather
  # - Identifier: TestMetrics
  #   Method: TestMetrics
  #   Params:
  #     action: gather
  #     systemPodMetricsEnabled: {{$ENABLE_SYSTEM_POD_METRICS}}
  #     clusterOOMsTrackerEnabled: {{$ENABLE_CLUSTER_OOMS_TRACKER}}
  #     restartCountThresholdOverrides: {{YamlQuote $RESTART_COUNT_THRESHOLD_OVERRIDES 4}}
  #     enableRestartCountCheck: {{$ENABLE_RESTART_COUNT_CHECK}}

deployment.yaml的修改相对比较交单, 只需要修改测试使用的镜像即可。修改完成后即可执行测试。

将kubemark集群(被测试环境)的config文件(/root/.kube/config)拷贝到当前工作目录下。注意执行机到被测试环境主节点需要配置ssh免密。

执行用例的命令如下:

进入clusterloader可执行文件目录,配置文件(config.yaml)也需转移到了此位置
cd $GOPATH/src/k8s.io/perf-test/clusterloader2
# ./clusterloader2 --testconfig=config.yaml --provider=kubemark --provider-configs=ROOT_KUBECONFIG=./kubemark-kubeconfig \
--kubeconfig=./kubemark-kubeconfig --v=2 --enable-exec-service=false \ --enable-prometheus-server=true --tear-down-prometheus-server=false \
--prometheus-manifest-path /home/docker/clusterloader2/prometheus/manifest --nodes=500 2>&1 | tee output.txt
这里开启了prometheus监控,如果不需要则将 --enable-prometheus-server=false,这样一些依赖于prometheus的用例不会被执行。
 

三、结果分析

上面的命令,会输出如下的日志,在下面的日志中,有如下几点要注意:

当执行完后,会输出各种详细的指标数据。通过搜索关键字SchedulingThroughput,我们可以看到如下的调度吞量(对应config.yaml中的Identifier为SchedulingThroughput这个measurement)

I0816 20:26:20.733126   14229 simple_test_executor.go:83] SchedulingThroughput: {
  "perc50": 20,
  "perc90": 20,
  "perc99": 20.2,
  "max": 24
}


通过搜索关键字pod_startup,可以找到如下的PodStartupLatency_SaturationPodStartupLatency的启动延时;以及StatelessPodStartupLatency_SaturationPodStartupLatency的启动延时。
由于这里创建的Pod都是Stateless的,所以这两个指标的数据是一致的。另外还可以找到StatefulPodStartupLatency_SaturationPodStartupLatency,不过由于没有stateful pod,所以数据为0。
这些数据对应着config.yaml中Identifier为SaturationPodStartupLatency这个measurement。

I0816 20:26:20.733129   14229 simple_test_executor.go:83] PodStartupLatency_SaturationPodStartupLatency: {
  "version": "1.0",
  "dataItems": [
  ...
    {
      "data": {
        "Perc50": 1323.250478,
        "Perc90": 1878.221624,
        "Perc99": 2184.178124
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...
I0816 20:26:20.733137   14229 simple_test_executor.go:83] StatelessPodStartupLatency_SaturationPodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
   {
      "data": {
        "Perc50": 1323.250478,
        "Perc90": 1878.221624,
        "Perc99": 2184.178124
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...
I0816 20:26:20.733142   14229 simple_test_executor.go:83] StatefulPodStartupLatency_SaturationPodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
    {
      "data": {
        "Perc50": 0,
        "Perc90": 0,
        "Perc99": 0
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...

 

 

latency pod的启动延时,对应着config.yaml中Identifier为PodStartupLatency这个measurement。

I0816 20:26:20.733148   14229 simple_test_executor.go:83] PodStartupLatency_PodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
{ 
      "data": {
        "Perc50": 1350.344608,
        "Perc90": 1943.066452,
        "Perc99": 2169.727106
      },
      "unit": "ms",
      "labels": { 
        "Metric": "pod_startup"
      }
    }
...
I0816 20:26:20.733152   14229 simple_test_executor.go:83] StatelessPodStartupLatency_PodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
    {
      "data": {
        "Perc50": 1350.344608,
        "Perc90": 1943.066452,
        "Perc99": 2169.727106
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...
I0816 20:26:20.733156   14229 simple_test_executor.go:83] StatefulPodStartupLatency_PodStartupLatency: {
  "version": "1.0",
  "dataItems": [
...
{
      "data": {
        "Perc50": 0,
        "Perc90": 0,
        "Perc99": 0
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
...

API调用延时的结果,会列出所有(resource,verb)对的结果。对应config.yaml文件中Identifier为APIResponsivenessPrometheus这个measurement。

I0816 20:26:20.733160   14229 simple_test_executor.go:83] APIResponsivenessPrometheus: {
  "version": "v1",
  "dataItems": [
    {
      "data": {
        "Perc50": 500,
        "Perc90": 580,
        "Perc99": 598
      },
      "unit": "ms",
      "labels": {
        "Count": "3",
        "Resource": "events",
        "Scope": "cluster",
        "SlowCount": "0",
        "Subresource": "",
        "Verb": "LIST"
      }
    },
    {
      "data": {
        "Perc50": 26.017594,
        "Perc90": 46.83167,
        "Perc99": 167.899999
      },
      "unit": "ms",
      "labels": {
        "Count": "1008",
        "Resource": "pods",
        "Scope": "namespace",
        "SlowCount": "0",
        "Subresource": "",
        "Verb": "LIST"
      }
    },
...
文章来自个人专栏
文章 | 订阅
0条评论
0 / 1000
请输入你的评论
0
0