Kubernetes上的Akka Cluster心跳延迟 [英] Akka Cluster heartbeat delays on Kubernetes

查看:326
本文介绍了Kubernetes上的Akka Cluster心跳延迟的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的Scala应用程序(部署Kubernetes)不断遇到≈3s的Akka Cluster心跳延迟.

Our Scala application (Kubernetes deployment) constantly experience Akka Cluster heartbeat delays of ≈3s.

一旦我们甚至有200s的延迟,它也会在下图中显示出来:

Once we even had a 200s delay which also manifested itself in the following graph:

有人可以建议进一步调查的事情吗?

Can someone suggest things to investigate further?

  • Kubernetes 1.12.5
requests.cpu = 16
# limits.cpu not set

  • Scala 2.12.7
  • Java 11.0.4 + 11
  • -XX:+UseG1GC
    -XX:MaxGCPauseMillis=200
    -XX:+AlwaysPreTouch
    -Xlog:gc*,safepoint,gc+ergo*=trace,gc+age=trace:file=/data/gc.log:time,level,tags:filecount=4,filesize=256M
    -XX:+PerfDisableSharedMem
    

    • Akka群集2.5.25
    • 一些例子:

      timestamp    delay_ms
      06:24:55.743 2693
      06:30:01.424 3390
      07:31:07.495 2487
      07:36:12.775 3758
      

      有4个可疑时间点,其中发生了许多 Java Thread Park 事件 同时注册Akka线程(角色和远程处理) 并且所有这些都与心跳问题相关:

      There were 4 suspicious time points where lots of Java Thread Park events were registered simultaneously for Akka threads (actors & remoting) and all of them correlate to heartbeat issues:

      07:05:39左右,没有心跳被延迟"日志,但这是一个:

      Around 07:05:39 there were no "heartbeat was delayed" logs, but was this one:

      07:05:39,673 WARN PhiAccrualFailureDetector heartbeat interval is growing too large for address SOME_IP: 3664 millis
      

      在此过程中未发现与暂停事件或线程阻塞相关 Java Flight Recording会话,仅两个 Safepoint Begin 事件 接近延迟:

      No correlation with halt events or blocked threads were found during Java Flight Recording session, only two Safepoint Begin events in proximity to delays:

      应用程序CPU使用率较低,因此我们认为它可能与 K8如何为CPU调度应用程序节点. 但是关闭CPU限制并没有太大改善, 尽管kubernetes.cpu.cfs.throttled.second指标消失了.

      The application CPU usage is low, so we thought it could be related to how K8s schedule our application node for CPU. But turning off CPU limits haven't improved things much, though kubernetes.cpu.cfs.throttled.second metric disappeared.

      似乎没有必要使用单独的调度程序,因为即使发生延迟,也会出现延迟 没有负载,我们还构建了一个类似于我们自己的显式应用程序 除了心跳之外什么也没做,它仍然会遇到这些延迟.

      Using a separate dispatcher seems to be unnecessary since delays happen even when there is no load, we also built an explicit application similar to our own which does nothing but heartbeats and it still experience these delays.

      根据我们的观察,它在以下几个K8s节点上的发生频率更高 当我们的应用程序负载不大时,大型K8s集群与许多其他应用程序共享.

      From our observations it happens way more frequently on a couple of K8s nodes in a large K8s cluster shared with many other apps when our application doesn't loaded much.

      一个单独的 专用的K8s集群,其中我们的应用程序经过了负载测试,几乎没有问题 心跳延迟.

      A separate dedicated K8s cluster where our app is load tested almost have no issues with heartbeat delays.

      推荐答案

      您是否可以排除垃圾收集?以我的经验,这是导致JVM分布式系统中心跳延迟的最常见原因(而且Kubernetes/Mesos环境中的CFS配额可以使非停停"的GC有效地变为STW,尤其是如果您使用的不是最新版本的GC) (比JDK8的212版晚)的openjdk版本.

      Have you been able to rule out garbage collection? In my experience, that's the most common cause for delayed heartbeats in JVM distributed systems (and the CFS quota in a Kubernetes/Mesos environment can make non-Stop-The-World GCs effectively STW, especially if you're not using a really recent (later than release 212 of JDK8) version of openjdk).

      安全点开始"之前的每个线程驻留确实使我相信GC实际上是罪魁祸首.某些GC操作(例如,重新排列堆)要求每个线程都处于安全点,因此,每当不阻塞时,线程将经常检查JVM是否希望它们安全点;如果是这样,线程会自行停泊以便到达安全点.

      Every thread parking before "Safepoint begin" does lead me to believe that GC is in fact the culprit. Certain GC operations (e.g. rearranging the heap) require every thread to be in a safepoint, so every so often when not blocked, threads will check if the JVM wants them to safepoint; if so the threads park themselves in order to get to a safepoint.

      如果您已排除GC,那么您是在云环境中运行(还是在无法确定CPU或网络未超额预定的VM上)运行? akka-cluster文档建议增加akka.cluster.failure-detector.threshold值,该值默认为适用于更受控制的LAN/裸机环境的值:对于云环境,建议使用12.0.这不会阻止心跳延迟,但是会减少由于单个长心跳而导致的虚假击倒事件的机会(并且还会延迟对真正的节点丢失事件的响应).但是,如果您希望容忍心跳到达时间从1s到200s的峰值,那么您将需要一个很高的阈值.

      If you've ruled out GC, are you running in a cloud environment (or on VMs where you can't be sure that the CPU or network aren't oversubscribed)? The akka-cluster documentation suggests increasing the akka.cluster.failure-detector.threshold value, which defaults to a value suitable for a more controlled LAN/bare-metal environment: 12.0 is recommended for cloud environments. This won't prevent delayed heartbeats, but it will decrease the chances of a spurious downing event because of a single long heartbeat (and also delay responses to genuine node loss events). If you want to tolerate a spike in heartbeat inter-arrival times from 1s to 200s, though, you'll need a really high threshold.

      这篇关于Kubernetes上的Akka Cluster心跳延迟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆