在詹金斯(Jenkins)中运行时,NPM测试神秘地卡住了 [英] NPM test getting mysteriously stuck when run in Jenkins

查看:126
本文介绍了在詹金斯(Jenkins)中运行时,NPM测试神秘地卡住了的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有两个都是通过CRA创建的TypeScript应用程序,还有一个CI管道,该管道运行一系列 npm 命令来运行测试/皮棉并为以后的阶段构建这些应用程序:

  time npm install-否-可选--unsafe-permnpm测试---coveragenpm运行tscnpm运行皮棉导出REACT_APP_VERSION = $ VERSION导出REACT_APP_COMMIT = $ GIT_COMMITnpm运行构建npm运行build-storybook 

我们的CI管道在Jenkins中运行,并且我们使用kubernetes插件来按需执行程序.通过我们的 Jenkinsfile 中的以下逻辑,该脚本针对 app1 app2 并行运行:

  stage('Frontend-App1'){代理人 {kubernetes {标签为节点"defaultContainer'jnlp'yamlFile'infrastructure/scripts/ci/pod-templates/node.yaml'idleMinutes 30}}环境 {CI =真"NPMRC_SECRET_FILE_PATH =凭据('verdaccio-npmrc')}脚步 {dir('frontend/app1'){容器('节点'){sh'cp $ NPMRC_SECRET_FILE_PATH〜/.npmrc'sh'chmod u + rw〜/.npmrc'sh'../../infrastructure/scripts/ci/build-frontend.sh'}publishHTML(目标:[allowMissing:否,alwaysLinkToLastBuild:否,keepAll:是的,reportDir:'coverage',reportFiles:'index.html',reportName:覆盖率报告(app1)"])junit的'testing/junit.xml'存储包括:"build/**/*",名称:"app1-build"存储包括:"storybook-static/**/*",名称:"app1-storybook-build"}}} 

所以,到我们所看到的.昨天,我们反复出现相同的症状: app1 的前端阶段将完成(两个中较小的一个),而 app2 的前端阶段将神秘地停止在运行测试的中间(Jenkins登录的最后一行始终是 PASS src/x/y/file.test.ts ,但并不总是相同的测试).在管道超时(或无聊的开发人员)将其杀死之前,它将在此状态下保持一整个小时.

我们运行 kubectl exec -it node-blah sh 进入运行卡住阶段的Pod,并进行一些诊断.运行 ps aux |猫为此:

  USER PID%CPU%MEM VSZ RSS TTY STAT开始时间命令节点1 0.0 0.0 4396 720?SS + 08:51 0:00猫节点17 0.0 0.0 0 0?Z 08:51 0:00 [sh]< defunct>节点32 0.0 0.0 0 0?Z 08:51 0:00 [sh]< defunct>节点47 0.0 0.0 0 0?Z 08:51 0:00 [sh]< defunct>节点664 0.0 0.0 0 0?Z 09:04 0:00 [sh]< defunct>...节点6760 0.0 0.0 4340 108?S 10:36 0:00 sh -c(pid = $$; {while [\(-d/proc/$ pid -o \!-d/proc/$$ \)-a -d'/home/jenkins/workspace/app_master/frontend/app2 @ tmp/durable-f617acc8'-a \!-f'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt'];执行触摸'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt';睡眠3;完成}& jsc = durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE = $ jsc'sh'-x'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh'>'/home/jenkins/workspace/app_master/frontend/app2 @ tmp/durable-f617acc8/jenkins-log.txt'2>& 1; echo $?>'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp''/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt';等待)&&-2>-&节点6761 0.0 0.0 4340 1060?S 10:36 0:00 sh -c(pid = $$; {while [\(-d/proc/$ pid -o \!-d/proc/$$ \)-a -d'/home/jenkins/workspace/app_master/frontend/app2 @ tmp/durable-f617acc8'-a \!-f'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt'];执行触摸'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt';睡眠3;完成}& jsc = durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE = $ jsc'sh'-x'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh'>'/home/jenkins/workspace/app_master/frontend/app2 @ tmp/durable-f617acc8/jenkins-log.txt'2>& 1; echo $?>'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp''/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt';等待)&&-2>-&节点6762 0.0 0.0 4340 812?S 10:36 0:00 sh -xe/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh节点6764 0.0 0.0 20096 2900?S 10:36 0:00/bin/bash ../../infrastructure/scripts/ci/build-frontend.sh节点6804 0.0 0.5 984620 38552Sl 10:37 0:00 npm节点6816 0.0 0.0 4356836?S 10:37 0:00 sh -c react-app-rewired test --reporters default --reporters jest-junit"--coverage"节点6817 0.0 0.4 877704 30220?Sl 10:37 0:00节点/home/jenkins/workspace/app_master/frontend/app2/node_modules/.bin/react-app-rewired测试--reporters默认--reporters jest-junit --coverage节点6823 0.4 1.3 1006148 97108?Sl 10:37 0:06节点/home/jenkins/workspace/app_master/frontend/app2/node_modules/react-app-rewired/scripts/test.js --reporters默认--reporters jest-junit --coverage节点6881 2.8 2.6 1065992 194076?Sl 10:37 0:41/usr/local/bin/node/home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js节点6886 2.8 2.6 1067004 195748?Sl 10:37 0:40/usr/local/bin/node/home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js节点6898 2.9 2.5 1058872 187360Sl 10:37 0:43/usr/local/bin/node/home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js节点6905 2.8 2.4 1054256 183492Sl 10:37 0:42/usr/local/bin/node/home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js节点6910 2.8 2.6 1067812 196344Sl 10:37 0:41/usr/local/bin/node/home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js节点6911 2.7 2.6 1063680 191088?Sl 10:37 0:40/usr/local/bin/node/home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js节点6950 0.8 1.9 1018536 145396Sl 10:38 0:11/usr/local/bin/node/home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js节点7833 0.0 0.0 4340 804?Ss 10:59 0:00 sh节点7918 0.0 0.0 4240 652?S 11:01 0:00睡眠3节点7919 0.0 0.0 17508 2048?R + 11:01 0:00 ps辅助节点7920 0.0 0.0 0.0 4396 716?S + 11:01 0:00猫 

在ps上的手册中:

  S可中断的睡眠(等待事件完成)l是多线程的(使用CLONE_THREAD,就像NPTL pthreads一样) 

所以我认为这表明测试已经开始正常运行,产生了子进程以并行运行它们,然后由于某种原因,在40秒左右之后,这些进程全部进入睡眠状态,不再执行任何操作.

我们对如何进一步调查感到很困惑,尤其是因为我们不方便地将自己喜欢的任何东西安装到Pod中以进行进一步调查(没有容易的root访问权限)的尴尬...但是任何建议的理论/下一步将受到欢迎!

**编辑**

idleMinutes 似乎不是罪魁祸首,因为今天,有好几次我们都看到该问题自恢复以来再次发生.我已经能够验证该脚本是否正在kubernetes中的一个全新节点中运行,而该节点以前从未被其他任何构建使用.所以现在我不知道最近发生了什么变化以使开始发生:(

解决方案

我对此颇为震惊,我很确定根本原因是测试使用了Pod中过多的内存.我们很幸运,昨天在几个版本中,我们看到在日志记录中打印出一个 ENOMEM 错误,然后以相同的方式卡住了它.我无法解释为什么我们不总是看到它们(我们回过头来检查了先前的示例,但它不在那里),但这就是使我们走上正轨的原因.

做了更多的挖掘工作,我碰巧及时运行了一个 kubectl顶部吊舱,以赶上一个发疯的节点吊舱-您可以看到 node-thk0r-5vpzk 在此特定时间使用 3131Mi ,我们将广告连播上的限制设置为 3Gi :

回头看一下Jenkins中的相应构建,我看到它现在处于卡住状态,但是没有 ENOMEM 日志记录.随后的 kubectl top pods 命令显示,现在在 node-thk0r-5vpzk 中的内存已降至合理水平,但是显然损坏已经造成,因为我们现在有了所有的孩子处于奇怪的睡眠状态的进程什么都不做.

这也(潜在地)解释了为什么在我引入 idleMinutes 行为后问题变得更加普遍-如果存在任何内存泄漏,则一遍又一遍地重复使用同一pod进行 npm test 将使它越来越有可能达到内存极限并出现异常.现在,我们的解决方法是使用-maxWorkers 设置来限制工作人员的数量,这使我们远远低于我们的 3Gi 限制.我们还计划使用-detectLeaks 稍微研究一下内存使用情况,以查看我们的测试中是否存在某些疯狂的问题,我们可以解决这些问题以解决猖ramp的内存使用情况.

如果有人遇到类似问题,希望这样做可以对其他人有所帮助.在疯狂的DevOps世界中又有一天...

We have two TypeScript apps, both created through CRA, and a CI pipeline which runs a series of npm commands to run tests/lint and build the apps for later stages:

time npm install --no-optional --unsafe-perm
npm test -- --coverage

npm run tsc
npm run lint

export REACT_APP_VERSION=$VERSION
export REACT_APP_COMMIT=$GIT_COMMIT

npm run build
npm run build-storybook

Our CI pipeline runs in Jenkins, and we're using the kubernetes plugin in order to get executors on-demand. The script is run in parallel for app1 and app2 via the following logic in our Jenkinsfile:

stage('Frontend - App1') {
    agent {
        kubernetes {
            label 'node'
            defaultContainer 'jnlp'
            yamlFile 'infrastructure/scripts/ci/pod-templates/node.yaml'
            idleMinutes 30
        }
    }
    environment {
        CI = 'true'
        NPMRC_SECRET_FILE_PATH = credentials('verdaccio-npmrc')
    }
    steps {
        dir('frontend/app1') {
            container('node') {
                sh 'cp $NPMRC_SECRET_FILE_PATH ~/.npmrc'
                sh 'chmod u+rw ~/.npmrc'
                sh '../../infrastructure/scripts/ci/build-frontend.sh'
            }
            publishHTML(target: [
                    allowMissing         : false,
                    alwaysLinkToLastBuild: false,
                    keepAll              : true,
                    reportDir            : 'coverage',
                    reportFiles          : 'index.html',
                    reportName           : "Coverage Report (app1)"
            ])
            junit 'testing/junit.xml'
            stash includes: 'build/**/*', name: 'app1-build'
            stash includes: 'storybook-static/**/*', name: 'app1-storybook-build'
        }
    }
}

So, onto what we're seeing. Repeatedly yesterday we saw the same symptoms: the frontend stage for app1 would complete (the smaller of the two), whilst app2 would mysteriously stop in the middle of running tests (the last line of logging in Jenkins was always PASS src/x/y/file.test.ts, but not always the same test). It would remain in this state for a full hour before getting killed by our pipeline timeout (or a bored developer).

We ran kubectl exec -it node-blah sh to get onto the pod that was running the stuck stage and get some diagnostics. Running ps aux | cat gives us this:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
node           1  0.0  0.0   4396   720 ?        Ss+  08:51   0:00 cat
node          17  0.0  0.0      0     0 ?        Z    08:51   0:00 [sh] <defunct>
node          32  0.0  0.0      0     0 ?        Z    08:51   0:00 [sh] <defunct>
node          47  0.0  0.0      0     0 ?        Z    08:51   0:00 [sh] <defunct>
node         664  0.0  0.0      0     0 ?        Z    09:04   0:00 [sh] <defunct>
.
.
.
node        6760  0.0  0.0   4340   108 ?        S    10:36   0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe  '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node        6761  0.0  0.0   4340  1060 ?        S    10:36   0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe  '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node        6762  0.0  0.0   4340   812 ?        S    10:36   0:00 sh -xe /home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh
node        6764  0.0  0.0  20096  2900 ?        S    10:36   0:00 /bin/bash ../../infrastructure/scripts/ci/build-frontend.sh
node        6804  0.0  0.5 984620 38552 ?        Sl   10:37   0:00 npm                                       
node        6816  0.0  0.0   4356   836 ?        S    10:37   0:00 sh -c react-app-rewired test --reporters default --reporters jest-junit "--coverage"
node        6817  0.0  0.4 877704 30220 ?        Sl   10:37   0:00 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/.bin/react-app-rewired test --reporters default --reporters jest-junit --coverage
node        6823  0.4  1.3 1006148 97108 ?       Sl   10:37   0:06 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/react-app-rewired/scripts/test.js --reporters default --reporters jest-junit --coverage
node        6881  2.8  2.6 1065992 194076 ?      Sl   10:37   0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node        6886  2.8  2.6 1067004 195748 ?      Sl   10:37   0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node        6898  2.9  2.5 1058872 187360 ?      Sl   10:37   0:43 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node        6905  2.8  2.4 1054256 183492 ?      Sl   10:37   0:42 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node        6910  2.8  2.6 1067812 196344 ?      Sl   10:37   0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node        6911  2.7  2.6 1063680 191088 ?      Sl   10:37   0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node        6950  0.8  1.9 1018536 145396 ?      Sl   10:38   0:11 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node        7833  0.0  0.0   4340   804 ?        Ss   10:59   0:00 sh
node        7918  0.0  0.0   4240   652 ?        S    11:01   0:00 sleep 3
node        7919  0.0  0.0  17508  2048 ?        R+   11:01   0:00 ps aux
node        7920  0.0  0.0   4396   716 ?        S+   11:01   0:00 cat

From the manual on ps:

S    interruptible sleep (waiting for an event to complete)
l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)

So I think what this shows is that the tests have started running fine, spawned child processes to run them in parallel, and then for whatever reason after 40 seconds or so those processes have all gone to sleep and are no longer doing anything.

We're pretty stumped with how to investigate this further, particularly as we have the awkwardness of not easily being able to install whatever we like into the pod for further investigation (no easy root access)... but any suggested theories / next steps would be welcomed!

** EDIT **

It seems idleMinutes wasn't the culprit, as several times today we've seen the issue happen again since reverting it. I've been able to verify that the script was running in a brand new node in kubernetes which hadn't been used by any other builds previously. So now I have no idea what's even changed recently to make this start happening :(

解决方案

Having banged my head against this some more, I'm pretty confident that the root cause was the tests using excessive memory in the pod. We got lucky that for a few builds yesterday we saw an ENOMEM error printed out amongst the logging, before it got stuck in an identical way. I can't explain why we weren't always seeing this (we went back and checked previous examples and it wasn't there), but that's what put us onto the right track.

Doing some more digging around, I happened to run a kubectl top pods in time to catch one of the node pods going crazy - you can see that node-thk0r-5vpzk is using 3131Mi at this particular moment in time, and we'd set the limit on the pod to be 3Gi:

Looking back at the corresponding build in Jenkins, I saw that it was now in the stuck state but with no ENOMEM logging. Subsequent kubectl top pods commands showed the memory had now decreased to a reasonable level in node-thk0r-5vpzk, but clearly the damage was already done as we now had all the child processes in the weird sleep state not doing anything.

This also (potentially) explains why the problem became way more common after I introduced the idleMinutes behaviour - if there's any sort of memory leak then re-using the same pod over and over for npm test will make it more and more likely to hit the memory ceiling and freak out. Our fix for now has been to limit the number of workers using the --maxWorkers setting, which keeps us well below our 3Gi limit. We're also planning to look into the memory usage a bit using --detectLeaks to see if there's something crazy in our tests we can fix to solve the rampant memory usage.

Hoping this can help someone else if they see a similar problem. Just another day in the crazy DevOps world...

这篇关于在詹金斯(Jenkins)中运行时,NPM测试神秘地卡住了的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆