在詹金斯(Jenkins)中运行时,NPM测试神秘地卡住了 [英] NPM test getting mysteriously stuck when run in Jenkins
问题描述
我们有两个都是通过CRA创建的TypeScript应用程序,还有一个CI管道,该管道运行一系列 npm
命令来运行测试/皮棉并为以后的阶段构建这些应用程序:
time npm install-否-可选--unsafe-permnpm测试---coveragenpm运行tscnpm运行皮棉导出REACT_APP_VERSION = $ VERSION导出REACT_APP_COMMIT = $ GIT_COMMITnpm运行构建npm运行build-storybook
我们的CI管道在Jenkins中运行,并且我们使用kubernetes插件来按需执行程序.通过我们的 Jenkinsfile
中的以下逻辑,该脚本针对 app1
和 app2
并行运行:
stage('Frontend-App1'){代理人 {kubernetes {标签为节点"defaultContainer'jnlp'yamlFile'infrastructure/scripts/ci/pod-templates/node.yaml'idleMinutes 30}}环境 {CI =真"NPMRC_SECRET_FILE_PATH =凭据('verdaccio-npmrc')}脚步 {dir('frontend/app1'){容器('节点'){sh'cp $ NPMRC_SECRET_FILE_PATH〜/.npmrc'sh'chmod u + rw〜/.npmrc'sh'../../infrastructure/scripts/ci/build-frontend.sh'}publishHTML(目标:[allowMissing:否,alwaysLinkToLastBuild:否,keepAll:是的,reportDir:'coverage',reportFiles:'index.html',reportName:覆盖率报告(app1)"])junit的'testing/junit.xml'存储包括:"build/**/*",名称:"app1-build"存储包括:"storybook-static/**/*",名称:"app1-storybook-build"}}}
所以,到我们所看到的.昨天,我们反复出现相同的症状: app1
的前端阶段将完成(两个中较小的一个),而 app2
的前端阶段将神秘地停止在运行测试的中间(Jenkins登录的最后一行始终是 PASS src/x/y/file.test.ts
,但并不总是相同的测试).在管道超时(或无聊的开发人员)将其杀死之前,它将在此状态下保持一整个小时.
我们运行 kubectl exec -it node-blah sh
进入运行卡住阶段的Pod,并进行一些诊断.运行 ps aux |猫
为此:
USER PID%CPU%MEM VSZ RSS TTY STAT开始时间命令节点1 0.0 0.0 4396 720?SS + 08:51 0:00猫节点17 0.0 0.0 0 0?Z 08:51 0:00 [sh]< defunct>节点32 0.0 0.0 0 0?Z 08:51 0:00 [sh]< defunct>节点47 0.0 0.0 0 0?Z 08:51 0:00 [sh]< defunct>节点664 0.0 0.0 0 0?Z 09:04 0:00 [sh]< defunct>...节点6760 0.0 0.0 4340 108?S 10:36 0:00 sh -c(pid = $$; {while [\(-d/proc/$ pid -o \!-d/proc/$$ \)-a -d'/home/jenkins/workspace/app_master/frontend/app2 @ tmp/durable-f617acc8'-a \!-f'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt'];执行触摸'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt';睡眠3;完成}& jsc = durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE = $ jsc'sh'-x'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh'>'/home/jenkins/workspace/app_master/frontend/app2 @ tmp/durable-f617acc8/jenkins-log.txt'2>& 1; echo $?>'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp''/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt';等待)&&-2>-&节点6761 0.0 0.0 4340 1060?S 10:36 0:00 sh -c(pid = $$; {while [\(-d/proc/$ pid -o \!-d/proc/$$ \)-a -d'/home/jenkins/workspace/app_master/frontend/app2 @ tmp/durable-f617acc8'-a \!-f'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt'];执行触摸'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt';睡眠3;完成}& jsc = durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE = $ jsc'sh'-x'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh'>'/home/jenkins/workspace/app_master/frontend/app2 @ tmp/durable-f617acc8/jenkins-log.txt'2>& 1; echo $?>'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv'/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp''/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt';等待)&&-2>-&节点6762 0.0 0.0 4340 812?S 10:36 0:00 sh -xe/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh节点6764 0.0 0.0 20096 2900?S 10:36 0:00/bin/bash ../../infrastructure/scripts/ci/build-frontend.sh节点6804 0.0 0.5 984620 38552Sl 10:37 0:00 npm节点6816 0.0 0.0 4356836?S 10:37 0:00 sh -c react-app-rewired test --reporters default --reporters jest-junit"--coverage"节点6817 0.0 0.4 877704 30220?Sl 10:37 0:00节点/home/jenkins/workspace/app_master/frontend/app2/node_modules/.bin/react-app-rewired测试--reporters默认--reporters jest-junit --coverage节点6823 0.4 1.3 1006148 97108?Sl 10:37 0:06节点/home/jenkins/workspace/app_master/frontend/app2/node_modules/react-app-rewired/scripts/test.js --reporters默认--reporters jest-junit --coverage节点6881 2.8 2.6 1065992 194076?Sl 10:37 0:41/usr/local/bin/node/home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js节点6886 2.8 2.6 1067004 195748?Sl 10:37 0:40/usr/local/bin/node/home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js节点6898 2.9 2.5 1058872 187360Sl 10:37 0:43/usr/local/bin/node/home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js节点6905 2.8 2.4 1054256 183492Sl 10:37 0:42/usr/local/bin/node/home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js节点6910 2.8 2.6 1067812 196344Sl 10:37 0:41/usr/local/bin/node/home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js节点6911 2.7 2.6 1063680 191088?Sl 10:37 0:40/usr/local/bin/node/home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js节点6950 0.8 1.9 1018536 145396Sl 10:38 0:11/usr/local/bin/node/home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js节点7833 0.0 0.0 4340 804?Ss 10:59 0:00 sh节点7918 0.0 0.0 4240 652?S 11:01 0:00睡眠3节点7919 0.0 0.0 17508 2048?R + 11:01 0:00 ps辅助节点7920 0.0 0.0 0.0 4396 716?S + 11:01 0:00猫
在ps上的手册中:
S可中断的睡眠(等待事件完成)l是多线程的(使用CLONE_THREAD,就像NPTL pthreads一样)
所以我认为这表明测试已经开始正常运行,产生了子进程以并行运行它们,然后由于某种原因,在40秒左右之后,这些进程全部进入睡眠状态,不再执行任何操作.
我们对如何进一步调查感到很困惑,尤其是因为我们不方便地将自己喜欢的任何东西安装到Pod中以进行进一步调查(没有容易的root访问权限)的尴尬...但是任何建议的理论/下一步将受到欢迎!
**编辑**
idleMinutes
似乎不是罪魁祸首,因为今天,有好几次我们都看到该问题自恢复以来再次发生.我已经能够验证该脚本是否正在kubernetes中的一个全新节点中运行,而该节点以前从未被其他任何构建使用.所以现在我不知道最近发生了什么变化以使开始发生:(
我对此颇为震惊,我很确定根本原因是测试使用了Pod中过多的内存.我们很幸运,昨天在几个版本中,我们看到在日志记录中打印出一个 ENOMEM
错误,然后以相同的方式卡住了它.我无法解释为什么我们不总是看到它们(我们回过头来检查了先前的示例,但它不在那里),但这就是使我们走上正轨的原因.
做了更多的挖掘工作,我碰巧及时运行了一个 kubectl顶部吊舱
,以赶上一个发疯的节点吊舱-您可以看到 node-thk0r-5vpzk
在此特定时间使用 3131Mi
,我们将广告连播上的限制设置为 3Gi
:
回头看一下Jenkins中的相应构建,我看到它现在处于卡住状态,但是没有 ENOMEM
日志记录.随后的 kubectl top pods
命令显示,现在在 node-thk0r-5vpzk
中的内存已降至合理水平,但是显然损坏已经造成,因为我们现在有了所有的孩子处于奇怪的睡眠状态的进程什么都不做.
这也(潜在地)解释了为什么在我引入 idleMinutes
行为后问题变得更加普遍-如果存在任何内存泄漏,则一遍又一遍地重复使用同一pod进行 npm test
将使它越来越有可能达到内存极限并出现异常.现在,我们的解决方法是使用-maxWorkers
设置来限制工作人员的数量,这使我们远远低于我们的 3Gi
限制.我们还计划使用-detectLeaks
稍微研究一下内存使用情况,以查看我们的测试中是否存在某些疯狂的问题,我们可以解决这些问题以解决猖ramp的内存使用情况.
如果有人遇到类似问题,希望这样做可以对其他人有所帮助.在疯狂的DevOps世界中又有一天...
We have two TypeScript apps, both created through CRA, and a CI pipeline which runs a series of npm
commands to run tests/lint and build the apps for later stages:
time npm install --no-optional --unsafe-perm
npm test -- --coverage
npm run tsc
npm run lint
export REACT_APP_VERSION=$VERSION
export REACT_APP_COMMIT=$GIT_COMMIT
npm run build
npm run build-storybook
Our CI pipeline runs in Jenkins, and we're using the kubernetes plugin in order to get executors on-demand.
The script is run in parallel for app1
and app2
via the following logic in our Jenkinsfile
:
stage('Frontend - App1') {
agent {
kubernetes {
label 'node'
defaultContainer 'jnlp'
yamlFile 'infrastructure/scripts/ci/pod-templates/node.yaml'
idleMinutes 30
}
}
environment {
CI = 'true'
NPMRC_SECRET_FILE_PATH = credentials('verdaccio-npmrc')
}
steps {
dir('frontend/app1') {
container('node') {
sh 'cp $NPMRC_SECRET_FILE_PATH ~/.npmrc'
sh 'chmod u+rw ~/.npmrc'
sh '../../infrastructure/scripts/ci/build-frontend.sh'
}
publishHTML(target: [
allowMissing : false,
alwaysLinkToLastBuild: false,
keepAll : true,
reportDir : 'coverage',
reportFiles : 'index.html',
reportName : "Coverage Report (app1)"
])
junit 'testing/junit.xml'
stash includes: 'build/**/*', name: 'app1-build'
stash includes: 'storybook-static/**/*', name: 'app1-storybook-build'
}
}
}
So, onto what we're seeing. Repeatedly yesterday we saw the same symptoms: the frontend stage for app1
would complete (the smaller of the two), whilst app2
would mysteriously stop in the middle of running tests (the last line of logging in Jenkins was always PASS src/x/y/file.test.ts
, but not always the same test). It would remain in this state for a full hour before getting killed by our pipeline timeout (or a bored developer).
We ran kubectl exec -it node-blah sh
to get onto the pod that was running the stuck stage and get some diagnostics. Running ps aux | cat
gives us this:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
node 1 0.0 0.0 4396 720 ? Ss+ 08:51 0:00 cat
node 17 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 32 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 47 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 664 0.0 0.0 0 0 ? Z 09:04 0:00 [sh] <defunct>
.
.
.
node 6760 0.0 0.0 4340 108 ? S 10:36 0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node 6761 0.0 0.0 4340 1060 ? S 10:36 0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node 6762 0.0 0.0 4340 812 ? S 10:36 0:00 sh -xe /home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh
node 6764 0.0 0.0 20096 2900 ? S 10:36 0:00 /bin/bash ../../infrastructure/scripts/ci/build-frontend.sh
node 6804 0.0 0.5 984620 38552 ? Sl 10:37 0:00 npm
node 6816 0.0 0.0 4356 836 ? S 10:37 0:00 sh -c react-app-rewired test --reporters default --reporters jest-junit "--coverage"
node 6817 0.0 0.4 877704 30220 ? Sl 10:37 0:00 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/.bin/react-app-rewired test --reporters default --reporters jest-junit --coverage
node 6823 0.4 1.3 1006148 97108 ? Sl 10:37 0:06 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/react-app-rewired/scripts/test.js --reporters default --reporters jest-junit --coverage
node 6881 2.8 2.6 1065992 194076 ? Sl 10:37 0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6886 2.8 2.6 1067004 195748 ? Sl 10:37 0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6898 2.9 2.5 1058872 187360 ? Sl 10:37 0:43 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6905 2.8 2.4 1054256 183492 ? Sl 10:37 0:42 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6910 2.8 2.6 1067812 196344 ? Sl 10:37 0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6911 2.7 2.6 1063680 191088 ? Sl 10:37 0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6950 0.8 1.9 1018536 145396 ? Sl 10:38 0:11 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 7833 0.0 0.0 4340 804 ? Ss 10:59 0:00 sh
node 7918 0.0 0.0 4240 652 ? S 11:01 0:00 sleep 3
node 7919 0.0 0.0 17508 2048 ? R+ 11:01 0:00 ps aux
node 7920 0.0 0.0 4396 716 ? S+ 11:01 0:00 cat
From the manual on ps:
S interruptible sleep (waiting for an event to complete)
l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
So I think what this shows is that the tests have started running fine, spawned child processes to run them in parallel, and then for whatever reason after 40 seconds or so those processes have all gone to sleep and are no longer doing anything.
We're pretty stumped with how to investigate this further, particularly as we have the awkwardness of not easily being able to install whatever we like into the pod for further investigation (no easy root access)... but any suggested theories / next steps would be welcomed!
** EDIT **
It seems idleMinutes
wasn't the culprit, as several times today we've seen the issue happen again since reverting it. I've been able to verify that the script was running in a brand new node in kubernetes which hadn't been used by any other builds previously. So now I have no idea what's even changed recently to make this start happening :(
Having banged my head against this some more, I'm pretty confident that the root cause was the tests using excessive memory in the pod. We got lucky that for a few builds yesterday we saw an ENOMEM
error printed out amongst the logging, before it got stuck in an identical way. I can't explain why we weren't always seeing this (we went back and checked previous examples and it wasn't there), but that's what put us onto the right track.
Doing some more digging around, I happened to run a kubectl top pods
in time to catch one of the node pods going crazy - you can see that node-thk0r-5vpzk
is using 3131Mi
at this particular moment in time, and we'd set the limit on the pod to be 3Gi
:
Looking back at the corresponding build in Jenkins, I saw that it was now in the stuck state but with no ENOMEM
logging. Subsequent kubectl top pods
commands showed the memory had now decreased to a reasonable level in node-thk0r-5vpzk
, but clearly the damage was already done as we now had all the child processes in the weird sleep state not doing anything.
This also (potentially) explains why the problem became way more common after I introduced the idleMinutes
behaviour - if there's any sort of memory leak then re-using the same pod over and over for npm test
will make it more and more likely to hit the memory ceiling and freak out. Our fix for now has been to limit the number of workers using the --maxWorkers
setting, which keeps us well below our 3Gi
limit. We're also planning to look into the memory usage a bit using --detectLeaks
to see if there's something crazy in our tests we can fix to solve the rampant memory usage.
Hoping this can help someone else if they see a similar problem. Just another day in the crazy DevOps world...
这篇关于在詹金斯(Jenkins)中运行时,NPM测试神秘地卡住了的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!