使用OpsCenter创建新Cassandra集群的随机失败 [英] Random failure of creating a New Cassandra Cluster using OpsCenter

查看:148
本文介绍了使用OpsCenter创建新Cassandra集群的随机失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

OpsCenter版本:5.1.0和 DSE版本:4.6.0

OpsCenter version: 5.1.0 and DSE Version: 4.6.0

直接使用OpsCenter创建全新的群集会给我们带来以下错误.它可以在相同的设置下随机运行,但有95%的失败次数是相同的错误. Opscenter在自己的盒子上运行,但与群集实例共享相同的安全组.好的,我已经打开了所有IP的所有TCP端口.以下是opscenterd.log中错误的堆栈跟踪:

Creating a brand new cluster by using OpsCenter directly, gives us the following error. It randomly works with the same settings but 95% of the times it fails with the same error. Opscenter is running on its own box but sharing the same Security groups as the cluster instances. For good measure, I have opened up all TCP ports to all IPs. The following is the stack trace of the error from the opscenterd.log:

* 2015-03-19 10:06:12 + 0000 [] INFO:开始配置过程 2015-03-19 10:06:12 + 0000 [] INFO:开始群集配置的安装阶段

*2015-03-19 10:06:12+0000 [] INFO: Starting provisioning process 2015-03-19 10:06:12+0000 [] INFO: Starting installation phase of cluster provisioning

2015-03-19 10:06:13 + 0000 []警告:HTTP请求 http://10. xxx:61621/alive ?失败:另一方拒绝连接:111:拒绝连接.

2015-03-19 10:06:13+0000 [] WARN: HTTP request http://10.x.x.x:61621/alive? failed: Connection was refused by other side: 111: Connection refused.

2015-03-19 10:06:13 + 0000 []信息:开始将OpsCenter代理安装到54.x.x.x

2015-03-19 10:06:13+0000 [] INFO: Beginning install of OpsCenter agent to 54.x.x.x

2015-03-19 10:06:26 + 0000 []警告:HTTP请求 http://10. xxx:61621/alive ?失败:另一方拒绝连接:111:拒绝连接.

2015-03-19 10:06:26+0000 [] WARN: HTTP request http://10.x.x.x:61621/alive? failed: Connection was refused by other side: 111: Connection refused.

2015-03-19 10:06:31 + 0000 [] INFO:ip 10.x.x.x的代理为版本无 2015-03-19 10:06:31 + 0000 [] INFO:ip 10.x.x.x的代理版本为u'5.1.0' 2015-03-19 10:07:23 + 0000 [] INFO:已在节点10.x.x.x上成功安装代理和dse

2015-03-19 10:06:31+0000 [] INFO: Agent for ip 10.x.x.x is version None 2015-03-19 10:06:31+0000 [] INFO: Agent for ip 10.x.x.x is version u'5.1.0' 2015-03-19 10:07:23+0000 [] INFO: Successfully installed agent and dse on node 10.x.x.x

2015-03-19 10:07:23 + 0000 [] INFO:开始群集配置的停止"阶段

2015-03-19 10:07:23+0000 [] INFO: Beginning "stop" phase of cluster provisioning

2015-03-19 10:07:25 + 0000 []警告:将请求'10 .xxx:/ops/stop'(f6708fa2-b45f-42b4-b992-90a82b460ac7)标记为失败:/usr/sbin/服务dse停止失败

2015-03-19 10:07:25+0000 [] WARN: Marking request '10.x.x.x: /ops/stop' (f6708fa2-b45f-42b4-b992-90a82b460ac7) as failed: /usr/sbin/service dse stop failed

    exit status: 1
    stdout:
    log_daemon_msg is a shell function
    Cassandra 2.0 and later require Java 7 or later.

2015-03-19 10:07:25 + 0000 []错误:无法停止节点10.x.x.x:/usr/sbin/service dse停止失败

2015-03-19 10:07:25+0000 [] ERROR: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed

    exit status: 1
    stdout:
    log_daemon_msg is a shell function
    Cassandra 2.0 and later require Java 7 or later.

2015-03-19 10:07:25 + 0000 []警告:将标记请求停止阶段"(0b6fcb6b-96ba-404e-a484-b4b6b167b309)标记为失败:无法停止节点10.xxx:/usr/sbin/service dse停止失败

2015-03-19 10:07:25+0000 [] WARN: Marking request 'stop stage' (0b6fcb6b-96ba-404e-a484-b4b6b167b309) as failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed

    exit status: 1
    stdout:
    log_daemon_msg is a shell function
    Cassandra 2.0 and later require Java 7 or later.

2015-03-19 10:07:25 + 0000 []错误:停止阶段失败:无法停止节点10.x.x.x:/usr/sbin/service dse停止失败

2015-03-19 10:07:25+0000 [] ERROR: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed

    exit status: 1
    stdout:
    log_daemon_msg is a shell function
    Cassandra 2.0 and later require Java 7 or later.

2015-03-19 10:07:25 + 0000 []警告:将请求'provision'(daf1c15d-92e3-40b0-83ca-34d548ea835b)标记为失败:停止阶段失败:无法停止节点10.xxx:/usr/sbin/service dse停止失败

2015-03-19 10:07:25+0000 [] WARN: Marking request 'provision' (daf1c15d-92e3-40b0-83ca-34d548ea835b) as failed: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed

    exit status: 1
    stdout:
    log_daemon_msg is a shell function
    Cassandra 2.0 and later require Java 7 or later.

2015-03-19 10:07:25 + 0000 []错误: 2015-03-19 10:07:25 + 0000 []错误:群集配置失败:异常:停止阶段失败:无法停止节点10.x.x.x:/usr/sbin/service dse停止失败

2015-03-19 10:07:25+0000 [] ERROR: 2015-03-19 10:07:25+0000 [] ERROR: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed

    exit status: 1
    stdout:
    log_daemon_msg is a shell function
    Cassandra 2.0 and later require Java 7 or later.

2015-03-19 10:07:25 + 0000 []错误:无法配置群集:群集配置失败:异常:停止阶段失败:无法停止节点10.xxx:/usr/sbin/service dse stop失败

2015-03-19 10:07:25+0000 [] ERROR: Failed to provision cluster: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed

    exit status: 1
    stdout:
    log_daemon_msg is a shell function
    Cassandra 2.0 and later require Java 7 or later.

2015-03-19 10:07:25 + 0000 []警告:标记请求28c021fd-d21a-4fed-bb5c-a4fe17d362e0失败:群集配置失败:异常:停止阶段失败:无法停止节点10.xxx :/usr/sbin/service dse停止失败

2015-03-19 10:07:25+0000 [] WARN: Marking request 28c021fd-d21a-4fed-bb5c-a4fe17d362e0 as failed: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed

    exit status: 1
    stdout:
    log_daemon_msg is a shell function
    Cassandra 2.0 and later require Java 7 or later.

2015-03-19 10:07:41 + 0000 []警告:无法找到IP为[u'fe80:0:0:0:20​​00:aff:feeb:31c7%2'的节点的匹配群集,u'10.xxx',u'0:0:0:0:0:0:0:0:1%1',u'127.0.0.1'];消息为[u'5.1.0',u'/1947480708/conf'].这通常表明OpsCenter代理仍在退役的旧节点上运行,或者是OpsCenter不再监视的群集的一部分.

2015-03-19 10:07:41+0000 [] WARN: Unable to find a matching cluster for node with IP [u'fe80:0:0:0:2000:aff:feeb:31c7%2', u'10.x.x.x', u'0:0:0:0:0:0:0:1%1', u'127.0.0.1']; the message was [u'5.1.0', u'/1947480708/conf']. This usually indicates that an OpsCenter agent is still running on an old node that was decommissioned or is part of a cluster that OpsCenter is no longer monitoring.

感谢任何帮助! 提前致谢 哈莎

Appreciate any help! Thanks in advance Harsha

推荐答案

OpCenter开发人员在此处.我使OpsCenter设置功能可以缩放(如您所见,偶尔会出现splat).我必须带着悲伤和羞愧告诉你,你正在遇到一个错误.

OpCenter developer here. I make the OpsCenter provisioning features go zoom (or splat occasionally as you've seen). It is with sadness and shame that I must tell you that you're hitting a bug.

OpsCenter设置使用的Datastax AMI 2.4版( https://github.com/riptano/ComboAMI/tree/2.4 )在启动时通过启动脚本完成了大量工作.这些任务之一是设置一些用于验证软件包的gpg存储库密钥.该过程可能会间歇性地失败,从而破坏软件包的安装并导致您看到的一系列错误.这种故障是间歇性的,并且最近发生的频率大大增加了.如果您查看/home/ubuntu/datastax-ami/ami.log,您应该会看到gpg密钥故障开始于其余故障链.

The Datastax AMI version 2.4 used by OpsCenter provisioning (https://github.com/riptano/ComboAMI/tree/2.4) does quite a bit of work at boot time via startup scripts. One of those tasks is to set up some gpg repository keys used to validate packages. Intermittently that process can fail, breaking package installs and leading to the series of errors that you're seeing. This failure is intermittent and has greatly increased in frequency recently. If you check /home/ubuntu/datastax-ami/ami.log you should see the gpg key failures that begin the rest of the failure chain.

不幸的是,此错误在技术栈中已经很遥远了,并且很难手动解决.如果只需要配置一个群集,则可以重试,直到运行良好为止.否则,最好的办法是手动启动实例并使用本地配置将dse/dsc部署到其私有IP地址:

Unfortunately, this error is pretty far down the technology stack and is difficult to manually work around. If you just need to provision a single cluster you can retry until you get a good run. Otherwise your best best is to manually launch the instances and use local provisioning to deploy dse/dsc to their private ip addresses:

  • 使用ami-ada2b6c4启动实例(假设您在us-east-1中)
    • 确保将实例添加到OpsCenterSecurity组.
    • 确保您拥有使用的密钥对的私密部分(在本地配置中将需要它)
    • 在实例数据页面上,点击高级下拉菜单,并将以下用户数据添加为文本"--raidonly --java7"
    • Launch instances using ami-ada2b6c4 (assuming you're in us-east-1)
      • Make sure to add the instances to the OpsCenterSecurity group.
      • Make sure you have the private half of the keypair you use (you'll need it during local provisioning)
      • On the instance data page, hit the advanced pulldown and add the following userdata as text "--raidonly --java7"

      不是一个超级简单的解决方法.我希望您这次在OpsCenter上的体验更加出色.好消息是我正在解决此错误,并将在即将发布的版本中对其进行修复.

      Not a super-simple workaround. I wish your experience with OpsCenter this time around was more awesome. The good news is I'm on this bug and it will be fixed in an upcoming point release.

      不再需要手动删除/etc/security/limits.d/cassandra.conf

      No longer necessary to manually remove /etc/security/limits.d/cassandra.conf

      这篇关于使用OpsCenter创建新Cassandra集群的随机失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆