升级到SDK 2.3.301后,Service Fabric Actor或服务将变得无法访问 [英] Service Fabric Actor or Service Becomes Inaccessible at Random after Upgrading to SDK 2.3.301

查看:76
本文介绍了升级到SDK 2.3.301后,Service Fabric Actor或服务将变得无法访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从Service Fabric SDK 2.0.135升级到2.3.301后,我们开始遇到以下情况:尽管在Service Fabric资源管理器中显示为正常,但仍无法访问Service Fabric参与者或服务.一旦处于此状态,则通过ActorProxy或ServiceProxy对actor或服务的任何调用将挂起5分钟,直到最终给出TimeoutException.一旦处于这种状态,参与者或服务将永远无法自行恢复-即使离开一个小时.唯一的解决方案是重置角色或服务所在的节点,重新部署角色或服务(完全相同的EXE),重置整个群集或重新启动所有群集计算机.

After upgrading from Service Fabric SDK 2.0.135 to 2.3.301, we have started encountering situations where a Service Fabric actor or service is inaccessible in spite of showing as healthy in Service Fabric Explorer. Once in this state, any call to the actor or service via the ActorProxy or ServiceProxy will hang for 5 minutes before finally giving a TimeoutException. Once in this state, the actor or service never recovers on its own – even if left for an hour. The only solution is to reset the node(s) on which the actor or service resides, redeploy the actor or service (exact same EXE), reset the entire cluster or reboot all of the cluster machines.

通常在部署或重新部署SF应用程序后进入此状态.

It usually gets into this state after deploying or re-deploying a SF application.

在使用Service Fabric的最后一年(自SDK v1.3起),我们从未遇到过此问题.它只有在移至2.3.301之后才开始.

In the last year of working with Service Fabric (since SDK v1.3), we have never had this problem. It only started after moving to 2.3.301.

它似乎是随机且不一致地发生的.我们的解决方案中13个SF应用程序中哪个生效,也是随机的.

It seems to happen randomly and inconsistently. Which of our 13 SF applications within our solution get effected is also random.

有人对我们如何解决此问题有任何想法吗?似乎是最新版本的Service Fabric中的错误,但也许我们最终做错了什么.

Does anyone have any ideas on how we might be able to resolve this? It seems like a bug in the latest version of Service Fabric but perhaps we are doing something wrong on our end.

感谢您的帮助.

以下是许多额外的信息,希望对理解我们所面临的问题有帮助.

Below is a lot of extra information that I hope will be useful in understanding what we're facing with this issue.

非常感谢

步骤

我实际上没有采取措施来始终如一地重现该问题.这就是我有时观察到的.

I don't really have steps to consistently reproduce the issue. This is simply what I observe sometimes.

  1. 我已编译,然后从Visual Studio重新部署了我的SF项目(调试"->开始而不调试")
  2. Visual Studio说它成功部署了项目
  3. Service Fabric资源管理器将我的所有服务显示为健康",包括数据绑定
  4. 所讨论的SF项目有2个参与者,它们是单个EXE的一部分. Service Fabric资源管理器显示了每个在不同节点上运行的参与者.
  5. Windows任务管理器显示了EXE的两个正在运行的副本,这很有意义,因为有两个节点正在运行EXE.

同样,在直接使用PowerShell部署到Azure之后,我们的质量检查人员会遇到此问题. (他没有从Visual Studio部署.)

Likewise, our QA experiences the issue after deploying to Azure using PowerShell directly. (He doesn't deploy from Visual Studio.)

回顾

  • Visual Studio说部署成功
  • Service Fabric Explorer显示一切正常
  • 任务管理器显示EXE的两个运行副本

当我看到失败

我有一个SF服务使用ServiceProxy或ActorProxy类调用另一个SF服务.我们在整个解决方案中通过结合13种不同的应用程序和大约25种不同的服务与解决方案来实现此目的演员们.自2015年11月我们开始使用Service Fabric SDK v1.3以来,它一直成功运行.

I have one SF Service calling another SF Service using the ServiceProxy or ActorProxy classes. We do this throughout our solution with a combination of 13 different applications and about 25 different Services & Actors. It has worked successfully since we started working with Service Fabric SDK v1.3 in November 2015.

现在,升级到2.3.301后,我们会定期出现随机的Actor或Service进入一种状态,在这种状态下,从ServiceProxy或ActorProxy调用该方法时,它无法响应对方法的调用.挂起5分钟后,我们收到一个System.Timeout异常,并显示以下消息:

Now, after upgrading to 2.3.301, we have the periodic occurrence of a random Actor or Service getting into a state where it fails to respond to a call to a method when called from ServiceProxy or ActorProxy. After 5 minutes of hanging, we receive a System.Timeout exception with the following message:

如果在服务繁忙或其时间较长时丢弃了消息,则可能会发生这种情况 运行操作并花费比配置操作更多的时间 超时.

This can happen if message is dropped when service is busy or its long running operation and taking more time than configured Operation Timeout.

请注意,该服务并不繁忙,也不执行长时间运行的操作.作为演员,该服务根本不会进行任何持续的操作.它只是公开其他服务可以使用的公共方法.从第一次调用开始就失败了.

Note that the service is NOT busy, nor is it performing a long-running operation. As an actor, the service doesn’t do any on-going operations at all. It simply exposes public methods that other services can consume. It fails from the very first call.

事实上,跟踪向我们显示,即使参与者从不中的方法的第一行也被调用.就像Service Fabric通信基础结构无法传递消息一样.

In fact, tracing shows us that even the first line of the method in the actor never gets called. It's as if the Service Fabric communication infrastructure fails to deliver the message.

开始时间

在过去的12个月中,我们从未见过此问题.

In the past 12 months, we had never seen this issue.

现在,自上周升级Service Fabric以来,在各种情况下,我们经常看到此问题.

Now, we are seeing this issue frequently and under a variety of conditions since upgrading Service Fabric last week.

我们升级到Service Fabric SDK 2.3.301.9590和Service Fabric 5.3.301.9590.

We upgrade to Service Fabric SDK 2.3.301.9590 and Service Fabric 5.3.301.9590.

起初,团队中的每个开发人员都独立遇到该问题,并且每个人都认为这只是我们机器的暂时性问题. Service Fabric确实存在一些问题,因此我们只接受此问题并继续.但是后来我们开始互相抱怨,意识到我们都看到了.甚至我们的QA都将在即将投入生产的环境中的云中看到它.

At first, each developer in the team encountered the issue independently and each thought it was a transient issue with just our machines. Service Fabric does have some issues so we just accept this and move on. But then we started to complain to each other and realized that we are all seeing it. Even our QAs are seeing it in the cloud on our environment that is soon to be production.

同样,这只有在上周我们升级到Service Fabric的最新版本时才开始.

Again, this only started when we upgraded to the latest version of Service Fabric last week.

以前,我们运行的是Service Fabric SDK 2.0.135.

Previously, we were running Service Fabric SDK 2.0.135.

我们通过安装SDK v 2.3.301,打开每个解决方案并允许Visual Studio进行升级来升级代码库.

We upgraded our codebase by installing SDK v 2.3.301, opening each of our solutions and allowing Visual Studio to conduct the upgrade.

环境

我正在具有16 GB RAM的i7上运行Windows 10 Enterprise的全新安装(不到2周前安装).我重新安装了Visual Studio 2015 Update 3和SF 2.3.301.9590.我把所有东西都装干净了.没有升级.

I’m running a fresh install of Windows 10 Enterprise (installed it less than 2 weeks ago) on an i7 with 16 gigs of RAM. I have a fresh install of Visual Studio 2015 Update 3 and SF 2.3.301.9590. I installed everything clean. No upgrades.

这在我所有同事的机器(不同的年龄,配置和新鲜度")上也都在发生.它偶尔发生在我们每个人身上.

This is also happening on all of my colleagues machines (of varying ages, configurations and "freshnesses"). It happens sporadically to each of us.

最关键的是,这也发生在Azure上的Service Fabric VM上.这些是我们的质量检查人员大约一个月前使用Azure上Service Fabric VM的标准模板创建的计算机.它已经预先安装了5.3.301.9590.他没有手动安装Service Fabric的任何更新.在开发人员升级到新版本之后,基于SF的应用程序在Azure(或我们自己的开发计算机)上才遇到此问题.

Most critically, this is also happening on our Service Fabric VMs on Azure. These are machines that our QA created about a month ago using the standard templates for Service Fabric VMs on Azure. It had 5.3.301.9590 pre-installed. He did not manually install any updates to Service Fabric. Our SF-based application did not encounter this problem on Azure (or our own dev machines) until after the developers upgraded to the new version.

这不是我的机器,也不是仅限于开发环境.我们所有人唯一一致的变化是SF版本的更新.

This is not a my machine thing, nor is it isolated to just the development environment. The only consistent change for all of us is the update of the SF version.

原因

我们不知道是什么原因造成的.

We have no idea what causes it.

通常在部署新的SF应用程序后立即发生.是的,我们确实要等待SF部署后通常需要2到3分钟才能自行确定".我们将其放置了一个小时或更长时间,但它永远无法正常工作.

It usually happens immediately after deploying a new SF application. Yes, we do wait for the usual 2 or 3 minutes it takes for SF to "figure itself out" after deploying. We have left it for an hour or more and it just never works.

有趣的是,我认为我有一个可以正常运行的SF服务,然后突然停止工作,但这是在我们意识到存在问题之前,所以我没有在寻找它.我不确定.

Anecdotally, I think I've had a SF Service that was working fine and then suddenly stopped working but this was before we realized there was an issue so I wasn't looking for it. I can't be certain.

工作量

一旦我们的SF服务处于无法访问"状态,Service Fabric便不会再次退出该状态.该应用程序完全无法使用.通过不同程度的成功,我们将执行以下操作:

Once we have a SF service in that "inaccessible" state, Service Fabric will not get itself back out of that state again. The application is completely unusable. With varying degrees of success, we do the following:

  • 重新部署无法访问的SF应用程序
  • 重新启动节点(通过Service Fabric资源管理器,请转到 节点,单击省略号按钮,然后单击重新启动"选项) 托管无法访问的SF服务的&演员
  • 重新启动整个SF群集(先停止再启动)
  • 重新启动所有运行SF节点的计算机
  • 重置整个群集,然后重新部署一切(最后的选择,但是 必要几次)
  • Re-deploy the inaccessible SF application
  • Restart the nodes (through Service Fabric Explorer by going to the node, clicking the ellipsis button and clicking the "Restart" option) that host the inaccessible SF services & actors
  • Restart the entire SF cluster (Stop then Start)
  • Restart all of the machines running a SF node
  • Reset the entire cluster and re-deploy everything (last resort but it has been necessary a few times)

有趣的是,使用任务管理器杀死有问题的进程无济于事.如果我终止了令人反感的进程,则Service Fabric将按预期方式将其重新启动,但是它仍然不会响应消息.

Interestingly, what does not help is using Task Manager to kill the offending processes. If I kill the offending process, Service Fabric restarts it (as expected) but it still won't respond to messages.

因此,问题似乎出在Service Fabric本身,而不是EXE.

Thus, the issue seems to be with Service Fabric itself and not with the EXEs.

当然,这些根本不是解决方案",因为在SF重新启动/重新平衡之前,它们将无法访问我们的整个应用程序.甚至重新启动几个节点也会使很多东西脱机.

Of course, these aren’t "solutions" at all because they leave our entire application inaccessible until SF can restart/rebalance. Even restarting a few of the nodes knocks a bunch of stuff off-line.

从本质上讲,这对我们来说是制胜法宝.像这样的Service Fabric,我们不可能将我们的应用程序投入生产(甚至是beta).

Essentially, this is a show-stopper for us. We can’t possibly put our application into production (or even beta) with Service Fabric behaving like this.

使用服务代理或Actor代理时的C#异常:

The C# Exception when Using the Service Proxy or Actor Proxy:

ActorProxy或ServicePRoxy抛出的异常的JSON呈现

"exception": {
    "ClassName": "System.TimeoutException",
    "Message": "This can happen if message is dropped when service is busy or its long running operation and taking more time than configured Operation Timeout.",
    "Data": null,
    "InnerException": null,
    "HelpURL": null,
    "StackTraceString": "   at Microsoft.ServiceFabric.Services.Communication.Client.ServicePartitionClient`1.<InvokeWithRetryAsync>d__7`1.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.ServiceFabric.Services.Remoting.Client.ServiceRemotingPartitionClient.<InvokeAsync>d__8.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.ServiceFabric.Services.Remoting.Builder.ProxyBase.<InvokeAsync>d__0.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.ServiceFabric.Services.Remoting.Builder.ProxyBase.<ContinueWithResult>d__7`1.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()\r\n   at RenderingCachingEngine.RenderingCachingEngine.<Render>d__10.MoveNext() in C:\\Code\\Ink\\Dev\\Current\\Source\\Rendering Service Fabric\\RenderingCachingEngine\\RenderingCachingEngine.cs:line 381",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "ExceptionMethod": "8\nMoveNext\nMicrosoft.ServiceFabric.Services, Version=5.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35\nMicrosoft.ServiceFabric.Services.Communication.Client.ServicePartitionClient`1+<InvokeWithRetryAsync>d__7`1\nVoid MoveNext()",
    "HResult": -2146233083,
    "Source": "Microsoft.ServiceFabric.Services",
    "WatsonBuckets": null
  }

这是Service Fabric信息的JSON呈现:

Here is a JSON rendering of the Service Fabric Info:

  "serviceFabricInfo": {
    "serviceFabricServiceName": "fabric:/Rendering/RenderingCachingEngine",
    "serviceFabricServiceTypeName": "RenderingCachingEngineType",
    "serviceFabricReplicaId": 131225099453058851,
    "serviceFabricPartitionId": "e400087d-8a08-4dab-bcdd-1f5ce82f374f",
    "serviceFabricApplicationName": "fabric:/Rendering",
    "serviceFabricApplicationTypeName": "RenderingType",
    "serviceFabricNodeName": "_Node_4"
  }

事件查看器在重新部署时会记录

Windows事件查看器确实在应用程序和服务日志-> Microsoft-Service Fabric->管理员"下显示了一些值得注意的日志.

Windows Event Viewer does show some note-worthy logs under "Applications and Services Logs -> Microsoft-Service Fabric -> Admin".

以下日志是在我重新部署应用程序的更新版本时发生的(请注意,DataBinding.exe是包含我的两个SF actor的EXE的名称):

The following logs happened while I was re-deploying an updated version of my application (note that DataBinding.exe is the name of the EXE containing my two SF actors):

Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          11/2/2016 2:38:53 PM
Event ID:      256
Task Category: Common
Level:         Error
Keywords:      Default
User:          NETWORK SERVICE
Computer:      shayward10.ovx.local
Description:
WriteNode failed. HRESULT=-2147467259, Output=CustomOutput
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{CBD93BC2-71E5-4566-B3A7-595D8EECA6E8}" />
    <EventID>256</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>1</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2016-11-02T18:38:53.678587200Z" />
    <EventRecordID>7620</EventRecordID>
    <Correlation />
    <Execution ProcessID="4440" ThreadID="7360" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>shayward10.ovx.local</Computer>
    <Security UserID="S-1-5-20" />
  </System>
  <EventData>
    <Data Name="id">
    </Data>
    <Data Name="type">XmlLiteWriter</Data>
    <Data Name="text">WriteNode failed. HRESULT=-2147467259, Output=CustomOutput</Data>
  </EventData>
</Event>

Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          11/2/2016 2:38:54 PM
Event ID:      23073
Task Category: Hosting
Level:         Warning
Keywords:      Default
User:          SYSTEM
Computer:      shayward10.ovx.local
Description:
ServiceHostProcess: DataBinding.exe for ApplicationId 805915c7-456c-49d3-af95-62cc44650664 terminated unexpectedly with exit code 3221225786 on node id bf865279ba277deb864a976fbf4c200e
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{CBD93BC2-71E5-4566-B3A7-595D8EECA6E8}" />
    <EventID>23073</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>90</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2016-11-02T18:38:54.820567800Z" />
    <EventRecordID>7621</EventRecordID>
    <Correlation />
    <Execution ProcessID="6944" ThreadID="3812" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>shayward10.ovx.local</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <EventData>
    <Data Name="id">bf865279ba277deb864a976fbf4c200e</Data>
    <Data Name="AppId">805915c7-456c-49d3-af95-62cc44650664</Data>
    <Data Name="ReturnCode">3221225786</Data>
    <Data Name="ProcessName">DataBinding.exe</Data>
  </EventData>
</Event>

Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          11/2/2016 2:38:56 PM
Event ID:      256
Task Category: Common
Level:         Error
Keywords:      Default
User:          NETWORK SERVICE
Computer:      shayward10.ovx.local
Description:
WriteNode failed. HRESULT=-2147467259, Output=CustomOutput
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{CBD93BC2-71E5-4566-B3A7-595D8EECA6E8}" />
    <EventID>256</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>1</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2016-11-02T18:38:56.261857600Z" />
    <EventRecordID>7627</EventRecordID>
    <Correlation />
    <Execution ProcessID="4440" ThreadID="8564" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>shayward10.ovx.local</Computer>
    <Security UserID="S-1-5-20" />
  </System>
  <EventData>
    <Data Name="id">
    </Data>
    <Data Name="type">XmlLiteWriter</Data>
    <Data Name="text">WriteNode failed. HRESULT=-2147467259, Output=CustomOutput</Data>
  </EventData>
</Event>

事件查看器在超时时记录

一旦服务处于不可访问状态,尝试调用它会在每个请求上生成以下日志(等待5分钟后):

Once the service is in an inaccessible state, trying to call it yields the following log on each request (after waiting for 5 minutes):

Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          11/2/2016 2:44:55 PM
Event ID:      44289
Task Category: FabricTransport
Level:         Warning
Keywords:      Default
User:          NETWORK SERVICE
Computer:      shayward10.ovx.local
Description:
Error While Sending Message : FABRIC_E_TIMEOUT
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{CBD93BC2-71E5-4566-B3A7-595D8EECA6E8}" />
    <EventID>44289</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>173</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2016-11-02T18:44:55.349048200Z" />
    <EventRecordID>7629</EventRecordID>
    <Correlation />
    <Execution ProcessID="18600" ThreadID="8076" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>shayward10.ovx.local</Computer>
    <Security UserID="S-1-5-20" />
  </System>
 <EventData>
    <Data Name="id">
    </Data>
    <Data Name="type">ServiceCommunicationClient</Data>
    <Data Name="text">Error While Sending Message : FABRIC_E_TIMEOUT</Data>
  </EventData>
</Event>

推荐答案

此问题可能在两种情况下发生.

This issue can happen in 2 scenarios.

  1. 如果您的ActorService方法处理所花费的时间超过了默认超时时间,那么您需要更改OperationTimeout值.默认情况下是5分钟.如果要更改超时,可以通过在客户端程序集中添加程序集 FabricTransportServiceRemotingProviderAttribute 来更改它.
  1. If your ActorService method processing is taking more than the default timeout, then you need to change OperationTimeout value. By default it is 5 minutes. If you want to change the timeout, you can change it by adding assembly FabricTransportServiceRemotingProviderAttribute in your client assembly.

https://msdn. microsoft.com/en-us/library/microsoft.servicefabric.services.remoting.fabrictransport.fabrictransportserviceremotingproviderattribute.aspx

  1. 如果不是第一种情况,则可以按照以下缓解方法尝试解决已知错误.
    • 在服务清单中为ActorService端点指定端口0.默认情况下,ActorEndpoint将列在ServiceManifest中,但端口将不存在.
  1. If first scenario is not the case, then you can try below mitigation for a known bug.
    • Specify Port 0 in the Service Manifest for the ActorService endpoint. By default, ActorEndpoint will be listed in ServiceManifest but port won’t be there.

这是您进行更改后对ActorService的外观.

This is how it will look for ActorService after you make change.

<Endpoint Name="Actor1ActorServiceEndpoint" Port="0" />

我们已经意识到了问题所在,并且正在解决此问题.

We are aware of the problem and a fix is on the way.

这篇关于升级到SDK 2.3.301后,Service Fabric Actor或服务将变得无法访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆