如何在Service Fabric中的分区级别上衡量资源使用情况? [英] How to measure resource usage on partitionlevel in Service Fabric?

查看:90
本文介绍了如何在Service Fabric中的分区级别上衡量资源使用情况?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

借助Service Fabric,我们获得了用于创建自定义指标和功能的工具.这样,我们所有人都可以创建自己的资源模型,资源平衡器将其用于在运行时执行.我想监视和使用物理资源,例如:内存,CPU和磁盘使用率.只要我们继续使用默认负载,此方法就可以正常工作.

With Service Fabric we get the tools to create custom metrics and capacities. This way we can all make our own resource models that the resource balancer uses to execute on runtime. I would like to monitor and use physical resources such as: memory, cpu and disk usage. This works fine as long as we keep using the default load.

但是对于服务/角色,负载不是静态的,因此我想使用内置的动态负载报告.这是我遇到的一个问题,ReportLoad在分区级别上工作.但是,分区都在节点上的同一进程内.我发现的所有监视物理资源的方法都将过程作为最小的度量单位,例如PerformanceCounter.如果使用此值,则可能会有成百上千的分区报告相同的负载,而负载却不能代表该分区.

But Load is not static for a service/actor, so I would like to use the built-in Dynamic load reporting. This is where I run into a problem, ReportLoad works on the level of partitions. However partitions are all within the same process on a Node. All methods for monitoring physical resources I found are using process as the smallest unit of measurement, such as PerformanceCounter. If this value would be used there could be hunderds of partitions reporting the same load and a load which is not representative of the partition.

问题是:如何在分区级别上衡量资源使用情况?

So the question is: How can the resource usage be measured on partition level?

推荐答案

服务实例和副本不仅在同一进程中托管,而且它们默认还共享.NET中的线程池!每次您创建一个新的服务实例时,平台实际上只是在宿主进程内部创建一个服务类的实例(从StatefulService或StatelessService派生的实例).这非常好,因为它速度快,价格便宜,并且您可以将大量服务打包到单个主机进程中,然后再打包到群集中的每个VM或计算机上.

Not only are service instances and replicas hosted in the same process, but they also share a thread pool by default in .NET! Every time you create a new service instance, the platform actually just creates an instance of your service class (the one that derives from StatefulService or StatelessService) inside the host process. This is great because it's fast, cheap, and you can pack a ton of services into a single host process and on to each VM or machine in your cluster.

但这也意味着资源是共享的,那么您如何知道每个分区的每个副本使用了多少?

But it also means resources are shared, so how do you know how much each replica of each partition is using?

答案是您报告的是虚拟资源而不是物理资源上的负载.想法是,服务创建者您可以跟踪有关您的服务的某些度量,并根据该信息制定度量标准.这是一个基于物理资源的虚拟资源的简单示例:

The answer is that you report load on virtual resources rather than physical resources. The idea is that you, the service author, can keep track of some measurement about your service, and you formulate metrics from that information. Here is a simple example of a virtual resource that's based on physical resources:

假设您有一个Web服务.您在Web服务上运行负载测试,并确定它可以在各种硬件配置文件上每秒处理的最大请求数量(以Azure VM大小和完全为单位的数字作为示例):

Suppose you have a web service. You run a load test on your web service and you determine the maximum requests per second it can handle on various hardware profiles (using Azure VM sizes and completely made-up numbers as an example):

  • A2:500 RPS
  • D2:1000 RPS
  • D4:1500 RPS

现在,在创建集群时,您可以根据所使用的硬件配置文件来相应地设置容量.因此,如果您有一个D2集群,则每个节点将定义1000 RPS的容量.

Now when you create your cluster, you set your capacities accordingly based on the hardware profiles you're using. So if you have a cluster of D2s, each node would define a capacity of 1000 RPS.

然后,您的Web服务的每个实例(如果有状态,则为副本)将报告平均RPS值.这是一种虚拟资源,您可以轻松地根据实例/副本进行计算.即使您不直接报告CPU,网络,内存等,它也与某些硬件配置文件相对应.您可以将此方法应用于您可以衡量的有关服务的任何内容,例如队列长度,并发用户数等.

Then each instance (or replica if stateful) of your web service reports an average RPS value. This is a virtual resource that you can easily calculate per instance/replica. It corresponds to some hardware profile, even though you're not reporting CPU, network, memory, etc. directly. You can apply this to anything that you can measure about your services, e.g., queue length, concurrent user count, etc.

如果您不想定义每秒特定于请求的容量,则可以通过为通用资源(例如内存或磁盘使用情况)定义类似于物理的容量来采用更通用的方法.但是,您实际上在这里要做的是为服务定义可用可用内存和磁盘,而不是总可用空间.在您的服务中,您可以跟踪每个实例/副本使用的每个容量有多少.但这不是总价值,只是您所知道的东西.因此,例如,如果您要跟踪存储在内存中的数据,则不一定包含运行时开销,临时堆分配等.

If you don't want to define a capacity as specific as requests per second, you can take a more general approach by defining physical-ish capacities for common resources, like memory or disk usage. But what you're really doing here is defining usable memory and disk for your services rather than total available. In your services you can keep track of how much of each capacity each instance/replica uses. But it's not a total value, it's just the stuff you know about. So for example if you're keeping track of data stored in memory, it wouldn't necessarily include runtime overhead, temporary heap allocations, etc.

我在一个Reliable Collection包装器中有此方法的示例,该包装器通过计数字节来严格报告您存储的数据量的负载指标: https://github.com/vturecek/metric-reliable-collections .它不会报告总的内存使用情况,因此您必须合理估计所需的开销并相应地定义容量,但同时不要报告临时堆分配和其他临时内存使用情况,这些指标报告的数据应该更加平滑,并且更能代表您正在存储的实际数据(例如,您不一定要仅因为.NET GC尚未运行而重新平衡群集).

I have an example of this approach in a Reliable Collection wrapper I wrote that reports load metrics strictly on the amount of data you store by counting bytes: https://github.com/vturecek/metric-reliable-collections. It doesn't report total memory usage, so you have to come up with a reasonable estimate of how much overhead you need and define your capacities accordingly, but at the same time by not reporting temporary heap allocations and other transient memory usage, the metrics that are reported should be much smoother and more representative of the actual data you're storing (you don't necessarily want to re-balance the cluster simply because the .NET GC hasn't run yet, for example).

这篇关于如何在Service Fabric中的分区级别上衡量资源使用情况?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆