分析多层分布式 Web 应用程序(服务器端) [英] Profiling a multi-tiered, distributed, web application (server side)

查看:41
本文介绍了分析多层分布式 Web 应用程序(服务器端)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从服务器 PoV分析一个复杂的网络应用程序.

I would like to profile a complex web application from the server PoV.

根据上面的维基百科链接和 Stack Overflow profiling 标签描述,分析(以其一种形式)意味着获取应用程序的 API/组件的列表(或图形表示),每个都有运行时的调用次数和时间.

According to the wikipedia link above, and the Stack Overflow profiling tag description, profiling (in one of its forms) means getting a list (or a graphical representation) of APIs/components of the application, each with the number of calls and time spent in it during run-time.

请注意,与传统的单一程序/单一语言不同,Web 服务器应用程序可能是:

Note that unlike a traditional one-program/one-language a web server application may be:

  • 分布在多台机器上
  • 不同的组件可能用不同的语言编写
  • 不同的组件可能运行在不同的操作系统等之上.

所以传统的只使用分析器"这个答案不容易适用于这个问题.

So the traditional "Just use a profiler" answer is not easily applicable to this problem.

不是寻找:

  • 粗略的性能统计数据,例如由各种日志分析工具(例如模拟)或 for
  • 提供的数据
  • 客户端、每页性能统计数据,例如 Google 的 Pagespeed 或 Yahoo! 等工具提供的数据.Y!Slow、瀑布图和浏览器组件加载时间)
  • Coarse performance stats like the ones provided by various log-analysis tools (e.g. analog) nor for
  • client-side, per-page performance stats like the ones presented by tools like Google's Pagespeed, or Yahoo! Y!Slow, waterfall diagrams, and browser component load times)

相反,我正在寻找经典的分析器风格的报告:

Instead, I'm looking for a classic profiler-style report:

  • 通话次数
  • 通话时长

通过函数/API/组件名称,在 Web 应用程序的服务器端.

by function/API/component-name, on the server-side of the web application.

归根结底,问题是:

如何分析多层、多平台、分布式 Web 应用程序?

更喜欢基于免费软件的解决方案.

A free-software based solution is much preferred.

我在网上搜索了一段时间的解决方案,除了一些非常昂贵的商业产品外,找不到任何可以满足我需求的东西.最后,我硬着头皮,思考了问题,写出了自己的解决方案,想自由分享.

I have been searching the web for a solution for a while and couldn't find anything satisfactory to fit my needs except some pretty expensive commercial offerings. In the end, I bit the bullet, thought about the problem, and wrote my own solution which I wanted to freely share.

我正在发布我自己的解决方案 因为在 SO 上鼓励这种做法

I'm posting my own solution since this practice is encouraged on SO

此解决方案远非完美,例如,它处于非常高的级别(单个 URL),可能不适用于所有用例.尽管如此,它还是极大地帮助了我尝试了解我的网络应用程序将时间花在哪里.

This solution is far from perfect, for example, it is at very high level (individual URLs) which may not good for all use-cases. Nevertheless, it has helped me immensely in trying to understand where my web-app spends its time.

本着开源和知识共享的精神,我欢迎其他人提供任何其他的,特别是优越的方法和解决方案.

In the spirit on open source and knowledge sharing, I welcome any other, especially superior, approaches and solutions from others.

推荐答案

考虑到传统分析器的工作原理,应该可以直接提出通用的免费软件解决方案来应对这一挑战.

Thinking of how traditional profilers work, it should be straight-forward to come up with a general free-software solution to this challenge.

让我们把问题分成两部分:

Let's break the problem into two parts:

  • 收集数据
  • 提交数据

假设我们可以将 Web 应用程序分解为单独的组成部分(API、函数)并测量它所花费的时间每一个部分来完成.每个部分被称为数千一天几次,所以我们可以在一整天左右的时间内收集这些数据多个主机.当这一天结束时,我们将有一个非常大的相关数据集.

Assume we can break our web application into its individual constituent parts (API, functions) and measure the time it takes each of these parts to complete. Each part is called thousands of times a day, so we could collect this data over a full day or so on multiple hosts. When the day is over we would have a pretty big and relevant data-set.

Epiphany #1:URL"替换function",以及我们现有的网络日志是它".我们需要的数据已经有了:

Epiphany #1: substitute 'function' with 'URL', and our existing web-logs are "it". The data we need is already there:

  • Web API 的每个部分都由请求 URL(可能是带有一些参数)
  • 每行显示往返时间(通常以微秒为单位)我们有一天,(周,月)价值的行,这些数据很方便

因此,如果我们可以访问所有分布式网络的标准网络日志我们网络应用程序的一部分,我们问题的一部分(收集数据)已解决.

So if we have access to standard web-logs for all the distributed parts of our web application, part one of our problem (collecting the data) is solved.

现在我们有一个大数据集,但仍然没有真正的洞察力.我们如何获得洞察力?

Now we have a big data-set, but still no real insight. How can we gain insight?

顿悟 #2: 直接可视化我们的(多个)网络服务器日志.

Epiphany #2: visualize our (multiple) web-server logs directly.

一张图片值 1000 字.我们可以使用哪个图片?

A picture is worth a 1000 words. Which picture can we use?

我们需要压缩成百上千或数百万行的多个网络服务器登录到一个简短的摘要,它会告诉大部分关于我们表现的故事.换句话说:目标是生成类似分析器的报告,甚至更好:图形分析器报告,直接来自我们的网络日志.

We need to condense 100s of thousands or millions lines of multiple web-server logs into a short summary which would tell most of the story about our performance. In other words: the goal is to generate a profiler-like report, or even better: a graphical profiler report, directly from our web logs.

想象一下我们可以映射:

Imagine we could map:

  • 一维的调用延迟
  • 调用另一个维度的次数,以及
  • 颜色的功能标识(本质上是第三维)

一张这样的图片:API 延迟的堆积密度图出现在下面(函数名称是为了说明目的而编造的).

One such picture: a stacked-density chart of latencies by API appears below (functions names were made-up for illustrative purposes).

  • 我们有一个从根本上代表 3 的三峰分布我们应用程序中的不同世界":
  • 最快的响应,以约 300 微秒为中心的延迟.这些响应来自我们的 varnish 缓存
  • 第二快,耗时不到 0.01 秒平均,来自我们中间层提供的各种 API网络应用程序 (Apache/Tomcat)
  • 最慢的响应,以 0.1 秒为中心,并且有时需要几秒钟的时间来响应,涉及往返到我们的 SQL 数据库.

我们可以看到缓存对应用程序的影响有多大(注意 x 轴是 log10 刻度)

We can see how dramatic caching effects can be on an application (note that the x-axis is on a log10 scale)

我们可以具体看到哪些 API 趋于快与慢,所以我们知道要关注什么.

We can specifically see which APIs tend to be fast vs slow, so we know what to focus on.

我们可以查看每天最常调用哪些 API.我们还可以看到其中一些很少被称为,甚至很难在图表上看到它们的颜色.

We can see which APIs are most often called each day. We can also see that some of them are so rarely called, it is hard to even see their color on the chart.

第一步是预处理和提取所需数据的子集从日志中.一个像 Unix 'cut' 在多个日志上的简单实用程序在这里可能就足够了.您可能还需要折叠多个将类似的 URL 转换为描述函数/API 的较短字符串,例如注册"或购买".如果你有多主机统一日志由负载平衡器生成的视图,这个任务可能更容易.我们仅提取 API(URL)的名称及其延迟,因此我们最终得到一个包含一对列的大文件,由制表符分隔

The first step is to pre-process and extract the subset needed-data from the logs. A trivial utility like Unix 'cut' on multiple logs may be sufficient here. You may also need to collapse multiple similar URLs into shorter strings describing the function/API like 'registration', or 'purchase'. If you have a multi-host unified log view generated by a load-balancer, this task may be easier. We extract only the names of the APIs (URLs) and their latencies, so we end up with one big file with a pair of columns, separated by TABs

*API_Name   Latency_in_microSecs*


func_01    32734
func_01    32851
func_06    598452
...
func_11    232734

现在我们在生成的数据对上运行下面的 R 脚本以生成想要的图表(使用 Hadley Wickham 出色的 ggplot2 库).瞧!

Now we run the R script below on the resulting data pairs to produce the wanted chart (using Hadley Wickham's wonderful ggplot2 library). Voilla!

最后,这是从 API+Latency TSV 数据文件生成图表的代码:

Finally, here's the code to produce the chart from the API+Latency TSV data file:

#!/usr/bin/Rscript --vanilla
#
# Generate stacked chart of API latencies by API from a TSV data-set
#
# ariel faigon - Dec 2012
#
.libPaths(c('~/local/lib/R',
         '/usr/lib/R/library',
         '/usr/lib/R/site-library'
))

suppressPackageStartupMessages(library(ggplot2))
# grid lib needed for 'unit()':
suppressPackageStartupMessages(library(grid))

#
# Constants: width, height, resolution, font-colors and styles
# Adapt to taste
#
wh.ratio = 2
WIDTH = 8
HEIGHT = WIDTH / wh.ratio
DPI = 200
FONTSIZE = 11
MyGray = gray(0.5)

title.theme   = element_text(family="FreeSans", face="bold.italic",
                        size=FONTSIZE)
x.label.theme = element_text(family="FreeSans", face="bold.italic",
                        size=FONTSIZE-1, vjust=-0.1)
y.label.theme = element_text(family="FreeSans", face="bold.italic",
                       size=FONTSIZE-1, angle=90, vjust=0.2)
x.axis.theme  = element_text(family="FreeSans", face="bold",
                        size=FONTSIZE-1, colour=MyGray)
y.axis.theme  = element_text(family="FreeSans", face="bold",
                        size=FONTSIZE-1, colour=MyGray)

#
# Function generating well-spaced & well-labeled y-axis (count) breaks
#
yscale_breaks <- function(from.to) {
    from <- 0
    to <- from.to[2]
    # round to 10 ceiling
    to <- ceiling(to / 10.0) * 10
    # Count major breaks on 10^N boundaries, include the 0
    n.maj = 1 + ceiling(log(to) / log(10))
    # if major breaks are too few, add minor-breaks half-way between them
    n.breaks <- ifelse(n.maj < 5, max(5, n.maj*2+1), n.maj)
    breaks <- as.integer(seq(from, to, length.out=n.breaks))
    breaks
}

#
# -- main
#

# -- process the command line args:  [tsv_file [png_file]]
#    (use defaults if they aren't provided)
#
argv <- commandArgs(trailingOnly = TRUE)
if (is.null(argv) || (length(argv) < 1)) {
    argv <- c(Sys.glob('*api-lat.tsv')[1])
}
tsvfile <- argv[1]
stopifnot(! is.na(tsvfile))
pngfile <- ifelse(is.na(argv[2]), paste(tsvfile, '.png', sep=''), argv[2])

# -- Read the data from the TSV file into an internal data.frame d
d <- read.csv(tsvfile, sep='\t', head=F)

# -- Give each data column a human readable name
names(d) <- c('API', 'Latency')

#
# -- Convert microseconds Latency (our weblog resolution) to seconds
#
d <- transform(d, Latency=Latency/1e6)

#
# -- Trim the latency axis:
#       Drop the few 0.001% extreme-slowest outliers on the right
#       to prevent them from pushing the bulk of the data to the left
Max.Lat <- quantile(d$Latency, probs=0.99999)
d <- subset(d, Latency < Max.Lat)

#
# -- API factor pruning
#       Drop rows where the APIs is less than small % of total calls
#
Rare.APIs.pct <- 0.001
if (Rare.APIs.pct > 0.0) {
    d.N <- nrow(d)
    API.counts <- table(d$API)
    d <- transform(d, CallPct=100.0*API.counts[d$API]/d.N)
    d <- d[d$CallPct > Rare.APIs.pct, ]
    d.N.new <- nrow(d)
}

#
# -- Adjust legend item-height &font-size
#    to the number of distinct APIs we have
#
API.count <- nlevels(as.factor(d$API))
Legend.LineSize <- ifelse(API.count < 20, 1.0, 20.0/API.count)
Legend.FontSize <- max(6, as.integer(Legend.LineSize * (FONTSIZE - 1)))
legend.theme  = element_text(family="FreeSans", face="bold.italic",
                        size=Legend.FontSize,
                        colour=gray(0.3))


# -- set latency (X-axis) breaks and labels (s.b made more generic)
lat.breaks <- c(0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10)
lat.labels <- sprintf("%g", lat.breaks)
#
# -- Generate the chart using ggplot
#
p <- ggplot(data=d, aes(x=Latency, y=..count../1000.0, group=API, fill=API)) +
   geom_bar(binwidth=0.01) +
      scale_x_log10(breaks=lat.breaks, labels=lat.labels) +
      scale_y_continuous(breaks=yscale_breaks) +
      ggtitle('APIs Calls & Latency Distribution') +
      xlab('Latency in seconds - log(10) scale') +
      ylab('Call count (in 1000s)') +
      theme(
            plot.title=title.theme,
            axis.title.y=y.label.theme,
            axis.title.x=x.label.theme,
            axis.text.x=x.axis.theme,
            axis.text.y=y.axis.theme,
            legend.text=legend.theme,
            legend.key.height=unit(Legend.LineSize, "line")
      )

#
# -- Save the plot into the png file
#
ggsave(p, file=pngfile, width=WIDTH, height=HEIGHT, dpi=DPI)

这篇关于分析多层分布式 Web 应用程序(服务器端)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆