R中的日志文件分析? [英] Logfile analysis in R?

查看:113
本文介绍了R中的日志文件分析?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道周围还有其他工具,例如awstats或splunk,但我想知道R中是否进行了一些严肃的(网络)服务器日志文件分析.我可能不是第一个想到在R中进行此操作的人,但仍然是R具有良好的可视化功能和良好的空间包.你知道吗还是有一种R程序包/代码可以处理可以构建的最常见的日志文件格式?还是仅仅是一个非常糟糕的主意?

I know there are other tools around like awstats or splunk, but I wonder whether there is some serious (web)server logfile analysis going on in R. I might not be the first thought to do it in R, but still R has nice visualization capabilities and also nice spatial packages. Do you know of any? Or is there a R package / code that handles the most common log file formats that one could build on? Or is it simply a very bad idea?

推荐答案

与一个项目有关,该项目为我们的网络运营人员构建了分析工具箱, 我在大约两个月前建立了其中之一.如果我将其开源,我的雇主没有问题,所以如果有人感兴趣,我可以将其放在我的github仓库中.我认为如果我构建一个R Package,这对这个小组来说是最有用的.我将无法立即做到这一点 因为我需要研究使用非R代码进行程序包构建的文档(它可能很简单,就像将/exec中的python字节码文件与合适的python运行时一样扔掉,但我不知道).

In connection with a project to build an analytics toolbox for our Network Ops guys, i built one of these about two months ago. My employer has no problem if i open source it, so if anyone is interested i can put it up on my github repo. I assume it's most useful to this group if i build an R Package. I won't be able to do that straight away though because i need to research the docs on package building with non-R code (it might be as simple as tossing the python bytecode files in /exec along with a suitable python runtime, but i have no idea).

我实际上感到很惊讶,我需要承担一个这样的项目.至少有几个出色的开源和免费日志文件解析器/查看器(包括出色的Webalyzer和AWStats),但都不解析服务器错误日志(解析服务器访问日志是两者的主要用例).

I was actually suprised that i needed to undertake a project of this sort. There are at least several excellent open source and free log file parsers/viewers (including the excellent Webalyzer and AWStats) but neither parse server error logs (parsing server access logs is the primary use case for both).

如果您不熟悉错误日志或它们与访问之间的区别 总之,Apache服务器(likewsie,nginx和IIS)记录了两个不同的日志,并且默认情况下将它们彼此相邻地存储在磁盘中同一目录中.在Mac OS X上, /var中根目录下的那个目录:

If you are not familiar with error logs or with the difference between them and access logs, in sum, Apache servers (likewsie, nginx and IIS) record two distinct logs and store them to disk by default next to each other in the same directory. On Mac OS X, that directory in /var, just below root:

$> pwd
   /var/log/apache2

$> ls
   access_log   error_log

对于网络诊断,错误日志通常比访问日志有用得多. 由于许多字段中数据的非结构化性质,它们也恰好更难处理,更重要的是,由于数据文件 解析是一个不规则的时间序列后,您将剩下来的时间-您可能有多个条目被键入单个时间戳记,然后下一个条目是三秒钟后,依此类推.

For network diagnostics, error logs are often far more useful than the access logs. They also happen to be significantly more difficult to process because of the unstructured nature of the data in many of the fields and more significantly, because the data file you are left with after parsing is an irregular time series--you might have multiple entries keyed to a single timestamp, then the next entry is three seconds later, and so forth.

我希望我可以抛弃原始错误日志(任何大小,但一次通常数百MB)的应用程序的另一端有用的东西-在这种情况下,必须预先打包的分析以及R内可用于命令行分析的数据多维数据集.鉴于此,我用python编写了原始日志解析器,而处理器(例如,将解析器的输出网格化以创建规则的时间序列)以及所有分析和数据可视化都用R进行了编码.

i wanted an app that i could toss in raw error logs (of any size, but usually several hundred MB at a time) have something useful come out the other end--which in this case, had to be some pre-packaged analytics and also a data cube available inside R for command-line analytics. Given this, i coded the raw-log parser in python, while the processor (e.g., gridding the parser output to create a regular time series) and all analytics and data visualization, i coded in R.

我已经建立了很长时间的分析工具,但仅在过去 我使用R已经有四年了.因此,我的第一印象-解析原始日志文件并将数据帧加载到R中后,R便很高兴能够使用R,并且它非常适合此类任务.一些令人惊讶的惊喜:

I have been building analytics tools for a long time, but only in the past four years have i been using R. So my first impression--immediately upon parsing a raw log file and loading the data frame in R is what a pleasure R is to work with and how it is so well suited for tasks of this sort. A few welcome suprises:

  • 序列化 .将工作数据持久存储在R中是一个命令 (救).我知道这一点,但我不知道这个二进制文件的效率如何 格式.您的实际数据:对于每50 MB解析的原始日志文件, RData表示约为500 KB--100:1压缩. (注意:我 通过使用data.table将其进一步降低到大约300:1 库并将压缩级别参数手动设置为保存 函数);

  • Serialization. To persist working data in R is a single command (save). I knew this, but i didn't know how efficient is this binary format. Thee actual data: for every 50 MB of raw logfiles parsed, the .RData representation was about 500 KB--100 : 1 compression. (Note: i pushed this down further to about 300 : 1 by using the data.table library and manually setting compression level argument to the save function);

IO .我的数据仓库严重依赖轻量级数据结构 完全驻留在RAM中并写入磁盘的服务器 异步地称为redis.该项目本身只有大约两个 已有30年的历史了,但在CRAN中已经有R的redis客户端(作者:B.W. 刘易斯,在本文发布时为1.6.1版);

IO. My Data Warehouse relies heavily on a lightweight datastructure server that resides entirely in RAM and writes to disk asynchronously, called redis. The proect itself is only about two years old, yet there's already a redis client for R in CRAN (by B.W. Lewis, version 1.6.1 as of this post);

主要数据分析 .该项目的目的是建立一个 供网络运营人员使用的库.我的目标是一个命令= 一个数据视图"类型的界面.例如,我使用了出色的 googleVis软件包可创建具有专业外观的 具有可排序列的可滚动/分页HTML表,其中 加载了聚合数据的数据帧(> 5,000行).只是那几个 交互式元素-例如,对列进行排序-提供了有用的信息 描述性分析.另一个例子,我写了很多薄 包装一些基本的数据处理和类似表的功能;每个 这些功能中,例如,我将绑定到可点击的按钮 在选项卡式网页上.同样,这是我在R中的荣幸,部分原因 因为很多时候该函数不需要包装器,单个 带有提供的参数的命令足以生成有用的 数据视图.

Primary Data Analysis. The purpose of this Project was to build a Library for our Network Ops guys to use. My goal was a "one command = one data view" type interface. So for instance, i used the excellent googleVis Package to create a professional-looking scrollable/paginated HTML tables with sortable columns, in which i loaded a data frame of aggregated data (>5,000 lines). Just those few interactive elments--e.g., sorting a column--delivered useful descriptive analytics. Another example, i wrote a lot of thin wrappers over some basic data juggling and table-like functions; each of these functions i would for instance, bind to a clickable button on a tabbed web page. Again, this was a pleasure to do in R, in part becasue quite often the function required no wrapper, the single command with the arguments supplied was enough to generate a useful view of the data.

最后一个项目符号的几个示例:

A couple of examples of the last bullet:

# what are the most common issues that cause an error to be logged?

err_order = function(df){
    t0 = xtabs(~Issue_Descr, df)
    m = cbind( names(t0), t0)
    rownames(m) = NULL
    colnames(m) = c("Cause", "Count")
    x = m[,2]
    x = as.numeric(x)
    ndx = order(x, decreasing=T)
    m = m[ndx,]
    m1 = data.frame(Cause=m[,1], Count=as.numeric(m[,2]),
                    CountAsProp=100*as.numeric(m[,2])/dim(df)[1])
    subset(m1, CountAsProp >= 1.)
}

# calling this function, passing in a data frame, returns something like:


                        Cause       Count    CountAsProp
1  'connect to unix://var/ failed'    200        40.0
2  'object buffered to temp file'     185        37.0
3  'connection refused'                94        18.8


显示用于使用googleVis进行交互式分析的主数据多维数据集:



使用googleVis显示的列联表(来自xtab函数调用)


The Primary Data Cube Displayed for Interactive Analysis Using googleVis:



A contingency table (from an xtab function call) displayed using googleVis)

这篇关于R中的日志文件分析?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆