解析Ruby中的Apache格式化的URL [英] Parse Apache Formatted URLs in Ruby

查看:218
本文介绍了解析Ruby中的Apache格式化的URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我怎么能在Apache的公共日志文件,并列出其中的所有网址在整洁的柱状图,如:

  /favicon.ico ##
/manual/mod/mod_autoindex.html#
/红宝石/ FAQ /窗/ ##
/ruby/faq/Windows/index.html#
/红宝石/ FAQ /窗/ RubyonRails#
/ruby/rubymain.html#
与/robots.txt ########

测试文件的示例:

  65.54.188.137  -   -  [03 /月/ 2006:03:50:20 -0400]GET /~longa/geomed/ppa/doc/localg/localg.htm HTTP /1.0200 24834
65.54.188.137 - - [03 /月/ 2006:03:50:32 -0400]GET /~longa/geomed/modules/sv/scen1.html HTTP / 1.0200 1919年
65.54.188.137 - - [03 /月/ 2006:03:53:51 -0400]GET /~longa/xlispstat/$c$c/statistics/introstat/axis/$c$c/axisDens.lsp HTTP / 1.0 200 15962
65.54.188.137 - - [03 /月/ 2006:04:03:03 -0400]GET /~longa/geomed/modules/cluster/lab/nm.pop HTTP / 1.0200 66302
65.54.188.137 - - [03 /月/ 2006:04:11:15 -0400]GET /~longa/geomed/data/france/names.txt HTTP / 1.0200 20706
74.129.13.176 - - [03 /月/ 2006:04:14:35 -0400]GET /~jbyoder/ambiguouslyyours/ambig.rss HTTP / 1.1304 -

这是我现在的(但我不知道如何使直方图):

  ...
---$ apache_line = / \\ A(?< IP_ADDRESS> \\ S +)\\ S + \\ S + \\ [(小于时间> [^ \\]] +)\\](<方法> GET | POST?)(LT? ; URL>?\\ S +)\\ S + (小于?状态> \\ D +)(小于?字节> \\ S +)/
$部分= apache_line.match(文件)
$ P零件[:IP地址],部分[:状态]部分[:法],部分[:URL]高清GET_URL(文件)
    命中= Hash.new {| H,K | H [k]的= 0}
    File.read(文件).to_a.each做|线|
    而$ P部分[:URL]
        如果k = K
            H [K] + = 1
            把%-15s%的%[K,'#'* H [K]
        结束
    结束
结束...
---

下面是完整的问题: http://pastebin.com/GRPS6cTZ 伪code是罚款


解决方案

  1. 您可以创建一个哈希映射每个路径的点击次数。为方便起见,我建议使用哈希,当你问它以前没有见过的路径,该值设置为0。例如:

     命中= Hash.new {| H,K | H [k]的= 0}
    ...
    点击[/ favicon.ico的] + = 1
    点击[/红宝石/ FAQ /窗/] + = 1
    点击[/ favicon.ico的] + = 1
    p命中
    #=> {/favicon.ico\"=>2,/红宝石/ FAQ /窗/=> 1}


  2. 在情况下,日志文件确实是巨大的,而不是在一个时间啜了整个事情到内存,处理线路之一。 (查看通过 文件的方法 类。)


  3. 由于Apache日志文件格式不标准的分隔符,我会使用常规的前pression采取每一行,并将其分离成你想要的块提示。假设你使用Ruby 1.9的,我会以后使用清洁访问命名捕获到的方法。例如:

      apache_line = / \\ A(小于?IP_ADDRESS> \\ S +)\\ S + \\ S + \\ [(LT;时间> [^ \\]] +)\\](? <方法> GET | POST)(小于网址> \\ S +)\\ S +? (小于?状态> \\ D +)(小于?字节> \\ S +)/
    ...
    部分= apache_line.match(log_line)
    普件[:IP地址],部分[:状态]部分[:法],部分[:URL]


  4. 您可能想选择基于状态code过滤这些。例如,你想在你的图形包括所有404命中如果有人打错?如果你不啜所有行到内存中,你将不会被使用阵列#选择,但你的循环中,而不是跳过它们。


  5. 您已经收集所有的命中之后,那么它的时间写出来的结果。一些有用的技巧:


    1. 哈希键#可以一次给你数组的所有键(路径)。你可能想写出所有与等量空白的路径,所以你需要弄清楚哪些是最长的。也许你想地图的路径的长度,然后获得最大元素,或者你要使用< A HREF =htt​​p://ruby-doc.org/core/classes/Enumerable.html#M001507相对=nofollow> max_by 找到最长的路径,然后找到它的长度。


    2. 虽然令人讨厌,使用的sprintf 字符串#%是奠定了格式化的报告的好方法。例如:

       把%-15s%的%[你好,####]
      #=&GT; 你好 ####


    3. 就像你需要找到好的格式最长的名字,可能希望找到与点击率最高的URL,这样就可以扩展您的最长的哈希量到该值。 散列值#会给你所有值的数组。或者,也许你有一个要求,一个必须重新present 100命中,什么的。


    4. 注意字符串#* ,您可以通过创建一个重复的字符串:

        P'#'* 10
      #=&GT; ##########



如果你有你的code的具体问题,提出更多的问题!

How can I take in a Apache Common Log file and list all of the URLs in it in a neat histogram like:

/favicon.ico                      ##
/manual/mod/mod_autoindex.html        #
/ruby/faq/Windows/                    ##
/ruby/faq/Windows/index.html    #
/ruby/faq/Windows/RubyonRails   #
/ruby/rubymain.html                   #
/robots.txt                           ########

Sample of test file:

65.54.188.137 - - [03/Sep/2006:03:50:20 -0400] "GET /~longa/geomed/ppa/doc/localg/localg.htm HTTP/1.0" 200 24834
65.54.188.137 - - [03/Sep/2006:03:50:32 -0400] "GET /~longa/geomed/modules/sv/scen1.html HTTP/1.0" 200 1919
65.54.188.137 - - [03/Sep/2006:03:53:51 -0400] "GET /~longa/xlispstat/code/statistics/introstat/axis/code/axisDens.lsp HTTP/1.0" 200 15962
65.54.188.137 - - [03/Sep/2006:04:03:03 -0400] "GET /~longa/geomed/modules/cluster/lab/nm.pop HTTP/1.0" 200 66302
65.54.188.137 - - [03/Sep/2006:04:11:15 -0400] "GET /~longa/geomed/data/france/names.txt HTTP/1.0" 200 20706
74.129.13.176 - - [03/Sep/2006:04:14:35 -0400] "GET /~jbyoder/ambiguouslyyours/ambig.rss HTTP/1.1" 304 -

This is what I have right now (but I'm not sure how to make the histogram):

...
---

$apache_line = /\A(?<ip_address>\S+) \S+ \S+ \[(?<time>[^\]]+)\] "(?<method>GET|POST) (?<url>\S+) \S+?" (?<status>\d+) (?<bytes>\S+)/
$parts = apache_line.match(file)
$p parts[:ip_address], parts[:status], parts[:method], parts[:url]

def get_url(file)
    hits = Hash.new {|h,k| h[k]=0}
    File.read(file).to_a.each do |line|
    while $p parts[:url]
        if k = k
            h[k]+=1
            puts "%-15s %s" % [k,'#'*h[k]]
        end
    end
end

...
---

Here is the full question: http://pastebin.com/GRPS6cTZ Pseudo code is fine.

解决方案

  1. You can create a hash mapping each path to the number of hits. For convenience, I suggest using a Hash that sets the value to 0 when you ask for a path it hasn't seen before. For example:

    hits = Hash.new{ |h,k| h[k]=0 }
    ...
    hits["/favicon.ico"] += 1
    hits["/ruby/faq/Windows/"] += 1
    hits["/favicon.ico"] += 1
    p hits
    #=> {"/favicon.ico"=>2, "/ruby/faq/Windows/"=>1}
    

  2. In case the log file is really huge, instead of slurping the whole thing into memory, process the lines one at a time. (Look through the methods of the File class.)

  3. Because Apache log file formats don't have standard delimiters, I'd suggesting using a regular expression to take each line and separate it into the chunks you want. Assuming you're using Ruby 1.9, I'm going to use named captures for clean access to the methods later on. For example:

    apache_line = /\A(?<ip_address>\S+) \S+ \S+ \[(?<time>[^\]]+)\] "(?<method>GET|POST) (?<url>\S+) \S+?" (?<status>\d+) (?<bytes>\S+)/
    ...
    parts = apache_line.match(log_line)
    p parts[:ip_address], parts[:status], parts[:method], parts[:url]
    

  4. You might want to choose to filter these based on the status code. For example, do you want to include in your graph all the 404 hits where someone mistyped? If you're not slurping all the lines into memory, you won't be using Array#select but instead skipping over them during your loop.

  5. After you have gathered all your hits, then its time to write out the results. Some helpful tips:

    1. Hash#keys can give you all the keys of the array (the paths) at once. You probably want to write out all the paths with the same amount of whitespace, so you need to figure out which is the longest. Perhaps you want to map the paths to their lengths and then get the max element, or perhaps you want to use max_by to find the longest path and then find its length.

    2. Although geeky, using sprintf or String#% is a great way to lay out formatted reports. For example:

      puts "%-15s %s" % ["Hello","####"]
      #=> "Hello           ####"
      

    3. Just like you needed to find the longest name for good formatting, might want to to find the URL with the most hits, so that you can scale your longest amount of hashes to that value. Hash#values will give you an array of all values. Alternatively, perhaps you have a requirement that one # must always represent 100 hits, or something.

    4. Note that String#* lets you create a string by repetition:

      p '#'*10
      #=> "##########"
      

If you have specific questions with your code, ask more questions!

这篇关于解析Ruby中的Apache格式化的URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆