bash脚本的基础上在另一个文件中指定的日期提取日志文件条目? [英] Bash script to extract entries from log file based on dates specified in another file?

查看:97
本文介绍了bash脚本的基础上在另一个文件中指定的日期提取日志文件条目?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pretty大逗号分隔的CSV日志看起来是这样的文件(> 50000行,让我们叫它file1.csv):

I've got a pretty big comma-delimited CSV log file (>50000 rows, let's call it file1.csv) that looks something like this:

field1,field2,MM-DD-YY HH:MM:SS,field4,field5...
...
field1,field2,07-29-10 08:04:22.7,field4,field5...
field1,field2,07-29-10 08:04:24.7,field4,field5...
field1,field2,07-29-10 08:04:26.7,field4,field5...
field1,field2,07-29-10 08:04:28.7,field4,field5...
field1,field2,07-29-10 08:04:30.7,field4,field5...
...

正如你所看到的,存在这样一个时间标记中间的一个领域。

As you can see, there is a field in the middle that is a time stamp.

我也有一个文件(让我们称之为file2.csv),有次一个简短列表:

I also have a file (let's call it file2.csv) that has a short list of times:

timestamp,YYYY,MM,DD,HH,MM,SS
20100729180031,2010,07,29,18,00,31
20100729180039,2010,07,29,18,00,39
20100729180048,2010,07,29,18,00,48
20100729180056,2010,07,29,18,00,56
20100729180106,2010,07,29,18,01,06
20100729180115,2010,07,29,18,01,15

我想这样做是为了提取仅在file1.csv那些在file2.csv指定的时间线。

What I would like to do is to extract only the lines in file1.csv that have times specified in file2.csv.

我如何做一个bash脚本?由于file1.csv是相当大的,效率也将是一个问题。我以前做的很简单的bash脚本,但真的不知道该如何处理这个问题。也许有些实施的awk?或有另一种方式?

How do I do this with a bash script? Since file1.csv is quite large, efficiency would also be a concern. I've done very simple bash scripts before, but really don't know how to deal with this. Perhaps some implementation of awk? Or is there another way?

P.S。并发症一:我手动当场查了一些在这两个文件中的条目,以确保它们能够匹配,和他们做。有只需要一个方法,以消除(或忽略)的额外的0.7,在秒(SS)场的file1.csv结束

P.S. Complication 1: I manually spot checked some of the entries in both files to make sure they would match, and they do. There just needs to be a way to remove (or ignore) the extra ".7" at the end of the seconds ("SS") field in file1.csv.

P.P.S。并发症二:原来在list1.csv的条目都是由大约两秒钟分开。有时在list2.csv时间戳落入权利两个条目之间在list1.csv!有没有办法找到这种情况下最匹配?

P.P.S. Complication 2: Turns out the entries in list1.csv are all separated by about two seconds. Sometimes the time stamps in list2.csv fall right in between two of the entries in list1.csv! Is there a way to find the closest match in this case?

推荐答案

如果您有GNU AWK(GAWK),你可以使用这种技术。

If you have GNU awk (gawk), you can use this technique.

为了到最近的时间相匹配,一种方法是将有awk的打印两行的file2.csv每一行,然后使用与的grep -f 作为在<一个href=\"http://stackoverflow.com/questions/3631442/bash-script-to-extract-entries-form-log-file-based-on-dates-specified-in-another/3631530#3631530\">John Kugelman的回答。第二线路将有一个第二添加到它

In order to match the nearest times, one approach would be to have awk print two lines for each line in file2.csv, then use that with grep -f as in John Kugelman's answer. The second line will have one second added to it.

awk -F, 'NR>1 {$1=""; print strftime("%m-%d-%y %H:%M:%S", mktime($0));
                               print strftime("%m-%d-%y %H:%M:%S", mktime($0) + 1)}' file2.csv > times.list
grep -f times.list file1.csv

这说明了两个不同的技术。

This illustrates a couple of different techniques.


  • 跳过记录头号跳过头(使用比赛实际上是更好)

  • ,而不是处理每个单独的领域, $ 1 清空和的strftime 创建所需格式<输出/ LI>
  • mktime 年月日HH毫米SS的字符串转换的格式( -F 和分配到 $ 1 删除逗号)自上划时代的秒数,我们把它加1,第二行

  • skip record number one to skip the header (using a match is actually better)
  • instead of dealing with each field individually, $1 is emptied and strftime creates the output in the desired format
  • mktime converts the string in the format "yyyy mm dd hh mm ss" (the -F, and the assignment to $1 removes the commas) to a number of seconds since the epoch, and we add 1 to it for the second line

这篇关于bash脚本的基础上在另一个文件中指定的日期提取日志文件条目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆