需要一种方法来按日期排序100 GB的日志文件 [英] Need a way to sort a 100 GB log file by date

查看:230
本文介绍了需要一种方法来按日期排序100 GB的日志文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,对于一些奇怪的原因我最终是未排序的100GB的日志文件(实际上它的部分进行排序的),而我试图以应用算法需要排序的数据。在日志文件中的一行看起来像这样

So, for some strange reason I end up with a 100GB log file that is unsorted (actually it's partially sorted), while the algorithms that I'm attempting to apply require sorted data. A line in the log file looks like so

data <date> data data more data



我有机会到C#4.0和我的工作站上大约4 GB的RAM。我会想象合并排序某种将是最好的位置,但短期实现这些算法自己 - 我想问问,如果有某种捷径,我可以采取

I have access to C# 4.0 and about 4 GB of RAM on my workstation. I would imagine that merge-sort of some kind would be best here, but short of implementing these algorithms myself - I want to ask if there's some kind of a shortcut I could take.

顺便与 DateTime.Parse()很慢,占用了大量的CPU时间解析日期字符串是可怜的10 MB /秒。难道还有比下面的一个更快的方法?

Incidentally parsing the date string with DateTime.Parse() is very slow and takes up a lot of CPU time - The chugging-rate is measly 10 MB/sec. Is there a faster way than the following?

    public static DateTime Parse(string data)
    {            
        int year, month, day;

        int.TryParse(data.Substring(0, 4), out year);
        int.TryParse(data.Substring(5, 2), out month);
        int.TryParse(data.Substring(8, 2), out day);

        return new DateTime(year, month, day);
    }



我写信,要加快 DateTime.Parse( ),它实际上运作良好,但仍采取循环水桶负荷。

I wrote that to speed up DateTime.Parse() and it actually works well, but is still taking a bucket-load of cycles.

请注意,当前对数文件我感兴趣的小时,分​​钟和秒也。我知道,我可以提供DateTime.Parse()与格式,但是这似乎并没有加快它所有的东西。

我。寻找一个正确的方向轻推,在此先感谢

I'm looking for a nudge in the right direction, thanks in advance.

修改:有人建议我用字符串比较,以比较日期。这将工作的排序阶段,但我确实需要解析的算法日期。我仍然不知道如何排序可用RAM的4GB 100GB的文件,没有做手工

EDIT: Some people have suggested that I use string comparison in order to compare dates. That would work for the sorting phase, but I do need to parse dates for the algorithms. I still have no idea how to sort 100GB file on 4GB of free ram, without doing it manually.

编辑2 :好,感谢几个建议,我使用窗口排序,我发现有一个为Linux 类似的工具。基本上你叫排序,并为您解决一切。当我们讲它做的的东西的,我希望它会很快结束。我使用的命令是

EDIT 2 : Well, thanks to several suggestions that I use windows sort, I found out that there's a similar tool for Linux. Basically you call sort and it fixes everything for you. As we speak it's doing something, and I hope it'll finish soon. The command I'm using is

sort -k 2b 2008.log > 2008.sorted.log



-k指定我要排序的第二行,这是一个时间串在平时的 YYYY-MM-DD HH:MM:ss.msek 格式。我必须承认,这名男子的页面都缺乏解释所有的选项,但是我发现很多的例子运行信息的coreutils'排序的调用'

-k specifies that I want to sort on the second row, which is an date-time string in the usual YYYY-MM-DD hh:mm:ss.msek format. I must admit that the man-pages are lacking explaining all the options, but I found a lot of examples by running info coreutils 'sort invocation'.

我会结果和时间报到。日志的这部分大约是27GB。我想单独整​​理2009和2010年,然后合并结果与排序-m选项的单个文件。

I'll report back with results and timings. This part of the log is about 27GB. I am thinking of sorting 2009 and 2010 separately and then merging the results into a single file with the sort -m option.

修改3 好,检查 iotop 的表明它读取在数据文件的小块,然后拼命做,以处理他们的东西。这个过程似乎是相当缓慢的。 =(

Edit 3 Well, checking iotop suggests that it's reading in small chunks of the data file and then furiously doing something in order to process them. This process seems to be quite slow. =(

排序不使用任何内存,并且只有单核心。当它读取数据该驱动器它没有处理任何事情。我是不是做错了什么?

sort isn't using any memory, and only a single core. When it does read data from the drive it's not processing anything. Am I doing something wrong?

修改4 在它仍然在做同样的事情,三个小时。现在我在那个阶段M,我想尝试的功能参数打,但我投入300小时... ...我会放弃在约4小时,并试图把它过夜计算与智慧的内存和空间参数...

Edit 4 Three hours in and it's still doing the same thing. Now I'm at that stage where I want to try playing with parameters of the function, but I'm three hours invested... I'll abort in in about 4 hours, and try to put it for overnight computation with smarter memory and space parameters...

修改5 之前,我回家了,我重新启动使用以下命令的过程:

Edit 5 Before I went home, I restarted the process with the following command:

sort -k 2b --buffer-size=60% -T ~/temp/ -T "/media/My Passport" 2010.log -o 2010.sorted.log

据返回此,今天上午:

sort: write failed: /media/My Passport/sortQAUKdT: File too large

Wraawr!的我以为我会只是添加尽可能多的硬盘驱动器尽可能地加速这一进程。显然,增加一个USB驱动器是有史以来最糟糕的想法。目前,我甚至不能告诉我们,如果它是关于FAT / NTFS或一些这样的,因为FDISK告诉我,USB驱动器是一个错误的设备......不开玩笑。我会尽力给后来别的去了,现在让我们把这个项目进入可能失败一堆。

Wraawr! I thought I would just add as many hard drives as possible to speed this process up. Apparently adding a USB-drive was the worst idea ever. At the moment I can't even tell if it's about FAT/NTFS or some such, because fdisk is telling me that the USB drive is a "wrong device"... no kidding. I'll try to give it another go later, for now let's put this project into the maybe failed pile.

最后通知
该时间,它的工作,具有同样的命令如上述,但没有问题的外部硬盘驱动器。谢谢大家的帮助!

Final Notice This time it worked, with the same command as above, but without the problematic external hard drive. Thank you all for your help!

标杆

使用2工作站级(至少70MB /秒的读/写IO)相同的SATA控制器上的硬盘磁盘,我花了162分钟排序一个30GB的日志文件。我需要今晚排序另一个52 GB的文件,我会后如何继​​续下去。

Using 2 workstation grade (at least 70mb/sec read/write IO) hard-disks on the same SATA controller, it took me 162 minutes to sort a 30GB log file. I will need to sort another 52 GB file tonight, I'll post how that goes.

推荐答案

如果一个字符串的排序将工作为你,那么就使用Windows SORT命令。排序的文件,并用它来完成。它会高兴地梳理你的100GB的文件,它的简单易用。

If a string sort will work for you, then just use the Windows SORT command. Sort the file and be done with it. It'll happily sort your 100GB file, and it's simple to use.

如果您需要过滤和转换的文件,具体日期字段,然后我就简单的写该数据字段转换到一个充满0的整数(如秒#自1970年以来,或任何你喜欢),并重写记录一个小转换程序。然后你就可以管道(|)到sort命令的输出,那么你有一个最终的,整理文件,这就是更容易被你的实用程序解析

If you need to filter and convert the file, specifically the date field, then I would simply write a small conversion program that converts the data field in to a 0 filled integer (like # of seconds since 1970, or whatever you like), and rewrites the record. Then you can pipe (|) the output in to the sort command, then you have a final, sorted file thats more readily parsed by your utility program.

我觉得你正在做的错误只是试图做这一切一气呵成。数据100GB是很多,这需要一些时间来复制,但是它并不需要那么长。因为你必须对它进行排序,你已经有了处理在某些点上文件的副本(即你需要尽可能多的自由空间,你的机器来处理在某些时候两个副本上),即使与外部排序例程像合并排序

I think the mistake you're making is simply trying to do this all in one go. 100GB of data is a lot, and it takes some time to copy, but it doesn't take THAT long. Since you have to sort it, you already have to deal with a copy of the file at some point (i.e. you need as much free space on your machine to handle both copies at some time), even with an external sorting routine like merge sort.

编写一个简单的格式化,并在它管道进行排序会为你节省一对夫妇往返穿过的文件,节省磁盘空间,因为你不可避免地只需要两个副本。

Writing a simple reformatter and piping it in to sort will save you a couple trips through the file, and save space on disk, since you'll inevitably just need the two copies.

我也要调整格式化中只是拉动我真正感兴趣的领域,并尽一切在这一点上重解析,使那你最终什么本质上是很容易被你的报告程序处理的格式化文件。这样,你以后可以节省时间可能运行报表一次以上。

I would also tweak the formatter in to pulling only the fields I'm really interested in, and do all of the "heavy" parsing at that point so that what you end up with is essentially a formatted file that easily handled by your reporting routines. That way you'll save time later when potentially running your reports more than once.

使用一个简单的CSV,或者甚至更好,一个固定长度的文件格式,如果可能的输出

Use a simple CSV or, even better, a fixed length file format for output if possible.

请确保您的最新信息,如果您选择使用一个整数,拥有所有字段的长度相同。否则,排序实用程序不会(你结束了1 10 2 3,而不是1 2 3 10.你最好有01 02 03 10)正确地对它们进行排序。

Make sure your date information, if you choose to use an integer, has all of the fields the same length. Otherwise the SORT utility won't sort them correctly (you end up with 1 10 2 3 instead of 1 2 3 10. You're better to have 01 02 03 10.).

编辑 -

让我们从不同的机智接近它

Let's approach it from a different tact.

最大的问题是你需要所有这些数据。这涉及到关于第一个做重解析早前建议。显然,更可以降低初始设定越好。例如,数据简单地删除10%的10GB

The biggest question is "do you need all this data". This relates to the earlier suggestion about doing the heavy parsing first. Obviously, the more you can reduce the initial set the better. For example, simply removing 10% of the data is 10GB.

我喜欢的东西来思考作为一个经验法则,尤其是大量的数据打交道时如果你有100万的东西,那么保存每毫秒,20分钟离底线。

Something I like to think about as a rule of thumb, especially when dealing with a lot of data: "If you have 1 Million of something, then every millisecond saved, is 20 minutes off the bottom line."

通常情况下,我们真的不认为在毫秒为单位我们的工作,它更,感觉快,中凭感觉。但是,1ms的== 20分钟/万元是一个很好的措施来获得你处理多少数据与把握,多久东西应该/可能采取。

Normally, we really don't think in terms of milliseconds for our work, it's more "seat of the pants", "that feels faster". But the 1ms == 20min/million is a good measure to get a grasp of how much data you're dealing with, and how long stuff should/could take.

对于你的情况下,数据的100GB。随着每个记录的100字节的赃物,你正在做1亿行。每毫秒20000分钟。 - 5 1/2小时。的一饮而尽的(这是一个经验法则,如果你自己算算就完全不是那么回事了这一点。)

For you case, 100GB of data. With a swag of 100 bytes per record, you're taking 1 Billion rows. 20,000 minutes per millisecond. -- 5 1/2 hours. gulp (It's a rule of thumb, if you do the math it doesn't quite work out to this.)

所以,你可以不胜感激,如果可能的话,以减少对原始数据的愿望。

So, you can appreciate the desire to reduce the raw data if at all possible.

这是一个原因,我推迟到Windows SORT命令。这是一个基本的过程,而是一个受细微差别,并且是可以使用一些优化。是谁写的排序有时间和机会的人们,使之最佳,在许多方面。不管他们做了或没了,我不能说。但其公平的假设,他们将投入更多的时间和精力在这个过程中,使它们的排序不如实用,对你们谁是时间紧迫之下。

That was one reason I deferred to the Windows SORT command. It's a basic process, but one affected by nuance, and one that can use some optimization. The folks who wrote SORT had time and opportunity to make it "optimal", in many ways. Whether they did or did not, I can't say. But its a fair assumption that they would put more time and attention in to this process to make their SORT as good as practical, versus you who are under a tight deadline.

有第三方的排序工具对大型数据集,这可能(理想情况下)工作了这种情况下更好。但是,这些都是不可用的,你(你可以让他们,但我不认为你想冲出去,并得到一些其他实用马上)。所以,排序是我们最好的猜测现在。

There are 3rd party sorting utilities for large data sets, that probably (ideally) work better for that case. But, those are unavailable to you (you can get them but I don't think you wanted to rush out and get some other utility right away). So, SORT is our best guess for now.

这是说,减少了数据集将获得比任何一种工具了。

That said, reducing the data set will gain more than any sort utility.

多少细节,你真的需要?多少信息,你真的跟踪?例如,如果是这样,比如说,网统计,你可能对你的网站1000页。但是,即便在一年每小时人数,365 * 24 * 1000,这是只有870万的信息,桶 - 从1B相去甚远

How much detail do you really need? And how much information are you really tracking? For example, if it were, say, web statistics, you may have 1000 pages on your site. But even with hourly numbers for a year, 365 * 24 * 1000, that's only 8.7M "buckets" of information -- a far cry from 1B.

那么,有任何预处理你可以做,不需要排序?总结信息化为粗粒度?你可以做,没有排序,简单地使用内存基于散列的地图。即使你没有足够的内存来处理所有数据100GB一抛,你可能有足够的做这件事的块(5块,10块),并写出了中间结果。

So, is there any preprocessing you can do that does not require sorting? Summarizing the information into a coarser granularity? You can do that without sorting, simply using memory based hash maps. Even if you don't have "enough memory" to process all 100GB of data in one throw, you probably have enough to do it in chunks (5 chunks, 10 chunks), and write out the intermediary results.

您还可以有很多更好的运气分割数据也是如此。到每月或每周的文件块。也许这不是很容易做到,因为数据大多是排序。但是,在这种情况下,如果是按日期,罪犯(即是从排序的数据)可能会在文件中聚集,用乱序的东西被刚混了对时间段的屏障(像天左右的过渡,也许你有一个像下午11时58分,下午11:59,00:00 AM,上午12时01分,下午11时58分,下午12时02分)行。你也许能够利用启发式为好。

You may also have a lot better luck splitting the data as well. Into monthly, or weekly file chunks. Maybe that's not easily done because the data is "mostly" sorted. But, in that case, if it's by date, the offenders (i.e. the data that's out of sort) may well be clustered within the file, with the "out of order" stuff being just mixed up on the barriers of the time periods (like around day transitions, maybe you have rows like 11:58pm, 11:59pm, 00:00am, 00:01am, 11:58pm, 00:02pm). You might be able to leverage that heuristic as well.

我们的目标是,如果你能有所确定性判断这是乱序的子集,并打破该文件了为了数据和豆腐块无序数据的,你的排序任务可能远小得多。那种乱序,然后你有一个合并的问题(比排序问题简单得多)的几行。

The goal being that if you can somewhat deterministically determine the subset that's out of order, and break the file up in to chunks of "in order data" and "out of order data", your sorting task may be MUCH MUCH smaller. Sort the few rows that are out of order, and then you have a merge problem (much simpler than a sorting problem).

所以,这些都是你可以采取接近战术问题。综述显然是最好的一个作为任何减少任何可测量这一数据负载,可能是值得的麻烦。当然,这一切都归结到你真正从数据想要的,明确的报告将有力地推动。这也是关于过早优化一个很好的点。如果他们不就行了报告,不处理它。)

So, those are tactics you can take approaching the problem. Summarization is obviously the best one as anything that reduces this data load in any measurable, is likely worth the trouble. Of course it all boils down to what you really want from the data, clearly the reports will drive that. This is also a nice point about "pre-mature optimization". If they're not reporting on it, don't process it :).

这篇关于需要一种方法来按日期排序100 GB的日志文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆