最快的方法来分析大型字符串(多线程) [英] Fastest Way to Parse Large Strings (multi threaded)

查看:267
本文介绍了最快的方法来分析大型字符串(多线程)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要开始一个项目,该项目将采取文本块,分析了大量的数据放进去成某种对象,那么它可以被序列化,存储和统计,从/数据收集。这需要尽可能快,因为我有文字> 10,000,000块,我需要开始上会得到每天10万的十万。

I am about to start a project which will be taking blocks of text, parsing a lot of data into them into some sort of object which can then be serialized, stored, and statistics / data gleaned from. This needs to be as fast as possible as I have > 10,000,000 blocks of text that I need to start on and will be getting 100,000's of thousands a day.

我我有12个至强多核+超线程的系统上运行此。我还可以使用/大概了解了一下CUDA编程,但对于字符串的东西认为其不恰当的。从每个字符串我需要分析大量的数据和它的一些我知道的一些确切的位置,我不并需要使用正则表达式的/东西智能。

I am running this on a system with 12 xeon cores + hyper threading. I also have access / know a bit about CUDA programming but for string stuff think that its not appropriate. From each string I need to parse a lot of data and some of it I know the exact positions of, some I don't and need to use regex's / something smart.

因此,考虑这样的事情:

So consider something like this:

object[] parseAll (string [] stringsToParse)
{
     parallel foreach 
          parse( string[n] )
}

object parse(string s)
{
     try to use exact positions / substring etc here instead of regex's
}

所以我的问题是:


  • 慢多少是使用正则表达式对SUBSTR。

  • 是.NET会比其他语言显著慢。

  • 什么样的优化(如果有的话)我能做到最大化并行。

  • 别的什么我还没有考虑?

  • How much slower is using regex's to substr.
  • Is .NET going to be significantly slower than other languages.
  • What sort of optimizations (if any) can I do to maximize parallelism.
  • Anything else I haven't considered?

任何帮助的感谢!很抱歉,如果这是长篇大论。

Thanks for any help! Sorry if this is long winded.

推荐答案

如何使用正则表达式对SUBSTR慢得多。

如果你正在寻找一个精确的字符串,SUBSTR会更快。正则表达式然而高度优化。他们(或至少部分)被编译成IL和可以将这些编译版本使用甚至存储在单独的程序 Regex.CompileToAssembly 。请参见 http://msdn.microsoft.com/en-us/library/9ek5zak6。 ASPX 了解详情。

How much slower is using regex's to substr.
If you are looking for an exact string, substr will be faster. Regular expressions however are highly optimized. They (or at least parts) are compiled to IL and you can even store these compiled versions in a separate assembly using Regex.CompileToAssembly. See http://msdn.microsoft.com/en-us/library/9ek5zak6.aspx for more information.

你真正需要做的是不要进行测量。使用像秒表是迄今为止最简单的方法来验证一个或其他代码构造是否工作得更快。

What you really need to do is do perform measurements. Using something like Stopwatch is by far the easiest way to verify whether one or the other code construct works faster.

我能做些什么优化的排序(如果有的话),以最大限度地提高并行性。结果
使用 Task.Factory.StartNew ,你可以安排任务的线程池运行。您还可以看看第三方物流(任务并行库,其中工作是一个组成部分)。这有很多的结构,可以帮助您并行工作,并允许像 Parallel.ForEach结构()来在多个线程执行迭代。请参见 http://msdn.microsoft.com/en-us/library/dd460717。 ASPX 了解详情。

What sort of optimizations (if any) can I do to maximize parallelism.
With Task.Factory.StartNew, you can schedule tasks to run on the thread pool. You may also have a look at the TPL (Task Parallel Library, of which Task is a part). This has lots of constructs that help you parallelize work and allows constructs like Parallel.ForEach() to execute an iteration on multiple threads. See http://msdn.microsoft.com/en-us/library/dd460717.aspx for more information.

别的我还没有考虑?结果
的事情之一这会伤害你这个数据量是内存管理。有几件事情要考虑到:

Anything else I haven't considered?
One of the things that will hurt you with this volume of data is memory management. A few things to take into account:


  • 限制内存分配:尝试重新使用相同的缓冲区为一个单一的文件,而不是照搬他们的时候,你只需要一个部分。假设你需要焦炭1000在一系列开始工作到2000年,没有该范围复制到一个新的缓冲区,但构建你的代码只在这个范围内工作。这将使你的代码络合剂,但它可以节省你的内存分配;

  • Limit memory allocation: try to re-use the same buffers for a single document instead of copying them when you only need a part. Say you need to work on a range starting at char 1000 to 2000, don't copy that range into a new buffer, but construct your code to work only in that range. This will make your code complexer, but it saves you memory allocations;

的StringBuilder 是一类重要的。如果你不知道它,看看。

StringBuilder is an important class. If you don't know of it yet, have a look.

这篇关于最快的方法来分析大型字符串(多线程)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆