解析文件的最快方法;最有效的方式来存储数据? [英] fastest way to parse a file; Most efficient way to store the data?

查看:115
本文介绍了解析文件的最快方法;最有效的方式来存储数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,

我正在尝试编写一个程序,可以非常快速地执行一些操作

并且有效地使用内存...


a)我需要解析一个空间分隔的文件,该文件非常大,

超过一百万行。

b)我需要存储将内容转换为唯一的哈希。

c)我需要对特定字段的数据进行排序。

d)我需要提取某些字段并将其报告给用户。


所以我的问题如下:


o从文件中提取字段的最有效方法是什么?

似乎没有令牌化每行的字符串

通过拆分进入数组是非常有效的....还有另外一种方法吗?

o我需要取每个字段的id并确保它没有

已存在于我的集合中。是使用arraylist最好的方式

这样做?我应该从文件的每一行获取数据并将它变成对象或STRUCT吗?从

内存的角度来看哪个更快更好?

o哪个系列具有最佳的排序功能?

o我假设一个for / next类型循环是通过此集合迭代输出数据的最佳方式...是否有更好的

方式?


感谢您的帮助。

Hi Everyone,
I am trying to write a program that does a few things very fast
and with efficient use of memory...

a) I need to parse a space-delimited file that is really large,
upwards fo a million lines.
b) I need to store the contents into a unique hash.
c) I need to then sort the data on a specific field.
d) I need to pull out certain fields and report them to the user.

So my questions are as follows:

o What is the most effiecient way to pull fields out of a file?
It does not appear that "tokenizing" the string of each line
into an array via split is very efficient.... is there another way?
o I need to take the id of each field and ensure that it does not
already exist in my "collection". Is using the arraylist the best way
to do this? Should I take the data from each line of the file and make
it into an object or a STRUCT? Which is faster and better from a
memory standpoint?
o Which collection has the best sorting capabilities?
o I assume that a for/next type of loop is the best way to
iterate through this collection to output the data... is there a better
way?

Thanks for any help.

推荐答案

您好,


百万行对于任何较少的RDBMS都是很重要的,所以无论你做什么,它都将是b $ b密集的。

Hi,

A million of rows is big for anything less of RDBMS so it will be
intensive no matter what you do.

a)我需要解析一个空格分隔文件非常大,超过一百万行。
b)我需要将内容存储到一个独特的散列中。
c)我需要对数据进行排序具体领域。
d)我需要提取某些字段并将其报告给用户。

所以我的问题如下:

o什么是最有效的方法从文件中提取字段?
它似乎没有令牌化通过拆分将每一行的字符串转换成数组是非常有效的......还有另外一种方法吗?


您最终会做到这一点(如果您保留整个

行)或者只是在阅读时将其拆分并保持单独每个实例为每个

令牌

o我需要获取每个字段的id并确保它不存在于我的集合中。是使用arraylist最好的方式来做这个吗?我应该从文件的每一行获取数据并将其作为对象或STRUCT吗?从内存的角度来看哪个更快更好?


我不认为arraylist是最好的,(以及

排序)你应该使用b-tree或者b-tree一棵二叉树(IIRC是一棵b树,好是


o哪个系列有最好的排序功能?


取决于,如果你使用二叉树,你将构建它已经排序

,所以你所要做的就是(预订?)迭代并且将获得

项目排序


现在,如果排序标准在另一个历史记录中发生变化:)

o I假设for / next类型的循环是遍历此集合以输出数据的最佳方式...是否有更好的方法?
a) I need to parse a space-delimited file that is really large,
upwards fo a million lines.
b) I need to store the contents into a unique hash.
c) I need to then sort the data on a specific field.
d) I need to pull out certain fields and report them to the user.

So my questions are as follows:

o What is the most effiecient way to pull fields out of a file?
It does not appear that "tokenizing" the string of each line
into an array via split is very efficient.... is there another way?
You will do it, eventually, either in the fly ( if you keep the entire
line) or just split it at reading time and keep a separate instance for each
token
o I need to take the id of each field and ensure that it does not
already exist in my "collection". Is using the arraylist the best way
to do this? Should I take the data from each line of the file and make
it into an object or a STRUCT? Which is faster and better from a
memory standpoint?
I don;t think that an arraylist is the best for this, ( as well as for
sorting ) you should use either a b-tree or a binary tree ( IIRC a b-tree is
better )
o Which collection has the best sorting capabilities?
Depends, if you use a binary tree for example you will build it sorted
already, so all you have to do is a ( preorder? ) iteration and will get
the items sorted

Now if the sorting criteria change on the fly that is another history :)
o I assume that a for/next type of loop is the best way to
iterate through this collection to output the data... is there a better
way?



无论你走到哪里,它都将是处理器密集型和内存密集型的,

100万美元毕竟是很多记录。


干杯,


-

Ignacio Machin,

ignacio.machin at dot.state.fl.us

佛罗里达州交通局




Any way you go it will be processor intensive as well as memory intensive,
1 million is a lot of records after all.

Cheers,

--
Ignacio Machin,
ignacio.machin AT dot.state.fl.us
Florida Department Of Transportation



hoopsho,


我要做的是使用FileStream(可能在
top上使用BufferedStream来减少对磁盘的读取)来读取文件的内容。由于

你说Split方法太慢了,我建议你用

字符读取字符,自己标记行。


如果你需要确保每一行都是唯一的,那么你需要

来使用结构。但是,将它们存储在Hashtable中意味着你每次想要访问它时都需要打开并取消装箱值,并且

这样效率非常低。


您使用结构的原因是它会为您提供一个

哈希值,如果所有哈希值都相同则相等。 />

为了解决这个问题,我建议你创建一个结构,其中包含你将填充的所有字段。然后,围绕

结构创建一个类包装器,该结构具有结构类型的私有字段。覆盖

GetHashCode从结构中返回哈希码。覆盖等于

以及当哈希码匹配时返回true。


然后,我会将其存储在哈希表中。添加新行时,请检查

以查看哈希表中是否有值。如果有,那么

忽略它,如果没有,则存储它。


然后你有在特定字段上排序数据的问题。

哈希表不会给你任何东西,所以我会推荐一个

数据集,或者使用一个ArrayList来保存对相同项目的引用/>
在Hashtable中。在一个数据集中存储一百万行表将会破坏你的服务器(除非你有一个可以备用的内存的TON),并且

一般都不是''好主意。哈希表具有相同的缺点。如果你不想使用数据集并且你知道要排序的字段,那么

然后我说当你添加项目时到散列表,执行二进制

搜索ArrayList并在

arraylist中的位置插入新值。这也是一个问题。假设您必须

在列表的开头插入第100万条记录的引用。

向前移动999,999条记录不会很快。


正是在这种情况下,您可能需要考虑不安全的

代码,分配内存缓冲区以及自己移动记录。

最后,我建议您使用RDBMS临时存储值

。之前使用的哈希表方案可以在这里提供帮助。如果你发现

你没有这些值的行,那么你可以插入它,

否则,什么都不做。当然,哈希表会产生开销,

和很多。但是,它会减少您必须执行的

数据库的操作次数,具体取决于您(b)b的平均重复数量。


最后,当你想输出结果时,你可以执行一个

查询,选择你想要的字段,然后按照你想要的方式对它进行排序。


您最终将根据一些

因素做出决定。最大的一个就是你在机器上拥有的RAM数量。

如果你有足够的金额可以在内存中支持你的

结构中的这么多信息,那么无论如何,请使用DataSet。但是,如果你没有
,或者重复记录的数量很高,那么请使用

Hashtabe / DB解决方案,因为它比循环更快全部

你自己的项目,比较字段,尝试排序等等。


希望这会有所帮助。


-

- Nicholas Paldino [.NET / C#MVP]

- mv*@spam.guard.caspershouse.com

" hoopsho" <豪***** @ gmail.com>在消息中写道

news:11 ********************** @ c13g2000cwb.googlegr oups.com ...
hoopsho,

What I would do is use a FileStream (perhaps with a BufferedStream on
top to reduce reads to the disk) to read the contents of the file. Since
you say the Split method is too slow, I recommend that you read character by
character, tokenizing the lines yourself.

If you need to make sure that every line is unique, then you would want
to use a struct. However, storing these in a Hashtable will mean that you
have to box and unbox the value every time you want to access this, and
that''s extremely inefficient.

The reason you would use a structure is that it would provide you with a
hashvalues that are equal if all of the hashvalues are the same.

To get around this, I would recommend creating a structure that has all
the fields you will populate. Then, create a class wrapper around the
structure, which has a private field of the type of the structure. Override
the GetHashCode to return the hash code from the structure. Override Equals
as well to return true when the hash codes match.

Then, I would store that in the hashtable. When adding a new row, check
to see if there is a value in the hashtable already. If there is, then
ignore it, if not, then store it.

Then you have the issue of sorting the data on a specific field. The
hashtable isn''t going to give you anything, so then I would recommend a
DataSet, or use an ArrayList that holds references to the same items that
are in the Hashtable. Storing a million row table in a data set is going to
crush your server (unless you have a TON of memory that you can spare), and
generally isn''t a good idea. The hashtable has the same drawbacks. If you
don''t want to use the DataSet and you know the field(s) you want to sort on,
then I say that as you add the items to the hashtable, perform a binary
search through the ArrayList and insert the new value at the position in the
arraylist where it should go. This is an issue as well. Say you have to
insert a reference at the beginning of the list for the millionth record.
Moving 999,999 records forward is not going to be quick.

It is this kind of scenario where you might want to consider unsafe
code, allocating a buffer of memory, and moving the records around yourself.

In the end, I would recommend that you use a RDBMS to store the values
temporarily. The hashtable scheme used before can help here. If you detect
that you don''t have a row with those values, then you can insert it,
otherwise, do nothing. Granted, the hashtable is going to incur overhead,
and a lot. However, it would reduce the number of operations against the
database that you have to perform, depending on the number of duplicates you
have (on average).

Finally, when you want to output the results, you can just perform a
query, selecting the fields you want, and sorting it how you wish.

You are ultimately going to make your decision based on a number of
factors. The biggest one is the amount of RAM that you have on the machine.
If you have a good amount that could support this much information in your
structure in memory, then by all means, use a DataSet. However, if you do
not, or the number of duplicate records is high, then go with the
Hashtabe/DB solution, as it would be much faster than looping through all of
the items yourself, comparing fields, trying to sort, etc, etc.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"hoopsho" <ho*****@gmail.com> wrote in message
news:11**********************@c13g2000cwb.googlegr oups.com...
大家好,
我正在努力编写一个程序,可以非常快速地完成一些事情并有效地使用内存......

a)我需要解析一个空间分隔的文件,这个文件非常大,超过一百万行。
b)我需要将内容存储到一个唯一的哈希。
c)我需要再排序关于特定领域的数据。
d)我需要提取某些字段并将其报告给用户。

所以我的问题如下:

o从文件中提取字段的最有效方法是什么?
它似乎不是令牌化的标记。通过拆分将每行的字符串转换成数组是非常有效的....还有另一种方法吗?
o我需要取每个字段的id并确保它不会
已经存在于我的集合中。是使用arraylist最好的方式来做这个吗?我应该从文件的每一行获取数据并将其作为对象或STRUCT吗?从内存的角度来看哪个更快更好?
o哪个集合具有最佳的排序功能?
o我认为for / next类型的循环是
遍历此集合以输出数据......有更好的方法吗?

感谢您的帮助。
Hi Everyone,
I am trying to write a program that does a few things very fast
and with efficient use of memory...

a) I need to parse a space-delimited file that is really large,
upwards fo a million lines.
b) I need to store the contents into a unique hash.
c) I need to then sort the data on a specific field.
d) I need to pull out certain fields and report them to the user.

So my questions are as follows:

o What is the most effiecient way to pull fields out of a file?
It does not appear that "tokenizing" the string of each line
into an array via split is very efficient.... is there another way?
o I need to take the id of each field and ensure that it does not
already exist in my "collection". Is using the arraylist the best way
to do this? Should I take the data from each line of the file and make
it into an object or a STRUCT? Which is faster and better from a
memory standpoint?
o Which collection has the best sorting capabilities?
o I assume that a for/next type of loop is the best way to
iterate through this collection to output the data... is there a better
way?

Thanks for any help.


首先,让我指出在计算中,你通常面临内存和速度之间的权衡。问题是,什么是最快的,

最有效的内存方式?就像问,什么是最快,最便宜的去巴黎的方式? (当然,假设你已经不在巴黎了。)有时最快的方式是最便宜的方式,但通常情况下,你有更多的方式。在速度和成本之间做出权衡:飞机的成本高于货机。所以它的速度和内存都是b $ b:有时最快的方式也是内存效率最高的b
,但通常情况下你需要换一个/>
其他。


这就是说,你的解决方案在很大程度上取决于你所分类的领域是否总是也是你需要的字段是唯一的。

如果是,那么我建议你使用某种形式的树形结构

(研究已经可用的.NET集合),将动态分类您的

项目,并在相同的

时间给您一些独特性的指示。既然你必须排序,你可以这样做,你的

唯一性检查一次。


但是,如果你可能确定唯一性在一个
字段上并在另一个字段上排序,然后使用除散列表之外的任何东西确定唯一性时没有值。

。哈希

表将为您提供快速查找功能,以确定您是否已经看到了一个密钥。只有一件事更快,这是一个非常非常大的B树,但是它占用了大量的内存所以我不会去

那样。哈希表是健壮且快速的。


至于排序,您应该构建树结构或使用

快速排序算法。这两种方法都相当快。我不会建议使用插入排序,这就是尼古拉斯建议的那样

(对不起),因为有了一百万条记录你会_definitely_注意到一个

性能差异。 Array类包含一个Sort方法,但它没有提到它使用的是哪种算法,尽管我必须假设如果编写框架的MS人员没有这样做,那么
不使用quicksort,或者使用
甚至更快的东西(是的,有一些更快的算法)然后

它们不是太尖锐。


最后,存在存储问题。是的,你可以解析每一行

然后把它吹成一个字符串数组,但是如果你必须再写一下它的价值。同样,如果你正在进行快速排序,你必须在内存中随机播放(可能)大记录。


另一种解决方法问题是要创建一个包含

a字符串,偏移量和长度的小类。如果你使用短整数作为

偏移量和长度,你可以将其减少到64位。当你在一行中阅读

,并且想要表示字段#15时,例如,在这些对象之一中创建一个新的
,将字符串指针设置为你的行你读了,

和偏移量以及表示你的领域开始的长度和

多长时间。


现在,如果您为此结构编写IComparer:


公共类FieldComparer:IComparer

{

public int Compare(对象x ,对象y)

{

MyField field1 =(MyField)x;

MyField field2 =(MyField)y;


if(field1.Length!= field2.Length)

{

return field1.Length - field2.Length;

}

else

{

返回String.Compare(field1.String,field1.Offset,

field2.String,field2.Offset,field2.Length);

}

}

}


我正在使用Field而不是结构来避免装箱和

取消装箱标准的比较方法。


所以,假设您必须确定一个字段的唯一性并且

在另一个字段上排序,我就是这样做的。


创建一个将Field对象作为其键的Hashtable。存储在Hasthable中的值

不重要,因此您不妨使用从文件中读取的

行。这不会导致线路的任何额外的存储空间,因为如果你将它们存储为字符串,那么运行时将分享点数,只要输入行本身

永远不会改变。


创建一个ArrayList,它将保存你想要最终的字段的FIeld对象排序。


读取每一行,找到你需要排序的字段,以及你需要验证为唯一的字段。为每个对象创建一个Field对象。检查

以查看您的唯一字段的Field对象是否已存储在Hashtable的

中,如果不是,则添加它。使用Add方法将您的

排序字段的Field对象添加到ArrayList。你没有说

你是否想要输出集中的记录只有当唯一字段

是第一次出现时,你可以在这里确定,因为你

已经尝试放入哈希表。


当你读完所有行时,使用ArrayList.Sort对数组进行排序

列表使用您在上面创建的IComparer类的实例。

这需要一段时间,但它比插入排序或任何

更快你可以自己滚动的其他排序方法。


运行数组列表并将记录逐个输送到某种类型

的输出对象,这将是知道要挑选哪些字段并向用户显示
。由于你的Field类包含该行的原始字符串

指针,你可以恢复输入行并扫描它所需的输出字段




这引入的唯一额外开销是你扫描输入

两次:一次获得你的独特/排序字段,一次得到你的

输出字段。但是,我怀疑这会造成重大的性能损失。不是所有的排序。

无论如何,这是我的解决方案!祝你好运!

First off, let me point out that in computing, you are usually facing a
tradeoff between memory and speed. The question, "What is the fastest,
most memory-efficient way to do this?" is like asking, "What is the
quickest, cheapest way to get to Paris?" (Assuming, of course, that
you''re not already in Paris. :) Sometimes the fastest way is the
cheapest way, but more often than not you have to make a tradeoff
between speed and cost: airplanes cost more than freighters. So it is
with speed and memory: sometimes the fastest way is also the most
memory-efficient, but more often than not you have to trade one for the
other.

That said, your solution depends very much upon whether the field
you''re sorting on is always also the field that you require be unique.
If it is, then I suggest that you use some form of tree structure
(research the already-available .NET collections), which will sort your
items on the fly and give you some indication of uniqueness at the same
time. Since you have to sort anyway, you might as well do that and your
uniqueness check all at once.

However, if you could potentially be determining uniqueness on one
field and sorting on a different field, then there''s no value in
determining uniqueness using anything other than a hash table. A hash
table will give you lightning-fast lookup capabilities to determine if
you''ve already seen a key. There''s only one thing faster, which is a
very, very big B-tree, but it uses up tons of memory so I wouldn''t go
that way. Hash tables are robust and fast.

As for sorting, you should either build a tree structure or use the
quicksort algorithm. Both methods are reasonably quick. I wouldn''t
suggest using insertion sort, which is what Nicholas was suggesting
(sorry) because with a million records you''ll _definitely_ notice a
performance difference. The Array class contains a Sort method, but it
doesn''t mention which algorithm it uses, although I must suppose that
if the MS people who wrote the Framework didn''t use quicksort, or
something even faster (yes, there are a few faster algorithms) then
they''re not too sharp.

Finally, there''s the problem of storage. Yes, you can parse each line
and blow it out into an array of strings, but then if you have to write
it out again. As well, if you''re doing a quick sort, you have to
shuffle (potentially) large records around in memory.

Another way to solve the problem is to create a small class containing
a string, an offset, and a length. If you use short integers for the
offset and the length you can pare this down to 64 bits. When you read
in a line, and you want to represent field #15, for example, make a new
one of these objects, set the string pointer to your line you read in,
and the offset and the length to indicate where your field starts and
how long it is.

Now if you write an IComparer for this structure:

public class FieldComparer : IComparer
{
public int Compare(object x, object y)
{
MyField field1 = (MyField)x;
MyField field2 = (MyField)y;

if (field1.Length != field2.Length)
{
return field1.Length - field2.Length;
}
else
{
return String.Compare(field1.String, field1.Offset,
field2.String, field2.Offset, field2.Length);
}
}
}

I''m using a class for Field rather than a struct to avoid boxing and
unboxing in the standard Compare method.

So, assuming that you have to determine uniqueness on one field and
sort on another field, here is how I would do it.

Make a Hashtable that will have Field objects as its keys. The values
stored in the Hasthable don''t matter, so you might as well use the
lines you''re reading from the file. This won''t result in any extra
storage for the lines, because if you''re storing them as strings then
the runtime will share points so long as the input lines themselves
never change.

Make an ArrayList that will hold the FIeld objects for the fields you
want to eventually sort on.

Read each line, find the field you need to sort on, and the field you
need to verify as unique. Create a Field object for each of them. Check
to see if the Field object for your unique field is already stored in
the Hashtable, and add it if it isn''t. Add the Field object for your
sort field to the ArrayList using the Add method. You didn''t say
whether you want the record in the output set only if the unique field
is the first occurrence, but you can determine that here because you
already tried to put in the hash table.

When you''ve read all the lines, use ArrayList.Sort to sort the array
list using an instance of the IComparer class that you created above.
This will take a while, but it''s faster than insertion sort or any
other sort method that you might roll yourself.

Run through the array list and feed the records one by one to some sort
of output object, which will know which fields to pick out and display
to the user. Since your Field class contains the original string
pointer for the line, you can recover the input line and scan it for
the output fields that you want.

The only extra overhead that this introduces is that you scan the input
line twice: once to get your unique / sort fields, and once to get your
output fields. However, I doubt that this will create a significant
performance hit. Not after all of that sorting.
Anyway, there''s my solution! Good luck!


这篇关于解析文件的最快方法;最有效的方式来存储数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆