解码策略 [英] Decoding strategy

查看:56
本文介绍了解码策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好。

我选择最好的解码策略时遇到了一些问题。

一些讨厌的问题。我必须处理非常大的文件,其中包含用各种编码编码的
文本。它们的长度使得在单次运行中加载

文件内容不合适。我通过使用P / Invoke实现内存映射解决了这个

问题,并且我以块的形式加载了

文件的内容。由于文件的内容是不同的b $ b b编码,我真正做的是将文件的一部分映射到内存中,然后使用System.Text.Encoding解码该部分。到目前为止,这么好,

但是。不难想象这种方法存在严重问题。

由于文件处理不是,也不能,顺序和

此外,内存映射限制了偏移量哪个映射可以采取

的地方,然后一些映射可以撕裂一个人物分开。如何处理

这个?我想实现解码器回退,这将检查当前映射后面的几个
字节,并试图替换未识别的

字符,但我不知道它是否可行。我不知道是否

解码器不会意外地将破碎的char误认为是有效的,但

与预期的角色不同。我想这取决于使用的编码

。您怎么看?

Hello everyone
I''ve got a little problem with choosing the best decoding strategy for
some nasty problem. I have to deal with very large files wich contain
text encoded with various encodings. Their length makes loading
contents of file into memory in single run inappropriate. I solved this
problem by implementing memory mapping using P/Invoke and I load
contents of file in chunks. Since files'' contents are in different
encodings what I really do is mapping portion of file into memory and
then decoding that part using System.Text.Encoding. So far, so good,
but. It''s not difficult to imagine serious problem with this approach.
Since file processing is not, and also cannot be, sequential and
furthermore, memory mapping limits offsets at which mapping can take
place, then some mapping can "tear" a character apart. How to deal with
this? I thought of implementing decoder fallback which would check few
bytes behind current mapping and would try to substitute unrecognized
chars, but I don''t know whether it is feasible. I do not know if
decoder will not accidently mistake broken char for some valid, but
different from expected, character. I guess it depends on encoding
used. What do You think?

推荐答案

我会使用FileStream实例来读取文件。 FileStream类

支持随机访问文件,允许您在文件中跳转。

您可以在需要时将内容尽可能少地读入内存到。


-

HTH,

Kevin Spencer

Microsoft MVP

鸡肉沙拉射击器
http://unclechutney.blogspot.com


一个男人,一个计划,一条运河,一个有着......哦,没关系的回文。


< ma**************@gmail.com在留言中写道

新闻:11 ****************** **@h48g2000cwc.googlegrou ps.com ...
I would use a FileStream instance to read the file. The FileStream class
supports random access to files, allowing you to jump around in the file.
You can read as little or as much as you want into memory when you need to.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Shooter
http://unclechutney.blogspot.com

A man, a plan, a canal, a palindrome that has.. oh, never mind.

<ma**************@gmail.comwrote in message
news:11********************@h48g2000cwc.googlegrou ps.com...

大家好

选择最好的我有点问题解码策略

一些令人讨厌的问题。我必须处理非常大的文件,其中包含用各种编码编码的
文本。它们的长度使得在单次运行中加载

文件内容不合适。我通过使用P / Invoke实现内存映射解决了这个

问题,并且我以块的形式加载了

文件的内容。由于文件的内容是不同的b $ b b编码,我真正做的是将文件的一部分映射到内存中,然后使用System.Text.Encoding解码该部分。到目前为止,这么好,

但是。不难想象这种方法存在严重问题。

由于文件处理不是,也不能,顺序和

此外,内存映射限制了偏移量哪个映射可以采取

的地方,然后一些映射可以撕裂一个人物分开。如何处理

这个?我想实现解码器回退,这将检查当前映射后面的几个
字节,并试图替换未识别的

字符,但我不知道它是否可行。我不知道是否

解码器不会意外地将破碎的char误认为是有效的,但

与预期的角色不同。我想这取决于使用的编码

。您怎么看?
Hello everyone
I''ve got a little problem with choosing the best decoding strategy for
some nasty problem. I have to deal with very large files wich contain
text encoded with various encodings. Their length makes loading
contents of file into memory in single run inappropriate. I solved this
problem by implementing memory mapping using P/Invoke and I load
contents of file in chunks. Since files'' contents are in different
encodings what I really do is mapping portion of file into memory and
then decoding that part using System.Text.Encoding. So far, so good,
but. It''s not difficult to imagine serious problem with this approach.
Since file processing is not, and also cannot be, sequential and
furthermore, memory mapping limits offsets at which mapping can take
place, then some mapping can "tear" a character apart. How to deal with
this? I thought of implementing decoder fallback which would check few
bytes behind current mapping and would try to substitute unrecognized
chars, but I don''t know whether it is feasible. I do not know if
decoder will not accidently mistake broken char for some valid, but
different from expected, character. I guess it depends on encoding
used. What do You think?





Kevin Spencer napisal(a):

Kevin Spencer napisal(a):

我会使用FileStream实例来读取文件。 FileStream类

支持随机访问文件,允许您在文件中跳转。

您可以在需要时将内容尽可能少地读入内存to。
I would use a FileStream instance to read the file. The FileStream class
supports random access to files, allowing you to jump around in the file.
You can read as little or as much as you want into memory when you need to.



Hello Kevin

感谢您的回复。

我没有用FileStream测试性能,但也许你可以确认 -

文件流是否在内存中缓存文件内容?我认为使用内存映射时会有一点点加速,因为我不必一直打到磁盘上。在我的解决方案中,我只需在整个

文件上打开映射,并根据需要创建视图。无论如何,让我们说我用

FileStream做了,我可以从中读取一些字节,但我仍然面临同样的问题 - 如何解释我读过的字节数,是否是字符的开头,或者可能是之前的结尾。角色?

Hello Kevin
Thanks for reply.
I didn''t test performance with FileStream, but maybe you can confirm -
Does File Stream caches contents of file in memory? I think there is
slight speedup when using memory mapping in that I do not have to hit
the disk all the time. In my solution I simply open mapping over whole
file and create views as needed. Anyway, let''s say that I did it using
FileStream, I can read some bytes from it, but I still face the same
problem - how to interpret first bytes I have read, whether they are
beginning of character, or maybe end of "previous" character?


嗨Marcin,


我需要一点澄清。你有多个文件,其中每个文件

可以使用不同的编码,或者你有多个文件,其中每个

文件使用多个编码?


我也对你提到的一个字符撕裂感到困惑。如果你能用
来解释这个参考,我会发现它很有帮助。


谢谢,


Kim Greenlee

-

digipede - 许多腿轻松工作。

现实世界的网格计算。
http://www.digipede.net
http://krgreenlee.blogspot.net

Hi Marcin,

I need a little clarification. You have multiple files where each file
could use a different encoding OR you have multiple files where WITHIN each
file multiple encodings are used?

I''m also confused by your reference to a character "tear". And if you could
explain that reference, I would find it helpful.

Thanks,

Kim Greenlee
--
digipede - Many legs make light work.
Grid computing for the real world.
http://www.digipede.net
http://krgreenlee.blogspot.net


这篇关于解码策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆