如何处理非常大的文本文件? [英] How to deal with a very large text file?

查看:428
本文介绍了如何处理非常大的文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在写一些需要处理非常大的文本文件(至少几个GiB)的东西.这里需要的(是固定的)是:

I'm currently writing something that needs to handle very large text files (a few GiB at least). What's needed here (and this is fixed) is:

  • 基于CSV,遵循RFC 4180(嵌入式换行符除外)
  • 对行的随机读取权限,尽管大多是逐行且在末尾附近
  • 在末尾添加行
  • (换行).显然,要求重写文件的其余部分也很罕见,因此目前并不特别重要

文件的大小禁止将其完全保留在内存中(这也是不希望的,因为在添加更改时应尽快保留更改).

The size of the file forbids keeping it completely in memory (which is also not desirable, since when appending the changes should be persisted as soon as possible).

我已经考虑过使用内存映射区域作为文件的窗口,如果请求超出其范围的行,该区域会四处移动.当然,在那个阶段,我仍然没有字节级别以上的抽象.为了实际处理内容,我有一个CharsetDecoder给我一个CharBuffer.现在的问题是,我可以处理CharBuffer中的文本行,但是我还需要知道文件中该行的字节偏移量(以保留行索引和偏移量的缓存,所以我不知道不必再次从头开始扫描文件以找到特定的行.

I have thought of using a memory-mapped region as a window into the file which gets moved around if a line outside its range is requested. Of course, at that stage I still have no abstraction above the byte level. To actually work with the contents I have a CharsetDecoder giving me a CharBuffer. Now the problem is, I can deal with lines of text probably just fine in the CharBuffer, but I also need to know the byte offset of that line within the file (to keep a cache of line indexes and offsets so I don't have to scan the file from the beginning again to find a specific line).

是否有办法将CharBuffer中的偏移量映射到匹配的ByteBuffer中的偏移量?对于ASCII或ISO-8859- *来说,这显然是微不足道的,而对于UTF-8和ISO 2022或BOCU-1来说,则显得不那么丑陋(并不是我实际上期望后两者,但是UTF-8应该是此处的默认值) –仍然会带来问题).

Is there a way to map the offsets in a CharBuffer to offsets in the matching ByteBuffer at all? It's obviously trivial with ASCII or ISO-8859-*, less so with UTF-8 and with ISO 2022 or BOCU-1 things would get downright ugly (not that I actually expect the latter two, but UTF-8 should be the default here – and still poses problems).

我想我可以再次将CharBuffer的一部分转换为字节并使用长度.要么可行,要么我遇到变音符号问题,在这种情况下,我可能会强制使用NFC或NFD,以确保始终对文本进行明确编码.

I guess I could just convert a portion of the CharBuffer to bytes again and use the length. Either it works or I get problems with diacritics in which case I could probably mandate the use of NFC or NFD to assure that the text is always unambiguously encoded.

还是,我想知道这是否就是去这里的方法.有更好的选择吗?

Still, I wonder if that is even the way to go here. Are there better options?

ETA:以下是对常见问题和建议的一些答复:

ETA: Some replies to common questions and suggestions here:

这是用于模拟运行的数据存储,旨在作为成熟数据库的小本地替代方案.我们也有数据库后端,并且已经使用了它们,但是对于它们不可用或不适用的情况,我们确实希望这样做.

This is a data storage for simulation runs, intended to be a small-ish local alternative to a full-blown database. We do have database backends as well and they are used, but for cases where they are unavailable or not applicable we do want this.

我也只支持CSV的一个子集(不带嵌入式换行符),但是现在还可以.这里有问题的地方几乎是我无法预测行的长度,因此需要创建文件的粗略图.

I'm also only supporting a subset of CSV (without embedded line breaks), but that's ok for now. The problematic points here are pretty much that I cannot predict how long the lines are and thus need to create a rough map of the file.

关于我上面概述的内容:我正在思考的问题是,我可以轻松确定字符级别(U + 000D + U + 000A)的行尾,但是我不想假设在字节级别上看起来像0A 0D(对于UTF-16,它已经失败了,例如,在0D 00 0A 0000 0D 00 0A中).我的想法是,通过不对当前使用的编码细节进行硬编码,可以使字符编码可变.但是我想我可以坚持使用UTF-8并载入其他所有内容.不过,还是感觉错误.

As for what I outlined above: The problem I was pondering was that I can easily determine the end of a line on the character level (U+000D + U+000A), but I didn't want to assume that this looks like 0A 0D on the byte level (which already fails for UTF-16, for example, where it's either 0D 00 0A 00 or 00 0D 00 0A). My thoughts were that I could make the character encoding changable by not hard-coding details of the encoding I currently use. But I guess I could just stick to UTF-8 and ingore everything else. Feels wrong, somehow, though.

推荐答案

在Java字符序列(实际上是UTF-16)和字节之间保持1:1映射非常困难,取决于您的字节数文件编码.即使使用UTF-8,1字节到1个字符的明显"映射也仅适用于ASCII. UTF-16和UTF-8都不保证Unicode字符可以存储在单个计算机charbyte中.

It's very difficult to maintain a 1:1 mapping between a sequence of Java chars (which are effectively UTF-16) and bytes which could be anything depending on your file encoding. Even with UTF-8, the "obvious" mapping of 1 byte to 1 char only works for ASCII. Neither UTF-16 nor UTF-8 guarantees that a unicode character can be stored in a single machine char or byte.

我会将进入文件的窗口保留为字节缓冲区,而不是char缓冲区.然后要在字节缓冲区中找到行尾,我将使用字符串与文件中的相同编码将Java字符串"\r\n"(或可能只是"\n")编码为字节序列.然后,我将使用该字节序列在字节缓冲区中搜索行尾.缓冲区末尾的行的位置+缓冲区距文件开头的偏移量精确映射到该行末尾的文件中的字节位置.

I would maintain my window into the file as a byte buffer, not a char buffer. Then to find line endings in the byte buffer, I'd encode the Java string "\r\n" (or possibly just "\n") as a byte sequence using the same encoding as the file is in. I'd then use that byte sequence to search for line endings in the byte buffer. The position of a line ending in the buffer + the offset of the buffer from the start of the file maps exactly to the byte position in the file of the line ending.

添加行只是在文件末尾查找并添加新行的一种情况.换行比较棘手.我想我会保留更改行的字节位置和更改内容的列表或映射.准备好编写更改时:

Appending lines is just a case of seeking to the end of the file and adding your new lines. Changing lines is more tricky. I think I would maintain a list or map of byte positions of changed lines and what the change is. When ready to write the changes:

  1. 将更改列表按字节位置排序
  2. 读取原始文件直到下一次更改,然后将其写入临时文件.
  3. 将已更改的行写入临时文件.
  4. 跳过原始文件中更改的行.
  5. 返回第2步,除非您已到达原始文件的末尾
  6. 将临时文件移到原始文件上.

这篇关于如何处理非常大的文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆