Java:从具有缓冲输入的随机访问文件中读取字符串 [英] Java: reading strings from a random access file with buffered input
问题描述
我以前从未接触过Java IO API的经验,现在我真的很沮丧。我发现很难相信它有多奇怪和复杂,做一个简单的任务有多难。
I've never had close experiences with Java IO API before and I'm really frustrated now. I find it hard to believe how strange and complex it is and how hard it could be to do a simple task.
我的任务:我有2个位置(起始字节,结束字节), pos1
和 pos2
。我需要读取这两个字节之间的行(包括起始字节,不包括结尾字节)并将它们用作UTF8字符串对象。
My task: I have 2 positions (starting byte, ending byte), pos1
and pos2
. I need to read lines between these two bytes (including the starting one, not including the ending one) and use them as UTF8 String objects.
例如,在大多数脚本中语言它将是一个非常简单的1-2-3-liner(在Ruby中,但它对于Python,Perl等基本相同):
For example, in most script languages it would be a very simple 1-2-3-liner like that (in Ruby, but it will be essentially the same for Python, Perl, etc):
f = File.open("file.txt").seek(pos1)
while f.pos < pos2 {
s = f.readline
# do something with "s" here
}
Java IO API很快就会出现问题;)实际上,我看到了两种从常规本地文件读取行(以 \ n
结尾)的方法:
It quickly comes hell with Java IO APIs ;) In fact, I see two ways to read lines (ending with \n
) from regular local files:
- RandomAccessFile 具有
getFilePointer()
和seek(long pos)
,但它是 readLine()读取非UTF8字符串(甚至不是字节数组),但编码损坏的字符串非常奇怪,并且没有缓冲(这可能意味着每个读取*()
调用将被转换为单个不正常操作系统read()
=>相当慢。) - BufferedReader 有很好的
readLine()
方法,它甚至可以用skip(long n)
进行搜索,但它无法确定已读取的偶数字节数,也没有提及文件中的当前位置。
- RandomAccessFile has
getFilePointer()
andseek(long pos)
, but it's readLine() reads non-UTF8 strings (and even not byte arrays), but very strange strings with broken encoding, and it has no buffering (which probably means that everyread*()
call would be translated into single undelying OSread()
=> fairly slow). - BufferedReader has great
readLine()
method, and it can even do some seeking withskip(long n)
, but it has no way to determine even number of bytes that has been already read, not mentioning the current position in a file.
我是试图使用类似的东西:
I've tried to use something like:
FileInputStream fis = new FileInputStream(fileName);
FileChannel fc = fis.getChannel();
BufferedReader br = new BufferedReader(
new InputStreamReader(
fis,
CHARSET_UTF8
)
);
...然后使用 fc.position()
获取当前文件读取位置和 fc.position(newPosition)
设置一个,但它似乎在我的情况下不起作用:看起来它返回位置由BufferedReader完成的缓冲区预填充,或类似的东西 - 这些计数器似乎以16K为增量进行四舍五入。
... and then using fc.position()
to get current file reading position and fc.position(newPosition)
to set one, but it doesn't seem to work in my case: looks like it returns position of a buffer pre-filling done by BufferedReader, or something like that - these counters seem to be rounded up in 16K increments.
我是否真的需要通过我自己,即文件阅读器界面,它将:
Do I really have to implement it all by myself, i.e. a file readering interface which would:
- 允许我在文件中获取/设置位置
- 缓冲区文件读取操作
- 允许读取UTF8字符串(或者至少允许读取所有内容直到下一个
\ n
)
- allow me to get/set position in a file
- buffer file reading operations
- allow reading UTF8 strings (or at least allow operations like "read everything till the next
\n
")
有没有比自己实施更快的方法?我在监督什么吗?
Is there a quicker way than implementing it all myself? Am I overseeing something?
推荐答案
import org.apache.commons.io.input.BoundedInputStream
FileInputStream file = new FileInputStream(filename);
file.skip(pos1);
BufferedReader br = new BufferedReader(
new InputStreamReader(new BoundedInputStream(file,pos2-pos1))
);
如果你不关心 pos2
,那么你不需要Apache Commons IO。
If you didn't care about pos2
, then you woundn't need Apache Commons IO.
这篇关于Java:从具有缓冲输入的随机访问文件中读取字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!