为什么我似乎无法从URL流中读取整个压缩文件? [英] Why can't I seem to read an entire compressed file from a URL stream?

查看:118
本文介绍了为什么我似乎无法从URL流中读取整个压缩文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Java直接从URL即时解析Wiktionary转储。 Wiki转储以压缩的BZIP2文件分发,我使用以下方法尝试解析它们:

I'm trying to parse Wiktionary dumps on the fly, directly from the URL, in Java. The Wiki dumps are distributed as compressed BZIP2 files, and I am using the following approach to attempt to parse them:

String fileURL = "https://dumps.wikimedia.org/cswiktionary/20171120/cswiktionary-20171120-pages-articles-multistream.xml.bz2";
URL bz2 = new URL(fileURL);
BufferedInputStream bis = new BufferedInputStream(bz2.openStream());
CompressorInputStream input = new CompressorStreamFactory().createCompressorInputStream(bis);
BufferedReader br2 = new BufferedReader(new InputStreamReader(input));
System.out.println(br2.lines().count());

但是,输出的行数仅为36,仅占整个文件的一小部分它的大小超过20MB。尝试逐行打印流,实际上只打印了几行XML:

However, the outputted line count is only 36, which is only a fraction of the total file, seeing it's over 20MB in size. Attempting to print the stream line-by-line, only a few lines of XML were actually printed:

String line = br2.readLine();
while(line != null) {
  System.out.println(line);
  line = br2.readLine();
}

我在这里缺少什么吗?我从网上找到的其他代码块几乎逐行复制了我的实现,其他人声称它们已经起作用了。为什么不读取整个流?

Is there something I am missing here? I copied my implementation almost line-for-line from other chunks of code I found online, which others claimed to have worked. Why isn't the entire stream being read? Thanks in advance.

推荐答案

事实证明,我只是傻瓜。维基百科的BZIP2文件是明确的多流文件(甚至在文件名中也是如此),因此,使用普通Commons Compress类只能读取一个流。您需要一个多流阅读器才能读取多流文件,并且从外观上看,您必须自己编写一个。我遇到了以下对我有用的实现:

So as it turns out, I was just being dumb. Wiktionary BZIP2 files are explicitly multistream (it even says so in the filename), and as a result, only one stream was being read in using the vanilla Commons Compress classes. You need a multistream reader in order to read multistream files, and from the looks of things, you have to write one yourself. I happened across the following implementation which worked for me:

https://chaosinmotion.blog/2011/07/29/and-another-curiosity-multi-stream-bzip2-files/

希望这对以后的人有所帮助:)

Hope this helps someone in the future :)

这篇关于为什么我似乎无法从URL流中读取整个压缩文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆