游离碱的RDF转储产量只有1150万的N-Triples,而不是1.9十亿的C#解析 [英] C# parsing of Freebase RDF dump yields only 11.5 million N-Triples instead of 1.9 billion
问题描述
I'm working on building a C# program to read the RDF data in the Google Freebase data dump. To start out, I've written a simple loop to simply read the file and get a count of the Triples. However, instead of getting the 1.9 billion count as stated in the documentation page (referred above), my program is counting only about 11.5 million and then exiting. The relevant portion of the source code is given below (takes about 30 seconds to run).
我缺少的是在这里吗?
// Simple reading through the gz file
try
{
using (FileStream fileToDecompress = File.Open(@"C:\Users\Krishna\Downloads\freebase-rdf-2014-02-16-00-00.gz", FileMode.Open))
{
int tupleCount = 0;
string readLine = "";
using (GZipStream decompressionStream = new GZipStream(fileToDecompress, CompressionMode.Decompress))
{
StreamReader sr = new StreamReader(decompressionStream, detectEncodingFromByteOrderMarks: true);
while (true)
{
readLine = sr.ReadLine();
if (readLine != null)
{
tupleCount++;
if (tupleCount % 1000000 == 0)
{ Console.WriteLine(DateTime.Now.ToShortTimeString() + ": " + tupleCount.ToString()); }
}
else
{ break; }
}
Console.WriteLine("Tuples: " + tupleCount.ToString());
}
}
}
catch (Exception ex)
{ Console.WriteLine(ex.Message); }
(我尝试使用 GZippedNTriplesParser
在 dotNetRdf
通过建立的这一建议,但是,这似乎是在窒息的 RdfParseException
右键开头(制表符?UTF-8?)。因此,就目前而言,试图推出自己的)。
(I tried using GZippedNTriplesParser
in dotNetRdf
to read the data by building on this recommendation, but that seems to be choking on an RdfParseException
right at the beginning (Tab delimiters? UTF-8??). So, for the moment, trying to roll my own).
推荐答案
游离碱RDF转储由地图内置/ reduce作业该输出200个人的gzip文件。然后,这些200个文件被连接到一个最终的GZIP文件。 据gzip的规范,从多个gzip文件的串联原始字节将产生一个有效的gzip文件。解压缩该文件时遵循规范的库应该产生与每个输入文件的拼接内容的单个文件。
The Freebase RDF dumps are built by a map/reduce job that outputs 200 individual Gzip files. Those 200 files are then concatenated into one final Gzip file. According to the Gzip spec, concatenating the raw bytes from multiple Gzip files will produce a valid Gzip file. A library that adheres to the spec should produce a single file with concatenated content of each input file when uncompressing that file.
根据您所看到的三倍的数量,我猜你的代码仅解压缩文件的第一个块,而忽略其他的199.我没有太大的C#程序员,但是从阅读的另一个答案#1 好像切换到 DotNetZip 将解决这个问题。
Based on the number of triples that you're seeing, I'm guessing that your code is only uncompressing the first chunk of the file and ignoring the other 199. I'm not much of a C# programmer but from reading another Stackoverflow answer it seems like switching to DotNetZip will solve this problem.
这篇关于游离碱的RDF转储产量只有1150万的N-Triples,而不是1.9十亿的C#解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!