游离碱的RDF转储产量只有1150万的N-Triples,而不是1.9十亿的C#解析 [英] C# parsing of Freebase RDF dump yields only 11.5 million N-Triples instead of 1.9 billion

查看:166
本文介绍了游离碱的RDF转储产量只有1150万的N-Triples,而不是1.9十亿的C#解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在建立一个C#程序中的谷歌中游离碱的数据转储。要开始了,我写了一个简单的循环简单地读取该文件并获得三倍的计数。然而,而不是让1.9十亿计数作为文档页面说明(以上简称),我的计划是只计算约11.5万元,然后退出。源代码的相关部分给出以下(约需30秒钟运行)。

I'm working on building a C# program to read the RDF data in the Google Freebase data dump. To start out, I've written a simple loop to simply read the file and get a count of the Triples. However, instead of getting the 1.9 billion count as stated in the documentation page (referred above), my program is counting only about 11.5 million and then exiting. The relevant portion of the source code is given below (takes about 30 seconds to run).

我缺少的是在这里吗?

// Simple reading through the gz file
try
{
    using (FileStream fileToDecompress = File.Open(@"C:\Users\Krishna\Downloads\freebase-rdf-2014-02-16-00-00.gz", FileMode.Open))
    {
        int tupleCount = 0;
        string readLine = "";

        using (GZipStream decompressionStream = new GZipStream(fileToDecompress, CompressionMode.Decompress))
        {
            StreamReader sr = new StreamReader(decompressionStream, detectEncodingFromByteOrderMarks: true);

            while (true)
            {
                readLine = sr.ReadLine();
                if (readLine != null)
                {
                    tupleCount++;
                    if (tupleCount % 1000000 == 0)
                    { Console.WriteLine(DateTime.Now.ToShortTimeString() + ": " + tupleCount.ToString()); }
                }
                else
                { break; }
            }
            Console.WriteLine("Tuples: " + tupleCount.ToString());
        }
    }
}
catch (Exception ex)
{ Console.WriteLine(ex.Message); }



(我尝试使用 GZippedNTriplesParser dotNetRdf 通过建立的这一建议,但是,这似乎是在窒息的 RdfParseException 右键开头(制表符?UTF-8?)。因此,就目前而言,试图推出自己的)。

(I tried using GZippedNTriplesParser in dotNetRdf to read the data by building on this recommendation, but that seems to be choking on an RdfParseException right at the beginning (Tab delimiters? UTF-8??). So, for the moment, trying to roll my own).

推荐答案

游离碱RDF转储由地图内置/ reduce作业该输出200个人的gzip文件。然后,这些200个文件被连接到一个最终的GZIP文件。 据gzip的规范,从多个gzip文件的串联原始字节将产生一个有效的gzip文件。解压缩该文件时遵循规范的库应该产生与每个输入文件的拼接内容的单个文件。

The Freebase RDF dumps are built by a map/reduce job that outputs 200 individual Gzip files. Those 200 files are then concatenated into one final Gzip file. According to the Gzip spec, concatenating the raw bytes from multiple Gzip files will produce a valid Gzip file. A library that adheres to the spec should produce a single file with concatenated content of each input file when uncompressing that file.

根据您所看到的三倍的数量,我猜你的代码仅解压缩文件的第一个块,而忽略其他的199.我没有太大的C#程序员,但是从阅读的另一个答案#1 好像切换到 DotNetZip 将解决这个问题。

Based on the number of triples that you're seeing, I'm guessing that your code is only uncompressing the first chunk of the file and ignoring the other 199. I'm not much of a C# programmer but from reading another Stackoverflow answer it seems like switching to DotNetZip will solve this problem.

这篇关于游离碱的RDF转储产量只有1150万的N-Triples,而不是1.9十亿的C#解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆