如何存储Freebase这样的数据? [英] How to store data like Freebase does?

查看:1584
本文介绍了如何存储Freebase这样的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我承认,这基本上是一个重要的问题:使用freebase数据本地服务器?,但我需要比已经给出的更详细的答案



我已经完全爱上了Freebase。我现在想要的是基本上创建一个非常简单的Freebase克隆,用于存储可能不属于Freebase本身的内容,但可以使用Freebase模式进行描述。基本上我想要的是一种简单而优雅的方式来存储Freebase本身的数据,并能够轻松地在Python(CherryPy)Web应用程序中使用该数据。



MQL参考指南的第2章指出:


Metaweb与您可能熟悉的关系数据库有根本的不同。关系数据库以表的形式存储数据,但是 Metaweb数据库将数据存储为节点之间的节点和这些节点之间的关系


我猜,这意味着我应该使用triplestore或图形数据库,如Neo4j?有没有人在使用Python环境中的任何一个人有任何经验?



(目前为止我实际尝试的是创建一个关系数据库模式,这将是能够轻松地存储Freebase主题,但我在SQLAlchemy中配置映射有问题)。



我正在研究的事情 / p>



更新[28/12/2011]:



我在Freebase博客上发现了一篇文章,描述了专有的元组存储/数据库Freebase本身使用( graphd): http://blog.freebase.com/2008/04/ 09 / a-brief-tour-of-graphd /

解决方案

这对我有用。它允许您将标准MySQL安装中的所有Freebase转储加载到小于100GB的磁盘上。关键是了解转储中的数据布局,然后将其转换(优化其空间和速度)。



Freebase概念在您尝试使用此文件之前(全部取自文档):




  • 主题 - 任何类型的/ common / topic,请注意对于您可能在Freebase中遇到的不同类型的ID:'id','mid','guid','webid'等。


  • 类型 - '是一个'关系

  • 属性 - '有一个'关系

  • 模式

  • 命名空间

  • 键 - 在'/ en'命名空间中可读取



一些其他重要的Freebase细节




  • 查询编辑器是您的朋友

  • 了解描述的'source','property','destination'和'value'概念 here

  • 一切都有中间,甚至像'/','/ m','/ en','/ lang','/ m / 0bnqs_5'等;使用查询编辑器进行测试: [{'id':'/','mid':null}]

  • 不知道数据转储中的任何实体(即行),你必须得到它的类型来做到这一点(例如我如何知道'/ m / 0cwtm'是一个人);

  • 每个实体至少有一个类型(但通常还有更多)

  • 每个实体至少有一个id /关键(但通常更多)

  • 本体论(即元数据)嵌入到与数据相同的转储和格式(不是其他发行版的情况下,如DBPedia等)。

  • 转储中的'destination'列是令人困惑的,它可能包含一个中间或一个键(参见下面的转换如何处理)

  • 域,类型,属性是同一时间的命名空间级别(谁想出这是一个天才IMHO);

  • 了解什么是主题,什么不是主题(绝对关键),例如这个实体'/ m / 03lmb2f' code>类型'/ film / performance'不是一个主题(我选择将它们看作是什么空白节点尽管这可能不是哲学上准确的),而'/ m / 04y78wb'

    < h2>转换

    (参见底部的Python代码)



    TRANSFORM 1 (从shell,从名称空间中分离链接忽略notable_for和非/ lang / en文本):

      python parse.py freebase。 tsv #end up with freebase_links.tsv和freebase_ns.tsv 

    TRANSFORM 2 (从Python控制台,freebase_ns_types.tsv上freebase_ns.tsv,freebase_ns_props.tsv加上我们现在忽略的其他15个)

     导入e 
    e.split_external_keys('freebase_ns.tsv')

    TRANSFORM 3 (从Python控制台,将财产和目的地转换为中介)

      import e 
    ns = e。 get_namespaced_data('freebase_ns_types.tsv')
    e.replace_property_and_destination_with_mid('freebase_links.tsv',ns)#produced freebase_links_pdmids.tsv
    e.replace_property_with_mid('freebase_ns_props.tsv',ns)#produces freebase_ns_props_pmids.tsv

    TRANSFORM 4 (从MySQL控制台载入freebase_links_mids.tsv,freebase_ns_props_mids。 tsv和freebase_ns_types.tsv在DB中):

      CREATE TABLE链接(
    源VARCHAR(20),
    属性VARCHAR(20),
    目标VARCHAR(20),
    值VARCHAR(1)
    )ENGINE = MyISAM CHARACTER SET utf8;

    CREATE TABLE ns(
    source VARCHAR(20),
    属性VARCHAR(20),
    目标VARCHAR(40),
    值VARCHAR(255 )
    )ENGINE = MyISAM CHARACTER SET utf8;

    CREATE TABLE类型(
    源VARCHAR(20),
    属性VARCHAR(40),
    目标VARCHAR(40),
    值VARCHAR(40 )
    )ENGINE = MyISAM CHARACTER SET utf8;

    LOAD DATA LOCAL INFILE/data/freebase_links_pdmids.tsvINTO TABLE links FIELDS TERMINATED BY'\t'LINES TERMINATED BY'\\\
    ';
    LOAD DATA LOCAL INFILE/data/freebase_ns_props_pmids.tsvINTO TABLE ns FIELDS TERMINATED BY'\t'LINES TERMINATED BY'\\\
    ';
    LOAD DATA LOCAL INFILE/data/freebase_ns_base_plus_types.tsvINTO TABLE types FIELDS TERMINATED BY'\t'LINES TERMINATED BY'\\\
    ';

    CREATE INDEX links_source ON链接(源)使用BTREE;
    CREATE INDEX ns_source ON ns(source)使用BTREE;
    CREATE INDEX ns_value ON ns(value)使用BTREE;
    CREATE INDEX types_source ON类型(源)使用BTREE;
    CREATE INDEX types_destination_value ON类型(目的地,值)使用BTREE;



    代码



    将其另存为e。 py:

      import sys 

    #返回一个要在mid(...) ,$ _




    $ b f中的行
    def get_namespaced_data(file_name)
    f = open(file_name)


    elements = line [: - 1] .split('\t')

    如果len(elements) 4:
    print'Skip ...'
    continue

    result [(elements [2],elements [3])] = elements [0]

    返回结果

    #runs内存不足
    def load_links(file_name):
    f = open(file_name)
    result = {}

    在f中的行
    如果len(结果)%1000000 == 0:
    print len(result)
    elements = line [: - 1] .split('\ t')
    src,prop,dest = elements [0],elements [1],elements [2]
    if result.get(src,False):
    if result [src] .get(prop,False):
    result [src] [prop] .append(dest)
    else:
    result [src] [prop] = [dest]
    else :
    result [src] = dict([(prop,[dest])])

    返回结果

    #same作为load_links而是命名空间的数据
    def load_ns(file_name):
    f = open(file_name)
    result = {}

    f中的行
    如果len(结果)%1000000 == 0:
    print len(result)
    elements = line [: - 1] .split('\t')
    src,prop,value = elements [0] [1],元素[3]
    如果result.get(src,False):
    if result [src] .get(prop,False):
    result [src] [prop] .append(value)
    else:
    result [src] [prop] = [value]
    else:
    result [src] = dict([(prop,[value] )])

    返回结果

    def links_in_set(file_name):
    f = open(file_name)
    result = set()

    for f:
    elements = line [: - 1] .split('\t')
    result.add(elements [0])
    return result

    def mid(key,ns):
    if key =='':
    return False
    elif key =='/':
    key =' / boot / root_namespace'
    parts = key.split('/')
    如果len(parts)== 1:#cover没有以'/'开头的东西的情况
    打印键
    return False
    如果parts [1] =='m':#already a mid
    return key
    namespace ='/'.join(parts[:- 1])$ ​​b $ b key = parts [-1]
    return ns.get((namespace,key),False)

    def replace_property_and_destination_with_mid(file_name,ns):
    fn = file_name.split('。')[0]
    f = open(file_name)
    f_out_mids = open(fn +'_ pdmids'+'。tsv','w')

    def convert_to_mid_if_possible(value):
    m = mid(value,ns)
    if m:return m
    else:return None

    counter = 0

    f中的行
    元素=行[: - 1] .split('\t')
    md = convert_to_mid_if_possible(elements [1])$ ​​b $ b dest = convert_to_mid_if_possible(elements [2])
    如果md和dest:
    元素[1] = md
    元素[2] = dest
    f_out_mids.write('\t' 。$($)


    $ b打印'跳过:'+ str(counter)

    def replace_property_with_mid(file_name,ns):
    fn = file_name.split('。')[0]
    f = open(file_name)
    f_out_mids = open(fn +'_ pmids'+' $ s

    $ b def convert_to_mid_if_possible(value):
    m = mid(value,ns)
    if m:return m
    else:return None

    for f:
    elements = line [: - 1] .split('\t')
    md = convert_to_mid_if_possible(elements [1])$ ​​b $ b如果md:
    元素[1] = md
    f_out_mids.write('\t'.join(elements)+'\\\
    ')
    else:
    # print'Skipping'+ elements [1]
    pass

    #cPickle
    #ns = e.get_namespaced_data('freebase_2.tsv')
    #import cPickle
    #cPickle.dump(ns,open('ttt.dump','wb'),protocol = 2)
    #ns = cPickle.load(open('ttt.dump','rb'))

    #fn ='/ m / 0'
    #n = fn.split('/')[2]
    #dir = n [: - 1]


    def is_mid(value):
    parts = value.split('/')
    如果len(parts)== 1:#it不以'/'
    return False
    如果parts [1] =='m':
    return True
    return False

    def check_if_property_or_destination_are_mid(file_name):
    f = open(file_name)

    f中的行
    元素=行[: - 1] .split('\t')
    #if is_mid (元素[1])或is_mid(元素[2]):
    如果is_mid(elements [1]):
    打印行


    def split_external_keys file_name)
    fn = file_name.split('。')[0]
    f = open(file_name)
    f_out_extkeys = open(fn +'_ extkeys'+'.tsv','w' )
    f_out_intkeys = open(fn +'_ intkeys'+'.tsv','w')
    f_out_props = open(fn +'_props'+'.tsv','w')
    f_out_types = open(fn +'_ types'+'.tsv','w')
    f_out_m = open(fn +'_ m'+'.tsv','w')
    f_out_src = open(fn +'_ src '+'.tsv','w')
    f_out_usr = open(fn +'_ usr'+'.tsv','w')
    f_out_base = open(fn +'_ base'+' ,'w')
    f_out_blg = open(fn +'_ blg'+'.tsv','w')
    f_out_bus = open(fn +'_ bus'+'.tsv','w')
    f_out_soft = open(fn +'_ soft'+'.tsv','w')
    f_out_uri = open(fn +'_ uri'+'.tsv','w')
    f_out_quot = open (fn +'_''''.tsv','w')
    f_out_frb = open(fn +'_ frb'+'.tsv','w')
    f_out_tag = open(fn +'_ tag' '.tsv','w')
    f_out_guid = open(fn +'_ guid'+'.tsv','w')
    f_out_dtwrld = open(fn +'_ dtwrld'+'.tsv' w')

    f中的行
    元素=行[: - 1] .split('\t')
    parts_2 = elements [2] .split '/')
    如果len(parts_2)== 1:#空白目标元素 - '',加上根域
    如果元素[1] =='/ type / object / key' :
    f_out_types.write(line)
    else:
    f_out_props.write(line)

    elif元素[2] =='/ lang / en':
    f_out_props.write(line)

    elif(parts_2 [1] =='wikipedia'or parts_2 [1] =='authority')和len(parts_2)> 2:
    f_out_extkeys.write(line)

    elif parts_2 [1] =='m':
    f_out_m.write(line)

    elif part_2 [1] =='en':
    f_out_intkeys.write(line)

    elif parts_2 [1] =='source'和len(parts_2)> 2:
    f_out_src.write(line)

    elif parts_2 [1] =='user':
    f_out_usr.write(line)

    elif part_2 [1] =='base'和len(parts_2)> 2:
    如果元素[1] =='/ type / object / key':
    f_out_types.write(line)
    else:
    f_out_base.write(line)

    elif parts_2 [1] =='biology'和len(parts_2)> 2:
    f_out_blg.write(line)

    elif parts_2 [1] =='business'and len(parts_2)> 2:
    f_out_bus.write(line)

    elif parts_2 [1] =='soft'和len(parts_2)> 2:
    f_out_soft.write(line)

    elif parts_2 [1] =='uri':
    f_out_uri.write(line)

    elif part_2 [1] =='quotationsbook'和len(parts_2)> 2:
    f_out_quot.write(line)

    elif parts_2 [1] =='freebase'和len(parts_2)> 2:
    f_out_frb.write(line)

    elif parts_2 [1] =='tag'和len(parts_2)> 2:
    f_out_tag.write(line)

    elif parts_2 [1] =='guid'和len(parts_2)> 2:
    f_out_guid.write(line)

    elif parts_2 [1] =='dataworld'和len(parts_2)> 2:
    f_out_dtwrld.write(line)

    else:
    f_out_types.write(line)

    将其另存为parse.py:

      import sys 

    def parse_freebase_quadruple_tsv_file(file_name):
    fn = file_name.split('。')[0]
    f = open(file_name)
    f_out_links = open(fn +'_ links'+'。tsv ','w')
    f_out_ns = open(fn +'_ ns'+'。tsv','w')

    在f中的行
    elements = line [ -1] .split('\t')

    如果len(elements)< 4:
    打印'跳过...'
    继续

    #print'处理'+ str(元素)

    这里描述的# //wiki.freebase.com/wiki/Data_dumps
    如果元素[1] .endswith('/ notable_for'):#ignore notable_for,它里面有JSON
    继续

    elif元素[2]而不是元素[3]:#case 1,链接
    f_out_links.write(行)

    elif not(elements [2] .startswith('/ lang / ')和元素[2]!='/ lang / en'):#ignore以外的语言
    f_out_ns.write(行)

    如果len(sys.argv [1: ])== 0:
    print'传递一个.tsv文件名列表

    在sys.argv [1:]中的file_name
    parse_freebase_quadruple_tsv_file(file_name)



    注意:




    • 根据机器,索引创建可能需要几个到12个小时以上(考虑到您正在处理的数据量)。

    • 为了能够遍历两个数据方向您需要链接的索引,我发现这是一个昂贵的时间,从未完成。

    • 许多其他优化是可能的。例如,'types'表格足够小,可以在Python dict中加载到内存中(见 e.get_namespaced_data('freebase_ns_types.tsv')



    此处标准免责声明。自从我这样做以来已经有几个月了。我相信这是很正确的,但是如果我的笔记错过了某些事情,我会道歉。不幸的是,我需要它的项目是通过裂缝,但希望这有助于别人。如果有什么不清楚的话,请在这里留下评论。


    I admit that this is basically a duplicate question of Use freebase data on local server? but I need more detailed answers than have already been given there

    I've fallen absolutely in love with Freebase. What I want now is to essentially create a very simple Freebase clone for storing content that may not belong on Freebase itself but can be described using the Freebase schema. Essentially what I want is a simple and elegant way to store data like Freebase itself does and be able to easily use that data in a Python (CherryPy) web application.

    Chapter 2 of the MQL reference guide states:

    The database that underlies Metaweb is fundamentally different than the relational databases that you may be familiar with. Relational databases store data in the form of tables, but the Metaweb database stores data as a graph of nodes and relationships between those nodes.

    Which I guess means that I should be using either a triplestore or a graph database such as Neo4j? Does anybody here have any experience with using one of those from a Python environment?

    (What I've actually tried so far is to create a relational database schema which would be able to easily store Freebase topics, but I'm having issues with configuring the mappings in SQLAlchemy).

    Things I'm looking into

    UPDATE [28/12/2011]:

    I found an article on the Freebase blog that describes the proprietary tuple store / database Freebase themselves use (graphd): http://blog.freebase.com/2008/04/09/a-brief-tour-of-graphd/

    解决方案

    This is what worked for me. It allows you to load all of a Freebase dump in a standard MySQL installation on less than 100GB of disk. The key is understanding the data layout in a dump and then transforming it (optimizing it for space and speed).

    Freebase notions you should understand before you attempt to use this (all taken from the documentation):

    • Topic - anything of type '/common/topic', pay attention to the different types of ids you may encounter in Freebase - 'id', 'mid', 'guid', 'webid', etc.
    • Domain
    • Type - 'is a' relationship
    • Properties - 'has a' relationship
    • Schema
    • Namespace
    • Key - human readable in the '/en' namespace

    Some other important Freebase specifics:

    • the query editor is your friend
    • understand the 'source', 'property', 'destination' and 'value' notions described here
    • everything has a mid, even things like '/', '/m', '/en', '/lang', '/m/0bnqs_5', etc.; Test using the query editor: [{'id':'/','mid':null}]​
    • you don't know what any entity (i.e. row) in the data dump is, you have to get to its types to do that (for instance how do I know '/m/0cwtm' is a human);
    • every entity has at least one type (but usually many more)
    • every entity has at least one id/key (but usually many more)
    • the ontology (i.e. metadata) is embedded in the same dump and the same format as the data (not the case with other distributions like DBPedia, etc.)
    • the 'destination' column in the dump is the confusing one, it may contain a mid or a key (see how the transforms bellow deal with this)
    • the domains, types, properties are namespace levels at the same time (whoever came up with this is a genius IMHO);
    • understand what is a Topic and what is not a Topic (absolutely crucial), for example this entity '/m/03lmb2f' of type '/film/performance' is NOT a Topic (I choose to think of these as what Blank Nodes in RDF are although this may not be philosophically accurate), while '/m/04y78wb' of type '/film/director' (among others) is;

    Transforms

    (see the Python code at the bottom)

    TRANSFORM 1 (from shell, split links from namespaces ignoring notable_for and non /lang/en text):

    python parse.py freebase.tsv  #end up with freebase_links.tsv and freebase_ns.tsv
    

    TRANSFORM 2 (from Python console, split freebase_ns.tsv on freebase_ns_types.tsv, freebase_ns_props.tsv plus 15 others which we ignore for now)

    import e
    e.split_external_keys( 'freebase_ns.tsv' )
    

    TRANSFORM 3 (from Python console, convert property and destination to mids)

    import e
    ns = e.get_namespaced_data( 'freebase_ns_types.tsv' )
    e.replace_property_and_destination_with_mid( 'freebase_links.tsv', ns )    #produces freebase_links_pdmids.tsv
    e.replace_property_with_mid( 'freebase_ns_props.tsv', ns ) #produces freebase_ns_props_pmids.tsv
    

    TRANSFORM 4 (from MySQL console, load freebase_links_mids.tsv, freebase_ns_props_mids.tsv and freebase_ns_types.tsv in DB):

    CREATE TABLE links(
    source      VARCHAR(20), 
    property    VARCHAR(20), 
    destination VARCHAR(20), 
    value       VARCHAR(1)
    ) ENGINE=MyISAM CHARACTER SET utf8;
    
    CREATE TABLE ns(
    source      VARCHAR(20), 
    property    VARCHAR(20), 
    destination VARCHAR(40), 
    value       VARCHAR(255)
    ) ENGINE=MyISAM CHARACTER SET utf8;
    
    CREATE TABLE types(
    source      VARCHAR(20), 
    property    VARCHAR(40), 
    destination VARCHAR(40), 
    value       VARCHAR(40)
    ) ENGINE=MyISAM CHARACTER SET utf8;
    
    LOAD DATA LOCAL INFILE "/data/freebase_links_pdmids.tsv" INTO TABLE links FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
    LOAD DATA LOCAL INFILE "/data/freebase_ns_props_pmids.tsv" INTO TABLE ns FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
    LOAD DATA LOCAL INFILE "/data/freebase_ns_base_plus_types.tsv" INTO TABLE types FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
    
    CREATE INDEX links_source            ON links (source)             USING BTREE;
    CREATE INDEX ns_source               ON ns    (source)             USING BTREE;
    CREATE INDEX ns_value                ON ns    (value)              USING BTREE;
    CREATE INDEX types_source            ON types (source)             USING BTREE;
    CREATE INDEX types_destination_value ON types (destination, value) USING BTREE;
    

    Code

    Save this as e.py:

    import sys
    
    #returns a dict to be used by mid(...), replace_property_and_destination_with_mid(...) bellow
    def get_namespaced_data( file_name ):
        f = open( file_name )
        result = {}
    
        for line in f:
            elements = line[:-1].split('\t')
    
            if len( elements ) < 4:
                print 'Skip...'
                continue
    
            result[(elements[2], elements[3])] = elements[0]
    
        return result
    
    #runs out of memory
    def load_links( file_name ):
        f = open( file_name )
        result = {}
    
        for line in f:
            if len( result ) % 1000000 == 0:
                print len(result)
            elements = line[:-1].split('\t')
            src, prop, dest = elements[0], elements[1], elements[2]
            if result.get( src, False ):
                if result[ src ].get( prop, False ):
                    result[ src ][ prop ].append( dest )
                else:
                    result[ src ][ prop ] = [dest]
            else:
                result[ src ] = dict([( prop, [dest] )])
    
        return result
    
    #same as load_links but for the namespaced data
    def load_ns( file_name ):
        f = open( file_name )
        result = {}
    
        for line in f:
            if len( result ) % 1000000 == 0:
                print len(result)
            elements = line[:-1].split('\t')
            src, prop, value = elements[0], elements[1], elements[3]
            if result.get( src, False ):
                if result[ src ].get( prop, False ):
                    result[ src ][ prop ].append( value )
                else:
                    result[ src ][ prop ] = [value]
            else:
                result[ src ] = dict([( prop, [value] )])
    
        return result
    
    def links_in_set( file_name ):
        f = open( file_name )
        result = set()
    
        for line in f:
            elements = line[:-1].split('\t')
            result.add( elements[0] )
        return result
    
    def mid( key, ns ):
        if key == '':
            return False
        elif key == '/':
            key = '/boot/root_namespace'
        parts = key.split('/')
        if len(parts) == 1:           #cover the case of something which doesn't start with '/'
            print key
            return False
        if parts[1] == 'm':           #already a mid
            return key
        namespace = '/'.join(parts[:-1])
        key = parts[-1]
        return ns.get( (namespace, key), False )
    
    def replace_property_and_destination_with_mid( file_name, ns ):
        fn = file_name.split('.')[0]
        f = open( file_name )
        f_out_mids = open(fn+'_pdmids'+'.tsv', 'w')
    
        def convert_to_mid_if_possible( value ):
            m = mid( value, ns )
            if m: return m
            else: return None
    
        counter = 0
    
        for line in f:
            elements = line[:-1].split('\t')
            md   = convert_to_mid_if_possible(elements[1])
            dest = convert_to_mid_if_possible(elements[2])
            if md and dest:
                elements[1] = md
                elements[2] = dest
                f_out_mids.write( '\t'.join(elements)+'\n' )
            else:
                counter += 1
    
        print 'Skipped: ' + str( counter )
    
    def replace_property_with_mid( file_name, ns ):
        fn = file_name.split('.')[0]
        f = open( file_name )
        f_out_mids = open(fn+'_pmids'+'.tsv', 'w')
    
        def convert_to_mid_if_possible( value ):
            m = mid( value, ns )
            if m: return m
            else: return None
    
        for line in f:
            elements = line[:-1].split('\t')
            md = convert_to_mid_if_possible(elements[1])
            if md:
                elements[1]=md
                f_out_mids.write( '\t'.join(elements)+'\n' )
            else:
                #print 'Skipping ' + elements[1]
                pass
    
    #cPickle
    #ns=e.get_namespaced_data('freebase_2.tsv')
    #import cPickle
    #cPickle.dump( ns, open('ttt.dump','wb'), protocol=2 )
    #ns=cPickle.load( open('ttt.dump','rb') )
    
    #fn='/m/0'
    #n=fn.split('/')[2]
    #dir = n[:-1]
    
    
    def is_mid( value ):
        parts = value.split('/')
        if len(parts) == 1:   #it doesn't start with '/'
            return False
        if parts[1] == 'm':
            return True
        return False
    
    def check_if_property_or_destination_are_mid( file_name ):
        f = open( file_name )
    
        for line in f:
            elements = line[:-1].split('\t')
            #if is_mid( elements[1] ) or is_mid( elements[2] ):
            if is_mid( elements[1] ):
                print line
    
    #
    def split_external_keys( file_name ):
        fn = file_name.split('.')[0]
        f = open( file_name )
        f_out_extkeys  = open(fn+'_extkeys' + '.tsv', 'w')
        f_out_intkeys  = open(fn+'_intkeys' + '.tsv', 'w')
        f_out_props    = open(fn+'_props'   + '.tsv', 'w')
        f_out_types    = open(fn+'_types'   + '.tsv', 'w')
        f_out_m        = open(fn+'_m'       + '.tsv', 'w')
        f_out_src      = open(fn+'_src'     + '.tsv', 'w')
        f_out_usr      = open(fn+'_usr'     + '.tsv', 'w')
        f_out_base     = open(fn+'_base'    + '.tsv', 'w')
        f_out_blg      = open(fn+'_blg'     + '.tsv', 'w')
        f_out_bus      = open(fn+'_bus'     + '.tsv', 'w')
        f_out_soft     = open(fn+'_soft'    + '.tsv', 'w')
        f_out_uri      = open(fn+'_uri'     + '.tsv', 'w')
        f_out_quot     = open(fn+'_quot'    + '.tsv', 'w')
        f_out_frb      = open(fn+'_frb'     + '.tsv', 'w')
        f_out_tag      = open(fn+'_tag'     + '.tsv', 'w')
        f_out_guid     = open(fn+'_guid'    + '.tsv', 'w')
        f_out_dtwrld   = open(fn+'_dtwrld'  + '.tsv', 'w')
    
        for line in f:
            elements = line[:-1].split('\t')
            parts_2 = elements[2].split('/')
            if len(parts_2) == 1:                 #the blank destination elements - '', plus the root domain ones
                if elements[1] == '/type/object/key':
                    f_out_types.write( line )
                else:
                    f_out_props.write( line )
    
            elif elements[2] == '/lang/en':
                f_out_props.write( line )
    
            elif (parts_2[1] == 'wikipedia' or parts_2[1] == 'authority') and len( parts_2 ) > 2:
                f_out_extkeys.write( line )
    
            elif parts_2[1] == 'm':
                f_out_m.write( line )
    
            elif parts_2[1] == 'en':
                f_out_intkeys.write( line )
    
            elif parts_2[1] == 'source' and len( parts_2 ) > 2:
                f_out_src.write( line )
    
            elif parts_2[1] == 'user':
                f_out_usr.write( line )
    
            elif parts_2[1] == 'base' and len( parts_2 ) > 2:
                if elements[1] == '/type/object/key':
                    f_out_types.write( line )
                else:
                    f_out_base.write( line )
    
            elif parts_2[1] == 'biology' and len( parts_2 ) > 2:
                f_out_blg.write( line )
    
            elif parts_2[1] == 'business' and len( parts_2 ) > 2:
                f_out_bus.write( line )
    
            elif parts_2[1] == 'soft' and len( parts_2 ) > 2:
                f_out_soft.write( line )
    
            elif parts_2[1] == 'uri':
                f_out_uri.write( line )
    
            elif parts_2[1] == 'quotationsbook' and len( parts_2 ) > 2:
                f_out_quot.write( line )
    
            elif parts_2[1] == 'freebase' and len( parts_2 ) > 2:
                f_out_frb.write( line )
    
            elif parts_2[1] == 'tag' and len( parts_2 ) > 2:
                f_out_tag.write( line )
    
            elif parts_2[1] == 'guid' and len( parts_2 ) > 2:
                f_out_guid.write( line )
    
            elif parts_2[1] == 'dataworld' and len( parts_2 ) > 2:
                f_out_dtwrld.write( line )
    
            else:
                f_out_types.write( line )
    

    Save this as parse.py:

    import sys
    
    def parse_freebase_quadruple_tsv_file( file_name ):
        fn = file_name.split('.')[0]
        f = open( file_name )
        f_out_links = open(fn+'_links'+'.tsv', 'w')
        f_out_ns    = open(fn+'_ns'   +'.tsv', 'w')
    
        for line in f:
            elements = line[:-1].split('\t')
    
            if len( elements ) < 4:
                print 'Skip...'
                continue
    
            #print 'Processing ' + str( elements )                                                                                                                  
    
            #cases described here http://wiki.freebase.com/wiki/Data_dumps                                                                                          
            if elements[1].endswith('/notable_for'):                               #ignore notable_for, it has JSON in it                                           
                continue
    
            elif elements[2] and not elements[3]:                                  #case 1, linked                                                                  
                f_out_links.write( line )
    
            elif not (elements[2].startswith('/lang/') and elements[2] != '/lang/en'):   #ignore languages other than English                                       
                f_out_ns.write( line )
    
    if len(sys.argv[1:]) == 0:
        print 'Pass a list of .tsv filenames'
    
    for file_name in sys.argv[1:]:
        parse_freebase_quadruple_tsv_file( file_name )
    

    Notes:

    • Depending on the machine the index creation may take anywhere from a few to 12+ hours (consider the amount of data you are dealing with though).
    • To be able to traverse the data in both directions you need an index on links.destination as well which I found to be expensive timewise and never finished.
    • Many other optimizations are possible here. For example the 'types' table is small enough to be loaded in memory in a Python dict (see e.get_namespaced_data( 'freebase_ns_types.tsv' ))

    And the standard disclaimer here. It has been a few months since I did this. I believe it is mostly correct but I do apologize if my notes missed something. Unfortunately the project I needed it for fell through the cracks but hope this helps someone else. If something isn't clear drop a comment here.

    这篇关于如何存储Freebase这样的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆