如何准确地从ODP中提取信息? [英] How to extract information from ODP accurately?

查看:63
本文介绍了如何准确地从ODP中提取信息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用 python 构建一个搜索引擎.

I am building a search engine in python.

我听说 Google 从 ODP(Open Directory Project)中获取页面描述,以防 Google 无法使用页面中的元数据找出描述...我想做类似的事情.

I have heard that Google fetches the description of pages from the ODP (Open Directory Project) in case Google can't figure out the description using the meta data from the page... I wanted to do something similar.

ODP 是来自 Mozilla 的在线目录,其中包含网络页面的描述,因此我想从 ODP 获取我的搜索结果的描述.如何从 ODP 获取特定 url 的准确描述,并在找不到时返回 python 类型None"(这意味着 ODP 不知道我在寻找哪个页面)?

ODP is an online directory from Mozilla which has descriptions of pages on the net, so I wanted to fetch the descriptions for my search results from the ODP. How do I get the accurate description of a particular url from ODP, and return the python type "None" if I couldn't find it (Which means ODP has no idea what page i am looking for)?

附注.有一个名为 http://dmoz.org/search?q=Your+Search+ 的网址参数,但我不知道如何从那里提取信息.

PS. there is a url called http://dmoz.org/search?q=Your+Search+Params but I dont know how to extract information from there.

推荐答案

要使用 ODP 数据,您需要下载 RDF数据转储.RDF 是一种 XML 格式;您可以索引转储以将 url 映射到描述;为此,我会使用 SQL 数据库.

To use ODP data, you'd download the RDF data dump. RDF is a XML format; you'd index that dump to map urls to descriptions; I'd use a SQL database for this.

请注意,URL 可以出现在转储中的多个位置.例如,堆栈溢出被列出两次.Google 使用 此条目 中的文本作为站点说明,Bing 使用 这个.

Note that URLs can be present in multiple locations in the dump. Stack Overflow is listed at twice, for example. Google uses the text from this entry as the site description, Bing uses this one instead.

数据转储当然相当大.使用合理的工具,例如 ElementTree iterparse() 方法 在您向数据库中添加条目时迭代地解析数据集.你真的只需要寻找 元素,取 下面的条目.

The data dump is of course rather large. Use sensible tools such as the ElementTree iterparse() method to parse the data set iteratively as you add entries to your database. You really only need to look for the <ExternalPage> elements, taking the <d:Title> and <d:Description> entries underneath.

使用 lxml(更快、更完整的 ElementTree 实现)看起来像:

Using lxml (a faster and more complete ElementTree implementation) that'd look like:

from lxml import etree as ET
import gzip
import sqlite3

conn = sqlite3.connect('/path/to/database')

# create table
with conn:
    cursor = conn.cursor()
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS odp_urls 
        (url text primary key, title text, description text)''')

count = 0
nsmap = {'d': 'http://purl.org/dc/elements/1.0/'}
with gzip.open('content.rdf.u8.gz', 'rb') as content, conn:
    cursor = conn.cursor()
    for event, element in ET.iterparse(content, tag='{http://dmoz.org/rdf/}ExternalPage'):
        url = element.attrib['about']
        title = element.xpath('d:Title/text()', namespaces=nsmap)
        description = element.xpath('d:Description/text()', namespaces=nsmap)
        title, description = title and title[0] or '', description and description[0] or ''

        # no longer need this, remove from memory again, as well as any preceding siblings
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]

        cursor.execute('INSERT OR REPLACE INTO odp_urls VALUES (?, ?, ?)',
            (url, title, description))
        count += 1
        if count % 1000 == 0:
            print 'Processed {} items'.format(count)

这篇关于如何准确地从ODP中提取信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆