如何从数百万个段落中提取包含特定单词的句子 [英] How to extract sentence containing a particular word from millions of paragraphs

查看:391
本文介绍了如何从数百万个段落中提取包含特定单词的句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Python Scrapy删除了数百万篇报纸文章.现在,我想提取一个包含单词的句子.下面是我的实现.

I scrapped millions of newspaper articles using Python Scrapy. Now, I want to extract a sentence containing a word. Below is my implementation.

import nltk
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
for a in articles:
    article_sentence = tokenizer.tokenize(a)
    for s in article_sentence:
        for w in words:
            if ' '+w+' ' in s:
                sentences[w].append(s)

我大约有1000个字.上面的代码效率不高,并且花费大量时间.同样,句子可以包含不同形式的根词(过去时).如何有效地提取句子.请帮忙.我还需要其他工具吗?

I have around ~1000 words. The above code is not efficient and takes a lot of time. Also, the sentence can contain root word in different form (past tense). How can I efficiently extract sentence. Please help. Any other tools that I need?

推荐答案

这听起来像是 Aho-Corasick 字符串匹配算法.它同时在单个文本(例如,标记化的句子或文档)中搜索多个字符串.这种同时搜索将消除初始实现中的内部循环(包括该循环中昂贵的字符串连接).

This sounds like a perfect application for the Aho-Corasick string-matching algorithm. It searches a single text (e.g. your tokenized sentence or document) for multiple strings simultaneously. That simultaneous search will eliminate the inner loop in your initial implementation (including the expensive string concatenation in that loop).

我只在Java中实现了Aho-Corasick,但是通过Google的快速搜索可以找到一些现有的Python实现的链接.例如.: * ahocorasick * pyhocorasick

I've only implemented Aho-Corasick it in Java, but a quick Google search yields links to several existing Python implementations. E.g.: * ahocorasick * pyhocorasick

我对任何一种实现(或任何其他选项)都没有经验,但是您可能会找到一个满足您需要的方法,或者,如果您觉得自己喜欢编码的话,也可以自己实现.

I have no experience with either implementation (or any of the other options), but you can probably find one that meets your needs - or implement it yourself if you feel like an enjoyable bit of coding.

我的建议是,在字典"特里(要搜索的匹配项)中包含所有感兴趣的单词形式.例如.如果要搜索写",则将写"和写"都插入特里.这样可以减少输入文档所需的预处理量.

My recommendation would be that you include all the word forms of interest in your 'dictionary' trie (the set of matches to search for). E.g. if you're searching for 'write', insert both 'write' and 'wrote' into the trie. That will reduce the amount of preprocessing you'll need to do to input documents.

我还建议您搜索尽可能大的文本(也许一次搜索一个段落或一个完整的文档,而不是一次搜索一个句子),以更有效地利用每个Aho-Corasick调用.

I'd also recommend searching texts as large as practical (perhaps a paragraph or a full document at a time, instead of one sentence at a time), to make more efficient use of each Aho-Corasick invocation.

这篇关于如何从数百万个段落中提取包含特定单词的句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆