Python连接文本中的组合关键字 [英] Python connect composed keywords in texts

查看:58
本文介绍了Python连接文本中的组合关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我的关键字列表是小写的.假设

So, I have a keyword list lowercase. Let's say

keywords = ['machine learning', 'data science', 'artificial intelligence']

和小写文本列表.比方说

and a list of texts in lowercase. Let's say

texts = [
  'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 
  'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]

我需要将文本转换为:

[[['the', 'new',
   'machine_learning',
   'model',
   'built',
   'by',
   'google',
   'is',
   'revolutionary',
   'for',
   'the',
   'current',
   'state',
   'of',
   'artificial_intelligence'],
  ['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']],
 [['data_science',
   'and',
   'artificial_intelligence',
   'are',
   'two',
   'different',
   'fields',
   'although',
   'they',
   'are',
   'interconnected'],
  ['scientists',
   'from',
   'harvard',
   'are',
   'explaining',
   'it',
   'in',
   'a',
   'detailed',
   'presentation',
   'that',
   'could',
   'be',
   'found',
   'on',
   'our',
   'page']]]

我现在要做的是检查关键字是否在文本中,并用_替换关键字.但这的复杂度为m * n,当您有700个长文本和200万个关键字时,这确实很慢.

What I do right now is checking if the keywords are in a text and replace them with the keywords with _. But this is of complexity m*n and it is really slow when you have 700 long texts and 2M keywords as in my case.

我试图使用Phraser,但是我无法仅使用关键字来构建一个.

I was trying to use Phraser, but I can't manage to build one with only my keywords.

有人可以建议我一种更优化的方法吗?

Could someone suggest me a more optimized way of doing it?

推荐答案

gensim 的 Phrases / Phraser 类旨在使用它们内部统计得出的记录,应将哪些单词对提升为词组-而不是用户提供的配对.(您可能会通过合成得分/阈值来戳戳并生成一个 Phraser 以执行您想要的操作,但这会有些尴尬和笨拙.)

The Phrases/Phraser classes of gensim are designed to use their internal, statistically-derived records of what word pairs should be promoted to phrases – not user-supplied pairings. (You could probably poke & prod a Phraser to do what you want, by synthesizing scores/thresholds, but that would be somewhat awkward & kludgey.)

您可以模仿它们的一般方法:(1)对令牌列表而不是原始字符串进行操作;(2)学习与学习记住应该组合的令牌对;&(3)单次执行组合.这比基于对字符串进行重复搜索和替换的方法要有效得多,这听起来好像您已经尝试并发现了需求.

You could, mimic their general approach: (1) operate on lists-of-tokens rather than raw strings; (2) learn & remember token-pairs that should be combined; & (3) perform combination in a single pass. That should work far more efficiently than anything based on doing repeated search-and-replace on a string – which it sounds like you've already tried and found wanting.

例如,让我们首先创建一个字典,其中的键是应该组合的单词对的元组,值是既包含其指定的组合令牌又包含第二个项目的元组,它们只是一个空元组.(其原因将在稍后阐明.)

For example, let's first create a dictionary, where the keys are tuples of word-pairs that should be combined, and the values are tuples that include both their designated combination-token, and a 2nd item that's just an empty-tuple. (The reason for this will become clear later.)

keywords = ['machine learning', 'data science', 'artificial intelligence']
texts = [
    'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 
    'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]

combinations_dict = {tuple(kwsplit):('_'.join(kwsplit), ()) 
                     for kwsplit in [kwstr.split() for kwstr in keywords]}
combinations_dict

此步骤之后, combinations_dict 为:

{('machine', 'learning'): ('machine_learning', ()),
 ('data', 'science'): ('data_science', ()),
 ('artificial', 'intelligence'): ('artificial_intelligence', ())}

现在,我们可以使用Python生成器函数来创建任何其他令牌序列的可迭代转换,该转换将原始令牌一对一地处理-但是在发出任何令牌之前,将下一个添加到缓冲的候选对中-令牌.如果该对是应该组合的一对,则将 yield 组合成一个单独的令牌,但如果不是,则仅发出第一个令牌,而将第二个令牌与下一个令牌合并在新的候选对象中一对.

Now, we can use a Python generator function to create an iterable transformation of any other sequence-of-tokens, that takes original tokens one-by-one – but before emitting any, adds the next to a buffered candidate pair-of-tokens. If that pair is one that should be combined, a single combined token is yielded – but if not, just the 1st token is emitted, leaving the 2nd to be combined with the next token in a new candidate pair.

例如:

def combining_generator(tokens, comb_dict):
    buff = ()  # start with empty buffer
    for in_tok in tokens:
        buff += (in_tok,)  # add latest to buffer
        if len(buff) < 2:  # grow buffer to 2 tokens if possible
            continue
        # lookup what to do for current pair... 
        # ...defaulting to emit-[0]-item, keep-[1]-item in new buff
        out_tok, buff = comb_dict.get(buff, (buff[0], (buff[1],)))
        yield out_tok 
    if buff:
        yield buff[0]  # last solo token if any

在这里,我们看到了早期的()空元组的原因:这是成功替换后 buff 的首选状态.并推动结果next-state这样可以帮助我们使用 dict.get(key,default)的形式,如果找不到该密钥,则提供要使用的特定值.

Here we see the reason for the earlier () empty-tuples: that's the preferred state of the buff after a successful replacement. And driving the result & next-state this way helps us use the form of dict.get(key, default) that supplies a specific value to be used if the key isn't found.

现在可以通过以下方式应用指定的组合:

Now designated combinations can be applied via:

tokenized_texts = [text.split() for text in texts]
retokenized_texts = [list(combining_generator(tokens, combinations_dict)) for tokens in tokenized_texts]
retokenized_texts

...将 tokenized_texts 报告为:

[
  ['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial', 'intelligence.', 'it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking'], 
  ['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields,', 'although', 'they', 'are', 'interconnected.', 'scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page.']
]

请注意,令牌('artificial','intelligence.') 没有在这里结合在一起,就像简单的 .split()使用的分词化已保留标点符号,以防止与规则完全匹配.

Note that the tokens ('artificial', 'intelligence.') aren't combined here, as the dirt-simple .split() tokenization used has left the punctuation attached, preventing an exact match to the rule.

实际项目将希望使用更复杂的令牌化,它可能会删除标点符号,或者将标点符号保留为令牌,或者进行其他预处理-结果将正确地通过'artificial'作为没有附加'.'的令牌.例如,仅保留单词运行字符并丢弃标点符号的简单令牌化将是:

Real projects will want to use a more-sophisticated tokenization, that might either strip the punctuation, or retain punctuation as tokens, or do other preprocessing - and as a result would properly pass 'artificial' as a token without the attached '.'. For example a simple tokenization that just retains runs-of-word-characters discarding punctuation would be:

import re
tokenized_texts = [re.findall('\w+', text) for text in texts]
tokenized_texts

另一个将散乱的非单词/非空格字符(标点符号)保留为独立令牌的方法是:

Another that also keeps any stray non-word/non-space characters (punctuation) as standalone tokens would be:

tokenized_texts = [re.findall(r'\w+|(?:[^\w\s])', text) for text in texts]
tokenized_texts

使用这些简单的 .split()替代方法之一,都可以确保您的第一文本显示出必要的('artificial','intelligence')对以进行组合.

Either of these alternatives to a simple .split() would ensure your 1st text presents the necessary ('artificial', 'intelligence') pair for combination.

这篇关于Python连接文本中的组合关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆