如何快速检查字符串以获取正确的英语单词? - Python [英] How to quickly check strings for correct English words? - Python

查看:185
本文介绍了如何快速检查字符串以获取正确的英语单词? - Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在pandas数据框中有一列,其中每个单元格包含相当长的单词字符串.这些字符串来自SQL数据库,包含单词和字母数字ID短语(非英语)的混合体,用空格分隔.这些字符串最多可以包含SQL的字符数.这也不是一个小数据框,我有几百万行.

I have a column in a pandas dataframe where each cell contains a rather long string of words. These strings are from an SQL database and contain a mix of words and alphanumeric id phrases which are not English, separated by spaces. These strings can be up to the character max of SQL. This is also not a small dataframe, i have several million rows.

问题是,为每个单元格仅保留正确的英语单词的最快方法是什么?

The question is, what is the fastest way to keep only correct English words for each cell?

下面是我的初始方法,根据tqdm建议的速度,这似乎需要几天才能完成(因此,procedure_apply).

Below is my initial method which seemingly would take days to complete based on the speed suggested from tqdm (hence the progress_apply).

import pandas as pd
from nltk.corpus import words
from tqdm import tqdm

def check_for_word(sentence):
    s = sentence.split(' ')
    for word in s:
        if word not in words.words():
            s.remove(word)
    return ' '.join(s)

tqdm.pandas(desc="Checking for Words in keywords")
df['keywords'] = df['keywords'].progress_apply(check_for_word)  

有没有一种方法会更快?

Is there a method which would be significantly faster?

感谢您的帮助!

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

以下答案非常有帮助,并且运行时间不到一秒钟(极大改进!).最后,我不得不从nltk.corpus单词更改为nltk.corpus wordnet,因为单词对于我的目的而言还不够详尽.最终结果是:

The answer below was very helpful and took less than a second to run (GREAT IMPROVEMENT!). In the end I had to change from nltk.corpus words to nltk.corpus wordnet, as words was not exhaustive enough of a list for my purposes. The final result ended up being:

from nltk.corpus import wordnet
from tqdm import tqdm

def check_for_word(s):
    return ' '.join(w for w in str(s).split(' ') if len(wordnet.synsets(w)) > 0)

tqdm.pandas(desc="Checking for Words in Keywords")
df['keywords'] = df['keywords'].progress_apply(check_for_word)

花费了43秒来运行.

which took 43 seconds to run.

推荐答案

words.words()返回一个列表,该列表花费O(n)时间来检查列表中是否存在单词.要优化时间复杂度,可以从此列表中创建集合,以提供恒定时间搜索.
第二个优化是列表中的remove()方法花费O(n)时间.您可以维护一个单独的列表以消除该开销.要了解有关各种操作的复杂性的更多信息,可以参考 https://www.ics.uci.edu/~pattis/ICS-33/lectures/complexitypython.txt

words.words() returns a list, which takes O(n) time for checking whether a word is present in the list or not. To optimize the time complexity, you can take create the set out of this list which offers the constant time search.
Second optimization is that remove() method on list takes O(n) time. You can maintain a separate list to remove that overhead. To know more about the complexity of various operations, you can refer to https://www.ics.uci.edu/~pattis/ICS-33/lectures/complexitypython.txt

set_of_words = set(words.words())

def check_for_word(sentence):
    s = sentence.split(' ')
    return ' '.join(w for word in s if word in set_of_words)

这篇关于如何快速检查字符串以获取正确的英语单词? - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆