如何加快一系列文档中键的总和? - pandas ,nltk [英] How to speed up the sum of presence of keys in the series of documents? - Pandas, nltk

查看:83
本文介绍了如何加快一系列文档中键的总和? - pandas ,nltk的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框列,其中包含

I have a dataframe column with documents like


38909    Hotel is an old style Red Roof and has not bee...
38913    I will never ever stay at this Hotel again. I ...
38914    After being on a bus for -- hours and finally ...
38918    We were excited about our stay at the Blu Aqua...
38922    This hotel has a great location if you want to...
Name: Description, dtype: object

我有一袋类似keys = ['Hotel','old','finally']的单词,但是keys = 44312

I have a bag of words like keys = ['Hotel','old','finally'] but the actual length of keys = 44312

当前我正在使用

df.apply(lambda x : sum([i in x for i in keys ]))

根据示例键给出以下输出

Which gives the following output based on sample keys


38909    2
38913    2
38914    3
38918    0
38922    1
Name: Description, dtype: int64

当我仅将其应用于实际数据100行时,它给出

When I apply this on actual data for just 100 rows timeit gives

1 loop, best of 3: 5.98 s per loop

,我有50000行.在nltk或pandas中是否有更快的方法来做同样的事情.

and I have 50000 rows. Is there a faster way of doing the same in nltk or pandas.

如果要查找文档数组

array([ 'Hotel is an old style Red Roof and has not been renovated up to the new standard, but the price was also not up to the level of the newer style Red Roofs. So, in overview it was an OK stay, and a safe',
   'I will never ever stay at this Hotel again. I stayed there a few weeks ago, and I had my doubts the second I checked in. The guy that checked me in, I think his name was Julio, and his name tag read F',
   "After being on a bus for -- hours and finally arriving at the Hotel Lawerence at - am, I bawled my eyes out when we got to the room. I realize it's suppose to be a boutique hotel but, there was nothin",
   "We were excited about our stay at the Blu Aqua. A new hotel in downtown Chicago. It's architecturally stunning and has generally good reviews on TripAdvisor. The look and feel of the place is great, t",
   'This hotel has a great location if you want to be right at Times Square and the theaters. It was an easy couple of blocks for us to go to theater, eat, shop, and see all of the activity day and night '], dtype=object)

推荐答案

以下代码并不完全等同于您的(慢速)版本,但它演示了这一思想:

The following code is not exactly equivalent to your (slow) version, but it demonstrates the idea:

keyset = frozenset(keys)
df.apply(lambda x : len(keyset.intersection(x.split())))

差异/局限性:

  1. 在您的版本中,即使单词作为子字符串包含在文档中的单词中,也将计算单词.例如,如果您的keys包含单词 tyl ,则由于您的第一个文档中出现了样式",因此将其计算在内.
  2. 我的解决方案未考虑文档中的标点符号.例如,第二个文档中的再次一词来自split(),并附有句号.可以通过使用删除标点符号的功能对文档进行预处理(或对split()的结果进行后处理)来解决此问题.
  1. In your version a word is counted even if it is contained as a substring in a word in the document. For example, had your keys contained the word tyl, it would be counted due to occurrence of "style" in your first document.
  2. My solution doesn't account for punctuation in the documents. For example, the word again in the second document comes out of split() with the full stop attached to it. That can be fixed by preprocessing the document (or postprocessing the result of the split()) with a function that removes the punctuation.

这篇关于如何加快一系列文档中键的总和? - pandas ,nltk的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆