Pandas NLTK 标记化“不可哈希类型:‘列表’" [英] Pandas NLTK tokenizing "unhashable type: 'list'"

查看:141
本文介绍了Pandas NLTK 标记化“不可哈希类型:‘列表’"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下示例:使用 Python 和 Gephi 进行 Twitter 数据挖掘:案例合成生物学

CSV 到:df['Country', 'Responses']

'Country'
Italy
Italy
France
Germany

'Responses' 
"Loren ipsum..."
"Loren ipsum..."
"Loren ipsum..."
"Loren ipsum..."

  1. 标记响应"中的文本
  2. 删除 100 个最常用的词(基于 brown.corpus)
  3. 找出剩下的 100 个最常用的词

我可以完成第 1 步和第 2 步,但在第 3 步时出错:

I can get through step 1 and 2, but get an error on step 3:

TypeError: unhashable type: 'list'

我相信这是因为我在一个数据框中工作并且进行了这个(可能是错误的)修改:

I believe it's because I'm working in a dataframe and have made this (likely erronous) modification:

原始示例:

#divide to words
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(tweets)

我的代码:

#divide to words
tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

我的完整代码:

df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)

tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

words =  df['tokenized_sents']

#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]

Out: ['the',
 ',',
 '.',
 'of',
 'and',
...]

#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]

TypeError: unhashable type: 'list'

关于不可散列的列表有很多问题,但我认为没有一个是完全相同的.有什么建议么?谢谢.

There are many questions on unhashable lists, but none that I understand to be quite the same. Any suggestions? Thanks.

追溯

TypeError                                 Traceback (most recent call last)
<ipython-input-164-a0d17b850b10> in <module>()
  1 #keep only most common words
----> 2 fdist = FreqDist(words)
  3 mostcommon = fdist.most_common(100)
  4 mclist = []
  5 for i in range(len(mostcommon)):

/home/*******/anaconda3/envs/*******/lib/python3.5/site-packages/nltk/probability.py in __init__(self, samples)
    104         :type samples: Sequence
    105         """
--> 106         Counter.__init__(self, samples)
    107 
    108     def N(self):

/home/******/anaconda3/envs/******/lib/python3.5/collections/__init__.py in __init__(*args, **kwds)
    521             raise TypeError('expected at most 1 arguments, got %d' % len(args))
    522         super(Counter, self).__init__()
--> 523         self.update(*args, **kwds)
    524 
    525     def __missing__(self, key):

/home/******/anaconda3/envs/******/lib/python3.5/collections/__init__.py in update(*args, **kwds)
    608                     super(Counter, self).update(iterable) # fast path when counter is empty
    609             else:
--> 610                 _count_elements(self, iterable)
    611         if kwds:
    612             self.update(kwds)

TypeError: unhashable type: 'list'

推荐答案

FreqDist 函数接受一个可迭代对象(做成字符串,但它可能适用于任何东西).你得到的错误是因为你传入了一个可迭代的列表.正如您所建议的,这是因为您所做的更改:

The FreqDist function takes in an iterable of hashable objects (made to be strings, but it probably works with whatever). The error you're getting is because you pass in an iterable of lists. As you suggested, this is because of the change you made:

df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

如果我了解Pandas 应用函数文档 正确,该行将 nltk.word_tokenize 函数应用于某些系列.word-tokenize 返回一个单词列表.

If I understand the Pandas apply function documentation correctly, that line is applying the nltk.word_tokenize function to some series. word-tokenize returns a list of words.

作为一种解决方案,在尝试应用 FreqDist 之前,只需将列表添加在一起,如下所示:

As a solution, simply add the lists together before trying to apply FreqDist, like so:

allWords = []
for wordList in words:
    allWords += wordList
FreqDist(allWords)

一个更完整的修订版来做你想做的事.如果您只需要识别第二组 100,请注意 mclist 将第二次识别.

A more complete revision to do what you would like. If all you need is to identify the second set of 100, note that mclist will have that the second time.

df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)

tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

lists =  df['tokenized_sents']
words = []
for wordList in lists:
    words += wordList

#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]

Out: ['the',
 ',',
 '.',
 'of',
 'and',
...]

#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
# mclist contains second-most common set of 100 words
words = [w for w in words if w in mclist]
# this will keep ALL occurrences of the words in mclist

这篇关于Pandas NLTK 标记化“不可哈希类型:‘列表’"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆