我该如何分割短语列表进入的话，所以我可以对他们使用柜台？ [英] How do I split a list of phrases into words so I can use counter on them?

查看：151 发布时间：2016/8/5 19:15:43 python beautifulsoup counter

本文介绍了我该如何分割短语列表进入的话，所以我可以对他们使用柜台？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的数据来自webforum对话线程。我创建的函数来清洁的停机词语，标点，并且这样的数据。然后，我创建了一个循环来清理所有在我的csv文件的职位，并把它们放入一个列表。然后我做的字数。我的问题是，列表中包含UNI code短语，而不是单个的单词。我该如何分割的短语，所以他们是个人的话，我可以指望。这里是我的code如下：

My data are conversation threads from a webforum. I created a function to clean the data of stop words, punctuation, and such. Then I created a loop to clean all the posts which were in my csv file and put them into a list. Then I did the word count. My problem is that list contains unicode phrases rather than individual words. How can I split up the phrases, so they are individual words that I can count. Here is my code below:

 def post_to_words(raw_post):
      HTML_text = BeautifulSoup(raw_post).get_text()
      letters_only = re.sub("[^a-zA-Z]", " ", HTML_text)
      words = letters_only.lower().split()
      stops = set(stopwords.words("english"))   
      meaningful_words = [w for w in words if not w in stops]
      return( " ".join(meaningful_words))

clean_Post_Text = post_to_words(fiance_forum["Post_Text"][0])
clean_Post_Text_split = clean_Post_Text.lower().split()
num_Post_Text = fiance_forum["Post_Text"].size
clean_posts_list = [] 

for i in range(0, num_Post_Text):
    clean_posts_list.append( post_to_words( fiance_forum["Post_Text"][i]))

from collections import Counter
     counts = Counter(clean_posts_list)
     print(counts)

我的输出如下：u'please按照指示通知移动接收器：1
我希望它看起来是这样的：

My output looks like this: u'please follow instructions notice move receiver':1 I want it to look like this:

请：1

如下：1。

说明：1。

等等....感谢这么多！

and so on....thanks so much!

推荐答案

您已经有一个单词列表，所以你并不需要分割什么，忘记调用的 str.join 的即。加入（meaningful_words）和刚创建的计数的每次通话字典并更新到 post_to_words ，你也做的方式很多工作，你需要做的是遍历 fiance_forum [POST_TEXT] 每个元素传递给函数。你只还需要一次创建一组停止字，而不是在每次迭代：

You already have a list of words so you don't need to split anything, forget calling str.join i.e " ".join(meaningful_words) and just create a Counter dict and update on each call to post_to_words, you are also doing way to much work, all you need to do is iterate over fiance_forum["Post_Text"] passing each element to the function. You only also need to create the set of stopwords once, not on every iteration:

from collections import Counter

def post_to_words(raw_pos, st):
    HTML_text = BeautifulSoup(raw_post).get_text()
    letters_only = re.sub("[^a-zA-Z]", " ", HTML_text)
    words = letters_only.lower().split()
    return (w for w in words if w not in st)



cn = Counter()
st = set(stopwords.words("english"))
for post in fiance_forum["Post_Text"]:
    cn.update(post_to_words(post, st)

这也避免了由需要你去做计数创造了巨大的单词列表。

That also avoids the need to create a huge list of words by doing the counting as you go.

这篇关于我该如何分割短语列表进入的话，所以我可以对他们使用柜台？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

我该如何分割短语列表进入的话，所以我可以对他们使用柜台？ [英] How do I split a list of phrases into words so I can use counter on them?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

我该如何分割短语列表进入的话，所以我可以对他们使用柜台？ [英] How do I split a list of phrases into words so I can use counter on them?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭