从python中的字符串中提取英语单词 [英] extract English words from string in python

查看:3801
本文介绍了从python中的字符串中提取英语单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文档,每一行都是一个字符串.它可能包含数字,非英文字母和单词,符号(例如!和*).我想从每一行中提取英语单词(英语单词由空格分隔). 我的代码如下,这是我的map-reduce作业的map函数.但是,基于最终结果,此映射器函数仅产生字母(例如a,b,c)频率计数.谁能帮我找到错误?谢谢

I have a document that each line is a string. It might contain digits, non-English letters and words, symbols(such as ! and *). I want to extract the English words from each line(English words are separated by space). My code is the following, which is the map function of my map-reduce job. However, based on the final result, this mapper function only produces letters(such as a,b,c) frequency count. Can anyone help me find the bug? Thanks

import sys
import re

for line in sys.stdin:
    line = re.sub("[^A-Za-z]", "", line.strip())
    line = line.lower()
    words = ' '.join(line.split())
    for word in words:
        print '%s\t%s' % (word, 1)

推荐答案

您实际上遇到了两个问题.

You've actually got two problems.

首先,这个:

line = re.sub("[^A-Za-z]", "", line.strip())

这将从行中删除所有非字母.这意味着您不再有任何空格可分割,因此也就无法将其分隔为单词.

This removes all non-letters from the line. Which means you no longer have any spaces to split on, and therefore no way to separate it into words.

接下来,即使您没有这样做,也可以这样做:

Next, even if you didn't do that, you do this:

words = ' '.join(line.split())

这不会给您单词列表,而是给您一个字符串,所有这些单词都串联在一起. (基本上,将所有空白行都转换为一个空格的原始行.)

This doesn't give you a list of words, this gives you a single string, with all those words concatenated back together. (Basically, the original line with all runs of whitespace converted into a single space.)

因此,在下一行中,当您这样做时:

So, in the next line, when you do this:

for word in words:

您要遍历一个字符串,这意味着每个word都是单个字符.因为这就是字符串:字符的可迭代项.

You're iterating over a string, which means each word is a single character. Because that's what strings are: iterables of characters.

如果您想要每个单词(正如您的变量名所暗示的那样),那么您已经拥有了这些单词,问题在于您将它们重新加入了一个字符串中.只是不要这样做:

If you want each word (as your variable names imply), you already had those, the problem is that you joined them back into a string. Just don't do this:

words = line.split()
for word in words:

或者,如果您想去除字母和空格之外的内容,请使用正则表达式去除字母和空格之外的所有内容,而不是去除字母和空格以外的所有内容的正则表达式,包括空格:

Or, if you want to strip out things besides letters and whitespace, use a regular expression that strips out everything besides letters and whitespace, not one that strips out everything besides letters, including whitespace:

line = re.sub(r"[^A-Za-z\s]", "", line.strip())
words = line.split()
for word in words:

但是,该模式仍然可能不是您想要的.您是否真的要将'abc1def'变成单个字符串'abcdef'或变成两个字符串'abc''def'?您可能想要这样:

However, that pattern is still probably not what you want. Do you really want to turn 'abc1def' into the single string 'abcdef', or into the two strings 'abc' and 'def'? You probably want either this:

line = re.sub(r"[^A-Za-z]", " ", line.strip())
words = line.split()
for word in words:

…或者仅仅是:

words = re.split(r"[^A-Za-z]", line.strip())
for word in words:

这篇关于从python中的字符串中提取英语单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆