使用python从txt文件中提取单词 [英] Extracting words from txt file using python

查看:1449
本文介绍了使用python从txt文件中提取单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从文本文件中提取单引号之间的所有单词.文本文件如下所示:

I want to extract all the words that are between single quotation marks from a text file. The text file looks like this:

u'MMA': 10,
=u'acrylic'= : 19,
== u'acting lessons': 2,
=u'aerobic': 141,
=u'alto': 2= 4,
=u&#= 39;art therapy': 4,
=u'ballet': 939,
=u'ballroom'= ;: 234,
= =u'banjo': 38,

理想情况下,我的输出看起来像这样:

And ideally, my output would look lie this:

MMA,
acrylic,
acting lessons,
...

从浏览帖子中看来,我应该为python使用NLTK/regex的某种组合来完成此操作.我尝试了以下方法:

From browsing posts, it seems like I should use some combination of NLTK / regex for python to accomplish this. I've tried the following:

import re

file = open('artsplus_categories.txt', 'r').readlines()

for line in file:
    list = re.search('^''$', file)

file.close()

并出现以下错误:

  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 142, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or buffer

我认为错误可能是由我寻找模式引起的.我的逻辑是我搜索"...."内部的所有内容.

I think the error might be caused by how I'm looking for the pattern. My logic is that I search for everything inside of the '....'.

re.py遇到了什么?

What's tripping up re.py?

谢谢!

遵循Ashwini的评论:

Following Ashwini's comment:

import re

file = open('artsplus_categories.txt', 'r').readlines()

for line in file:
    list = re.search('^''$', line)

print list

#file.close()

但是输出中没有任何内容:

But the output contains nothing:

Samuel-Finegolds-MacBook-Pro:~ samuelfinegold$ /var/folders/jv/9_sy0bn10mbdft1bk9t14qz40000gn/T/Cleanup\ At\ Startup/artsplus_categories_clean-393952531.278.py.command ; exit;
None
logout



@Rasco:这是我遇到的错误:



@Rasco: here's the error I'm getting:

File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 177, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
logout

我正在使用以下代码:

file2 = open('artsplus_categories.txt', 'r').readlines()
list = re.findall("'[^']*'", file2)
for x in list:
    print (x)

推荐答案

实际上,不是将line传递给正则表达式,而是实际上将其传递了整个列表(文件).您应该将line传递给re.search而不是file.

Instead of passing the line to the regex you actually passed it the whole list(file). You should pass line to re.search not file.

for line in file:
    lis = re.search('^''$', line) # line not file

请勿使用listfile作为变量名.它们是内置函数.

Don't use list, file as variable names. They are built-in functions.

更新:

with open('artsplus_categories.txt') as f:
    for line in f:
        print re.search(r"'(.*)'", line).group(1)
...         
MMA
acrylic
acting lessons
aerobic
alto
art therapy
ballet
ballroom
banjo

这篇关于使用python从txt文件中提取单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆