打开文件并阅读句子 [英] open file and read sentence

查看:65
本文介绍了打开文件并阅读句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想打开一个文件并获得句子.文件中的句子像这样:

I want to open a file and get sentences. The sentences in the file go across lines, like this:

"He said, 'I'll pay you five pounds a week if I can have it on my own
terms.'  I'm a poor woman, sir, and Mr. Warren earns little, and the
money meant much to me.  He took out a ten-pound note, and he held it
out to me then and there. 

当前我正在使用以下代码:

currently I'm using this code:

text = ' '.join(file_to_open.readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

readlines 贯穿句子,是否有解决此问题的好方法,使其仅获取句子?(没有NLTK)

readlines cuts through the sentences, is there a good way to solve this to get only the sentences? (without NLTK)

当前问题:

file_to_read = 'test.txt'

with open(file_to_read) as f:
    text = f.read()

import re
word_list = ['Mrs.', 'Mr.']     

for i in word_list:
    text = re.sub(i, i[:-1], text)

在测试案例中,我得到的是,太太换了先生,而先生只是先生.我尝试了其他几件事,但似乎没有用.答案可能很简单,但我很想念

What I get back ( in the test case) is that Mrs. changed to Mr while Mr. is just Mr . I tried several other things, but don't seem to work. Answer is probably easy but I'm missing it

推荐答案

如果执行此操作,则您的正则表达式适用于上面的文本:

Your regex works on the text above if you do this:

with open(filename) as f:
    text = f.read()

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

唯一的问题是,正则表达式在先生"中的点上分开.从上面的文本中删除,因此您需要修复/更改它.

The only problem is, the regex splits on the dot in "Mr." from your text above, so you need to fix/change that.

对此的一种解决方案(尽管不是完美的)是,您可以消除出现在Mr后面的所有点的情况:

One solution to this, though not perfect, is you could take out all occurences of a dot after Mr:

text = re.sub(r'(M\w{1,2})\.', r'\1', text) # no for loop needed for this, like there was before

这与'M'匹配,后跟最少1个字符,最多2个字母数字字符(\ w {1,3}),后跟一个点.模式的括号部分被分组并捕获,并且在替换中被称为'\ 1'(或组1,因为您可能会有更多的括号组).因此,从本质上讲,先生或太太是匹配的,但是只捕获了先生或太太部分,然后用不包括点的捕获部分替换先生或太太.

this Matches an 'M' followed by minimum 1, maximum 2 alphanumeric chars(\w{1,3}), followed by a dot. The parenthesised part of the pattern is grouped and captured, and it's referenced in the replacement as '\1'(or group 1, as you could have more parenthesised groups). So essentially, the Mr. or Mrs. is matched, but only the Mr or Mrs part is captured, and the Mr. or Mrs. is then replaced by the captured part which excludes the dot.

然后:

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

将按照您想要的方式工作.

will work the way you want.

这篇关于打开文件并阅读句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆