包含引号的文本的句子标记化 [英] Sentence tokenization for texts that contains quotes

查看:74
本文介绍了包含引号的文本的句子标记化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

代码:

from nltk.tokenize import sent_tokenize           
pprint(sent_tokenize(unidecode(text)))

输出:

[After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.',
 'Finally they pushed you out of the cold emergency room.',
 'I failed to protect you.',
 '"Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.',]

输入:

杜死于窒息后,她的男友令人心碎 在线消息:我的手臂,呼吸和意识丧失 心跳变得越来越弱.最终他们把你赶出了 冷急诊室.我没有保护你."

After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

23岁的李娜是江西省一个农业家庭的农民工, 期待在2015年结婚.

Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

应在前面的句子中加上行情.代替" Li.

Quotes should be included in previous sentence. Instead of " Li.

它在."如何解决此问题上失败了?

It fails at ." How to fix this?

解释文本的提取.

html = open(path, "r").read()                           #reads html code
article = extractor.extract(raw_html=html)              #extracts content
text = unidecode(article.cleaned_text)                  #changes encoding 

在这里,article.cleaned_text是unicode.使用此方法将字符"更改为."的想法.

Here, article.cleaned_text is in unicode. The idea behind using this to change characters " to ".

解决方案@alvas错误的结果:

Solutions @alvas Incorrect Result:

['After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.',
 'Finally they pushed you out of the cold emergency room.',
 'I failed to protect you.',
 '"',
 'Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.'
]

Edit2: (已更新)nltk和python版本

(Updated) nltk and python version

python -c "import nltk; print nltk.__version__"
3.0.4
python -V
Python 2.7.9

推荐答案

我不确定所需的输出是什么,但我认为您可能需要在nltk.sent_tokenize之前进行一些段落分割,即:

I'm not sure what is the desired output but I think you might need some paragraph segmentation before nltk.sent_tokenize, i.e.:

>>> text = """After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
... 
... Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015."""
>>> from nltk import sent_tokenize
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     for sent in sent_tokenize(pg):
...             print sent
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.
Finally they pushed you out of the cold emergency room.
I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

可能的话,如果在双引号中也需要字符串,因此您可以尝试以下方法:

Possibly, you might want strings within the double quotes too, if so you could try this:

>>> import re
>>> str_in_doublequotes = r'"([^"]*)"'
>>> re.findall(str_in_doublequotes, text)
['Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.']

或者您可能需要这个:

>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: 
"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

从文件读取时,尝试使用 io 包:

When reading from file, try to use the io package:

alvas@ubi:~$ echo -e """After Du died of suffocation, her boyfriend posted a heartbreaking message online: \"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.\"\n\nLi Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.""" > in.txt
alvas@ubi:~$ cat in.txt 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
alvas@ubi:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from nltk import sent_tokenize
>>> text = io.open('in.txt', 'r', encoding='utf8').read()
>>> print text
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

>>> for sent in sent_tokenize(text):
...     print sent
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.
Finally they pushed you out of the cold emergency room.
I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

以及段落和引号提取技巧:

And with the paragraph and quote extraction hacks:

>>> import io, re
>>> from nltk import sent_tokenize
>>> str_in_doublequotes = r'"([^"]*)"'
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: 
"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

要使魔术将引号前的句子与引号连接起来(不要眨眼,它看起来与上面的样子完全一样):

For the magic to concatenate the pre-quote sentence with the quotes (don't blink, it looks quite the same as above):

>>> import io, re
>>> from nltk import sent_tokenize
>>> str_in_doublequotes = r'"([^"]*)"'
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent,
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online:  "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

上述代码的问题在于它仅限于以下语句:

The problem with the above code is that it is limited to sentences like:

杜死于窒息后,她的男友令人心碎 在线消息:我的手臂,呼吸和意识丧失 心跳变得越来越弱.最终他们把你赶出了 冷急诊室.我没能保护你."

After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

并且无法处理:

失去意识在我的怀里,你的呼吸和心跳变得 越来越弱.终于他们把你赶出了寒冷的紧急状态 房间.我没能保护你,"她的男朋友伤心欲绝 杜死于窒息后在线发布消息.

"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you," her boyfriend posted a heartbreaking message online after Du died of suffocation.

只需确保,我的python/nltk版本是:

Just to make sure, my python/nltk versions are:

$ python -c "import nltk; print nltk.__version__"
'3.0.3'
$ python -V
Python 2.7.6


除了文本处理的计算方面,问题中文本的语法还有细微的差别.


Beyond the computational aspect of the text processing, there's something subtly different about the grammar of the text in the question.

引号后接分号:的事实与传统的英语语法不同.这可能已经在中文新闻中流行了,因为在中文中:

The fact that a quote is followed by a semi-colon : is untypical of the traditional English grammar. This might have been popularized in the Chinese news because in Chinese:

啊杜窒息息死后,男友在网上发了令人惊叹的消息:"..."

啊杜窒息死亡后,男友在网上发了令人心碎的消息: "..."

在传统的英语中,从语法上讲,它应该是:

In traditional English in a very prescriptive grammatical sense, it would have been:

杜死于窒息后,她的男友令人心碎 在线消息"..."

After Du died of suffocation, her boyfriend posted a heartbreaking message online, "..."

然后,引号后的语句将以逗号结尾而不是句号,例如:

And a post-quotation statement would have been signalled by an ending comma instead of a fullstop, e.g.:

"...",她的男友在杜后网上发布了令人心碎的消息 因窒息而死.

"...," her boyfriend posted a heartbreaking message online after Du died of suffocation.

这篇关于包含引号的文本的句子标记化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆