如何使用NLTK从文本中提取引用 [英] How to extract quotations from text using NLTK

查看:104
本文介绍了如何使用NLTK从文本中提取引用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个项目,我需要从大量文章中提取报价.在这里,用引号表示的是人们所说的话,例如:阿伦说过要提取的文字".我正在将NLTK用于其他与NLP相关的任务,因此任何使用NLTK或任何种类的Python库的解决方案都将非常有用.

I have a project wherein I need to extract quotations from a huge set of articles . Here , by quotations I mean things said by people , for eg: Alen said " text to be extracted ." I'm using NLTK for my other NLP related tasks so any solution using NLTK or any kind of Python library would be quite useful.

谢谢

推荐答案

如Mayur所述,您可以进行正则表达式来提取引号之间的所有内容

As Mayur mentioned, you can do a regex to pick up everything between quotes

list = re.findall("\".*?\"", string)

您会遇到的问题是,引号之间实际上存在很多不引号的东西.

The problem you'll run into is that there can be a surprisingly large amount of things between quotation marks that are actually not quotations.

如果您正在撰写学术文章,可以在引号后寻找一个数字以获取脚注编号.否则,您可能会遇到类似非学术性文章的事情:

If you're doing academic articles, you can look for a number after the closing quotation to pick up the footnote number. Else with non academic articles, perhaps you could run something like:

"(said|writes|argues|concludes)(,)? \".?\""

可能更精确,但可能会丢失诸如块引号之类的引号(块引号反正会给您带来麻烦,因为它们可以在右引号之前包含换行符)

can be more precise, but risks losing quotes such as blockquotes (blockquotes will cause you problems anyways because they can include a newline before the closing quotation mark)

对于使用NLTK来说,我想不出什么可以帮助的,除了词网可以为"said"寻找同义词.

As for using NLTK, I can't think of anything there that will be of much help other than perhaps wordnet for finding synonyms for "said".

这篇关于如何使用NLTK从文本中提取引用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆