如何使用R提取包含特定人名的句子 [英] How to extract sentences containing specific person names using R

查看:14
本文介绍了如何使用R提取包含特定人名的句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 R 从文本中提取包含特定人名的句子,这里是一个示例段落:

I am using R to extract sentences containing specific person names from texts and here is a sample paragraph:

作为蒂宾根改革者的反对者,他接受了马丁·路德 (Martin Luther) 到维滕贝格大学 (University of Martin Luther) 的邀请,并得到了他的叔叔约翰·鲁伊奇林 (Johann Reuchlin) 的推荐.Melanchthon 21 岁时成为维滕贝格的希腊语教授.他研究圣经,特别是保罗的圣经和福音派教义.他作为旁观者出席了莱比锡的争论(1519 年),但参与了他的评论.Johann Eck 攻击了他的观点,Melanchthon 在他的 Defensio contra Johannem Eckium 中根据圣经的权威回答.

Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin. Melanchthon became professor of the Greek language in Wittenberg at the age of 21. He studied the Scripture, especially of Paul, and Evangelical doctrine. He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments. Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium.

在这个简短的段落中,有几个人名,例如:Johann ReuchlinMelanchthonJohann Eck.在openNLP包的帮助下,可以正确提取和识别三个人名Martin LutherPaulMelanchthon.那么我有两个问题:

In this short paragraph, there are several person names such as: Johann Reuchlin, Melanchthon, Johann Eck. With the help of openNLP package, three person names Martin Luther, Paul and Melanchthon can be correctly extracted and recognized. Then I have two questions:

  1. 如何提取包含这些名字的句子?
  2. 由于命名实体识别器的输出不是那么有希望,如果我在每个名称中添加[[]]",例如[[Johann Reuchlin]]、[[Melanchthon]],我如何提取句子包含这些名称表达式 [[A]], [[B]] ...?
  1. How could I extract sentences containing these names?
  2. As the output of named entity recognizer is not so promising, if I add "[[ ]]" to each name such as [[Johann Reuchlin]], [[Melanchthon]], how could I extract sentences containing these name expressions [[A]], [[B]] ...?

推荐答案

Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.

toMatch <- c("Martin Luther", "Paul", "Melanchthon")

unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]


> unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                                    
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"                                                                               
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"    

或者更干净一点:

sentences<-unlist(strsplit(para,split="\\."))
sentences[grep(paste(toMatch, collapse="|"),sentences)]

如果您要查找每个人所在的句子作为单独的返回值,则:

If you are looking for the sentences that each person is in as separate returns then:

toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)

[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"

[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

编辑 3:要添加每个人的姓名,请执行一些简单的操作,例如:

Edit 3: To add each persons name, do something simple such as:

foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}

编辑 4:

如果您想找到包含多个人/地点/事物(词)的句子,只需为这两个添加一个参数,例如:

EDIT 4:

And if you wanted to find sentences that had multiple people/places/things (words), then just add an argument for those two such as:

toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")

并将 perl 更改为 TRUE:

foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}


> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"                                                                                                                                         
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"

[[2]]
[1] "Paul"                                                                   
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] "Melanchthon"                                                                                                                          
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"                                                                                                     
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

编辑 5:回答你的另一个问题:

给定:

sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"

gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])

会给你双括号内的词.

> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
[1] "Tübingen"        "Wittenberg"      "Martin Luther"   "Johann Reuchlin"

这篇关于如何使用R提取包含特定人名的句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆