R将语料库分解为句子 [英] R break corpus into sentences

查看:121
本文介绍了R将语料库分解为句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  1. 我有许多PDF文档,已经将其阅读成库tm的语料库.一个人怎么能把语料分解成句子呢?

  1. I have a number of PDF documents, which I have read into a corpus with library tm. How can one break the corpus into sentences?

这可以通过从软件包qdap [*]中读取readLinessentSplit的文件来完成.该功能需要一个数据框.它还将需要放弃语料库并单独读取所有文件.

It can be done by reading the file with readLines followed by sentSplit from package qdap [*]. That function requires a dataframe. It would also would require to abandon the corpus and read all files individually.

如何在tm中的语料库上传递函数sentSplit {qdap}?还是有更好的方法?.

How can I pass function sentSplit {qdap} over a corpus in tm? Or is there a better way?.

注意:库openNLP中有一个函数sentDetect,现在为Maxent_Sent_Token_Annotator-适用相同的问题:如何将其与语料库[tm]结合?/p>

Note: there was a function sentDetect in library openNLP, which is now Maxent_Sent_Token_Annotator - the same question applies: how can this be combined with a corpus [tm]?

推荐答案

我不知道如何重塑语料库,但这将是一个很棒的功能.

I don't know how to reshape a corpus but that would be a fantastic functionality to have.

我想我的方法将是这样的:

I guess my approach would be something like this:

使用这些软件包

# Load Packages
require(tm)
require(NLP)
require(openNLP)

我将文本设置为句子功能,如下所示:

I would set up my text to sentences function as follows:

convert_text_to_sentences <- function(text, lang = "en") {
  # Function to compute sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'. 
  sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = lang)

  # Convert text to class String from package NLP
  text <- as.String(text)

  # Sentence boundaries in text
  sentence.boundaries <- annotate(text, sentence_token_annotator)

  # Extract sentences
  sentences <- text[sentence.boundaries]

  # return sentences
  return(sentences)
}

还有我对整形语料库功能的理解(注意:除非您以某种方式修改此功能并适当地复制它们,否则您将失去元属性)

And my hack of a reshape corpus function (NB: you will lose the meta attributes here unless you modify this function somehow and copy them over appropriately)

reshape_corpus <- function(current.corpus, FUN, ...) {
  # Extract the text from each document in the corpus and put into a list
  text <- lapply(current.corpus, Content)

  # Basically convert the text
  docs <- lapply(text, FUN, ...)
  docs <- as.vector(unlist(docs))

  # Create a new corpus structure and return it
  new.corpus <- Corpus(VectorSource(docs))
  return(new.corpus)
}

其工作方式如下:

## create a corpus
dat <- data.frame(doc1 = "Doctor Who is a British science fiction television programme produced by the BBC. The programme depicts the adventures of a Time Lord—a time travelling, humanoid alien known as the Doctor. He explores the universe in his TARDIS (acronym: Time and Relative Dimension in Space), a sentient time-travelling space ship. Its exterior appears as a blue British police box, a common sight in Britain in 1963, when the series first aired. Along with a succession of companions, the Doctor faces a variety of foes while working to save civilisations, help ordinary people, and right wrongs.",
                  doc2 = "The show has received recognition from critics and the public as one of the finest British television programmes, winning the 2006 British Academy Television Award for Best Drama Series and five consecutive (2005–10) awards at the National Television Awards during Russell T Davies's tenure as Executive Producer.[3][4] In 2011, Matt Smith became the first Doctor to be nominated for a BAFTA Television Award for Best Actor. In 2013, the Peabody Awards honoured Doctor Who with an Institutional Peabody \"for evolving with technology and the times like nothing else in the known television universe.\"[5]",
                  doc3 = "The programme is listed in Guinness World Records as the longest-running science fiction television show in the world[6] and as the \"most successful\" science fiction series of all time—based on its over-all broadcast ratings, DVD and book sales, and iTunes traffic.[7] During its original run, it was recognised for its imaginative stories, creative low-budget special effects, and pioneering use of electronic music (originally produced by the BBC Radiophonic Workshop).",
                  stringsAsFactors = FALSE)

current.corpus <- Corpus(VectorSource(dat))
# A corpus with 3 text documents

## reshape the corpus into sentences (modify this function if you want to keep meta data)
reshape_corpus(current.corpus, convert_text_to_sentences)
# A corpus with 10 text documents

我的sessionInfo输出

My sessionInfo output

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
  [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] NLP_0.1-0     openNLP_0.2-1 tm_0.5-9.1   

loaded via a namespace (and not attached):
  [1] openNLPdata_1.5.3-1 parallel_3.0.1      rJava_0.9-4         slam_0.1-29         tools_3.0.1  

这篇关于R将语料库分解为句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆