绑定字符向量以列出到数据帧中 [英] bind character vector to list into dataframe

查看:76
本文介绍了绑定字符向量以列出到数据帧中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个URL列表,并提取了以下内容:

I have a list of URLs and have extracted the content as follows:

library(httr)
link="http://www.workerspower.net/disposable-workers-the-real-price-of-sweat-shop-labor"
get.link=GET(link)
get.content=content(x2,as="text")
extract.content=str_extract_all(y2,"<p>(.*?)</p>")

这将给出带有文本的列表1".每个列表的长度取决于URL/随URL变化. 我想将URL [link]与内容[extract.content]绑定在一起,并将其转换为数据框,然后将其导入到语料库中. 我的尝试失败了,例如.由于行长不同,这不起作用:

This gives a "list of 1" with text. The length of each list is dependent on/varies with the URL. I would like to bind the URL [link] with the content [extract.content] and transform it into a dataframe and then import that into a Corpus. My attempts fail, eg. this does not work because of the different row lengths:

all=data.frame(url.vec=c(link1,link2),text.vec=c(extract.content1,extract.content2))

有人知道如何将角色[vector]与角色[list]组合吗?

Does anyone knows how to combine a character[vector] with a character[list]?

推荐答案

我将使用XML包进行此操作.然后,您应该避免将正则表达式与html/xml文档一起使用.使用xpath代替.在这里,我创建了一个小函数,通过给出一个链接可以创建语料库.

I would do this using XML package. Then you should avoid using regular expression with html/xml documents. Use xpath instead. Here I create a small function that giving a link it create the corpus.

library(XML)
create.corpus <- function(link){
  doc <- htmlParse(link)
  parag <- xpathSApply(doc,'//p',xmlValue)
  library(tm)
  cc <- Corpus(VectorSource(parag))
  meta(cc,type='corpus','link') <- link
  cc
}
## call it 
cc <- create.corpus(link)

检查结果:

 meta(cc,type='corpus')
# $create_date
# [1] "2014-01-03 17:40:50 GMT"
# 
# $creator
# [1] ""
# 
# $link
# [1] "http://www.workerspower.net/disposable-workers-the-real-price-of-sweat-shop-labor"

> cc
# A corpus with 36 text documents

这篇关于绑定字符向量以列出到数据帧中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆