从 R 统计中的 XML 文件创建数据集 [英] Creating a dataset from an XML file in R statistics

查看：28 发布时间：2021/10/2 18:42:14 r xml-parsing

本文介绍了从 R 统计中的 XML 文件创建数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试下载期刊文章记录的 XML 文件，并创建一个数据集，以便在 R 中进一步询问.我对 XML 完全陌生，而且在 R 方面还是新手.我使用来自 2 个来源的一些代码拼凑了一些代码:GoogleScholarXScraper和提取记录来自 pubMed

I am trying to download an XML file of journal article records and create a dataset for further interrogation in R. I'm completely new to XML and quite novice at R. I cobbled together some code using bits of code from 2 sources: GoogleScholarXScraper and Extracting records from pubMed

library(RCurl)
library(XML)
library(stringr)

#Search terms
SearchString<-"cancer+small+cell+non+lung+survival+plastic"
mySearch<-str_c("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=",SearchString,"&usehistory=y",sep="",collapse=NULL)

#Seach
pub.esearch<-getURL(mySearch)

#Extract QueryKey and WebEnv
pub.esearch<-xmlTreeParse(pub.esearch,asText=TRUE)
key<-as.numeric(xmlValue(pub.esearch[["doc"]][["eSearchResult"]][["QueryKey"]]))
env<-xmlValue(pub.esearch[["doc"]][["eSearchResult"]][["WebEnv"]])

#Fetch Records
myFetch<-str_c("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&WebEnv=",env,"&retmode=xml&query_key=",key)
pub.efetch<-getURL(myFetch)
myxml<-xmlTreeParse(pub.efetch,asText=TRUE,useInternalNodes=TRUE)

#Create dataset of article characteristics #This doesn't work
pub.data<-NULL
pub.data<-data.frame(
  journal <- xpathSApply(myxml,"//PubmedArticle/MedlineCitation/MedlineJournalInfo/MedlineTA", xmlValue),
  abstract<- xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Abstract/AbstractText",xmlValue),
  affiliation<-xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Affiliation", xmlValue),
  year<-xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year", xmlValue)
  ,stringsAsFactors=FALSE)

我似乎遇到的主要问题是我返回的 XML 文件的结构不完全统一.例如，一些引用的节点结构是这样的:

The main problem I seem to have is that my returned XML file is not completely uniformly structured. For example, some references have a node structure like this:

- <Abstract>
<AbstractText>The Wilms' tumor gene... </AbstractText>

虽然有些标签是这样的

- <Abstract>
<AbstractText Label="BACKGROUND &#38; AIMS" NlmCategory="OBJECTIVE">Some background text.</AbstractText>
<AbstractText Label="METHODS" NlmCategory="METHODS"> Some text on methods.</AbstractText>

当我提取 'AbstactText' 时，我希望能取回 24 行数据(我今天运行这个合成搜索时有 24 条记录)，但 xpathSApply 将 'AbstactText' 中的所有标签作为我的数据帧的单个元素返回.有没有办法在此实例中折叠 XML 结构/忽略标签?有没有办法让 xpathSApply 在路径末尾找不到任何东西时返回NA"?我知道 xmlToDataFrame，这听起来应该符合要求，但是每当我尝试使用它时，它似乎都没有给我任何明智的选择.

When I extract the 'AbstactText' I am hoping to get 24 rows of data back (there are 24 records when I run this made up search today), but xpathSApply returns all labels within 'AbstactText' as individual elements of my dataframe. Is there a way to collapse the XML structure in this instance/Ignore the labels? Is there a way to make xpathSApply return 'NA' when nothing is found at end of a path? I am aware of xmlToDataFrame, which sounds like it should fit the bill, but whenever I try to use this it doesn't seem to give me anything sensible.

感谢您的帮助

从 R 统计中的 XML 文件创建数据集 [英] Creating a dataset from an XML file in R statistics

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从 R 统计中的 XML 文件创建数据集 [英] Creating a dataset from an XML file in R statistics

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭