如何从嵌套的 xml 结构中获取数据? [英] How to get data out of nested xml structure?
问题描述
我正在尝试使用 API,它以嵌套 XML 的形式提供数据,我想将其保存为数据框.我的问题是我不知道如何从这个嵌套的 XML 中获取值.下面是一个例子:
I'm trying to work API which gives me data as nested XML and I want to save it as a data frame. My problem is that I don't know how to get values out of this nested XML. Here is a example:
# Sample data
library(xml2)
url <- "https://clinicaltrials.gov/api/query/full_studies?expr=neuro&min_rnk=1&max_rnk=20&fmt=xml"
download.file(url, destfile = "xml_data.xml")
fil <- "xml_data.xml"
dat <- xml2::read_xml(fil)
这给出了一个嵌套的 xml 文件,但我不明白如何使用这种结构.
This gives a nested xml file, but I don't understand how to work with this structure.
<FullStudiesResponse>
....
<FullStudyList>
<FullStudy Rank="1">
<Struct Name="Study">
<Struct Name="ProtocolSection">
<Struct Name="IdentificationModule">
<Field Name="NCTId">NCT01843582</Field>
我可以使用如下命令访问 FullStudyList:
I can get to FullStudyList with command like:
xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy")
但是例如,如果我想获取所有 NCTId
或 Rank
值,我该如何引用它?到目前为止我已经尝试过
But for example, if I want to get all NCTId
or Rank
values, how I can refer to it? So far I have tried
xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy/NCTId")
xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy/@NCTId")
xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy//NCTId")
这显然行不通.或者是否有更好的方法使用嵌套的 xml 来获取数据框中的数据?
Which obviously won't work. Or is there better way to work with nested xml's to get data in a data frame?
推荐答案
简短的回答是:不要使用 XML.该网站的以下文档说您可以指定所需的 fmt
.它不必是 XML.JSON 在 R 中更容易处理.
The short answer is: don't use XML. The following documentation from that website says that you can specify the fmt
you want. It doesn't have to be XML. JSON is much easier to handle in R.
试试这个
library(httr)
library(jsonlite)
library(tibble)
res <- fromJSON(content(GET("https://clinicaltrials.gov/api/query/full_studies?expr=neuro&min_rnk=1&max_rnk=20&fmt=json")))
结果是一个嵌套列表,但我猜你对FullStudies
The result is a nested list, but I guess that you are interested in the data stored in FullStudies
df <- as_tibble(res$FullStudiesResponse$FullStudies)
这给了我们
# A tibble: 20 x 2
Rank Study$ProtocolS~ $$$OrgStudyIdIn~ $$$$OrgStudyIdT~ $$$$OrgStudyIdL~ $$$Organization~ $$$$OrgClass $$$BriefTitle $$$OfficialTitle $$$Acronym $$StatusModule$~
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 NCT02642055 NEURO+001 NA NA Neuro+ INDUSTRY Efficacy of ~ Efficacy of NEU~ NA May 2016
2 2 NCT01801813 RC12_0416 NA NA Nantes Universi~ OTHER Risk Factors~ Observational S~ Craniosco~ March 2016
3 3 NCT03813290 DSRB A/2018/006~ NA NA National Health~ OTHER_GOV A Neuro-Tech~ A Neuro-Technol~ NA February 2020
4 4 NCT03773926 2018-A00604-51 NA NA Zeta Technologi~ INDUSTRY Neuro-feedba~ Neuro-feedback ~ TNTA December 2018
5 5 NCT04189172 AAG-O-H-1630 NA NA Aesculap AG INDUSTRY MiDura-Study~ Multicenter, In~ MiDura May 2020
6 6 NCT03756337 PIC-20 NA NA Oticon Medical INDUSTRY Neuro 1 vs. ~ Comparison of A~ NA November 2018
7 7 NCT03484143 P17.03 NA NA Vielight Inc. INDUSTRY Neuro RX Gam~ Vielight Neuro ~ NA June 2020
8 8 NCT02138110 InVivo-100-101 NA NA InVivo Therapeu~ INDUSTRY The INSPIRE ~ The INSPIRE Stu~ NA December 2019
9 9 NCT03935724 A2017SCI03 NA NA Neuroplast INDUSTRY Clinical Stu~ A Multi-center,~ SCI2 September 2020
10 10 NCT03798002 RiphahI Maryam ~ NA NA Riphah Internat~ OTHER Neuro-muscul~ Effects of Neur~ NA August 2019
11 11 NCT03655262 R61MH113772 U.S. NIH Grant/~ https://project~ University of C~ OTHER Treating Pho~ Treating Phobia~ NA April 2019
12 12 NCT04418609 Neuro-COVID-19 NA NA University of Z~ OTHER Neuro-COVID-~ Neuro-COVID-19:~ Neuro-COV~ June 2020
13 13 NCT01174329 1234 NA NA Universidad Aut~ OTHER Treatment of~ Difference in S~ SALELECTR~ July 2010
14 14 NCT04205019 A2019SCI04 NA NA Neuroplast INDUSTRY Safety Stem ~ A 3 Months Open~ SSCiSCI September 2020
15 15 NCT02941627 PIC_07 NA NA Oticon Medical INDUSTRY The Neuro Zt~ The Neuro Zti C~ NA February 2017
16 16 NCT03328195 P17.02 NA NA Vielight Inc. INDUSTRY Vielight Neu~ A Pilot Study E~ NA September 2020
17 17 NCT02401841 Policlinico 12 NA NA Policlinico Hos~ OTHER Resolution o~ Resolution of N~ NA October 2015
18 18 NCT03882567 03/2015 NA NA Universidad Rey~ OTHER Effectivenes~ Effectiveness o~ SCENAR October 2019
19 19 NCT04583163 2019-0945 NA NA Hackensack Meri~ OTHER Variability ~ Inter- and Intr~ NA October 2020
20 20 NCT01845155 CMTR-TC-02 NA NA German Center f~ OTHER Neuro-Music-~ Neuro-Music-The~ NA February 2014
# ... with 103 more variables: $$$OverallStatus <chr>, $$$ExpandedAccessInfo$HasExpandedAccess <chr>, $$$StartDateStruct$StartDate <chr>, $$$$StartDateType <chr>,
# $$$PrimaryCompletionDateStruct$PrimaryCompletionDate <chr>, $$$$PrimaryCompletionDateType <chr>, $$$CompletionDateStruct$CompletionDate <chr>,
# $$$$CompletionDateType <chr>, $$$StudyFirstSubmitDate <chr>, $$$StudyFirstSubmitQCDate <chr>, $$$StudyFirstPostDateStruct$StudyFirstPostDate <chr>,
# $$$$StudyFirstPostDateType <chr>, $$$LastUpdateSubmitDate <chr>, $$$LastUpdatePostDateStruct$LastUpdatePostDate <chr>, $$$$LastUpdatePostDateType <chr>,
# $$$ResultsFirstSubmitDate <chr>, $$$ResultsFirstSubmitQCDate <chr>, $$$ResultsFirstPostDateStruct$ResultsFirstPostDate <chr>, $$$$ResultsFirstPostDateType <chr>,
# $$$LastKnownStatus <chr>, $$SponsorCollaboratorsModule$ResponsibleParty$ResponsiblePartyType <chr>, $$$$ResponsiblePartyInvestigatorFullName <chr>,
# $$$$ResponsiblePartyInvestigatorTitle <chr>, $$$$ResponsiblePartyInvestigatorAffiliation <chr>, $$$$ResponsiblePartyOldNameTitle <chr>,
# $$$$ResponsiblePartyOldOrganization <chr>, $$$LeadSponsor$LeadSponsorName <chr>, $$$$LeadSponsorClass <chr>, $$$CollaboratorList$Collaborator <list>,
# $$OversightModule$OversightHasDMC <chr>, $$$IsFDARegulatedDrug <chr>, $$$IsFDARegulatedDevice <chr>, $$$IsUnapprovedDevice <chr>, $$$IsUSExport <chr>,
# $$DescriptionModule$BriefSummary <chr>, $$$DetailedDescription <chr>, $$ConditionsModule$ConditionList$Condition <list>, $$$KeywordList$Keyword <list>,
# $$DesignModule$StudyType <chr>, $$$PhaseList$Phase <list>, $$$DesignInfo$DesignAllocation <chr>, $$$$DesignInterventionModel <chr>,
# $$$$DesignPrimaryPurpose <chr>, $$$$DesignMaskingInfo$DesignMasking <chr>, $$$$$DesignWhoMaskedList$DesignWhoMasked <list>, $$$$$DesignMaskingDescription <chr>,
# $$$$DesignObservationalModelList$DesignObservationalModel <list>, $$$$DesignTimePerspectiveList$DesignTimePerspective <list>,
# $$$$DesignInterventionModelDescription <chr>, $$$EnrollmentInfo$EnrollmentCount <chr>, $$$$EnrollmentType <chr>, $$$PatientRegistry <chr>,
# $$$TargetDuration <chr>, $$ArmsInterventionsModule$ArmGroupList$ArmGroup <list>, $$$InterventionList$Intervention <list>,
# $$OutcomesModule$PrimaryOutcomeList$PrimaryOutcome <list>, $$$SecondaryOutcomeList$SecondaryOutcome <list>, $$$OtherOutcomeList$OtherOutcome <list>,
# $$EligibilityModule$EligibilityCriteria <chr>, $$$HealthyVolunteers <chr>, $$$Gender <chr>, $$$MinimumAge <chr>, $$$MaximumAge <chr>, $$$StdAgeList$StdAge <list>,
# $$$StudyPopulation <chr>, $$$SamplingMethod <chr>, $$ContactsLocationsModule$OverallOfficialList$OverallOfficial <list>, $$$LocationList$Location <list>,
# $$$CentralContactList$CentralContact <list>, $$IPDSharingStatementModule$IPDSharing <chr>, $$ReferencesModule$ReferenceList$Reference <list>,
# $$$SeeAlsoLinkList$SeeAlsoLink <list>, $DerivedSection$MiscInfoModule$VersionHolder <chr>, $$$RemovedCountryList$RemovedCountry <list>,
# $$ConditionBrowseModule$ConditionMeshList$ConditionMesh <list>, $$$ConditionAncestorList$ConditionAncestor <list>,
# $$$ConditionBrowseLeafList$ConditionBrowseLeaf <list>, $$$ConditionBrowseBranchList$ConditionBrowseBranch <list>,
# $$InterventionBrowseModule$InterventionBrowseLeafList$InterventionBrowseLeaf <list>, $$$InterventionBrowseBranchList$InterventionBrowseBranch <list>,
# $ResultsSection$ParticipantFlowModule$FlowGroupList$FlowGroup <list>, $$$FlowPeriodList$FlowPeriod <list>, $$$FlowPreAssignmentDetails <chr>,
# $$$FlowRecruitmentDetails <chr>, $$BaselineCharacteristicsModule$BaselinePopulationDescription <chr>, $$$BaselineGroupList$BaselineGroup <list>,
# $$$BaselineDenomList$BaselineDenom <list>, $$$BaselineMeasureList$BaselineMeasure <list>, $$OutcomeMeasuresModule$OutcomeMeasureList$OutcomeMeasure <list>,
# $$AdverseEventsModule$EventsFrequencyThreshold <chr>, $$$EventsTimeFrame <chr>, $$$EventGroupList$EventGroup <list>, $$$SeriousEventList$SeriousEvent <list>,
# $$$OtherEventList$OtherEvent <list>, $$MoreInfoModule$CertainAgreement$AgreementPISponsorEmployee <chr>, $$$$AgreementRestrictiveAgreement <chr>,
# $$$PointOfContact$PointOfContactTitle <chr>, $$$$PointOfContactOrganization <chr>, $$$$PointOfContactEMail <chr>, $$$$PointOfContactPhone <chr>, ...
这篇关于如何从嵌套的 xml 结构中获取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!