R:从 JSON/XML (clinicaltrials.gov) 到 data.frame (tidy) 的嵌套列表 [英] R: nested list from JSON/XML (clinicaltrials.gov) to data.frame (tidy)

查看:50
本文介绍了R:从 JSON/XML (clinicaltrials.gov) 到 data.frame (tidy) 的嵌套列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目的

对于大学研究,我尝试处理公开可用的临床研究数据此处.

For university research I try to process data of clinical studies publicly available here.

为了重现性,我想直接使用下载的 JSON 或 XML 文件(而不是通过 网络 API,已被描述:how-to-get-data-out-of-nested-xml-structure).

For reproducibility, I would like to directly use the downloaded JSON or XML files (and not to retrieve the data via the web API, which has been described: how-to-get-data-out-of-nested-xml-structure).

更新 1:JSON 文件的结构发布在此处

Update 1: The structure of the JSON file is published here

更新 2:XML 文件的结构发布在此处

Update 2: The structure of the XML file is published here

我认为 tidyjson::read_json 和 tidyjson::spread_all 可以解决问题!请参阅答案部分.

I think tidyjson::read_json and tidyjson::spread_all do the trick! See the answer section.

我需要什么

对于我的工作流程,我需要将数据转换为 data.frames(整洁的 data.frames 会更好).不过,我更喜欢 JSON,如果有针对 XML 格式的解决方案,我会很高兴.

For my workflow, I need to convert the data to data.frames (tidy data.frames would be even better). I prefer JSON, hoever, if there was a solution for the XML format I would be very glad.

测试数据

我使用 jsonlite::fromJSON("NCT0455805.json")

test <- list(FullStudy = list(Rank = 254369L, Study = list(ProtocolSection = list(
    IdentificationModule = list(NCTId = "NCT01455805", OrgStudyIdInfo = list(
        OrgStudyId = "SS2011UK"), Organization = list(OrgFullName = "Spinal Simplicity LLC", 
        OrgClass = "INDUSTRY"), BriefTitle = "Minuteman Spinal Fusion Implant Versus Surgical Decompression for Lumbar Spinal Stenosis", 
        OfficialTitle = "Efficacy and Quality of Life Following Treatment of Lumbar Spinal Stenosis, Spondylolisthesis or Degenerative Disc Disease With the Minuteman Interspinous Interlaminar Fusion Implant Versus Surgical Decompression"), 
    StatusModule = list(StatusVerifiedDate = "October 2020", 
        OverallStatus = "Active, not recruiting", ExpandedAccessInfo = list(
            HasExpandedAccess = "No"), StartDateStruct = list(
            StartDate = "June 2012"), PrimaryCompletionDateStruct = list(
            PrimaryCompletionDate = "March 2024", PrimaryCompletionDateType = "Anticipated"), 
        CompletionDateStruct = list(CompletionDate = "March 2024", 
            CompletionDateType = "Anticipated"), StudyFirstSubmitDate = "October 13, 2011", 
        StudyFirstSubmitQCDate = "October 18, 2011", StudyFirstPostDateStruct = list(
            StudyFirstPostDate = "October 20, 2011", StudyFirstPostDateType = "Estimate"), 
        LastUpdateSubmitDate = "October 22, 2020", LastUpdatePostDateStruct = list(
            LastUpdatePostDate = "October 26, 2020", LastUpdatePostDateType = "Actual")), 
    SponsorCollaboratorsModule = list(ResponsibleParty = list(
        ResponsiblePartyType = "Sponsor"), LeadSponsor = list(
        LeadSponsorName = "Spinal Simplicity LLC", LeadSponsorClass = "INDUSTRY"), 
        CollaboratorList = list(Collaborator = list(list(CollaboratorName = "The Leeds Teaching Hospitals NHS Trust", 
            CollaboratorClass = "OTHER")))), OversightModule = list(
        OversightHasDMC = "Yes"), DescriptionModule = list(BriefSummary = "Lumbar spinal stenosis (LSS), is a common disorder of narrowing of the spinal canal in the lower part of the back. This causes discomfort in the legs when standing or walking because of pressure on the spinal nerves.There are several treatment options for LSS including physiotherapy, lumbar surgical decompression procedures such as laminectomy, Foraminotomy, Discectomy and more recently devices for interspinous distraction such as the XSTOP® and from May 2011 Minuteman\231.\n\nSurgical decompression for LSS involves the removal of excess bone, ligament, and soft-tissue allowing more room for the nerves. The operation is usually preformed under general anaesthetic and with an average stay in hospital for 2-3 nights. Whereas the Minuteman\231 implant is preformed as a day case under local or general anaesthetic and involves implanting the device into the space between two back bones to relieve pressure on the nerves and, therefore, pain in the legs.\n\nThis is a multi centred (four sites) randomised controlled trial with a total sample of 50 participants after obtaining their informed consent. Participants will attend the pain clinic at the Hospitals for a baseline visit where they will be randomised with a ratio of 1:1 to receive either the Minuteman\231 Interspinous interlaminar fusion Implant or standard surgical decompression for the treatment of lumbar spinal stenosis (LSS). Following randomisation arrangements will be made for the participant to receive the randomised treatment. If allocated to Minuteman\231 Implant, the treatment will be conducted by the Pain Specialist identified at the site. If allocated to surgical decompression, the treatment will be conducted by the neuro/spinal-surgeon identified at the site. Participates will be followed up regularly for 60 months post implant to assess clinical efficacy, safety, participants function and quality of life of each treatment.", 
        DetailedDescription = "This is a prospective randomised study monitoring patients for up to 5 years post treatment. Only patients who have an appropriately diagnosed Lumbar Spinal Stenosis with intermittent claudication with/without low back pain, with no adequate symptomatic relief after at least 6 months of conservative treatment will be asked to give consent to be involved. Potential participants will be approached for enrollment 17days before the planned baseline visit. Patients will be given oral and written information about the trial as well as the patient information leaflet for the study. If informed consent is given their participation in this study will be for a maximum of 5 years."), 
    ConditionsModule = list(ConditionList = list(Condition = c("Lumbar Spinal Stenosis", 
    "Spondylolisthesis", "Degenerative Disc Disease"))), DesignModule = list(
        StudyType = "Interventional", PhaseList = list(Phase = "Not Applicable"), 
        DesignInfo = list(DesignAllocation = "Randomized", DesignInterventionModel = "Parallel Assignment", 
            DesignPrimaryPurpose = "Treatment", DesignMaskingInfo = list(
                DesignMasking = "None (Open Label)")), EnrollmentInfo = list(
            EnrollmentCount = "50", EnrollmentType = "Anticipated")), 
    ArmsInterventionsModule = list(ArmGroupList = list(ArmGroup = list(
        list(ArmGroupLabel = "Minuteman Fusion Implant", ArmGroupType = "Active Comparator", 
            ArmGroupDescription = "Minuteman\231 interspinous interlaminar fusion Implant (interspinous interlaminar fusion device) which gained CE Mark approval in May 2011", 
            ArmGroupInterventionList = list(ArmGroupInterventionName = "Device: Minuteman Fusion Implant")), 
        list(ArmGroupLabel = "Surgical decompression", ArmGroupType = "Other", 
            ArmGroupDescription = "Surgical decompression refers to the following operations Laminectomy, Foraminotomy, Discectomy or any other surgical procedure that the clinician feels is relevant for the decompression of lumbar spinal stenosis.", 
            ArmGroupInterventionList = list(ArmGroupInterventionName = "Procedure: surgical decompression")))), 
        InterventionList = list(Intervention = list(list(InterventionType = "Device", 
            InterventionName = "Minuteman Fusion Implant", InterventionDescription = "The Minuteman\231 interspinous interlaminar fusion device consists of a central threaded portion that has a two-part wing plate hinged near its proximal end, with spikes on the extended distal end of the wing plate, and a multi-spiked end cap plate that is located at the distal end of the device and is retained and tightened in place with a locking hex nut. Compression between the spiked wing plate and the spiked end cap plate serves to fix the spinous processes in place and to facilitate fusion, together with bone graft fusion material placed within the device. The threaded external body has been designed to provide ease of distraction and insertion via a minimally invasive surgical procedure.", 
            InterventionArmGroupLabelList = list(InterventionArmGroupLabel = "Minuteman Fusion Implant"), 
            InterventionOtherNameList = list(InterventionOtherName = "The Minuteman\231 interspinous interlaminar fusion device")), 
            list(InterventionType = "Procedure", InterventionName = "surgical decompression", 
                InterventionDescription = "Surgical decompression refers to the following operations Laminectomy, Foraminotomy, Discectomy or any other surgical procedure that the clinician feels is relevant for the decompression of lumbar spinal stenosis", 
                InterventionArmGroupLabelList = list(InterventionArmGroupLabel = "Surgical decompression"))))), 
    OutcomesModule = list(PrimaryOutcomeList = list(PrimaryOutcome = list(
        list(PrimaryOutcomeMeasure = "Change from baseline of clinical efficacy up to 60 months post procedure", 
            PrimaryOutcomeDescription = "These include:\n\nVisual Analogue Scale (VAS) pain scores Leg Pain\nVisual Analogue Scale (VAS) pain scores Back Pain\nOswestry Disability Index (ODI)\nZurich Claudication Questionnaire (ZCQ)\nAssessment of Physical Function via distance walked in 5 minutes and number of repetitions of sitting to standing in 1 minute.\n\nThe main outcome will be a comparison between treatment groups based on the change from baseline at each follow-up visit for each of the measures listed above.", 
            PrimaryOutcomeTimeFrame = "8 weeks and up to 60 months post procedure."))), 
        SecondaryOutcomeList = list(SecondaryOutcome = list(list(
            SecondaryOutcomeMeasure = "measures of quality of life", 
            SecondaryOutcomeDescription = "These include:\n\nChange in functional status questionnaire from baseline\nParticipants global impression of change from baseline (PGIC)\nClinician's global Impression of change from baseline (CGIC)\nEmployment status", 
            SecondaryOutcomeTimeFrame = "8 weeks and up to 60 months post procedure."), 
            list(SecondaryOutcomeMeasure = "Adverse events related to device and procedure", 
                SecondaryOutcomeTimeFrame = "safety to be assessed at 8 weeks and up to 60 months post procedure.")))), 
    EligibilityModule = list(EligibilityCriteria = "Inclusion Criteria:\n\nIs male or a non pregnant female aged 18years or older\nBMI = 35kg/m2\nHas chronic leg pain with or without back pain of greater than 6 months duration,which is partially or completely relieved by either sitting or adopting a flexed posture and who are suitable in the clinicians opinion for posterior lumbar surgery\nPre-operative ODI score = 20%\nPre-operative ZCQ Physical Function Domain =2\nPre-operative VAS Leg pain score = 4\nHas completed at least 6 months of conservative treatment without obtaining adequate symptomatic relief or has worsening neurological symptoms.\nHas degenerative changes at 1 or 2 levels confirmed by MRI or CT Myelogram within the last 12 months) with one or more of the following:\nLumbar spinal stenosis with intermittent neurogenic claudication\nDegeneration of the disc (as evidenced by imaging on MRI)\nAnnular thickening\nDegenerative Spondylolisthesis = Meyerding Grade 1\nThickening of ligamentum flavum\n\nExclusion Criteria:\n\nFixed motor deficit\nHas undergone previous lumbar spinal surgery\nIs unwilling or unable to give consent or adhere to the follow up schedule\nHas active infection or metastatic disease\nHas spondylolisthesis > grade 1\nHas neurogenic bladder or bowel disease\nHas a history of Osteopenia and or Osteoporosis. Evaluation of possible Osteopenia and or Osteoporosis will be conducted via a bone density scan prior to randomisation if ANY of the Bone Mass Evaluation criteria is met\nPatients who are not deemed fit for anaesthesia/major surgery due to underlying medical condition", 
        HealthyVolunteers = "No", Gender = "All", MinimumAge = "18 Years", 
        StdAgeList = list(StdAge = c("Adult", "Older Adult"))), 
    ContactsLocationsModule = list(OverallOfficialList = list(
        OverallOfficial = list(list(OverallOfficialName = "Ganesan Baranidharan, Dr", 
            OverallOfficialAffiliation = "Leeds Teaching Hospitals NHS Trust", 
            OverallOfficialRole = "Principal Investigator"))), 
        LocationList = list(Location = list(list(LocationFacility = "Taunton & Somerset NHS Foundation Trust of Musgrove Park Hospital", 
            LocationCity = "Taunton", LocationState = "Somerset", 
            LocationZip = "TA1 5DA", LocationCountry = "United Kingdom"), 
            list(LocationFacility = "The Ipswich Hospital NHS Trust", 
                LocationCity = "Ipswich", LocationState = "Suffolk", 
                LocationZip = "IP4 5PD", LocationCountry = "United Kingdom"), 
            list(LocationFacility = "Pain and Interventional Neuromodulation Research Group, Pain Management Dept, Seacroft Hospital, Leeds Teaching Hospitals NHS Trust", 
                LocationCity = "Leeds", LocationState = "West Yorkshire", 
                LocationZip = "LS14 6UH", LocationCountry = "United Kingdom"), 
            list(LocationFacility = "The Dudley Group NHS Foundation Trust, Russell Hall Hospital", 
                LocationCity = "Birmingham", LocationZip = "DY1 2HQ", 
                LocationCountry = "United Kingdom"))))), DerivedSection = list(
    MiscInfoModule = list(VersionHolder = "February 26, 2021"), 
    ConditionBrowseModule = list(ConditionMeshList = list(ConditionMesh = list(
        list(ConditionMeshId = "D000013130", ConditionMeshTerm = "Spinal Stenosis"), 
        list(ConditionMeshId = "D000055959", ConditionMeshTerm = "Intervertebral Disc Degeneration"), 
        list(ConditionMeshId = "D000013168", ConditionMeshTerm = "Spondylolisthesis"), 
        list(ConditionMeshId = "D000003251", ConditionMeshTerm = "Constriction, Pathologic"))), 
        ConditionAncestorList = list(ConditionAncestor = list(
            list(ConditionAncestorId = "D000020763", ConditionAncestorTerm = "Pathological Conditions, Anatomical"), 
            list(ConditionAncestorId = "D000013122", ConditionAncestorTerm = "Spinal Diseases"), 
            list(ConditionAncestorId = "D000001847", ConditionAncestorTerm = "Bone Diseases"), 
            list(ConditionAncestorId = "D000009140", ConditionAncestorTerm = "Musculoskeletal Diseases"), 
            list(ConditionAncestorId = "D000013169", ConditionAncestorTerm = "Spondylolysis"), 
            list(ConditionAncestorId = "D000055009", ConditionAncestorTerm = "Spondylosis"))), 
        ConditionBrowseLeafList = list(ConditionBrowseLeaf = list(
            list(ConditionBrowseLeafId = "M26992", ConditionBrowseLeafName = "Intervertebral Disc Degeneration", 
                ConditionBrowseLeafAsFound = "Degenerative Disc Disease", 
                ConditionBrowseLeafRelevance = "high"), list(
                ConditionBrowseLeafId = "M14546", ConditionBrowseLeafName = "Spondylolisthesis", 
                ConditionBrowseLeafAsFound = "Spondylolisthesis", 
                ConditionBrowseLeafRelevance = "high"), list(
                ConditionBrowseLeafId = "M14510", ConditionBrowseLeafName = "Spinal Stenosis", 
                ConditionBrowseLeafAsFound = "Spinal Stenosis", 
                ConditionBrowseLeafRelevance = "high"), list(
                ConditionBrowseLeafId = "M5058", ConditionBrowseLeafName = "Constriction, Pathologic", 
                ConditionBrowseLeafAsFound = "Stenosis", ConditionBrowseLeafRelevance = "high"), 
            list(ConditionBrowseLeafId = "M21103", ConditionBrowseLeafName = "Pathological Conditions, Anatomical", 
                ConditionBrowseLeafRelevance = "low"), list(ConditionBrowseLeafId = "M14502", 
                ConditionBrowseLeafName = "Spinal Diseases", 
                ConditionBrowseLeafRelevance = "low"), list(ConditionBrowseLeafId = "M3708", 
                ConditionBrowseLeafName = "Bone Diseases", ConditionBrowseLeafRelevance = "low"), 
            list(ConditionBrowseLeafId = "M10680", ConditionBrowseLeafName = "Musculoskeletal Diseases", 
                ConditionBrowseLeafRelevance = "low"), list(ConditionBrowseLeafId = "M14547", 
                ConditionBrowseLeafName = "Spondylolysis", ConditionBrowseLeafRelevance = "low"), 
            list(ConditionBrowseLeafId = "M26580", ConditionBrowseLeafName = "Spondylosis", 
                ConditionBrowseLeafRelevance = "low"), list(ConditionBrowseLeafId = "T6038", 
                ConditionBrowseLeafName = "Quality of Life", 
                ConditionBrowseLeafRelevance = "low"))), ConditionBrowseBranchList = list(
            ConditionBrowseBranch = list(list(ConditionBrowseBranchAbbrev = "BC05", 
                ConditionBrowseBranchName = "Muscle, Bone, and Cartilage Diseases"), 
                list(ConditionBrowseBranchAbbrev = "All", ConditionBrowseBranchName = "All Conditions"), 
                list(ConditionBrowseBranchAbbrev = "BC23", ConditionBrowseBranchName = "Symptoms and General Pathology"), 
                list(ConditionBrowseBranchAbbrev = "BXM", ConditionBrowseBranchName = "Behaviors and Mental Disorders"))))))))

我已经取得的成就

我可以轻松地将一批 JSON 文件读取到列表中,如here(x= 带有文件路径的向量)

I can easily read a batch of JSON files to a list as described here (x= vector with paths to the files)

library(parallel)
library(jsonlite) 
    cl <- makeCluster(detectCores() - 1)
    json_list<-parLapply(cl,paths$path,function(x) jsonlite::fromJSON(x))
    stopCluster(cl)

我的尝试

我在 jsonlite::fromJSON 中尝试了选项 simplifyDatFrame = T,但是,我收到以下错误消息:

I tried the option simplifyDatFrame = T in jsonlite::fromJSON, however, I get this error messages:

1: In (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  :
  row names were found from a short variable and have been discarded
2: In (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  :
  row names were found from a short variable and have been discarded

我尝试了一个建议的解决方案(how-to-get-data-out-of-nested-xml-structure) 用于直接使用clinicaltrials.gov 的Web API 生成的嵌套列表.

I tried a solution proposed (how-to-get-data-out-of-nested-xml-structure) for the the nested lists generated directly with the web API of clinicaltrials.gov.

as_tibble(test$FullStudy$Study)
Error: Tibble columns must have compatible sizes.
* Size 2: Column `DerivedSection`.
* Size 11: Column `ProtocolSection`.
i Only values of size one are recycled.

我尝试使用 tidyjson,但是,我无法从嵌套列表中获得整洁的 data.frame.

I tried to use tidyjson, however, I could not manage to get tidy data.frame from my nested lists.

推荐答案

tidyjson完美:

直接使用 tidyjson::read_json 读取 JSON 文件以获得正确的格式是很重要的(tbl_json (S3: tbl_json/tbl_df/tbl/data.frame) 以便进一步处理.)>

It is imortant to read the JSON file directly with tidyjson::read_json to get the right format (tbl_json (S3: tbl_json/tbl_df/tbl/data.frame) for further processing.

#library
library(tidyjson)

# load the JSON file
tidyjson::read_json("NCT0455805.json") -> test

# check the data structure
str(test)
tbl_json [1 x 2] (S3: tbl_json/tbl_df/tbl/data.frame)

# make a tibble
test %>% tidyjson::spread_all()

> # A tibble: 1 x 42   ..JSON document.id FullStudy.Rank FullStudy.Study~ FullStudy.Study~ FullStudy.Study~ FullStudy.Study~
> FullStudy.Study~ FullStudy.Study~ FullStudy.Study~   <chr>       
> <int>          <dbl> <chr>            <chr>            <chr>          
> <chr>            <chr>            <chr>            <chr>            1
> "{\"F~           1         254369 NCT01455805      Minuteman Spina~
> Efficacy and Qu~ October 2020     Active, not rec~ October 13, 2011
> October 18, 2011

这篇关于R:从 JSON/XML (clinicaltrials.gov) 到 data.frame (tidy) 的嵌套列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆