将 PMCID 表行解析为列形式 [英] Parsing of PMCID table row to column form

查看:58
本文介绍了将 PMCID 表行解析为列形式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

dput(t1)结构(列表(PMCID = c(PMC7809753",PMC7809753",PMC7809753",PMC7809753"、PMC7809753"、PMC7790830"、PMC7790830"、PMC7790830"、PMC7790830"、PMC7790830")、表 = c(表 1"、表 1"、表1"、表1"、表1"、表1"、表1"、表1"、表 1"、表 1"),行 = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L,4L, 5L), text = c(药物=阿糖胞苷(Ara-C);目标=DNA聚合酶;流入=ENT1、CNT3、OCTN1;代谢=激活:dCK、dCMPK、NDK.失活:CDA、dCMPD、PN-I.;流出量=MRP4,7,8;参考文献=[14, 30–33, 78–80]",药物=柔红霉素(DNR);目标 = DNA,拓扑异构酶 II;流入=被动扩散;流出=P-gp,MRP1,7,BCRP;参考文献=[44, 51, 81–84]",药物=米托蒽醌(MX);目标 = DNA,拓扑异构酶 II;流入=被动扩散;流出=P-gp、MRP1、BCRP;参考文献=[44, 85–90]",药物=依托泊苷(VP-16);目标 = 拓扑异构酶 II;流入=被动扩散;流出=P-gp,MRP1-3,6,BCRP;参考文献=[16, 91, 92]",药物=甲氨蝶呤(MTX);目标=DHFR、TS、AICARFT;流入=RFC,PCFT;代谢=醛氧化酶,FPGS(多聚谷氨酰化);流出=P-gp、MRP1-5、BCRP;参考文献=[16, 93, 94]","患者编号=1;年龄(岁)=45;性别=M;FAB亚型=M2;细胞数(×109/l):WBC=30.1;细胞数(×109/l):HB=87;细胞数(×109/l):PLT=9;BM爆炸(%)=70.5;核型=46,XX,t(8,21)(q22;q22)","患者编号=2;年龄(岁)=41;性别=F;FAB亚型=M5;细胞计数(×109/l):WBC=14.58;细胞数(×109/l):HB=103;细胞数(×109/l):PLT=62;BM Blast (%)=60.4;核型=46,XX","患者编号=3;年龄(岁)=49;性别=M;FAB亚型=M4;细胞数(×109/l):WBC=4.84;细胞数(×109/l):HB=69;细胞数(×109/l):PLT=100;BM Blast (%)=88;核型=45,XY,-7","患者编号=4;年龄(岁)=65;性别=M;FAB亚型=M5;细胞数(×109/l):WBC=220;细胞数(×109/l):HB=85;细胞数(×109/l):PLT=52;BM爆炸(%)=86.8;核型=46,XY","患者编号=5;年龄(岁)=61;性别=F;FAB亚型=M5;细胞计数(×109/l):WBC=4.61;细胞数(×109/l):HB=71;细胞数(×109/l):PLT=197;BM Blast (%)=32.4;核型=46,XX")), row.names = c(NA, -10L), class = c(tbl_df", tbl", data.frame";))

上面是我的示例数据框,看起来像这样

head(t1)# 小块:6 x 4PMCID 表行文本<chr><chr><int><chr>1 PMC7809753 表…… 1 药物=阿糖胞苷(Ara-C);目标 = DNA 聚合酶;流入=ENT1,CNT3,OCTN1;代谢=激活:dCK、dCMPK、NDK……2 PMC7809753 表… 2 药物=柔红霉素 (DNR);目标 = DNA,拓扑异构酶 II;流入=被动扩散;流出=P-gp,MRP1,7,BCRP;参考文献=[…3 PMC7809753 表…… 3 药物=米托蒽醌 (MX);目标 = DNA,拓扑异构酶 II;流入=被动扩散;流出=P-gp、MRP1、BCRP;参考文献=[44,…4 PMC7809753 表… 4 药物=依托泊苷(VP-16);目标 = 拓扑异构酶 II;流入=被动扩散;流出=P-gp,MRP1-3,6,BCRP;参考文献=[16, ...5 PMC7809753 表… 5 药物=甲氨蝶呤 (MTX);目标=DHFR、TS、AICARFT;流入=RFC,PCFT;代谢=醛氧化酶,FPGS(聚谷氨酸…6 PMC7790830 表…… 1 患者编号=1;年龄(岁)=45;性别=M;FAB亚型=M2;细胞数(×109/l):WBC=30.1;细胞数(×109/l):HB=87;…

例如这篇论文 PMC7809753

现在的问题是如何将特定 PMCID 的每个表解析为表格或列状结构,如论文中所示.

更新根据我的 PMCID,我可以将每一行拆分为一个列表.

aa <- split(t1, f = t1$PMCID)

这给了我这个

$PMC7790830# 小费:5 x 4PMCID 表行文本<chr><chr><int><chr>1 PMC7790830 表…… 1 患者编号=1;年龄(岁)=45;性别=M;FAB亚型=M2;细胞数(×109/l):WBC=30.1;细胞数(×109/l):HB=87;…2 PMC7790830 表... 2 患者编号=2;年龄(岁)=41;性别=F;FAB亚型=M5;细胞计数(×109/l):WBC=14.58;细胞数(×109/l):HB=103…3 PMC7790830 表…… 3 患者编号=3;年龄(岁)=49;性别=M;FAB亚型=M4;细胞数(×109/l):WBC=4.84;细胞数(×109/l):HB=69;…4 PMC7790830 表…… 4 患者数=4;年龄(岁)=65;性别=M;FAB亚型=M5;细胞数(×109/l):WBC=220;细胞数(×109/l):HB=85;C…5 PMC7790830 表…… 5 患者编号=5;年龄(岁)=61;性别=F;FAB亚型=M5;细胞计数(×109/l):WBC=4.61;细胞数(×109/l):HB=71;…$PMC7809753# 小块:5 x 4PMCID 表行文本<chr><chr><int><chr>1 PMC7809753 表…… 1 药物=阿糖胞苷(Ara-C);目标 = DNA 聚合酶;流入=ENT1,CNT3,OCTN1;代谢=激活:dCK、dCMPK、NDK……2 PMC7809753 表… 2 药物=柔红霉素 (DNR);目标 = DNA,拓扑异构酶 II;流入=被动扩散;流出=P-gp,MRP1,7,BCRP;参考文献=[…3 PMC7809753 表…… 3 药物=米托蒽醌 (MX);目标 = DNA,拓扑异构酶 II;流入=被动扩散;流出=P-gp、MRP1、BCRP;参考文献=[44,…4 PMC7809753 表… 4 药物=依托泊苷(VP-16);目标 = 拓扑异构酶 II;流入=被动扩散;流出=P-gp,MRP1-3,6,BCRP;参考文献=[16, ...5 PMC7809753 表… 5 药物=甲氨蝶呤 (MTX);目标=DHFR、TS、AICARFT;流入=RFC,PCFT;代谢=醛氧化酶,FPGS(聚谷氨酸…

更新 v2

我尝试根据以下解决方案将相同的 PMCID 行分成一个.

将重复的行转换为 R 中的分隔列

库(splitstackshape)图书馆(数据表)DT <- setDT(t1)[, do.call(paste, c(.SD, list(collapse=', '))) , PMCID]DT1 <- cSplit(DT, 'V1', sep='[,]+', 固定=假,stripWhite=TRUE)setnames(DT1, 2:ncol(DT1), rep(names(t1)[-1], 41))DT1

所以问题仍然如上所示,我如何将与列表相对应的那些行分离和分离成列或某种表格形式,如图所示.

解决方案

我认为将 tidypmc 包与 europepmc 输出一起使用可能会有所帮助.下面是使用 pmc_table 从 PMC 文章中提取第一个表的示例.这也使用 tidyversepurrr 中的 map.

库(tidypmc)图书馆(tidyverse)图书馆(欧洲)文档 <- 地图(PMC7809753",epmc_ftxt)tbls <- pmc_table(doc[[1]])表[[1]]

输出

# tibble: 7 x 6药物靶点流入代谢流出参考文献.<chr><chr><chr><chr><chr><chr>1 阿糖胞苷 (Ara-C) DNA 聚合酶 ENT1、CNT3、OCTN1激活:dCK、dCMPK、NDK.灭活... MRP4,7,8 [14, 30–33, ...2 柔红霉素 (DNR) DNA,拓扑异构体……被动扩散"P-gp, MRP1,7,… [44, 51, 81–…3 米托蒽醌 (MX) DNA,拓扑异构体……被动扩散"P-gp, MRP1, B… [44, 85–90]4 Etoposide (VP-16) Topoisomerase II 被动扩散"P-gp, MRP1-3,... [16, 91, 92]5 甲氨蝶呤 (MTX) DHFR、TS、AICAR……RFC、PCFT醛氧化酶"、FPGS(聚谷氨酰胺化……P-gp、MRP1-5……[16, 93, 94]6 Venetoclax (VEN) Bcl-2 被动扩散"P-gp [72, 95]7 Gemtuzumab Ozogami…DNA Ab 介导的内切…溶酶体卡利车霉素从 Ab 中裂解,…P-gp,MRP1 [73, 77]

编辑 (1/30/21):要为多篇文章自动执行此过程(并根据您的其他问题和方法),请考虑以下事项.

您可以拥有一个包含您的 pmcids 的向量,并将其与 map 一起使用.这将创建包含所有 pmcids 文章的所有 xml 的 docs.

然后你可以再次使用 map 将所有表存储在 my_tables 中,这将是一个列表.

b <-epmc_search(query = '阿糖胞苷 aml OPEN_ACCESS:Y',limit = 6)pmcids <- b$pmcid[b$isOpenAccess==Y"]文档 <- 地图(pmcids,epmc_ftxt)my_tables <- 地图(文档,pmc_table)

然后您可以通过以下方式访问,例如,文章 2 表 1:

my_tables[[2]][[1]]

编辑 (1/31/21): 要将每篇文章的名称设置为 PMCID,您可以使用 set_names,并使用 %> 链接;%map.set_names 将为您的矢量添加名称.当您调用此函数但不提供其他名称时,它将使用矢量元素作为名称.例如:

docs <- pmcids %>%set_names() %>%地图(.,epmc_ftxt)

您可以在之后单独调用 my_tables <- map(docs, pmc_table),或者甚至将其添加到链中(将整个内容存储为 my_tables),如果只是对表格感兴趣,而不是完整的文档.

最终,您可以像这样使用 PMCID 访问单个表:

my_tables[[PMC7806552"]][[1]]

dput(t1)
structure(list(PMCID = c("PMC7809753", "PMC7809753", "PMC7809753", 
"PMC7809753", "PMC7809753", "PMC7790830", "PMC7790830", "PMC7790830", 
"PMC7790830", "PMC7790830"), table = c("Table 1", "Table 1", 
"Table 1", "Table 1", "Table 1", "Table 1", "Table 1", "Table 1", 
"Table 1", "Table 1"), row = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 
4L, 5L), text = c("Drug=Cytarabine (Ara-C); Target=DNA polymerases; Influx=ENT1, CNT3, OCTN1; Metabolisma=Activation: dCK, dCMPK, NDK. Inactivation: CDA, dCMPD, PN-I.; Efflux=MRP4,7,8; Refs.=[14, 30–33, 78–80]", 
"Drug=Daunorubicin (DNR); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1,7, BCRP; Refs.=[44, 51, 81–84]", 
"Drug=Mitoxantrone (MX); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1, BCRP; Refs.=[44, 85–90]", 
"Drug=Etoposide (VP-16); Target=Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1-3,6, BCRP; Refs.=[16, 91, 92]", 
"Drug=Methotrexate (MTX); Target=DHFR, TS, AICARFT; Influx=RFC, PCFT; Metabolisma=Aldehyde oxidase, FPGS (polyglutamylation); Efflux=P-gp, MRP1-5, BCRP; Refs.=[16, 93, 94]", 
"Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; Cell count(×109/l): PLT=9; BM Blast (%)=70.5; Karyotype=46,XX,t(8,21)(q22;q22)", 
"Patients no.=2; Age (years)=41; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=14.58; Cell count(×109/l): HB=103; Cell count(×109/l): PLT=62; BM Blast (%)=60.4; Karyotype=46,XX", 
"Patients no.=3; Age (years)=49; Gender=M; FAB subtype=M4; Cell count(×109/l): WBC=4.84; Cell count(×109/l): HB=69; Cell count(×109/l): PLT=100; BM Blast (%)=88; Karyotype=45,XY,-7", 
"Patients no.=4; Age (years)=65; Gender=M; FAB subtype=M5; Cell count(×109/l): WBC=220; Cell count(×109/l): HB=85; Cell count(×109/l): PLT=52; BM Blast (%)=86.8; Karyotype=46,XY", 
"Patients no.=5; Age (years)=61; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=4.61; Cell count(×109/l): HB=71; Cell count(×109/l): PLT=197; BM Blast (%)=32.4; Karyotype=46,XX"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))

The above one is my sample data frame which looks like this

head(t1)
# A tibble: 6 x 4
  PMCID      table    row text                                                                                                                
  <chr>      <chr>  <int> <chr>                                                                                                               
1 PMC7809753 Table…     1 Drug=Cytarabine (Ara-C); Target=DNA polymerases; Influx=ENT1, CNT3, OCTN1; Metabolisma=Activation: dCK, dCMPK, NDK.…
2 PMC7809753 Table…     2 Drug=Daunorubicin (DNR); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1,7, BCRP; Refs.=[…
3 PMC7809753 Table…     3 Drug=Mitoxantrone (MX); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1, BCRP; Refs.=[44,…
4 PMC7809753 Table…     4 Drug=Etoposide (VP-16); Target=Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1-3,6, BCRP; Refs.=[16, …
5 PMC7809753 Table…     5 Drug=Methotrexate (MTX); Target=DHFR, TS, AICARFT; Influx=RFC, PCFT; Metabolisma=Aldehyde oxidase, FPGS (polyglutam…
6 PMC7790830 Table…     1 Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; …

For example this paper PMC7809753 paper whose output is above. In paper the First table is "Properties of the chemotherapeutic drugs used in AML" looks like this. In my data frame the Table 1 of PMC7809753 ID is repeated 5 times which corresponds to the above pic i have attached.

Now the The issue is how do i parse each table of particular PMCID into a tabular or column like structure as shown in the paper.

UPDATE Based on my PMCID I can split each of the row into a list.

aa <- split(t1, f = t1$PMCID) 

which gives me this

$PMC7790830
# A tibble: 5 x 4
  PMCID      table    row text                                                                                                                
  <chr>      <chr>  <int> <chr>                                                                                                               
1 PMC7790830 Table…     1 Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; …
2 PMC7790830 Table…     2 Patients no.=2; Age (years)=41; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=14.58; Cell count(×109/l): HB=103…
3 PMC7790830 Table…     3 Patients no.=3; Age (years)=49; Gender=M; FAB subtype=M4; Cell count(×109/l): WBC=4.84; Cell count(×109/l): HB=69; …
4 PMC7790830 Table…     4 Patients no.=4; Age (years)=65; Gender=M; FAB subtype=M5; Cell count(×109/l): WBC=220; Cell count(×109/l): HB=85; C…
5 PMC7790830 Table…     5 Patients no.=5; Age (years)=61; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=4.61; Cell count(×109/l): HB=71; …

$PMC7809753
# A tibble: 5 x 4
  PMCID      table    row text                                                                                                                
  <chr>      <chr>  <int> <chr>                                                                                                               
1 PMC7809753 Table…     1 Drug=Cytarabine (Ara-C); Target=DNA polymerases; Influx=ENT1, CNT3, OCTN1; Metabolisma=Activation: dCK, dCMPK, NDK.…
2 PMC7809753 Table…     2 Drug=Daunorubicin (DNR); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1,7, BCRP; Refs.=[…
3 PMC7809753 Table…     3 Drug=Mitoxantrone (MX); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1, BCRP; Refs.=[44,…
4 PMC7809753 Table…     4 Drug=Etoposide (VP-16); Target=Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1-3,6, BCRP; Refs.=[16, …
5 PMC7809753 Table…     5 Drug=Methotrexate (MTX); Target=DHFR, TS, AICARFT; Influx=RFC, PCFT; Metabolisma=Aldehyde oxidase, FPGS (polyglutam…

UPDATE v2

I tried to segregate the same PMCID rows into one based on the below solution.

Convert duplicate rows to separate columns in R

library(splitstackshape)
library(data.table)
DT <- setDT(t1)[, do.call(paste, c(.SD, list(collapse=', '))) , PMCID]
DT1 <- cSplit(DT, 'V1', sep='[ ,]+', fixed=FALSE, stripWhite=TRUE)
setnames(DT1, 2:ncol(DT1), rep(names(t1)[-1], 41))
DT1

So still the problem remains as above how do i separate and segregate those rows corresponding to the list into column or some tabular form as shown in the pic.

解决方案

I think it may be helpful to use tidypmc package with your europepmc output. Here is an example of extracting the first table from your PMC article using pmc_table. This also uses map from purrr in tidyverse.

library(tidypmc)
library(tidyverse)
library(europepmc)

doc <- map("PMC7809753", epmc_ftxt)
tbls <- pmc_table(doc[[1]])
tbls[[1]]

Output

# A tibble: 7 x 6
  Drug                Target           Influx            Metabolisma                                 Efflux         Refs.        
  <chr>               <chr>            <chr>             <chr>                                       <chr>          <chr>        
1 Cytarabine (Ara-C)  DNA polymerases  ENT1, CNT3, OCTN1 "Activation: dCK, dCMPK, NDK. Inactivation… MRP4,7,8       [14, 30–33, …
2 Daunorubicin (DNR)  DNA, Topoisomer… Passive diffusion ""                                          P-gp, MRP1,7,… [44, 51, 81–…
3 Mitoxantrone (MX)   DNA, Topoisomer… Passive diffusion ""                                          P-gp, MRP1, B… [44, 85–90]  
4 Etoposide (VP-16)   Topoisomerase II Passive diffusion ""                                          P-gp, MRP1-3,… [16, 91, 92] 
5 Methotrexate (MTX)  DHFR, TS, AICAR… RFC, PCFT         "Aldehyde oxidase, FPGS (polyglutamylation… P-gp, MRP1-5,… [16, 93, 94] 
6 Venetoclax (VEN)    Bcl-2            Passive diffusion ""                                          P-gp           [72, 95]     
7 Gemtuzumab Ozogami… DNA              Ab-mediated endo… "Lysosomal Calicheamicin cleavage from Ab,… P-gp, MRP1     [73, 77]     

Edit (1/30/21): To automate this process for multiple articles (and based on your other question and approach), consider the following.

You can have a vector containing your pmcids, and use that with map. This will create docs containing all the xml for all the pmcids articles.

Then you can use map again to store all the tables in my_tables, which would be a list.

b <-epmc_search(query = 'cytarabine aml OPEN_ACCESS:Y',limit = 6)
pmcids <- b$pmcid[b$isOpenAccess=="Y"]
docs <- map(pmcids, epmc_ftxt)
my_tables <- map(docs, pmc_table)

You can then access, for example, article 2 table 1 by:

my_tables[[2]][[1]]

Edit (1/31/21): To set the names of each article to the PMCID, you can use set_names, and chain using %>% with map. set_names will add names to your vector. When you call this function, but don't provide additional names, it will use the vector elements as the names. For example:

docs <- pmcids %>%
  set_names() %>%
  map(., epmc_ftxt)

You can call separately my_tables <- map(docs, pmc_table) afterwards, or even add this to the chain (storing the whole thing as my_tables) if only interested in tables, and not the full documents.

Ultimately, you could then access individual tables using the PMCID like this:

my_tables[["PMC7806552"]][[1]]

这篇关于将 PMCID 表行解析为列形式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆