创建一个函数以获取europmc文献以跳过不返回表格的文件 [英] Creating a function to fetch europmc literature to skip paper which does't return tables

查看:45
本文介绍了创建一个函数以获取europmc文献以跳过不返回表格的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的问题,我想跟进.问题

This was my question which I would like to follow up. Question

我正在遵循一种相当复杂的方式来做我现在想要做的事情.

I m following a rather complex way to do what I want to do as of now.

但是 Ben 提出的一个简单解决方案就是这个

But a simple solution which proposed by Ben was this

library(tidypmc)
library(tidyverse)
library(europepmc)

doc <- map("PMC7809753", epmc_ftxt)
tbls <- pmc_table(doc[[1]])
tbls[[1]]

我的目标是我正在尝试做.请参见europmc上具有开放访问权限的药品或疾病等,并以表格形式获取其数据 ** **并保存.

My objective was i was trying to do. See drugs or disease etc on europmc which have open access and the fetch its data **as a tabular form** and save it.

要实现第一部分,就可以完成工作

To achieve the first part this does the job

library(europepmc)
b <-epmc_search(query = 'cytarabine aml OPEN_ACCESS:Y',limit = 20)
pmcids <- b$pmcid[b$isOpenAccess=="Y"]

我得到了属于角色的类的pmcids.

I get pmcids which class is character.

要以 Ben 的形式做第二部分,建议这样做确实很好.

To do the second part as Ben suggested this works really well.

doc <- map("PMC7809753", epmc_ftxt)
tbls <- pmc_table(doc[[1]])
tbls[[1]]

要在一个慷慨的stackoverflow用户的帮助下解决上述问题,我获得了此功能

To address the above with help a generous stackoverflow user I got this function

b <-epmc_search(query = 'cytarabine aml OPEN_ACCESS:Y',limit = 6)
pmcids <- b$pmcid[b$isOpenAccess=="Y"]
    pub_tables <- lapply(pmcids, function(pmc_id) {
      message("-- Trying ", pmc_id, "...")
      doc <- tryCatch(pmc_xml(pmc_id), 
                      error = function(e) {
                        message("------ Failed to recover PMCID")
                        return(NULL)
                      })
      if(!is.null(doc)) { 
        #-- If succeed, try to get table
        tables <- pmc_table(doc)
        if(!is.null(tables)) {
          #-- If succeed, try to get table name
          table_caps <- pmc_caption(doc) %>%
            filter(tag == "table")
          names(tables) <- paste(table_caps$label, table_caps$text, sep = " - ")
        }
        return(tables) 
      } else {
        #-- If fail, return NA
        return(NA)
      }
    })
    names(pub_tables) <- pmcids

这很好用,但是我得到了这个错误

This works well but i got this error

Error in names(tables) <- paste(table_caps$label, table_caps$text, sep = " - ") : 
  'names' attribute [3] must be the same length as the vector [2]

这些是我用来验证其限制为6的pmcid.

These are my pmcids which Im using to query it with limit set to 6.

"PMC7837979" "PMC7809753" "PMC7790830" "PMC7797573" "PMC7806552" "PMC7836575"

现在我该如何跳过那些论文,如果我没有得到任何信息,那么我将跳至下一篇,换句话说,如何解决此错误.

Now how do i skip those papers where if I dont get any information then I will skip to the next one in other words how to work around this error.

我在创建复杂函数上有非常微小的/分钟的经验,但是从代码中,如果我理解这段代码应该在上面工作,但不确定为什么不是这样!

I have very tiny/minute experience in creating complicated function but from the code if i understand this chunk of code should be working on it but not sure why it is not!!.

} else {
    #-- If fail, return NA
    return(NA)
  }


Error in names(tables) <- paste(table_caps$label, table_caps$text, sep = " - ") : 
      'names' attribute [3] must be the same length as the vector [2]

例如,当限制设置为4时,pub_tables作为列表返回,而最后一个pmcid返回为

For example When the limit is set 4 it works well the pub_tables is returned as list and the last pmcid is returned as

$PMC7797573
NULL

但是问题出现在"PMC7806552" 上.因此,当我在读取表时出现错误,然后移至下一个PMCID时,如何获得空结果.

But the problem occurs with "PMC7806552". So how do i get the null result when i see an error in fetching table and then move to the next PMCIDs.

任何帮助将不胜感激.

或者有任何更简单的方法.

Or there is any simpler way of doing it.

推荐答案

此处是对该函数进行了稍微修改以使其起作用的功能.唯一的修改是我添加了以下几行:

Here is the function modified slightly to work. The only edit is that I added these lines:

table_caps <- table_caps %>% group_by(label) %>% 
   summarise(text = paste(text, collapse=" "), 
             tag = "table")

table_caps 对象的初始定义之后.问题在于某些表标题有多个句子.这会将多个句子粘贴在一起.

after the initial definition of the table_caps object. The problem was that some table captions had multiple sentences. This pastes the multiple sentences together.

b <-epmc_search(query = 'cytarabine aml OPEN_ACCESS:Y',limit = 10)
pmcids <- b$pmcid[b$isOpenAccess=="Y"]
pub_tables <- lapply(pmcids, function(pmc_id) {
  message("-- Trying ", pmc_id, "...")
  doc <- tryCatch(pmc_xml(pmc_id), 
                  error = function(e) {
                    message("------ Failed to recover PMCID")
                    return(NULL)
                  })
  if(!is.null(doc)) { 
    #-- If succeed, try to get table
    tables <- pmc_table(doc)
    if(!is.null(tables)) {
      #-- If succeed, try to get table name
      table_caps <- pmc_caption(doc) %>%
        filter(tag == "table")
      table_caps <- table_caps %>% group_by(label) %>% 
        summarise(text = paste(text, collapse=" "), 
                  tag = "table")
      names(tables) <- paste(table_caps$label, table_caps$text, sep = " - ")
    }
    return(tables) 
  } else {
    #-- If fail, return NA
    return(NA)
  }
})

这篇关于创建一个函数以获取europmc文献以跳过不返回表格的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆