下载 PMC 和 PubMed 数据库中的所有全文文章 [英] Downloading all full-text articles in PMC and PubMed databases

查看:140
本文介绍了下载 PMC 和 PubMed 数据库中的所有全文文章的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 NCBI 帮助台回答的问题之一,我们不能批量下载"PubMed Central.但是,我可以使用NCBI E-utilities"使用Efetch下载PMC数据库中的所有全文论文,或者至少使用Esearch<找到所有相应的PMCids/strong> 在 Entrez 编程实用程序中?如果是,那么如何?如果不能使用E-utilities,有没有其他方法可以下载所有全文文章?

According to one of the answered questions by NCBI Help Desk , we cannot "bulk-download" PubMed Central. However, can I use "NCBI E-utilities" to download all full-text papers in PMC database using Efetch or at least find all corresponding PMCids using Esearch in Entrez Programming Utilities? If yes, then how? If E-utilities cannot be used, is there any other way to download all full-text articles?

推荐答案

首先,在您批量下载文件之前,我强烈建议您阅读 电子公用事业使用指南.

First of all, before you go downloading files in bulk, I highly recommend you read the E-utilities usage guidelines.

如果您想要全文文章,您将希望将搜索限制为打开访问文件.此外,如果您想要任何好的文章,我还建议将您的搜索限制为 Medline 文章.然后就可以进行搜索了.

If you want full-text articles, you're going to want to limit your search to open access files. Furthermore, I suggest also restricting your search to Medline articles if you want articles that are any good. Then you can do the search.

使用 Biopython,这给了我们:

Using Biopython, this gives us :

search_query = 'medline[sb] AND "open access"[filter]'

# getting search results for the query
search_results = Entrez.read(Entrez.esearch(db="pmc", term=search_query, retmax=10, usehistory="y"))

您可以使用 PMC 网站上的搜索功能,它会显示生成的查询,您可以将其复制/粘贴到您的代码中.现在您已经完成了搜索,您可以实际下载文件:

You can use the search function on the PMC website and it will display the generated query that you can copy/paste into your code. Now that you've done the search, you can actually download the files :

handle = Entrez.efetch(db="pmc", rettype="full", retmode="xml", retstart=0, retmax=int(search_results["Count"]), webenv=search_results["WebEnv"], query_key=search_results["QueryKey"])

  • 您可能希望通过循环中的变量更改 retstartretmax 来批量下载,以避免淹没服务器.
  • 如果handle 只包含一个文件,handle.read() 包含整个XML 文件作为一个字符串.如果它包含更多,则文章包含在
    节点中.
  • 全文仅在 XML 中可用,pubmed 中可用的默认解析器不处理 XML 命名空间,因此您将自己使用 ElementTree(或其他解析器)来解析您的 XML.
  • 在这里,可以通过 E-utilities 的内部历史找到这些文章,该历史通过 webenv 参数访问并通过 usehistory="y" 启用Entrez.read()
  • 中的参数

    • You might want to download in batches by changing retstart and retmax by variables in a loop in order to avoid flooding the servers.
    • If handle contains only one file, handle.read() contains the whole XML file as a string. If it contains more, the articles are contained in <article></article> nodes.
    • The full text is only available in XML, and the default parser available in pubmed doesn't handle XML namespaces, so you're going to be on your own with ElementTree (or an other parser) to parse your XML.
    • Here, the articles are found thanks to the internal history of E-utilities, which is accessed with the webenv argument and enabled thanks to the usehistory="y" argument in Entrez.read()
    • 关于使用 ElementTree 解析 XML 的一些提示:您不能删除孙节点,因此您可能想要递归删除一些节点.node.text 返回 node 中的文本,但只返回到第一个孩子,所以你需要按照 "".join 的行做一些事情(node.itertext()) 如果你想获取给定节点中的所有文本.

      A few tips about XML parsing with ElementTree : You can't delete a grandchild node, so you're probably going to want to delete some nodes recursively. node.text returns the text in node, but only up to the first child, so you'll need to do something along the lines of "".join(node.itertext()) if you want to get all the text in a given node.

      这篇关于下载 PMC 和 PubMed 数据库中的所有全文文章的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆