要使用Python和放大器拉从RCSB页面期刊称号; BeautifulSoup [英] Want to pull a journal title from an RCSB Page using python & BeautifulSoup

查看：176 发布时间：2016/8/5 19:21:32 python python-2.7 beautifulsoup python-requests protein-database

本文介绍了要使用Python和放大器拉从RCSB页面期刊称号; BeautifulSoup的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图获取有关只给出蛋白质的4信PDBID在蛋白质数据银行原来的引用文件的具体信息。

I am trying to get specific information about the original citing paper in the Protein Data Bank given only the 4 letter PDBID of the protein.

要做到这一点，我现在用的是Python库请求和BeautifulSoup。尝试建立code，我去页为特定的蛋白质，在这种情况下1K48，同时也节省了页面的HTML（通过按命令+ S并保存HTML到我的桌面）。

To do this I am using the python libraries requests and BeautifulSoup. To try and build the code, I went to the page for a particular protein, in this case 1K48, and also save the HTML for the page (by hitting command+s and saving the HTML to my desktop).

首先要注意的事项：

1），该页面的网址是： HTTP：//www.rcsb .ORG / PDB / explore.do？structureId = 1K48

1) The url for this page is: http://www.rcsb.org/pdb/explore.do?structureId=1K48

2）您可以通过使用合适的替代PDBID的最后四个字符得到任何蛋白质的页面。

2) You can get to the page for any protein by replacing the last four characters with the appropriate PDBID.

3）我会想很多PDBIDs执行此过程中，为了进行排序的大名单由杂志，他们最初出现在

3) I am going to want to perform this procedure on many PDBIDs, in order to sort a large list by the Journal they originally appeared in.

4）通过HTML搜索，人们发现设在这里的表单里面的期刊名称：

4) Searching through the HTML, one finds the journal title located inside a form here:

<form action="http://www.rcsb.org/pdb/search/smartSubquery.do" method="post" name="queryForm">  
    <p><span id="se_abstractTitle"><a onclick="c(0);">Refined</a> <a onclick="c(1);">structure</a> <a onclick="c(2);">and</a> <a onclick="c(3);">metal</a> <a onclick="c(4);">binding</a> <a onclick="c(5);">site</a> of the <a onclick="c(8);">kalata</a> <a onclick="c(9);">B1</a> <a onclick="c(10);">peptide.</a></span></p>                                                        
    <p><a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor(&#39;Skjeldal, L.&#39;);">Skjeldal, L.</a>,&nbsp;&nbsp;<a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor(&#39;Gran, L.&#39;);">Gran, L.</a>,&nbsp;&nbsp;<a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor(&#39;Sletten, K.&#39;);">Sletten, K.</a>,&nbsp;&nbsp;<a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor(&#39;Volkman, B.F.&#39;);">Volkman, B.F.</a></p> 
    <p>
        <b>Journal:</b>     
        (2002)
        <span class="se_journal">Arch.Biochem.Biophys.</span>
        <span class="se_journal"><b>399: </b>142-148</span>         
    </p>

一个更大量的形式为，但它是不相关的。我所知道的是，我的日记的标题，Arch.Biochem.Biophys，位于一个span标记内的类se_journal。

A lot more is in the form but it is not relevant. What I do know is that my journal title, "Arch.Biochem.Biophys", is located within a span tag with class "se_journal".

所以我写了下面code：

And so I wrote the following code:

def JournalLookup():
    PDBID= '1K48'

    import requests
    from bs4 import BeautifulSoup

    session = requests.session()

    req = session.get('http://www.rcsb.org/pdb/explore.do?structureId=%s' %PDBID)

    doc = BeautifulSoup(req.content)
    Journal = doc.findAll('span', class_="se_journal")

在理想情况下，我会能够使用找到代替的findAll因为这些文件中的只有两个，但我用的findAll至少验证我得到一个空列表。我以为，它会返回一个包含有类se_journal两跨度的标签列表，但它不是返回一个空列表。

Ideally I'd be able to use find instead of findAll as these are the only two in the document, but I used findAll to at least verify I'm getting an empty list. I assumed that it would return a list containing the two span tags with class "se_journal", but it instead returns an empty list.

花几个小时通过可能的解决方案，其中包括一张code的已打印文档中的每一个跨度去后，我的结论是，请求文档不包括我想在所有的线路。

After spending several hours going through possible solutions, including a piece of code that printed every span in doc, I have concluded that the requests doc does not include the lines I want at all.

有谁知道为什么是这样的话，什么我可能做来解决它？

Does anybody know why this is the case, and what I could possibly do to fix it?

感谢。

要使用Python和放大器拉从RCSB页面期刊称号; BeautifulSoup [英] Want to pull a journal title from an RCSB Page using python & BeautifulSoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

要使用Python和放大器拉从RCSB页面期刊称号; BeautifulSoup [英] Want to pull a journal title from an RCSB Page using python &amp; BeautifulSoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

要使用Python和放大器拉从RCSB页面期刊称号; BeautifulSoup [英] Want to pull a journal title from an RCSB Page using python & BeautifulSoup

登录关闭