要使用Python和放大器拉从RCSB页面期刊称号; BeautifulSoup [英] Want to pull a journal title from an RCSB Page using python & BeautifulSoup

查看:176
本文介绍了要使用Python和放大器拉从RCSB页面期刊称号; BeautifulSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图获取有关只给出蛋白质的4信PDBID在蛋白质数据银行原来的引用文件的具体信息。

I am trying to get specific information about the original citing paper in the Protein Data Bank given only the 4 letter PDBID of the protein.

要做到这一点,我现在用的是Python库请求和BeautifulSoup。尝试建立code,我去页为特定的蛋白质,在这种情况下1K48,同时也节省了页面的HTML(通过按命令+ S并保存HTML到我的桌面)。

To do this I am using the python libraries requests and BeautifulSoup. To try and build the code, I went to the page for a particular protein, in this case 1K48, and also save the HTML for the page (by hitting command+s and saving the HTML to my desktop).

首先要注意的事项:

1),该页面的网址是: HTTP://www.rcsb .ORG / PDB / explore.do?structureId = 1K48

1) The url for this page is: http://www.rcsb.org/pdb/explore.do?structureId=1K48

2)您可以通过使用合适的替代PDBID的最后四个字符得到任何蛋白质的页面。

2) You can get to the page for any protein by replacing the last four characters with the appropriate PDBID.

3)我会想很多PDBIDs执行此过程中,为了进行排序的大名单由杂志,他们最初出现在

3) I am going to want to perform this procedure on many PDBIDs, in order to sort a large list by the Journal they originally appeared in.

4)通过HTML搜索,人们发现设在这里的表单里面的期刊名称:

4) Searching through the HTML, one finds the journal title located inside a form here:

<form action="http://www.rcsb.org/pdb/search/smartSubquery.do" method="post" name="queryForm">  
    <p><span id="se_abstractTitle"><a onclick="c(0);">Refined</a> <a onclick="c(1);">structure</a> <a onclick="c(2);">and</a> <a onclick="c(3);">metal</a> <a onclick="c(4);">binding</a> <a onclick="c(5);">site</a> of the <a onclick="c(8);">kalata</a> <a onclick="c(9);">B1</a> <a onclick="c(10);">peptide.</a></span></p>                                                        
    <p><a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor(&#39;Skjeldal, L.&#39;);">Skjeldal, L.</a>,&nbsp;&nbsp;<a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor(&#39;Gran, L.&#39;);">Gran, L.</a>,&nbsp;&nbsp;<a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor(&#39;Sletten, K.&#39;);">Sletten, K.</a>,&nbsp;&nbsp;<a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor(&#39;Volkman, B.F.&#39;);">Volkman, B.F.</a></p> 
    <p>
        <b>Journal:</b>     
        (2002)
        <span class="se_journal">Arch.Biochem.Biophys.</span>
        <span class="se_journal"><b>399: </b>142-148</span>         
    </p>

一个更大量的形式为,但它是不相关的。我所知道的是,我的日记的标题,Arch.Biochem.Biophys,位于一个span标记内的类se_journal。

A lot more is in the form but it is not relevant. What I do know is that my journal title, "Arch.Biochem.Biophys", is located within a span tag with class "se_journal".

所以我写了下面code:

And so I wrote the following code:

def JournalLookup():
    PDBID= '1K48'

    import requests
    from bs4 import BeautifulSoup

    session = requests.session()

    req = session.get('http://www.rcsb.org/pdb/explore.do?structureId=%s' %PDBID)

    doc = BeautifulSoup(req.content)
    Journal = doc.findAll('span', class_="se_journal")

在理想情况下,我会能够使用找到代替的findAll因为这些文件中的只有两个,但我用的findAll至少验证我得到一个空列表。我以为,它会返回一个包含有类se_journal两跨度的标签列表,但它不是返回一个空列表。

Ideally I'd be able to use find instead of findAll as these are the only two in the document, but I used findAll to at least verify I'm getting an empty list. I assumed that it would return a list containing the two span tags with class "se_journal", but it instead returns an empty list.

花几个小时通过可能的解决方案,其中包括一张code的已打印文档中的每一个跨度去后,我的结论是,请求文档不包括我想在所有的线路。

After spending several hours going through possible solutions, including a piece of code that printed every span in doc, I have concluded that the requests doc does not include the lines I want at all.

有谁知道为什么是这样的话,什么我可能做来解决它?

Does anybody know why this is the case, and what I could possibly do to fix it?

感谢。

推荐答案

你感兴趣的是由JavaScript提供的内容。这很容易找到,请访问相同的URL上的浏览器禁用了javascript,你不会看到具体的信息。它还显示一个友好的信息:

The content you are interested in is provided by the javascript. It's easy to find out, visit the same URL on browser with javascript disabled and you will not see that specific info. It also displays a friendly message:

这个浏览器未启用JavaScript或已经把手机关掉了。
  这个网站不会没有JavaScript正常工作。

"This browser is either not Javascript enabled or has it turned off. This site will not function correctly without Javascript."

对于JavaScript驱动页面,则不能使用Python的请求。还有一些备选方案,一种是 dryscape

For javascript driven pages, you cannot use Python Requests. There are some alternatives, one being dryscape.

PS:不要在函数中导入库/模块。 Python不推荐它和 PEP08 说:

PS: Do not import libraries/modules within a function. Python does not recommend it and PEP08 says that:

进口量始终把在文件的开头,就在任一模块的意见和文档字符串,和之前模块全局变量和常量。

Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.

<一个href=\"http://stackoverflow.com/questions/128478/should-python-import-statements-always-be-at-the-top-of-a-module\">This SO质疑解释了它为什么不推荐的方式来做到这一点。

This SO question explains why it's not recomended way to do it.

这篇关于要使用Python和放大器拉从RCSB页面期刊称号; BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆