Python 请求 api 未获取表主体内的数据 [英] Python request api is not fetching data inside table bodies

查看:23
本文介绍了Python 请求 api 未获取表主体内的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取网页以从请求响应返回的文本数据中获取表值.

<tbody class="stats"></tbody><tbody class="annotation"></tbody>

实际上,tbody 类中存在一些数据,但是`我无法使用请求访问该数据.

这是我的代码

server = "http://www.ebi.ac.uk/QuickGO/GProtein"header = {'User-agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de;rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'}有效载荷 = {'ac':'Q9BRY0'}response = requests.get(server, params=payloads)打印(响应.文本)#soup = BeautifulSoup(response.text, 'lxml')#打印(汤)

解决方案

坦率地说,我开始对涉及硒等产品的常规抓取失去兴趣,然后我不确定它是否会起作用.这种方法确实可以.

如果您有多个文件要下载,您只会这样做,至少以这种形式.

<预><代码>>>>进口BS4>>>form = '''<form method="POST" action="GAnnotation"><input name="a" value="" type="hidden"><input name="termUse" value="祖先" type="hidden"><input name="relType" value="IPO=" type="hidden"><input name="customRelType" value="IPOR+-?=" type="hidden"><input name="protein" value="Q9BRY0" type="hidden"><input name="tax" value="" type="hidden"><input name="qualifier"value="" type="hidden"><input name="goid" value="" type="hidden"><input name="ref" value="" type="hidden"><;input name="evidence" value="" type="hidden"><输入名称="选择"值="正常"类型="隐藏"><输入名称="aspectSorter"值=""类型="隐藏"><;input name="start" value="0" type="hidden">'''>>>汤 = bs4.BeautifulSoup(form, 'lxml')>>>动作 = 汤.find('form').attrs['action']>>>行动'G注释'>>>输入 = 汤.findAll('输入')>>>参数 = {}>>>对于输入中的输入:... params[input.attrs['name']] = input.attrs['value']...>>>进口请求>>>r = requests.post('http://www.ebi.ac.uk/QuickGO/GAnnotation', data=params)>>>r<响应[200]>>>>open('temp.htm', 'w').write(r.text)4082

下载的文件就是您点击按钮后会收到的文件.

Chrome 浏览器的详细信息:

  • 在 Chrome 中打开页面.
  • 右键单击下载"链接.
  • 选择检查".
  • 在 Chrome _Developer_ 菜单(靠近顶部)中选择网络",然后选择全部".
  • 点击页面中的下载".
  • --> 在新打开的窗口中点击下载".
  • 'quickgoUtil.js:36' 将出现在 'Initiator' 列中.
  • 点击它.
  • 现在你可以通过点击它的行号在`form.submit();`上设置断点.
  • 再次点击下载";执行将在断点处暂停.
  • 在右侧窗口中注意本地".它的内容之一是`form`.您可以针对表单的内容展开它.

您希望此元素的 outerHTML 属性用于上面代码中使用的信息,即用于其 action 和名称-值对.(以及使用 POST 的隐含信息.)

现在使用请求模块向网站提交请求.

这里是 params 中的项目列表,以防您想提出其他请求.

<预><代码>>>>对于 params.keys() 中的项目:... 项目,参数[项目]...('限定符', '')('来源', '')('计数', '25')('蛋白质','Q9BRY0')('格式','gaf')('termUse', '祖先')('gz', '假')('和', '')('goid', '')('开始','0')('customRelType', 'IPOR+-?=')('证据', '')('aspectSorter', '')('税', '')('relType', 'IPO=')('限制', '22')('col', 'proteinDB,proteinID,proteinSymbol,qualifier,goID,goName,aspect,evidence,ref,with,proteinTaxon,date,from,splice')('q', '')('参考', '')('选择','正常')('一种', '')

I am trying to scrap a webpage to get table values from text data returned from requests response.

</thead>
 <tbody class="stats"></tbody>
 <tbody class="annotation"></tbody>
 </table>
 </div>

Actually there is some data present inside tbody classes but `I am unable to access that data using requests.

Here is my code

server = "http://www.ebi.ac.uk/QuickGO/GProtein"
header = {'User-agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; 
rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'}
payloads = {'ac':'Q9BRY0'}
response = requests.get(server, params=payloads)

print(response.text)
#soup = BeautifulSoup(response.text, 'lxml')
#print(soup)

解决方案

Frankly, I'm beginning to lose interest in routine scraping involving products like selenium, and then beyond that I wasn't sure it would work. This approach does.

You would only do this, in this form at least, if you had more than a few files to download.

>>> import bs4
>>> form = '''<form method="POST" action="GAnnotation"><input name="a" value="" type="hidden"><input name="termUse" value="ancestor" type="hidden"><input name="relType" value="IPO=" type="hidden"><input name="customRelType" value="IPOR+-?=" type="hidden"><input name="protein" value="Q9BRY0" type="hidden"><input name="tax" value="" type="hidden"><input name="qualifier" value="" type="hidden"><input name="goid" value="" type="hidden"><input name="ref" value="" type="hidden"><input name="evidence" value="" type="hidden"><input name="with" value="" type="hidden"><input name="source" value="" type="hidden"><input name="q" value="" type="hidden"><input name="col" value="proteinDB,proteinID,proteinSymbol,qualifier,goID,goName,aspect,evidence,ref,with,proteinTaxon,date,from,splice" type="hidden"><input name="select" value="normal" type="hidden"><input name="aspectSorter" value="" type="hidden"><input name="start" value="0" type="hidden"><input name="count" value="25" type="hidden"><input name="format" value="gaf" type="hidden"><input name="gz" value="false" type="hidden"><input name="limit" value="22" type="hidden"></form>'''
>>> soup = bs4.BeautifulSoup(form, 'lxml')
>>> action = soup.find('form').attrs['action']
>>> action 
'GAnnotation'
>>> inputs = soup.findAll('input')
>>> params = {}
>>> for input in inputs:
...     params[input.attrs['name']] = input.attrs['value']
...     
>>> import requests
>>> r = requests.post('http://www.ebi.ac.uk/QuickGO/GAnnotation', data=params)
>>> r
<Response [200]>
>>> open('temp.htm', 'w').write(r.text)
4082

The downloaded file is what you would receive if you simply clicked on the button.

Details for the Chrome browser:

  • Open the page in Chrome.
  • Right-click on the 'Download' link.
  • Select 'Inspect'.
  • Select 'Network' in the Chrome _Developer_ menu (near the top), and then 'All'.
  • Click on 'Download' in the page.
  • --> Click on 'Download' in the newly opened window.
  • 'quickgoUtil.js:36' will appear in the 'Initiator' column.
  • Click on it.
  • Now you can set the breakpoint on `form.submit();` by clicking on its line number.
  • Click on 'Download' again; execution will pause at breakpoint.
  • In the right-hand window notice 'Local'. One of its contents is `form`. You can expand it for the contents of the form.

You want the outerHTML property of this element for the information used in the code above, namely for its action and name-value pairs. (And the implied information that POST is used.)

Now use the requests module to submit a request to the website.

Here's a list of the items in params in case you want to make other requests.

>>> for item in params.keys():
...     item, params[item]
... 
('qualifier', '')
('source', '')
('count', '25')
('protein', 'Q9BRY0')
('format', 'gaf')
('termUse', 'ancestor')
('gz', 'false')
('with', '')
('goid', '')
('start', '0')
('customRelType', 'IPOR+-?=')
('evidence', '')
('aspectSorter', '')
('tax', '')
('relType', 'IPO=')
('limit', '22')
('col', 'proteinDB,proteinID,proteinSymbol,qualifier,goID,goName,aspect,evidence,ref,with,proteinTaxon,date,from,splice')
('q', '')
('ref', '')
('select', 'normal')
('a', '')

这篇关于Python 请求 api 未获取表主体内的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆