BeautifulSoup返回的HTML与查看源代码不同 [英] BeautifulSoup returning different html than view source

查看:80
本文介绍了BeautifulSoup返回的HTML与查看源代码不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是使用BeautifulSoup的新手,如果我的问题很愚蠢,请原谅我.但是,自从早上6点开始,我一直在谷歌上搜索并尝试在每个stackoverflow线程中提出建议,但无济于事.

I'm brand new to using BeautifulSoup, so forgive me if my question is stupid. However, I've been googling and trying suggestions in every stackoverflow thread I could since 6am, but to no avail.

我的问题是我有一个带基因名称的.csv文件,其中一些文件采用ensEMBL格式,这意味着我必须使用ensembl数据库来查找所需的信息.其余的我可以使用ncbi数据库.

My problem is that I have a .csv file with gene names, some of them are in ensEMBL format, which means I MUST use the ensembl database to lookup the info I need. For the rest I can use the ncbi database.

现在,我的代码就可以了.我知道这是因为发送到ncbi的每个查询都返回我需要的信息,并且我能够使用BeautifulSoup提取所有信息并将其输出到csv.但是,无论是urlopen还是BeautifulSoup都无法按照我被理解的方式工作.

Now, my code is just fine. I know this because every query sent to ncbi returns the info I need, and I'm able to extract it all with BeautifulSoup and output it to a csv. HOWEVER, either urlopen or BeautifulSoup are not working the way I've been led to understand they work.

当我在地址栏中输入以下URL时,将加载正确的网页:

When I put the following URL into my address bar, the correct webpage loads: http://uswest.ensembl.org/Gallus_gallus/Gene/Summary?db=core;g=ENSGALG00000016955;r=1:165302186-165480795;t=ENSGALT00000027404.

然后我可以查看源代码并签出HTML.但是当我有:

I can then view source and check out the HTML. Yet when I have:

html = urlopen(http://uswest.ensembl.org/Gallus_gallus/Gene/Summary?db=core;g=ENSGALG00000016955;r=1:165302186-165480795;t=ENSGALT00000027404, 'lxml')

当我在浏览器中加载相同的URL并查看源代码时,它输出的HTML根本不是我得到的.我知道对于使用javascript的页面,inspect元素和视图源将有所不同,但是urlopen应该始终返回与视图源相同的HTML.

The HTML it outputs is not at all what I get when I load the same URL in my browser and view source. I know that for pages with javascript, inspect element and view source will be different, but urlopen should ALWAYS return the same HTML as view source.

我需要在描述"之后提取字符串.通过浏览器中的链接,我可以检查源代码并查看需要在BeautifulSoup中找到的标签.但是,除非urlopen正常工作并返回正确的HTML,否则我无能为力.我的RA工作有赖于今晚完成.

I need to extract the string after "Description". Visiting the link in my browser, I can inspect source and see the tags I need to find with BeautifulSoup; however, unless urlopen works properly and returns the correct HTML, there is nothing I can do. My RA job depends on getting this done by tonight.

有什么建议吗?

推荐答案

页面的某些部分由script标记中引用的Javascript(例如摘要")加载.但是,您要查找的文本嵌入在HTML中.使用此代码可以在Description标签之后找到文本:

Parts of the page are loaded by the Javascript that is referenced in the script tag, for instance the "Summary". However the text you are looking for is embedded in the HTML. Locating the text after the Description tag works with this code:

import requests
from bs4 import BeautifulSoup

url = "http://uswest.ensembl.org/Gallus_gallus/Gene/Summary?db=core;g=ENSGALG00000016955;r=1:165302186-165480795;t=ENSGALT00000027404"
r = requests.get(url, timeout=5)
html = BeautifulSoup(r.text)
description = html.find("div", {'class': "rhs"})
print description.text

这篇关于BeautifulSoup返回的HTML与查看源代码不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆