当我用beautifulSoup而不是Pandas刮时为什么会有桌子 [英] Why is there a table when I scrape with beautifulSoup, but not pandas

查看：71 发布时间：2020/9/20 7:23:47 python pandas beautifulsoup

本文介绍了当我用beautifulSoup而不是Pandas刮时为什么会有桌子的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

尝试在

Trying to scrape entries on this page into a tab-delimited format (mainly pulling out the sequence and UniProt accession number).

当我跑步时:

url = 'www.signalpeptide.de/index.php?sess=&m=listspdb_bacteria&s=details&id=1000&listname='    
table = pd.read_html(url)
print(table)

我得到:

Traceback (most recent call last):
  File "scrape_signalpeptides.py", line 7, in <module>
    table = pd.read_html(url)
  File "/Users/ION/anaconda3/lib/python3.7/site-packages/pandas/io/html.py", line 1094, in read_html
    displayed_only=displayed_only)
  File "/Users/ION/anaconda3/lib/python3.7/site-packages/pandas/io/html.py", line 916, in _parse
    raise_with_traceback(retained)
  File "/Users/ION/anaconda3/lib/python3.7/site-packages/pandas/compat/__init__.py", line 420, in raise_with_traceback
    raise exc.with_traceback(traceback)
ValueError: No tables found

所以我尝试了漂亮的汤法:

So then I tried the beautiful soup method:

import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup

url = 'http://www.signalpeptide.de/index.php?sess=&m=listspdb_bacteria&s=details&id=1000&listname='
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
print(soup)

我可以看到那里有数据.有谁知道为什么我不能使用pandas.read_html解析此页面? 根据以下建议，我运行了此

and I can see there is data there. Does anyone have an idea why can I not parse this page with pandas.read_html? Edit 1: Based on suggestion below I ran this:

from bs4 import BeautifulSoup
import requests
s = requests.session()
s.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
res = s.get('https://www.signalpeptide.de/index.php?sess=&m=listspdb_bacteria&s=details&id=2&listname=')
print(res)

....我将URL更改为www，http和https的全部；对于所有错误，我都会得到与连接错误有关的错误，例如

....I changed the URL to all of www,http and https; and for all I get errors relating to connection errors, e.g.

urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x1114f0898>: Failed to establish a new connection: [Errno 61] Connection refused

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.signalpeptide.de', port=443): Max retries exceeded with url: /index.php?sess=&m=listspdb_bacteria&s=details&id=2&listname= (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x1114f0898>: Failed to establish a new connection: [Errno 61] Connection refused'

ConnectionRefusedError: [Errno 61] Connection refused

推荐答案

两个脚本中的url变量不同.

The url variable is different between your scripts.

并排比较:

url = 'www.signalpeptide.de/index.php?sess=&m=listspdb_bacteria&s=details&id=1000&listname=' # pandas
url = 'http://www.signalpeptide.de/index.php?sess=&m=listspdb_bacteria&s=details&id=1000&listname=' # BeautifulSoup

我怀疑http://位对于大熊猫将其识别为URL而不是HTML本身很重要.毕竟，pandas.read_html将参数动态解释为在文档中进行了描述

I suspect that the http:// bit is important for pandas to recognize it as a URL as opposed to the HTML itself. After all, pandas.read_html interprets the argument dynamically as described in the documentation

URL，类似文件的对象或包含HTML的原始字符串.请注意，lxml仅接受http，ftp和文件url协议.如果您有一个以'https'开头的URL，则可以尝试删除's'.

A URL, a file-like object, or a raw string containing HTML. Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with 'https' you might try removing the 's'.

特别是部分，如果您有一个以"https"开头的网址，则可以尝试删除"s" ，这使我相信http://对于知道它是链接而不是文件状对象"链接或原始HTML.

Where specifically the part If you have a URL that starts with 'https' you might try removing the 's' leads me to believe the http:// is important for it to know it is a link as opposed to a "file-like object" or raw HTML.

如果错误超过最大重试次数，则可能需要使用标头实现requests.session.我为此所做的先前代码如下:

If the error is exceeding max retries, you probably need to implement a requests.session with headers. A previous code I've done with this looked like:

import requests s = requests.session() s.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36' res = s.get('your_url')

此时，您应该能够以与普通requests.get()对象相同的方式解释res对象(可以调用诸如.text之类的方法).我不太确定s.headers的工作方式，只是我复制并修复了我的脚本而来的另一篇SO帖子！

At which point you should be able to interpret the res object the same way you would a normal requests.get() object (you can call methods like .text and such). I'm not too sure how the s.headers work, it was just from another SO post that I copied and fixed my script!

最后一个代码块中的部分错误消息是

Part of the error message from your last code block is

ssl.CertificateError:主机名"www.signalpeptide.de"与"www.kg13.art"，"www.thpr.net"都不匹配

ssl.CertificateError: hostname 'www.signalpeptide.de' doesn't match either of 'www.kg13.art', 'www.thpr.net'

这意味着他们的SSL证书无效，并且https可能无法工作，因为无法验证主机.我将其调整为http并显示结果HTML:

Which means their SSL certificate is not valid, and https probably won't work because the host cannot be verified. I adjusted it to http and to show the resulting HTML:

from bs4 import BeautifulSoup import requests s = requests.session() s.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36' res = s.get('http://www.signalpeptide.de/index.php?sess=&m=listspdb_bacteria&s=details&id=2&listname=') print(res.text)

结果:

C:\Users\rparkhurst\PycharmProjects\Workspace\venv\Scripts\python.exe C:/Users/rparkhurst/PycharmProjects/Workspace/new_workspace.py <!doctype html> <html class="no-js" lang="en"> <head> <meta charset="utf-8"/> <meta name="viewport" content="width=device-width, initial-scale=1.0"/> <title>Signal Peptide Database</title> <link rel="stylesheet" href="css/foundation.css"> <link href='http://cdnjs.cloudflare.com/ajax/libs/foundicons/3.0.0/foundation-icons.css' rel='stylesheet' type='text/css'> <link href="css/custom.css" rel="stylesheet" type="text/css"> </head> <body> <div class="top-bar"> <div class="row"> <div class="top-bar-left"> <div class="top-bar-title"> <span data-responsive-toggle="responsive-menu" data-hide-for="medium"> <span class="menu-icon dark" data-toggle></span> </span> <a href="./"><img src="img/logo.jpg" alt="logo" id="logo"></a> </div> </div> <div class="top-bar-right"> <h3 class="hide-for-small">Signal Peptide Website</h3> <div id="responsive-menu"> <ul class="dropdown menu" data-dropdown-menu> <li><a href="./?m=myprotein">Search my Protein</a></li> <li><a href="./?m=searchspdb">Advanced Search</a></li> <li><a href="./?m=listspdb">Database Search</a></li> <li><a href="./?m=references">References</a></li> <li><a href="./?m=hints">Hints</a></li> <li><a href="./?m=links">Links</a></li> <li><a href="./?m=imprint">Imprint</a></li> </ul> </div> </div> </div> </div> <br> <div class="row columns"> <div class="content"> <span class="headline">Signal Peptide Database - Bacteria</span><br><br> <form action="index.php" method="post"><input type="hidden" name="sess" value=""> <input type="hidden" name="m" value="listspdb_bacteria"> <input type="hidden" name="id" value="2"> <input type="hidden" name="a" value="save"> <table cellspacing="2" cellpadding="2" border="0"> <tr> <td colspan="3" class="tabhead"> <b>Entry Details</b></td></tr> <tr height="23"> <td class="highlight">ID</td> <td class="highlight" width="50"> </td> <td class="highlight">2</td> </tr> <tr height="23"> <td class="highlight">Source Database</td> <td class="highlight" width="50"> </td> <td class="highlight">UniProtKB/Swiss-Prot</td> </tr> <tr height="23"> <td class="highlight">UniProtKB/Swiss-Prot Accession Number</td> <td class="highlight" width="50"> </td> <td class="highlight">A6X5T5    (Created: 2009-01-20 Updated: 2009-01-20)</td> </tr> <tr height="23"> <td class="highlight">UniProtKB/Swiss-Prot Entry Name</td> <td class="highlight" width="50"> </td> <td class="highlight"><a target="_new" class="bblack" href="http://www.uniprot.org/uniprot/14KL_OCHA4">14KL_OCHA4</a></td> </tr> <tr height="23"> <td class="highlight">Protein Name</td> <td class="highlight" width="50"> </td> <td class="highlight">Lectin-like protein BA14k</td> </tr> <tr height="23"> <td class="highlight">Gene</td> <td class="highlight" width="50"> </td> <td class="highlight">Oant_3884</td> </tr> <tr height="23"> <td class="highlight">Organism Scientific</td> <td class="highlight" width="50"> </td> <td class="highlight">Ochrobactrum anthropi (strain ATCC 49188 / DSM 6882 / NCTC 12168)</td> </tr> <tr height="23"> <td class="highlight">Organism Common</td> <td class="highlight" width="50"> </td> <td class="highlight"></td> </tr> <tr height="23"> <td class="highlight">Lineage</td> <td class="highlight" width="50"> </td> <td class="highlight">Bacteria<br>  Proteobacteria<br>    Alphaproteobacteria<br>      Rhizobiales<br>        Brucellaceae<br>          Ochrobactrum<br></td> </tr> <tr height="23"> <td class="highlight">Protein Length [aa]</td> <td class="highlight" width="50"> </td> <td class="highlight">151</td> </tr> <tr height="23"> <td class="highlight">Protein Mass [Da]</td> <td class="highlight" width="50"> </td> <td class="highlight">17666</td> </tr> <tr height="23"> <td class="highlight">Features</td> <td class="highlight" width="50"> </td> <td class="highlight"><table><tr><td><b>Type</b></td><td><b>Description</b></td><td><b>Status</b></td><td><b>Start</b></td><td><b>End</b></td></tr><tr><td class="w"><font color="red">signal peptide</font>   </td><td class="w"><font color="red"></font>   </td><td class="w"><font color="red">potential</font>   </td><td class="w"><font color="red">1</font>   </td><td class="w"><font color="red">26</font></td></tr><tr><td class="w"><font color="blue">chain</font>   </td><td class="w"><font color="blue">Lectin-like protein BA14k</font>   </td><td class="w"><font color="blue"></font>   </td><td class="w"><font color="blue">27</font>   </td><td class="w"><font color="blue">151</font></td></tr><tr><td class="w"><font color="green">transmembrane region</font>   </td><td class="w"><font color="green"></font>   </td><td class="w"><font color="green">potential</font>   </td><td class="w"><font color="green">83</font>   </td><td class="w"><font color="green">103</font></td></tr></table></td> </tr> <tr height="23"> <td class="highlight">SP Length</td> <td class="highlight" width="50"> </td> <td class="highlight">26</td> </tr> <tr valign="top"> <td class="highlight"></td><td class="highlight" width="50"> </td><td class="highlightfixed">----+----1----+----2----+----3----+----4----+----5</td></tr><tr valign="top"> <td class="highlight">Signal Peptide</td><td class="highlight" width="50"> </td><td class="highlightfixed">MNIFKQTCVGAFAVIFGATSIAPTMA</td></tr><tr valign="top"> <td class="highlight"> Sequence</td><td class="highlight" width="50"> </td><td class="highlightfixed"><font color="red">MNIFKQTCVGAFAVIFGATSIAPTMA</font><font color="blue">APLNLERPVINHNVEQVRDHRRPP<br>RHYNGHRPHRPGYWNGHRGYRHYRHGYRRYND</font><font color="green">GWWYPLAAFGAGAIIGGA<br>VSQ</font><font color="blue">PRPVYRAPRMSNAHVQWCYNRYKSYRSSDNTFQPYNGPRRQCYSPYS<br>R</td></tr><tr valign="top"> <td class="highlight"> Original</td><td class="highlight" width="50"> </td><td class="highlightfixed">MNIFKQTCVGAFAVIFGATSIAPTMAAPLNLERPVINHNVEQVRDHRRPP<br>RHYNGHRPHRPGYWNGHRGYRHYRHGYRRYNDGWWYPLAAFGAGAIIGGA<br>VSQPRPVYRAPRMSNAHVQWCYNRYKSYRSSDNTFQPYNGPRRQCYSPYS<br>R</td></tr><tr valign="top"> <td class="highlight"></td><td class="highlight" width="50"> </td><td class="highlightfixed">----+----1----+----2----+----3----+----4----+----5</td></tr><tr height="23"> <td class="highlight">Hydropathies</td> <td class="highlight" width="50"> </td> <td class="highlight"><a href="./hydropathy/hydropathy.php?id=2" target="_new"><img src="./hydropathy/hydropathy.php?id=2" border="0" width="600"></a></td> </tr> <tr> <td colspan="3" class="nohighlight"> </td> </tr> <tr> <td colspan="3" class="tabhead" align="center"><input class="button" type="reset" value="Back" onclick="history.back(-1);"></td> </tr> </table> </form></div> <hr> <div class="row"> <div class="small-4 medium-3 columns"><a href="./">Home</a>   <a href="./?m=imprint">Imprint</a></div> <div class="small-8 medium-9 columns text-right"> © 2007-2017 <a href="mailto:kapp@mpi-cbg.de">Katja Kapp</a>, Dresden & <a href="http://www.thpr.net/">thpr.net e. K.</a>, Dresden, Germany, last update 2010-06-11 </div> </div><br><br> <script src="js/vendor/jquery.js"></script> <script src="js/foundation.js"></script> <script> $(document).foundation(); </script> </body> </html> Process finished with exit code 0

所以看来这可以解决您的问题.

So it seems this solves your issues.

这篇关于当我用beautifulSoup而不是Pandas刮时为什么会有桌子的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

当我用beautifulSoup而不是Pandas刮时为什么会有桌子 [英] Why is there a table when I scrape with beautifulSoup, but not pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

当我用beautifulSoup而不是Pandas刮时为什么会有桌子 [英] Why is there a table when I scrape with beautifulSoup, but not pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭