beautifulSoup不一致的行为 [英] beautifulSoup inconsistent behavior

查看:188
本文介绍了beautifulSoup不一致的行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我完全被下面的HTML刮code,我在两种不同环境自编自的行为感到困惑的需要帮助找到这种差异的根本原因

I am completely puzzled by the behavior of the following HTML-scraping code that I wrote in two different environments and need help finding the root cause of this discrepancy.

import sys
import bs4
import md5
import logging
from urllib2 import urlopen
from platform import platform

# Log particulars of the environment
logging.warning("OS platform is %s" %platform())
logging.warning("Python version is %s" %sys.version)
logging.warning("BeautifulSoup is at %s and its version is %s" %(bs4.__file__, bs4.__version__))

# Open web-page and read HTML
url = 'http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=JXIG&size=all'
response = urlopen(url)
html = response.read()

# Calculate MD5 to ensure that the same string was downloaded
print "MD5 sum for html string downloaded is %s" %md5.new(html).hexdigest()

# Make beautiful soup
soup = bs4.BeautifulSoup(html, 'html')
contigsTable = soup.find("table", {"class" : "zebra"})
contigs = []

# Parse table in soup to find all records
for row in contigsTable.findAll('tr'):
    column = row.findAll('td')
    if len(column) > 2:
        contigs.append(column[1])

# Expect identical results on any machine that this is run
print "Number of contigs identified is %s" %len(contigs)

在机1台,这种运行到返回:

WARNING:root:OS platform is Linux-3.10.10-031010-generic-x86_64-with-Ubuntu-12.04-precise   
WARNING:root:Python version is 2.7.3 (default, Jun 22 2015, 19:33:41)  
[GCC 4.6.3]  
WARNING:root:BeautifulSoup is at /usr/local/lib/python2.7/dist-packages/bs4/__init__.pyc and its version is 4.3.2  
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf  

Number of contigs identified is 630  

在机2台,这非常相同的code运行返回:

WARNING:root:OS platform is Linux-2.6.32-431.46.2.el6.nersc.x86_64-x86_64-with-debian-6.0.6
WARNING:root:Python version is 2.7.4 (default, Apr 17 2013, 10:26:13) 
[GCC 4.6.3]
WARNING:root:BeautifulSoup is at /global/homes/i/img/.local/lib/python2.7/site-packages/bs4/__init__.pyc and its version is 4.3.2
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf

Number of contigs identified is 462


重叠群的计算的数量是不同的。请注意,在同一code解析HTML表格就不是来自对方,可惜领先截然不同的两种不同的环境产生不同的结果该生产噩梦。人工检测确认结果的机返回2 是不正确的,但迄今无法解释。


The number of contigs calculated is different. Please note that the same code parses an HTML table to yield different results on two different environments that are not strikingly different from each other and unfortunately leading to this production nightmare. Manual inspection confirms that the results returned on Machine 2 are incorrect, but has so far been impossible to explain.

有没有人有类似的经历?你注意到有什么不对的code或我应该停止信任 BeautifulSoup 共?

Does anyone have similar experience? Do you notice anything wrong with this code or should I stop trusting BeautifulSoup altogether?

推荐答案

您所遇到的的解析器之间的差异的是 BeaufitulSoup 自动选择,在您指定的HTML标记类型。其中解析器拿起取决于什么模块可在当前的Python环境:

You are experiencing the differences between parsers that BeaufitulSoup chooses automatically for the "html" markup type you've specified. Which parser is picked up depends on what modules are available in the current Python environment:

如果您不指定任何东西,你会得到最好的HTML解析器,这是
  安装。美丽的汤lxml的行列的解析器是最好的,那么
  html5lib的,那么Python的内置解析器。

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

要具有跨平台一致的行为,是明确的:

To have a consistent behavior across the platforms, be explicit:

soup = BeautifulSoup(html, "html.parser")
soup = BeautifulSoup(html, "html5lib")
soup = BeautifulSoup(html, "lxml")

另请参阅:安装解析器

这篇关于beautifulSoup不一致的行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆