传递LXML输出BeautifulSoup [英] Passing lxml output to BeautifulSoup
问题描述
我的下线code正常工作,但我无法通过LXML从urllib的传递网页BeautifulSoup。我使用的urllib为基本身份验证,然后限于lxml解析(它给出了我们需要刮特定页面的好成绩),然后以BeautifulSoup。
My offline code works fine but I'm having trouble passing a web page from urllib via lxml to BeautifulSoup. I'm using urllib for basic authentication then lxml to parse (it gives a good result with the specific pages we need to scrape) then to BeautifulSoup.
#! /usr/bin/python
import urllib.request
import urllib.error
from io import StringIO
from bs4 import BeautifulSoup
from lxml import etree
from lxml import html
file = open("sample.html")
doc = file.read()
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
result = etree.tostring(html.getroot(), pretty_print=True, method="html")
soup = BeautifulSoup(result)
# working perfectly
通过了工作,我试图通过的urllib给它一个页面:
With that working, I tried to feed it a page via urllib:
# attempt 1
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
# TypeError: initial_value must be str or None, not bytes
试图处理错误消息,我想:
Trying to deal with the error message, I tried:
# attempt 2
html = etree.parse(bytes.decode(doc), parser)
#OSError: Error reading file
我不知道该怎么办了OSERROR所以我寻求另一种方法。我发现,建议使用lxml.html代替lxml.etree所以未来的尝试是:
I didn't know what to do about the OSError so I sought another method. I found suggestions to use lxml.html instead of lxml.etree so the next attempt is:
attempt 3
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
html = html.document_fromstring(doc)
print (html)
# <Element html at 0x140c7e0>
soup = BeautifulSoup(html) # also tried (html, "lxml")
# TypeError: expected string or buffer
这显然给了某种形式的结构,但如何将它传递给BeautifulSoup?我的问题是双重的:我如何可以传递的urllib一个页面lxml.etree(如attampt 1,最近我的工作$ C $三)?或者,我如何传递一个lxml.html结构BeautifulSoup(如上)?我明白身边的数据类型,这两种运转,但不知道该怎么办他们。
This clearly gives a structure of some sort, but how to pass it to BeautifulSoup? My question is twofold: How can I pass a page from urllib to lxml.etree (as in attampt 1, closest to my working code)? or, How can I pass a lxml.html structure to BeautifulSoup (as above)? I understand that both revolve around datatypes but don't know what to do about them.
蟒蛇3.3,LXML 3.0.1,BeautifulSoup 4.我是新来的蟒蛇。感谢互联网为code片段和示例。
python 3.3, lxml 3.0.1, BeautifulSoup 4. I'm new to python. Thanks to the internet for code fragments and examples.
推荐答案
BeautifulSoup可以使用的 LXML
解析器直接的,没必要去这些长度。
BeautifulSoup can use the lxml
parser directly, no need to go to these lengths.
BeautifulSoup(doc, 'lxml')
这篇关于传递LXML输出BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!