传递LXML输出​​BeautifulSoup [英] Passing lxml output to BeautifulSoup

查看:522
本文介绍了传递LXML输出​​BeautifulSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的下线code正常工作,但我无法通过LXML从urllib的传递网页BeautifulSoup。我使用的urllib为基本身份验证,然后限于lxml解析(它给出了我们需要刮特定页面的好成绩),然后以BeautifulSoup。

My offline code works fine but I'm having trouble passing a web page from urllib via lxml to BeautifulSoup. I'm using urllib for basic authentication then lxml to parse (it gives a good result with the specific pages we need to scrape) then to BeautifulSoup.

#! /usr/bin/python
import urllib.request 
import urllib.error 
from io import StringIO
from bs4 import BeautifulSoup 
from lxml import etree 
from lxml import html 

file = open("sample.html")
doc = file.read()
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
result = etree.tostring(html.getroot(), pretty_print=True, method="html")
soup = BeautifulSoup(result)
# working perfectly

通过了工作,我试图通过的urllib给它一个页面:

With that working, I tried to feed it a page via urllib:

# attempt 1
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
# TypeError: initial_value must be str or None, not bytes

试图处理错误消息,我想:

Trying to deal with the error message, I tried:

# attempt 2
html = etree.parse(bytes.decode(doc), parser)
#OSError: Error reading file

我不知道该怎么办了OSERROR所以我寻求另一种方法。我发现,建议使用lxml.html代替lxml.etree所以未来的尝试是:

I didn't know what to do about the OSError so I sought another method. I found suggestions to use lxml.html instead of lxml.etree so the next attempt is:

attempt 3
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
html = html.document_fromstring(doc)
print (html)
# <Element html at 0x140c7e0>
soup = BeautifulSoup(html) # also tried (html, "lxml")
# TypeError: expected string or buffer

这显然给了某种形式的结构,但如何将它传递给BeautifulSoup?我的问题是双重的:我如何可以传递的urllib一个页面lxml.etree(如attampt 1,最近我的工作$ C $三)?或者,我如何传递一个lxml.html结构BeautifulSoup(如上)?我明白身边的数据类型,这两种运转,但不知道该怎么办他们。

This clearly gives a structure of some sort, but how to pass it to BeautifulSoup? My question is twofold: How can I pass a page from urllib to lxml.etree (as in attampt 1, closest to my working code)? or, How can I pass a lxml.html structure to BeautifulSoup (as above)? I understand that both revolve around datatypes but don't know what to do about them.

蟒蛇3.3,LXML 3.0.1,BeautifulSoup 4.我是新来的蟒蛇。感谢互联网为code片段和示例。

python 3.3, lxml 3.0.1, BeautifulSoup 4. I'm new to python. Thanks to the internet for code fragments and examples.

推荐答案

BeautifulSoup可以使用的 LXML 解析器直接的,没必要去这些长度。

BeautifulSoup can use the lxml parser directly, no need to go to these lengths.

BeautifulSoup(doc, 'lxml')

这篇关于传递LXML输出​​BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆