传递LXML输出BeautifulSoup [英] Passing lxml output to BeautifulSoup

查看：522 发布时间：2016/8/5 19:17:22 python beautifulsoup lxml

本文介绍了传递LXML输出BeautifulSoup的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的下线code正常工作，但我无法通过LXML从urllib的传递网页BeautifulSoup。我使用的urllib为基本身份验证，然后限于lxml解析（它给出了我们需要刮特定页面的好成绩），然后以BeautifulSoup。

My offline code works fine but I'm having trouble passing a web page from urllib via lxml to BeautifulSoup. I'm using urllib for basic authentication then lxml to parse (it gives a good result with the specific pages we need to scrape) then to BeautifulSoup.

#! /usr/bin/python
import urllib.request 
import urllib.error 
from io import StringIO
from bs4 import BeautifulSoup 
from lxml import etree 
from lxml import html 

file = open("sample.html")
doc = file.read()
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
result = etree.tostring(html.getroot(), pretty_print=True, method="html")
soup = BeautifulSoup(result)
# working perfectly

通过了工作，我试图通过的urllib给它一个页面：

With that working, I tried to feed it a page via urllib:

# attempt 1
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
# TypeError: initial_value must be str or None, not bytes

试图处理错误消息，我想：

Trying to deal with the error message, I tried:

# attempt 2
html = etree.parse(bytes.decode(doc), parser)
#OSError: Error reading file

我不知道该怎么办了OSERROR所以我寻求另一种方法。我发现，建议使用lxml.html代替lxml.etree所以未来的尝试是：

I didn't know what to do about the OSError so I sought another method. I found suggestions to use lxml.html instead of lxml.etree so the next attempt is:

attempt 3
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
html = html.document_fromstring(doc)
print (html)
# <Element html at 0x140c7e0>
soup = BeautifulSoup(html) # also tried (html, "lxml")
# TypeError: expected string or buffer

这显然给了某种形式的结构，但如何将它传递给BeautifulSoup？我的问题是双重的：我如何可以传递的urllib一个页面lxml.etree（如attampt 1，最近我的工作$ C $三）？或者，我如何传递一个lxml.html结构BeautifulSoup（如上）？我明白身边的数据类型，这两种运转，但不知道该怎么办他们。

This clearly gives a structure of some sort, but how to pass it to BeautifulSoup? My question is twofold: How can I pass a page from urllib to lxml.etree (as in attampt 1, closest to my working code)? or, How can I pass a lxml.html structure to BeautifulSoup (as above)? I understand that both revolve around datatypes but don't know what to do about them.

蟒蛇3.3，LXML 3.0.1，BeautifulSoup 4.我是新来的蟒蛇。感谢互联网为code片段和示例。

python 3.3, lxml 3.0.1, BeautifulSoup 4. I'm new to python. Thanks to the internet for code fragments and examples.

传递LXML输出BeautifulSoup [英] Passing lxml output to BeautifulSoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

传递LXML输出​​BeautifulSoup [英] Passing lxml output to BeautifulSoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

传递LXML输出BeautifulSoup [英] Passing lxml output to BeautifulSoup

登录关闭