BeautifulSoup在Python不正确的解析 [英] BeautifulSoup in Python not parsing right
问题描述
我正在运行的Python 2.7.5,并使用内置的HTML解析器什么我即将来形容。
I am running Python 2.7.5 and using the built-in html parser for what I am about to describe.
我试图完成的任务是,以HTML的一大块是本质上是一个良方。下面是一个例子。
The task I am trying to accomplish is to take a chunk of html that is essentially a recipe. Here is an example.
html_chunk =< H1>微型薯Knishes< / H1>< P>发贴者bettyboop50在recipegoldmine.com 2001&LT 5月10日; / P>< P>使约42缩影knishes< / p>< p>该只是美味为你的肚子<!/ p>< p> 3杯土豆泥(约LT; BR>&安培; NBSP;&安培; NBSP;&安培; NBSP; 2非常大土豆)< BR> 2个鸡蛋,稍打散< BR> 1个大洋葱,切块< BR> 2汤匙人造黄油< BR>盐1茶匙(或适量)< BR> 1/8茶匙黑胡椒< BR> 3 / 8杯Matzoh餐< BR>蛋黄1个,1汤匙水&LT挨打; / p>< p> preheat烤箱至400华氏度< / p>< p>在小炒洋葱切丁黄油或人造黄油至金黄色和LT量; / p>< p>在中型碗,结合土豆泥,炒洋葱,鸡蛋,人造奶油,盐,胡椒和Matzoh餐< / p>< p>形式混合成约核桃大小的小球,在一个良好的油的烤盘蛋黄混合液和地方刷,烤20分钟或直到完全焦黄< / p>中
的目标是分离出的标题,垃圾,配料,指令,服务,和配料的数量。
The goal is to separate out the header, junk, ingredients, instructions, serving, and number of ingredients.
下面是我的code,它可以实现此目的。
Here is my code that accomplishes that
from bs4 import BeautifulSoup
def list_to_string(list):
joined = ""
for item in list:
joined += str(item)
return joined
def get_ingredients(soup):
for p in soup.find_all('p'):
if p.find('br'):
return p
def get_instructions(p_list, ingredient_index):
instructions = []
instructions += p_list[ingredient_index+1:]
return instructions
def get_junk(p_list, ingredient_index):
junk = []
junk += p_list[:ingredient_index]
return junk
def get_serving(p_list):
for item in p_list:
item_str = str(item).lower()
if ("yield" or "make" or "serve" or "serving") in item_str:
yield_index = p_list.index(item)
del p_list[yield_index]
return item
def ingredients_count(ingredients):
ingredients_list = ingredients.find_all(text=True)
return len(ingredients_list)
def get_header(soup):
return soup.find('h1')
def html_chunk_splitter(soup):
ingredients = get_ingredients(soup)
if ingredients == None:
error = 1
header = ""
junk_string = ""
instructions_string = ""
serving = ""
count = ""
else:
p_list = soup.find_all('p')
serving = get_serving(p_list)
ingredient_index = p_list.index(ingredients)
junk_list = get_junk(p_list, ingredient_index)
instructions_list = get_instructions(p_list, ingredient_index)
junk_string = list_to_string(junk_list)
instructions_string = list_to_string(instructions_list)
header = get_header(soup)
error = ""
count = ingredients_count(ingredients)
return (header, junk_string, ingredients, instructions_string,
serving, count, error)
它运作良好,但在那里我有包含像炒
,因为汤= BeautifulSoup(html_chunk)$ C字符串块的情况$ C>导致炒变成绍塔©,这是一个问题,因为我有这样的html_chunk食谱巨大的CSV文件,我想很好地结构中的所有这些,然后得到输出回数据库。我试图检查它绍塔©出来正确使用该 HTML previewer ,它仍然来自作为出©绍塔。我不知道该怎么办这个问题。
It works well except in situations where I have chunks that contain strings like "Sauté"
because soup = BeautifulSoup(html_chunk)
causes Sauté to turn into Sauté and this is a problem because I have a huge csv file of recipes like the html_chunk and I'm trying to structure all of them nicely and then get the output back into a database. I tried checking it Sauté comes out right using this html previewer and it still comes out as Sauté. I don't know what to do about this.
什么是奇怪的是,当我做什么BeautifulSoup的文档显示
What's stranger is that when I do what BeautifulSoup's documentation shows
BeautifulSoup("Sacré bleu!")
# <html><head></head><body>Sacré bleu!</body></html>
我得到
# Sacré bleu!
但我的同事尝试,在他的Mac上,从终端运行,并且他得到了什么文档显示。
But my colleague tried that on his Mac, running from terminal, and he got exactly what the documentation shows.
我真的AP preciate你的帮助。谢谢你。
I really appreciate all your help. Thank you.
推荐答案
BeautifulSoup试图猜测编码,有时犯错,但您可以通过添加指定编码的 from_encoding
参数:
例如:
BeautifulSoup tries to guess the encoding, sometimes it makes a mistake, however you can specify the encoding by adding the from_encoding
parameter:
for example
soup = BeautifulSoup(html_text, from_encoding="UTF-8")
的编码通常,可以在网页上的报头
The encoding is usually available in the header of the webpage
这篇关于BeautifulSoup在Python不正确的解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!