BeautifulSoup在Python不正确的解析 [英] BeautifulSoup in Python not parsing right

查看：183 发布时间：2016/8/5 19:18:18 python html encoding beautifulsoup

本文介绍了BeautifulSoup在Python不正确的解析的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在运行的Python 2.7.5，并使用内置的HTML解析器什么我即将来形容。

I am running Python 2.7.5 and using the built-in html parser for what I am about to describe.

我试图完成的任务是，以HTML的一大块是本质上是一个良方。下面是一个例子。

The task I am trying to accomplish is to take a chunk of html that is essentially a recipe. Here is an example.

html_chunk =＆LT; H1＆GT;微型薯Knishes＆LT; / H1＆GT;＆LT; P＆GT;发贴者bettyboop50在recipegoldmine.com 2001＆LT 5月10日; / P＆GT;＆LT; P＆GT;使约42缩影knishes＆LT; / p＆GT;＆LT; p＆gt;该只是美味为你的肚子＆LT;！/ p＆GT;＆LT; p＆GT; 3杯土豆泥（约LT; BR＆GT;＆安培; NBSP;＆安培; NBSP;＆安培; NBSP; 2非常大土豆）＆LT; BR＆GT; 2个鸡蛋，稍打散＆LT; BR＆GT; 1个大洋葱，切块＆LT; BR＆GT; 2汤匙人造黄油＆LT; BR＆GT;盐1茶匙（或适量）＆LT; BR＆GT; 1/8茶匙黑胡椒＆LT; BR＆GT; 3 / 8杯Matzoh餐＆LT; BR＆GT;蛋黄1个，1汤匙水＆LT挨打; / p＆GT;＆LT; p＆GT; preheat烤箱至400华氏度＆LT; / p＆GT;＆LT; p＆gt;在小炒洋葱切丁黄油或人造黄油至金黄色和LT量; / p＆GT;＆LT; p＆gt;在中型碗，结合土豆泥，炒洋葱，鸡蛋，人造奶油，盐，胡椒和Matzoh餐＆LT; / p＆GT;＆LT; p＆GT;形式混合成约核桃大小的小球，在一个良好的油的烤盘蛋黄混合液和地方刷，烤20分钟或直到完全焦黄＆LT; / p＆gt;中

的目标是分离出的标题，垃圾，配料，指令，服务，和配料的数量。

The goal is to separate out the header, junk, ingredients, instructions, serving, and number of ingredients.

下面是我的code，它可以实现此目的。

Here is my code that accomplishes that

from bs4 import BeautifulSoup

def list_to_string(list):
   joined = ""
   for item in list:
      joined += str(item)
   return joined

def get_ingredients(soup):
   for p in soup.find_all('p'):
      if p.find('br'):
         return p

def get_instructions(p_list, ingredient_index):
   instructions = []
   instructions += p_list[ingredient_index+1:]
   return instructions

def get_junk(p_list, ingredient_index):
   junk = []
   junk += p_list[:ingredient_index]
   return junk

def get_serving(p_list):
   for item in p_list:
      item_str = str(item).lower()
      if ("yield" or "make" or "serve" or "serving") in item_str:
         yield_index = p_list.index(item)
         del p_list[yield_index]
         return item

def ingredients_count(ingredients):
   ingredients_list = ingredients.find_all(text=True)
   return len(ingredients_list)

def get_header(soup):
   return soup.find('h1')

def html_chunk_splitter(soup):
   ingredients = get_ingredients(soup)
   if ingredients == None:
      error = 1
      header = ""
      junk_string = ""
      instructions_string = ""
      serving = ""
      count = ""
   else:
      p_list = soup.find_all('p')
      serving = get_serving(p_list)
      ingredient_index = p_list.index(ingredients)
      junk_list = get_junk(p_list, ingredient_index)
      instructions_list = get_instructions(p_list, ingredient_index)
      junk_string = list_to_string(junk_list)
      instructions_string = list_to_string(instructions_list)
      header = get_header(soup)
      error = ""
      count = ingredients_count(ingredients)
   return (header, junk_string, ingredients, instructions_string, 
   serving, count, error)

它运作良好，但在那里我有包含像炒，因为汤= BeautifulSoup（html_chunk）导致炒变成绍塔©，这是一个问题，因为我有这样的html_chunk食谱巨大的CSV文件，我想很好地结构中的所有这些，然后得到输出回数据库。我试图检查它绍塔©出来正确使用该 HTML previewer ，它仍然来自作为出©绍塔。我不知道该怎么办这个问题。


It works well except in situations where I have chunks that contain strings like "Sauté" because soup = BeautifulSoup(html_chunk) causes Sauté to turn into SautÃ© and this is a problem because I have a huge csv file of recipes like the html_chunk and I'm trying to structure all of them nicely and then get the output back into a database. I tried checking it SautÃ© comes out right using this html previewer and it still comes out as SautÃ©. I don't know what to do about this.
什么是奇怪的是，当我做什么BeautifulSoup的文档显示
What's stranger is that when I do what BeautifulSoup's documentation shows
BeautifulSoup("Sacr&eacute; bleu!")
# <html><head></head><body>Sacré bleu!</body></html>

我得到
# Sacr├⌐ bleu!

但我的同事尝试，在他的Mac上，从终端运行，并且他得到了什么文档显示。
But my colleague tried that on his Mac, running from terminal, and he got exactly what the documentation shows.
我真的AP preciate你的帮助。谢谢你。
I really appreciate all your help. Thank you.
推荐答案
 BeautifulSoup试图猜测编码，有时犯错，但您可以通过添加指定编码的 from_encoding 参数：
例如：
BeautifulSoup tries to guess the encoding, sometimes it makes a mistake, however you can specify the encoding by adding the from_encoding parameter: 
for example
soup = BeautifulSoup(html_text, from_encoding="UTF-8")

的编码通常，可以在网页上的报头
The encoding is usually available in the header of the webpage

                        这篇关于BeautifulSoup在Python不正确的解析的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

BeautifulSoup在Python不正确的解析 [英] BeautifulSoup in Python not parsing right

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

BeautifulSoup在Python不正确的解析 [英] BeautifulSoup in Python not parsing right

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭