在本地HTML文件上使用Python中的Beautiful Soup使用错误的重音字符 [英] Wrong accented characters using Beautiful Soup in Python on a local HTML file

查看:45
本文介绍了在本地HTML文件上使用Python中的Beautiful Soup使用错误的重音字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Python的Beautiful Soup非常熟悉,我一直习惯于抓取实时网站.

现在,我正在抓取本地HTML文件(

我要说的是,您的第一个代码实际上很好并且应该可以工作.

关于第二个代码,您正在尝试 decode str ,这是错误的.因为 decode 函数用于 byte object .

我相信您使用的是 Windows ,其默认编码为 cp1252 而不是 UTF-8 .

请运行以下代码:

  import sys打印(sys.getdefaultencoding())打印(sys.stdin.encoding)打印(sys.stdout.encoding)打印(sys.stderr.encoding) 

并检查您的输出是 UTF-8 还是 cp1252 .

请注意,如果您将 VSCode Code-Runner 结合使用,请在终端中以 py code.py 运行代码>

解决方案(通过聊天)

(1)如果您使用的是Windows 10

  • 打开控制面板并通过小图标更改视图
  • 点击区域
  • 点击管理"标签
  • 点击更改系统区域设置...
  • 勾选测试版:使用Unicode UTF-8 ..."框
  • 单击确定",然后重新启动计算机

(2)如果您不在Windows 10上或者只是不想更改以前的设置,则在第一个代码中将 open("AH.html")更改为open("AH.html",encoding ="UTF-8"),即:

从bs4

 导入BeautifulSoup使用open("AH.html",encoding ="UTF-8")为f:汤= BeautifulSoup(f,'html.parser')tb = soup.find("table")对于tb.find_all("tr")[55]中的项目:打印(item.text) 

I'm quite familiar with Beautiful Soup in Python, I have always used to scrape live site.

Now I'm scraping a local HTML file (link, in case you want to test the code), the only problem is that accented characters are not represented in the correct way (this never happened to me when scraping live sites).

This is a simplified version of the code

import requests, urllib.request, time, unicodedata, csv
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('AH.html'), "html.parser")
tables = soup.find_all('table')
titles = tables[0].find_all('tr')
print(titles[55].text)

which prints the following output

2:22 - Il Destino È Già Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]

while the correct output should be

2:22 - Il Destino È Già Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]


I looked for a solution, read many questions/answers and found this answer, which I implemented in the following way

import requests, urllib.request, time, unicodedata, csv
from bs4 import BeautifulSoup
import codecs

response = open('AH.html')
content = response.read()
html = codecs.decode(content, 'utf-8')
soup = BeautifulSoup(html, "html.parser")

However, it runs the following error

Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
TypeError: a bytes-like object is required, not 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\user\Desktop\score.py", line 8, in <module>
    html = codecs.decode(content, 'utf-8')
TypeError: decoding with 'utf-8' codec failed (TypeError: a bytes-like object is required, not 'str')

I guess is easy to solve the problem, but how to do it?

解决方案

from bs4 import BeautifulSoup


with open("AH.html") as f:
    soup = BeautifulSoup(f, 'html.parser')
    tb = soup.find("table")
    for item in tb.find_all("tr")[55]:
        print(item.text)

I've to say, that your first code is actually fine and should works.

Regarding the second code, you are trying to decode str which is faulty. as decode function is for byte object.

I believe that you are using Windows where the default encoding of it is cp1252 not UTF-8.

Could you please run the following code:

import sys

print(sys.getdefaultencoding())
print(sys.stdin.encoding)
print(sys.stdout.encoding)
print(sys.stderr.encoding)

And check your output if it's UTF-8 or cp1252.

note that if you are using VSCode with Code-Runner, kindly run your code in the terminal as py code.py

SOLUTIONS (from the chat)

(1) If you are on windows 10

  • Open Control Panel and change view by Small icons
  • Click Region
  • Click the Administrative tab
  • Click on Change system locale...
  • Tick the box "Beta: Use Unicode UTF-8..."
  • Click OK and restart your pc

(2) If you are not on Windows 10 or just don't want to change the previous setting, then in the first code change open("AH.html") to open("AH.html", encoding="UTF-8"), that is write:

from bs4 import BeautifulSoup

with open("AH.html", encoding="UTF-8") as f:
    soup = BeautifulSoup(f, 'html.parser')
    tb = soup.find("table")
    for item in tb.find_all("tr")[55]:
        print(item.text)

这篇关于在本地HTML文件上使用Python中的Beautiful Soup使用错误的重音字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆