网页的HTML无法正确显示外语字符 [英] HTML from a webpage does not display foreign language characters correctly
问题描述
如果标题有误导性,我们深表歉意.
Apologies if the title is misleading.
我试图通过查询歌词站点,然后使用CLD2检查歌词的语言来找出给定歌曲的语言.但是,对于某些歌曲(例如下面给出的示例),外语字符未正确编码,这意味着CLD2抛出此错误: input在字节2121(32761的字节)周围包含无效的UTF-8
I am trying to find out the language of a given song by querying a lyric site and then using CLD2 to check the language of the lyrics. However, with certain songs (such as the example given below) the foreign language characters aren't being encoded properly, which means that CLD2 throws up this error: input contains invalid UTF-8 around byte 2121 (of 32761)
import requests
import re
from bs4 import BeautifulSoup
import cld2
response = requests.get(https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html)
soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
counter+=1
if counter == 21:
lyrics = item.get_text()
checklang(lyrics)
print("Lyrics found!")
break
def checklang(lyrics):
try:
isReliable, textBytesFound, details = cld2.detect(lyrics)
language = re.search("ENGLISH", str(details))
if language == None:
print("foreign lang")
if len(re.findall("Unknown", str(details))) < 2:
print("foreign lang")
if language != None:
print("english")
pass
值得一提的是,这不仅限于非拉丁字符,有时还会出现撇号或其他标点符号.
It is also worth mentioning that this is not limited to non-latin characters and sometimes occurs with apostrophes or other punctuation.
任何人都可以阐明发生这种情况的原因或我可以做些什么来解决此问题吗?
Can anyone shed some light on why this is happening or what I could do to work around it?
推荐答案
请求
应该根据HTTP标头对响应的编码做出有根据的猜测.
Requests
should make educated guesses about the encoding of the response based on the HTTP headers.
不幸的是,在给定的示例中,尽管 response.content
显示<; meta charset ="utf-8">
.
Unfortunately, in given example, response.encoding
shows ISO-8859-1
in despite of response.content
shows <meta charset="utf-8">
.
这是我基于 响应内容 requests
文档中的em>段落.
Here's my solution based on Response Content paragraph in the requests
documentation.
import requests
import re
from bs4 import BeautifulSoup
# import cld2
import pycld2 as cld2
def checklang(lyrics):
#try:
isReliable, textBytesFound, details = cld2.detect(lyrics)
# language = re.search("ENGLISH", str(details))
for detail in details:
print(detail)
response = requests.get('https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html')
print(response.encoding)
response.encoding = 'utf-8' ### key change ###
soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
counter+=1
if counter == 21:
lyrics = item.get_text()
checklang(lyrics)
print("Lyrics found!")
break
输出: \ SO \ 65630066.py
ISO-8859-1
('ENGLISH', 'en', 74, 833.0)
('Korean', 'ko', 20, 3575.0)
('Unknown', 'un', 0, 0.0)
Lyrics found!
这篇关于网页的HTML无法正确显示外语字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!