网页中的 HTML 无法正确显示外语字符 [英] HTML from a webpage does not display foreign language characters correctly

查看:18
本文介绍了网页中的 HTML 无法正确显示外语字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果标题具有误导性,我们深表歉意.

Apologies if the title is misleading.

我试图通过查询歌词站点然后使用 CLD2 检查歌词的语言来找出给定歌曲的语言.但是,对于某些歌曲(例如下面给出的示例),外语字符没有正确编码,这意味着 CLD2 会抛出此错误:input contains invalid UTF-8 around byte 2121 (of 32761)

I am trying to find out the language of a given song by querying a lyric site and then using CLD2 to check the language of the lyrics. However, with certain songs (such as the example given below) the foreign language characters aren't being encoded properly, which means that CLD2 throws up this error: input contains invalid UTF-8 around byte 2121 (of 32761)

import requests
import re
from bs4 import BeautifulSoup
import cld2

response = requests.get(https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html)

soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
    counter+=1
    if counter == 21:
        lyrics = item.get_text()
        checklang(lyrics)
        print("Lyrics found!")
        break

def checklang(lyrics):
    try:
        isReliable, textBytesFound, details = cld2.detect(lyrics)
        language = re.search("ENGLISH", str(details))
        
        if language == None:
            print("foreign lang")
                      
        if len(re.findall("Unknown", str(details))) < 2:
            print("foreign lang")
                      
        if language != None:
            print("english")
            pass

还值得一提的是,这不仅限于非拉丁字符,有时还会出现撇号或其他标点符号.

It is also worth mentioning that this is not limited to non-latin characters and sometimes occurs with apostrophes or other punctuation.

谁能解释一下为什么会发生这种情况,或者我可以做些什么来解决这个问题?

Can anyone shed some light on why this is happening or what I could do to work around it?

推荐答案

Requests 应该根据 HTTP 标头对响应的编码进行有根据的猜测.

Requests should make educated guesses about the encoding of the response based on the HTTP headers.

不幸的是,在给定的示例中,response.encoding 显示 ISO-8859-1,尽管 response.content 显示 <meta charset="utf-8">.

Unfortunately, in given example, response.encoding shows ISO-8859-1 in despite of response.content shows <meta charset="utf-8">.

这是我基于响应内容requests 文档中的 em> 段落.

Here's my solution based on Response Content paragraph in the requests documentation.

import requests
import re
from bs4 import BeautifulSoup
# import cld2
import pycld2 as cld2

def checklang(lyrics):
        #try:
        isReliable, textBytesFound, details = cld2.detect(lyrics)
        # language = re.search("ENGLISH", str(details))
        for detail in details:
            print(detail)

response = requests.get('https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html')

print(response.encoding)
response.encoding = 'utf-8'                         ### key change ###

soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
    counter+=1
    if counter == 21:
        lyrics = item.get_text()
        checklang(lyrics)
        print("Lyrics found!")
        break

输出:SO65630066.py

ISO-8859-1
('ENGLISH', 'en', 74, 833.0)
('Korean', 'ko', 20, 3575.0)
('Unknown', 'un', 0, 0.0)
Lyrics found!

这篇关于网页中的 HTML 无法正确显示外语字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆