网页的HTML无法正确显示外语字符 [英] HTML from a webpage does not display foreign language characters correctly

查看:64
本文介绍了网页的HTML无法正确显示外语字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果标题有误导性,我们深表歉意.

Apologies if the title is misleading.

我试图通过查询歌词站点,然后使用CLD2检查歌词的语言来找出给定歌曲的语言.但是,对于某些歌曲(例如下面给出的示例),外语字符未正确编码,这意味着CLD2抛出此错误: input在字节2121(32761的字节)周围包含无效的UTF-8

I am trying to find out the language of a given song by querying a lyric site and then using CLD2 to check the language of the lyrics. However, with certain songs (such as the example given below) the foreign language characters aren't being encoded properly, which means that CLD2 throws up this error: input contains invalid UTF-8 around byte 2121 (of 32761)

import requests
import re
from bs4 import BeautifulSoup
import cld2

response = requests.get(https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html)

soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
    counter+=1
    if counter == 21:
        lyrics = item.get_text()
        checklang(lyrics)
        print("Lyrics found!")
        break

def checklang(lyrics):
    try:
        isReliable, textBytesFound, details = cld2.detect(lyrics)
        language = re.search("ENGLISH", str(details))
        
        if language == None:
            print("foreign lang")
                      
        if len(re.findall("Unknown", str(details))) < 2:
            print("foreign lang")
                      
        if language != None:
            print("english")
            pass

值得一提的是,这不仅限于非拉丁字符,有时还会出现撇号或其他标点符号.

It is also worth mentioning that this is not limited to non-latin characters and sometimes occurs with apostrophes or other punctuation.

任何人都可以阐明发生这种情况的原因或我可以做些什么来解决此问题吗?

Can anyone shed some light on why this is happening or what I could do to work around it?

推荐答案

请求应该根据HTTP标头对响应的编码做出有根据的猜测.

Requests should make educated guesses about the encoding of the response based on the HTTP headers.

不幸的是,在给定的示例中,尽管 response.content 显示<; meta charset ="utf-8"> .

Unfortunately, in given example, response.encoding shows ISO-8859-1 in despite of response.content shows <meta charset="utf-8">.

这是我基于 响应内容 requests 文档中的em>段落.

Here's my solution based on Response Content paragraph in the requests documentation.

import requests
import re
from bs4 import BeautifulSoup
# import cld2
import pycld2 as cld2

def checklang(lyrics):
        #try:
        isReliable, textBytesFound, details = cld2.detect(lyrics)
        # language = re.search("ENGLISH", str(details))
        for detail in details:
            print(detail)

response = requests.get('https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html')

print(response.encoding)
response.encoding = 'utf-8'                         ### key change ###

soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
    counter+=1
    if counter == 21:
        lyrics = item.get_text()
        checklang(lyrics)
        print("Lyrics found!")
        break

输出: \ SO \ 65630066.py

ISO-8859-1
('ENGLISH', 'en', 74, 833.0)
('Korean', 'ko', 20, 3575.0)
('Unknown', 'un', 0, 0.0)
Lyrics found!

这篇关于网页的HTML无法正确显示外语字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆