转换成 utf16 [英] Convert in utf16

查看：26 发布时间：2021/9/15 19:40:08 python html utf-8 python-unicode

本文介绍了转换成 utf16的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在抓取多个网站并提取产品名称.在某些名称中有这样的错误:

Malecon 12 Jahre 0,05 ltr.<br>Reserva SuperiorBols 西瓜 Lik\u00f6r 0,7l海曼\u00b4s黑刺李杜松子酒Ron Zacapa Edici\u00f3n Negra哈瓦那俱乐部 A\u00f1ejo 特别Caol Ila 13 Jahre(G&amp;M 发现)

我该如何解决?我正在使用 xpath 和 re.search 来获取名称.

在每个 Python 文件中，这是第一个代码:# -*- coding: utf-8 -*-

这是源代码，我是如何获取信息的.

if ''articleName':' 详细说明:close_to_product = details.split('"articleName":', 1)[1]close_to_product_2 = close_to_product.split('"imageTitle', 1)[0]如果 debug_product == 1:打印('尝试前的产品:' + repr(closer_to_product_2))尝试:found_product = re.search(f'{'"'}(.*?)'f'{'",'}'closer_to_product_2).group(1)除了属性错误:found_product = ''如果 debug_product == 1:打印('清除产品:'，'>>>' + repr(found_product)+ '<<<<')如果没有找到_产品:打印(product_detail_page，found_product)项目['产品'] = '默认'别的:items['products'] = found_product

详情

product_details = information.xpath('/*').extract()product_details = [details.strip() 详情见 product_details]

解决方案

问题出在哪里 (Python 3.8.3)?

导入html字符串 = ['Bols 西瓜 Lik\u00f6r 0,7l','海曼\u00b4s黑刺李杜松子酒'，'Ron Zacapa Edici\u00f3n Negra','哈瓦那俱乐部 A\u00f1ejo 特别'，'Caol Ila 13 Jahre (G&M Discovery)','老普尔特尼 \\u00b7 12 年 \\u00b7 40% vol','Killepitsch Kr\\u00e4uterlik\\u00f6r 42% 0,7 L']对于字符串中的 str:打印(html.unescape(str)).编码('raw_unicode_escape').解码('unicode_escape'))

<块引用>

Bols Watermelon Likör 0,7l海曼的黑刺李杜松子酒Ron Zacapa Edición Negra哈瓦那俱乐部 Añejo EspecialCaol Ila 13 Jahre(G&M 发现)Old Pulteney · 12 年 · 40% volKillepitsch Kräuterlikör 42% 0,7 L

编辑使用 .encode('raw_unicode_escape').decode('unicode_escape') 实现翻倍的 Reverse Solidi，参见 Python 特定编码

I am crawling several websites and extract the names of the products. In some names there are errors like this:

Malecon 12 Jahre 0,05 ltr.<br>Reserva Superior
Bols Watermelon Lik\u00f6r 0,7l
Hayman\u00b4s Sloe Gin
Ron Zacapa Edici\u00f3n Negra
Havana Club A\u00f1ejo Especial
Caol Ila 13 Jahre (G&amp;M Discovery)

How can I fix that? I am using xpath and re.search to get the names.

In every Python file, this is the first code: # -*- coding: utf-8 -*-

Edit:

This is the sourcecode, how I get the information.

if '"articleName":' in details:
                            closer_to_product = details.split('"articleName":', 1)[1]
                            closer_to_product_2 = closer_to_product.split('"imageTitle', 1)[0]
                            if debug_product == 1:
                                print('product before try:' + repr(closer_to_product_2))
                            try:
                                found_product = re.search(f'{'"'}(.*?)'f'{'",'}'closer_to_product_2).group(1)
                            except AttributeError:
                                found_product = ''
                            if debug_product == 1:
                                print('cleared product: ', '>>>' + repr(found_product) + '<<<')
                            if not found_product:
                                print(product_detail_page, found_product)
                                items['products'] = 'default'
                            else:
                                items['products'] = found_product

Details

product_details = information.xpath('/*').extract()
product_details = [details.strip() for details in product_details]

解决方案

Where is a problem (Python 3.8.3)?

import html

strings = [
  'Bols Watermelon Lik\u00f6r 0,7l',
  'Hayman\u00b4s Sloe Gin',
  'Ron Zacapa Edici\u00f3n Negra',
  'Havana Club A\u00f1ejo Especial',
  'Caol Ila 13 Jahre (G&amp;M Discovery)',
  'Old Pulteney \\u00b7 12 Years \\u00b7 40% vol',
  'Killepitsch Kr\\u00e4uterlik\\u00f6r 42% 0,7 L']
  
for str in strings:
  print( html.unescape(str).
                encode('raw_unicode_escape').
                decode('unicode_escape') )

Bols Watermelon Likör 0,7l
Hayman´s Sloe Gin
Ron Zacapa Edición Negra
Havana Club Añejo Especial
Caol Ila 13 Jahre (G&M Discovery)
Old Pulteney · 12 Years · 40% vol
Killepitsch Kräuterlikör 42% 0,7 L

Edit Use .encode('raw_unicode_escape').decode('unicode_escape') for doubled Reverse Solidi, see Python Specific Encodings

这篇关于转换成 utf16的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

转换成 utf16 [英] Convert in utf16

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

转换成 utf16 [英] Convert in utf16

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭