在python中将拉丁字符串转换为unicode [英] Converting a latin string to unicode in python

查看：100 发布时间：2021/7/16 21:51:30 python unicode scrapy latin

本文介绍了在python中将拉丁字符串转换为unicode的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在工作 o scrapy，我抓取了一些网站并将抓取页面中的项目存储到 json 文件中，但其中一些包含以下格式.

l = ["保持在一起",福勒房车之旅"，"S\u00e9n\u00e9gal - 马里 - 尼日尔","H\u00eatres et \u00e9tang","Col\u00e8ge marsan","N\u00b0one",第 1 天(阿拉伯语)\u0633\u0637\u0648\u0631\u0639\u0628\u0631\u0627\u0644\u0623\u064a\u0627\u0645 1"，"\u00cdndia, Tail\u00e2ndia &amp; Cingapura"]

我可以期望列表包含不同的格式，但我想转换它并将字符串存储在列表中，其原始名称如下

l = ["把它放在一起",福勒房车之旅"，第 1 天(阿拉伯语) سطور عبر الأيام 1 | شمس الدين خ | 博客" ,"Índia, Tailândia & Cingapura"]

提前致谢...........

解决方案

您有包含 unicode 转义的字节字符串.您可以使用 unicode_escape 编解码器将它们转换为 unicode:

<预><代码>>>>打印 "H\u00eatres et \u00e9tang".decode("unicode_escape")Hêtres et étang

并且您可以将其编码回字节字符串:

<预><代码>>>>s = "H\u00eatres et \u00e9tang".decode("unicode_escape")>>>s.encode("latin1")'H\xeatres 和 \xe9tang'

您可以过滤和解码非 unicode 字符串，例如:

for s in l:如果不是 isinstance(s, unicode):打印 s.decode('unicode_escape')

I am working o scrapy, I scraped some sites and stored the items from the scraped page in to json files, but some of them are containing the following format.

l = ["Holding it Together",
     "Fowler RV Trip",
     "S\u00e9n\u00e9gal - Mali - Niger","H\u00eatres et \u00e9tang",
     "Coll\u00e8ge marsan","N\u00b0one",
     "Lines through the days 1 (Arabic) \u0633\u0637\u0648\u0631 \u0639\u0628\u0631 \u0627\u0644\u0623\u064a\u0627\u0645 1",
     "\u00cdndia, Tail\u00e2ndia &amp; Cingapura"]

I can expect that the list consists of different format, but i want to convert that and store the strings in the list with their original names like below

l = ["Holding it Together",
     "Fowler RV Trip",
     "Lines through the days 1 (Arabic) سطور عبر الأيام 1 | شمس الدين خ | Blogs"         ,
     "Índia, Tailândia & Cingapura "]

Thanks in advance...........

解决方案

You have byte strings containing unicode escapes. You can convert them to unicode with the unicode_escape codec:

>>> print "H\u00eatres et \u00e9tang".decode("unicode_escape")
Hêtres et étang

And you can encode it back to byte strings:

>>> s = "H\u00eatres et \u00e9tang".decode("unicode_escape")
>>> s.encode("latin1")
'H\xeatres et \xe9tang'

You can filter and decode the non-unicode strings like:

for s in l: 
    if not isinstance(s, unicode): 
        print s.decode('unicode_escape')

这篇关于在python中将拉丁字符串转换为unicode的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在python中将拉丁字符串转换为unicode [英] Converting a latin string to unicode in python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在python中将拉丁字符串转换为unicode [英] Converting a latin string to unicode in python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭