将Unicode转义转换为希伯来语文本 [英] Convert Unicode Escape to Hebrew text

查看:246
本文介绍了将Unicode转义转换为希伯来语文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在json文件中包含以下文本:

"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa 
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"

代表希伯来语中的文本אחוזתפולג".

无论我使用哪种编码/解码,我似乎都无法正确使用 Python 3.

例如,请尝试:

text = "\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa 
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092".encode('unicode-escape')

print(text)

我得到的文字是:

b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'

如果我能够删除仅一个反斜杠并转

,则

其中字节码几乎是 正确的文本

b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'

进入

text = b'\xd7\x90\xd7\x97\xd7\x95\xd7\x96\xd7\xaa \xd7\xa4\xd7\x95\xd7\x9c\xd7\x92'

(请注意我如何将双斜杠更改为单斜杠)

text.decode('utf-8')

将在希伯来语中产生正确的文本.

但是我很努力地做到这一点,并且无法创建一段代码来为我做到这一点(而不是像我刚才展示的那样手动...)

任何帮助,感激不尽...

解决方案

此字符串不代表"希伯来语文本(至少不是Unicode代码点,UTF-16,UTF-8或任何众所周知的方式)完全没有).相反,它代表一个UTF-16代码单元序列,并且该序列主要由乘法符号,货币符号和一些奇怪的控制字符组成.

看起来原始字符数据已经以某种奇怪的编码组合进行了多次编码和解码.

假设这就是字面上保存在JSON文件中的内容:

"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa \u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"

您可以按以下方式恢复希伯来语文本:

(jsonInput
  .encode('latin-1')
  .decode('raw_unicode_escape')
  .encode('latin-1')
  .decode('utf-8')
)

对于上面的示例,它给出:

'אחוזת פולג'

如果您正在使用JSON解串器读取数据,那么您当然应该省略.encode('latin-1').decode('raw_unicode_escape')步骤,因为JSON解串器已经为您解释了转义序列.也就是说,在JSON解串器加载了text元素之后,仅将其编码为latin-1然后将其解码为utf-8就足够了.之所以有效,是因为latin-1(ISO-8859-1)是一种8位字符编码,它完全对应于unicode的前256个代码点,而奇怪的是,断行的文本将UTF-8编码的每个字节编码为ASCII转义符一个UTF-16代码单元.

我不确定如果您的JSON同时包含损坏的转义序列和有效文本,该怎么办,这可能是latin-1无法再正常工作的原因.除非JSON本身仅包含ASCII,否则请不要将此转换应用于您的JSON文件.

I have the following text in a json file:

"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa 
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"

which represents the text "אחוזת פולג" in Hebrew.

no matter which encoding/decoding i use i don't seem to get it right with Python 3.

if for example ill try:

text = "\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa 
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092".encode('unicode-escape')

print(text)

i get that text is:

b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'

which in bytecode is almost the correct text, if i was able to remove only one backslash and turn

b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'

into

text = b'\xd7\x90\xd7\x97\xd7\x95\xd7\x96\xd7\xaa \xd7\xa4\xd7\x95\xd7\x9c\xd7\x92'

(note how i changed double slash to single slash) then

text.decode('utf-8')

would yield the correct text in Hebrew.

but i am struggling to do so and couldn't manage to create a piece of code which will do that for me (and not manually as i just showed...)

any help much appreciated...

解决方案

This string does not "represent" Hebrew text (at least not as unicode code points, UTF-16, UTF-8, or in any well-known way at all). Instead, it represents a sequence of UTF-16 code units, and this sequence consists mostly of multiplication signs, currency signs, and some weird control characters.

It looks like the original character data has been encoded and decoded several times with some strange combination of encodings.

Assuming that this is what literally is saved in your JSON file:

"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa \u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"

you can recover the Hebrew text as follows:

(jsonInput
  .encode('latin-1')
  .decode('raw_unicode_escape')
  .encode('latin-1')
  .decode('utf-8')
)

For the above example, it gives:

'אחוזת פולג'

If you are using a JSON deserializer to read in the data, then you should of course omit the .encode('latin-1').decode('raw_unicode_escape') steps, because the JSON deserializer would already interpret the escape sequences for you. That is, after the text element is loaded by JSON deserializer, it should be sufficient to just encode it as latin-1 and then decode it as utf-8. This works because latin-1 (ISO-8859-1) is an 8-bit character encoding that corresponds exactly to the first 256 code points of unicode, whereas your strangely broken text encodes each byte of UTF-8 encoding as an ASCII-escape of an UTF-16 code unit.

I'm not sure what you can do if your JSON contains both the broken escape sequences and valid text at the same time, it might be that the latin-1 doesn't work properly any more. Please don't apply this transformation to your JSON file unless the JSON itself contains only ASCII, it would only make everything worse.

这篇关于将Unicode转义转换为希伯来语文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆