Python-编解码器将ascii编码为unicode:错误 [英] Python - codec encoding ascii to unicode: error

查看:190
本文介绍了Python-编解码器将ascii编码为unicode:错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

:)我正在尝试将输入文件(当前为英文)的音译恢复为原始格式(印地文)的过程

:) I am trying to go about the process of reversing transliteration of an input file(currently in english) back to its original form(in hindi)

样本或输入文件的一部分看起来像这样:

A sample or a part of the input file looks like this:

E-k- b-u-d-z*dhi-m-aan- p-ksii#

E-k- ghn-e- j-ngg-l- m-e-ng E-k- b-h-u-t- UUNNc-aa p-e-dr thaa#
U-s- k-ii p-t-z*t-o-ng s-e- l-d-ii shaakhaay-e-ng m-j-*zb-uut- b-aaj-u-O-ng k-ii t-r-h- pheil-ii h-u-II thiing#
w-n- h-NNs-o-ng k-aa E-k- jhu-nhz*D- I-s- p-e-dr p-r- n-i-w-aas- k-r-t-aa thaa#
w-e- s-b- y-h-aaNN s-u-r-ksi-t- the- AUr- b-dre- AAr-aam- s-e- r-h-t-e- the-#
U-n- m-e-ng s-e- E-k- p-ksii b-h-u-t- b-u-d-z*dhi-m-aan- thaa#
I-s- b-u-d-z*dhi-m-aan- p-ksii n-e- E-k- d-i-n- p-e-dr k-ii j-dr m-e-ng s-e- E-k- l-t-aa k-o- U-g-t-e- d-e-khaa# 
I-s- k-e- b-aar-e- m-e-ng U-s-n-e- d-uus-r-e- p-ksi-y-o-ng s-e- b-aat- k-ii#
"k-z*y-aa t-u-m-z*h-e-ng w-h- l-t-aa d-i-khaaII d-e-t-ii h-ei", U-s- n-e- U-n- s-e- p-uuchaa "t-u-m-z*h-e-ng I-s-e- n-Shz*T- k-r- d-e-n-aa c-aah-i-E-"#
"I-s-e- k-z*y-o-ng n-Shz*T- k-r- d-e-n-aa c-aah-i-E-?" h-NNs-o-ng n-e- AAshz*c-*ry- s-e- p-uuchaa "y-h- t-o- I-t-n-ii cho-T-ii s-e- h-ei#
h-m-e-ng y-h- k-z*y-aa h-aan-i- p-h-u-NNc-aa s-k-t-ii h-ei"#
"m-e-r-e- m-i-tro-ng," b-u-d-z*dhi-m-aan- p-ksii n-e- U-t-z*t-r- d-i-y-aa "w-h- cho-T-ii s-ii l-t-aa j-l-z*d-ii h-ii b-drii h-o- j-aay-e-g-ii#
y-h- h-m-aar-e- p-e-dr p-r- c-Dh*z k-r- U-s- s-e- l-i-p-T-t-ii j-aay-e-g-ii AUr- phi-r- m-o-T-ii AUr- m-j-*zb-uut- h-o- j-aay-e-g-ii"#
"t-o- k-z*y-aa h-u-AA"#

它的等效英语含义是:

A WISE OLD BIRD.

Deep in the forest stood a very tall tree.
Its leafy branches spread out like long arms.
This was the home of a flock of wild geese.
They were safe there.
One of the geese was a wild old bird.
One  day this wise old bird noticed  a small creeper growing at the foot of the tree.
He spoke to the other birds about it.
"Do you see that creeper ?" he said to them.
"You must destroy it."
"Why must we destroy it ?" asked the geese in surprise.
"It is so small.
What harm can it do?"
"My friends," replied the wise old bird, " that little creeper will soon grow.

我的脚本如下:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import sys
CODEC = 'utf-8'
input_file=sys.argv[1]
output_file=sys.argv[2]
list1=[]



f=open(input_file,'r')
f1 = open(output_file,'w')

english_hindi_dict={'A' : u'अ' ,  'AA' : u'आ ' , 'I' : u'इ' , 'II' : u'ई ' , 'U' : u'उ ' ,\
                'UU' : u'ऊ' , 'r' : u'ऋ' , 'E' : u'ए' , 'ai' : u'ऐ' , 'O' : u'ओ' , 'AU' : u'औ' ,\
                'k' : u'क' , 'kh' : u'ख' , 'g' : u'ग' , 'gh' : u'घ' , 'c' : u'च' , 'ch' : u'छ',\
                'j': u'ज' , 'jh' : u'झ' , 'tr' : u'त्र' , 'T' : u'ट'  , 'Th' : u'ठ' , 'D' : u'ड',\
                'dr' : u'ड' , 'Dh' : u'ढ' , 'Na' : u'ण' , 'th' : u'त' ,  'tha' : u'थ',\
                'd' : u'द' , 'dh': u'ध' , 'n' : u'न' , 'p' : u'प' , 'ph' : u'फ' ,\
                'b' : u'ब' , 'bh' : u'भ' , 'm' : u'म' , 'y' : u'य' , 'r' : u'र' , 'l' : u'ल' ,\
                'w' : u'व' , 'sh' : u'श' , 'sha' : u'ष', 's' : u'स' , 'h' : u'ह' , 'ks' : u'क्ष' ,\
                'i' : u'ि' , 'ii' : u'ी' , 'u' : u'ु' , 'uu' : u'ू' , 'e' : u'े' ,\
                'aa' : u'ै' , 'o' : u'ो' , 'AU' : u'ौ' ,'H' : u'्' ,'mn' : u'ं' ,\
                'NN' : u'ँ' , 'AW' : u'ॅ' , 'rr' : u'ृ' , '4' : u'४' , '6': u'६'  , '8' : u'८',\
                '2' : u'२' , '5' : u'५' , '3' : u'३' , '7' : u'७' , '9' : u'९' , '1' : u'१'}
for line in f:
      #line=line.strip() to remove a line from its newline character....  
      #line=line.rstrip('.')   
      line=line.replace('-','')
      line=line.replace('#','|') # i am using the or symbol for poornviram
      #line=line.replace('।','')
      #line = line.lower()
for word in line:
    for ch in word:
        if (ch in english_hindi_dict) :
            translatedToken = english_hindi_dict[ch]
        else :
                translatedToken = ch

#{ translatedToken = english_hindi_dict[ch] }

#for ch in line:
    f1.write(translatedToken)
    #print translatedToken
    #line = line.replace( char,english_hindi_dict[char] )   
      #list1.append(line)
f.close()

f1.write(' '.join(list1))

f1.close()

我得到的错误是:

python transliterate_eh_nw.py Hstory.txt op1.txt
Traceback (most recent call last):
  File "transliterate_eh_nw.py", line 43, in <module>
    f1.write(translatedToken)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u092f' in position 0: ordinal not in range(128)

能否请您告诉我如何处理此错误. 谢谢..:)

Could you please tell me how do I deal with this error. Thank you..:)

推荐答案

除了您要问的问题之外,您还有其他一些问题.

You have a few problems other than the one which you asked about.

(1)一个概念性问题:"E-k- b-u-d-z * dhi-m-aan- p-ksii#"不是不是英语".它是使用某种罗马化方案以ASCII编写的印地语语言.看起来像ITRAN,但ITRAN没有AA和A,只有aa和a.该计划有名字吗?您可以提供网址吗?最好将您的对象描述为将一些印地语文本从未命名的罗马字译成梵文".

(1) A conceptual problem: "E-k- b-u-d-z*dhi-m-aan- p-ksii#" is not "english". It is Hindi language written in ASCII using some romanization scheme. It looks like ITRAN but ITRAN doesn't have AA and A, it has only aa and a. Does the scheme have a name? Can you supply a URL? Your object is better described as "transliterate some Hindi text from the unnamed romanization to Devanagari script".

(2)显示将文本从印地语翻译为英语的结果("A WISE OLD BIRD"等)仅适度有用.达瓦纳加里的预期输出将是一个更好的主意.

(2) Showing the result of translating your text from Hindi to English ("A WISE OLD BIRD" etc) is only moderately useful. The expected Devanagari output would be a better idea.

(3)如@ kaiser.se所述,音译词典具有多字节(最多3个字节!)键,其中某些键是其他键的前缀.假定必须优先于A识别AA,必须在g之前识别gh,依此类推.对字典项的迭代以可预测的顺序发生,但出于您的目的,应视为随机.在下面的代码中,我将较长的键"设置为优先级.

(3) As remarked by @kaiser.se, the transliteration dictionary has multi-byte (up to 3 bytes!) keys, some of which are prefixes of others. Presumably AA must be recognised in priority to A, gh must be recognised before g, etc. Iterating over the items of a dictionary happens in an order that is predictable but for your purposes should be regarded as random. In the code that follows, I've given priority to longer "keys".

(4)要么词典中缺少一些字母键(或者是a t z),要么音译规则比我们迄今为止任何人都猜想的要复杂

(4) Either the dictionary is missing some letter keys (a S t z) or the transliteration rules are more complicated than any of us has guessed so far

(5)字符#*和-的含义不是100%显而易见.从您的输入文本来看,z和*仅以z *的组合出现

(5) The meaning of the characters # * and - is not 100% obvious. It appears from your input text that z and * appear only in combination as z*

(6)如果您解释了例如shaakhaay-e-ng ...是从sh开始,然后是aa,还是从sha开始,然后是a?规则是什么?

(6) It would be a good idea if you explained the interpretation of e.g. shaakhaay-e-ng ... does it start with sh then aa or does it start with sha then a? What are the rules?

您所要解决的问题的答案当然是其他几个人指出的,您需要使用显示设备支持的编码对unicode输出进行编码,例如UTF-8.

The answer to the problem that you asked about is of course as several others have pointed out that you need to encode your unicode output using an encoding that is supported by your display device e.g. UTF-8.

以下是一些代码:

#!/usr/bin/python
# -*- coding: UTF-8 -*-

input_data = """
E-k- b-u-d-z*dhi-m-aan- p-ksii#

E-k- ghn-e- j-ngg-l- m-e-ng E-k- b-h-u-t- UUNNc-aa p-e-dr thaa#
[snip]
"t-o- k-z*y-aa h-u-AA"#
"""

roman_devanagari_dict={'A' : u'अ' ,  'AA' : u'आ ' , 'I' : u'इ' , 'II' : u'ई ' , 'U' : u'उ ' ,\
[snip]
            '2' : u'२' , '5' : u'५' , '3' : u'३' , '7' : u'७' , '9' : u'९' , '1' : u'१'}

#Presuming we need to do the 3-letter cases then the 2-letter then the 1-letter
replacements = [(-len(k), unicode(k), v) for k, v in roman_devanagari_dict.items()]
replacements.sort()

data = input_data.decode('ascii')

for _junk, from_text, to_text in replacements:
    data = data.replace(from_text, to_text)

# Presuming the '-' are inter-character markers, delete them last, not first
data = data.replace(u'-', '')
data = data.replace(u'#', '')
print "untransliterated:", set(c for c in data if 0x20 < ord(c) < 0x7f)

BOM = u'\ufeff'
outf = open('devanagari.txt', 'w')
outf.write(BOM.encode('utf8')) # for the benefit of clueless Windows s/w
outf.write(data.encode('utf8'))
outf.close()

输出:

एz *धिमैनपक्षी

एक बुदz*धिमैन पक्षी

एtेथa पtz tोनगषaखैयेनगमज zबूtबैजुओनगी <हँसो<<<<<< कककककबडेबडेबडेैमैमैमहहहहेे धिमैनथa बुदz धिमैषीषीदिदिमेमेैैगग ेेेेगगषियोषियोषियो "tकz यैtुमz हेनगलtैदेtदेहेि",उेेहेपूछैहह"tुमz हेनहहहS क로रदेनैचैहिए" इसेzz योनगनS टर로रै?"" यपूछैपूछैयहोीी हमेनगकz emहैििसीीी" मेरेमिा्््ो"बुद<<धिमैप z> tहो "tोscalz यैहुआ"

एक घने जनगगल मेनग एक बहुt ऊँचै पेड थa उ स की पtztोनग से लदी षaखैयेनग मजzबूt बैजुओनग की tरह फेिली हुई तीनग वन हँसोनग कै एक झुनहzड इस पेड पर निवैस करtै थa वे सब यहैँ सुरक्षिt ते ौर बडे आ रैम से रहtे ते उ न मेनग से एक पक्षी बहुt बुदzधिमैन थa इस बुदzधिमैन पक्षी ने एक दिन पेड की जड मेनग से एक लtै को उ गtे देखै इस के बैरे मेनग उ सने दूसरे पक्षियोनग से बैt की "कzयै tुमzहेनग वह लtै दिखैई देtी हेि", उ स ने उ न से पूछै "tुमzहेनग इसे नSहzट कर देनै चैहिए" "इसे कzयोनग नSहzट कर देनै चैहिए?" हँसोनग ने आ शzरय से पूछै "यह tो इtनी छोटी से हेि हमेनग यह कzयै हैनि पहुँचै सकtी हेि" "मेरे मित्रोनग," बुदzधिमैन पक्षी ने उ tztर दियै "वह छोटी सी लtै जलzदी ही बडी हो जैयेगी यह हमैरे पेड पर चढz कर उ स से लिपटtी जैयेगी ौर फिर मोटी ौर मजzबूt हो जैयेगी" "tो कzयै हुआ "

通过Google翻译推送时只有几个可识别的单词.

which has only a few recognisable words when shoved through Google Translate.

更新:

  • 其中三个条目(AA,II和U)在梵文等价之后.也许应该删除空格.

  • Three of the entries (AA, II, and U) have a space after the Devanagari equivalent. Perhaps the spaces should be removed.

辅音的一般模式似乎是:

The general pattern for consonants appears to be:

DEVANAGARI字母XA由x
表示 DEVANAGARI字母XXA用X
表示 DEVANAGARI字母XHA由xh
表示 DEVANAGARI字母XXHA以Xh表示

DEVANAGARI LETTER XA is represented by x
DEVANAGARI LETTER XXA is represented by X
DEVANAGARI LETTER XHA is represented by xh
DEVANAGARI LETTER XXHA is represented by Xh

但是3个条目破坏了模式:
SSA-> sha,但是模式显示为S
TA-> th,但模式显示为t
THA-> tha但模式显示为

However 3 entries break the pattern:
SSA -> sha but pattern says S
TA -> th but pattern says t
THA -> tha but pattern says th

注意:更改以上3个条目使我的代码不再抱怨在对示例文本进行音译时S和t保持不变,并删除了看似异常的sha和tha条目.

Note: changing the above 3 entries stopped my code from complaining that S and t were left unchanged when transliterating your sample text, and removed the seemingly-anomalous sha and tha entries.

  • 条目(D和dr)被映射到相同的字符DEVANAGARI LETTER DDA. D是该字符的预期条目;也许博士应该映射到其他地方.

  • Entries (D and dr) are mapped to the same character, DEVANAGARI LETTER DDA. D is the expected entry for that character; perhaps dr should be mapped elsewhere.

没有关于DEVANAGARI LETTER NGA(U + 0919)的条目;也许应该将其编码为ng-示例文本中有几个以ng结尾的单词.

There is no entry for DEVANAGARI LETTER NGA (U+0919); perhaps it should be encoded as ng -- there are a few words ending in ng in the sample text.

示例文本中未分类的"z *"出现与DEVANAGARI LETTER ZA(U + 095B)有关系吗?

Are the uncatered-for "z*" occurrences in the sample text anything to do with DEVANAGARI LETTER ZA (U+095B)?

这篇关于Python-编解码器将ascii编码为unicode:错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆