python编码utf-8 [英] python encoding utf-8

查看:169
本文介绍了python编码utf-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在python中做了一些脚本。我创建一个我保存在一个文件中的字符串。这个字符串得到很多数据,来自一个目录的arberescence和文件名。
根据convmv,我所有的arborescence都是UTF-8。



我想保留UTF-8中的所有内容,因为我将保存在MySQL中。
现在,在UTF-8的MySQL中,我遇到一些字符(如é或è - 我的法语)的问题。



我希望python总是使用字符串作为UTF-8。我在互联网上阅读了一些信息,我这样做。



我的脚本从这开始:

 #!/ usr / bin / python 
# - * - 编码:utf-8 - * -
def createIndex():
import codecs
toUtf8 = codecs.getencoder('UTF8')
#lot of operations&建立indexSTR字符串
findex = open('config / index / music_vibration _'+ date +'。index','a')
findex.write(codecs.BOM_UTF8)
findex。写(toUtf8(indexSTR))#这个bug!

当我执行时,这里是答案: UnicodeDecodeError:'ascii'编解码器无法解码位置2171处的字节0xc3:序号不在范围(128)



编辑:
我看到,在我的文件,口音很好写。创建此文件后,我读取它,并将其写入MySQL。
但我不明白为什么,但我有编码问题。
我的MySQL数据库是在utf8,或似乎是SQL查询 SHOW变量LIKE'char%'只返回我utf8或二进制。



我的功能如下所示:

 #!/ usr / bin / python 
# - * - 编码:utf-8 - * -

def saveIndex(index,date):
import MySQLdb as mdb
import codecs

sql = mdb.connect('localhost','admin','*******','music_vibration')
sql.charset =utf8
findex = open('config / index /'+ index,'r')
lines = findex.readlines()
行中的行:
如果line.find('#artiste')!= -1:
artiste = line.split('[:::]')
artiste = artiste [1] .replace('\\\
','')

c = sql .cursor()
c.execute('SELECT COUNT(id)AS nbr FROM artistes WHERE nom ='+ artiste +'')
nbr = c.fetchone()
如果nbr [0] == 0:
c = sql.cursor()
iArt + = 1
c.execute('INSERT INTO artistes(nom,status,path)VALUES('+ artiste +',99,''+ artiste +'/)'。 ')

在文件中很好地显示的艺术家在BDD中写入不好。
有什么问题?

解决方案

你不需要编码的数据已经编码。当您尝试这样做时,Python将首先尝试将解码为 unicode ,然后才能将其编码回UTF-8。这是在这里失败的:

 >>> data = u'\\\Ã'#Unicode数据
>>>> data = data.encode('utf8')#encoding to UTF-8
>>>>数据
'\xc3\x83'
>>>> data.encode('utf8')#尝试*重新编码
追溯(最近的最后一次呼叫):
文件< stdin>,第1行,< module>
UnicodeDecodeError:'ascii'编解码器无法解码位置0的字节0xc3:序号不在范围(128)

只需将数据直接写入文件,就可以编码已经编码的数据。



如果你而是建立 unicode 值,你确实必须对这些值进行编码才能写入文件。您要使用 codecs.open() ,它返回一个将unicode值编码为UTF-8的文件对象。



您也真的不要写出UTF-8 BOM,除非 支持不能读取UTF-8的Microsoft工具(如MS Notepad)。



对于您的MySQL插入问题,您需要执行两件事:




  • charset ='utf8'添加到您的 MySQLdb.connect()调用中。


  • 在查询或插入时使用 unicode 对象,而不是 str 使用sql参数,以便MySQL连接器可以为您做正确的事情:

      artiste = artiste。解码('utf8')#它已经是UTF8,解码为unicode 

    c.execute('SELECT COUNT(id)AS nbr FROM artistes WHERE nom =%s',(artiste,))

    #...

    c.execute('INSERT INTO artistes(nom,status,path)VALUES(%s,99,%s)',(artiste,artiste + u' /'))




如果您使用 codecs.open()自动解码内容:

  import codecs 

sql = mdb.connect('localhost','admin','ugo&( - @ F','music_vibration',charset ='utf8')

with codecs.open('config / index /'+ index,'r','utf8')as findex:
findex中的行:
如果你''艺术家'不在行:
continue

artiste = line.split(u'[:::]')[1] .strip()

cursor = sql.cursor( )
cursor.execute('SELECT COUNT(id)AS nbr FROM artistes WHERE nom =%s',(artiste,))
如果不是cursor.fetchone()[0]:
cursor = sql.cursor()
cursor.execute('INSERT INTO artistes(nom,status,path)VALUES(%s,99,%s)',(art iste,artiste + u'/'))
artists_inserted + = 1

你可能想要刷新Unicode和UTF-8和编码。我可以推荐以下文章:




I am doing some scripts in python. I create a string that I save in a file. This string got lot of data, coming from the arborescence and filenames of a directory. According to convmv, all my arborescence is in UTF-8.

I want to keep everything in UTF-8 because I will save it in MySQL after. For now, in MySQL, which is in UTF-8, I got some problem with some characters (like é or è - I'am French).

I want that python always use string as UTF-8. I read some informations on the internet and i did like this.

My script begin with this :

 #!/usr/bin/python
 # -*- coding: utf-8 -*-
 def createIndex():
     import codecs
     toUtf8=codecs.getencoder('UTF8')
     #lot of operations & building indexSTR the string who matter
     findex=open('config/index/music_vibration_'+date+'.index','a')
     findex.write(codecs.BOM_UTF8)
     findex.write(toUtf8(indexSTR)) #this bugs!

And when I execute, here is the answer : UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2171: ordinal not in range(128)

Edit: I see, in my file, the accent are nicely written. After creating this file, I read it and I write it into MySQL. But I dont understand why, but I got problem with encoding. My MySQL database is in utf8, or seems to be SQL query SHOW variables LIKE 'char%' returns me only utf8 or binary.

My function looks like this :

#!/usr/bin/python
# -*- coding: utf-8 -*-

def saveIndex(index,date):
    import MySQLdb as mdb
    import codecs

    sql = mdb.connect('localhost','admin','*******','music_vibration')
    sql.charset="utf8"
    findex=open('config/index/'+index,'r')
    lines=findex.readlines()
    for line in lines:
        if line.find('#artiste') != -1:
            artiste=line.split('[:::]')
            artiste=artiste[1].replace('\n','')

            c=sql.cursor()
            c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom="'+artiste+'"')
            nbr=c.fetchone()
            if nbr[0]==0:
                c=sql.cursor()
                iArt+=1
                c.execute('INSERT INTO artistes(nom,status,path) VALUES("'+artiste+'",99,"'+artiste+'/")'.encode('utf8')

And artiste who are nicely displayed in the file writes bad into the BDD. What is the problem ?

解决方案

You don't need to encode data that is already encoded. When you try to do that, Python will first try to decode it to unicode before it can encode it back to UTF-8. That is what is failing here:

>>> data = u'\u00c3'            # Unicode data
>>> data = data.encode('utf8')  # encoded to UTF-8
>>> data
'\xc3\x83'
>>> data.encode('utf8')         # Try to *re*-encode it
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Just write your data directly to the file, there is no need to encode already-encoded data.

If you instead build up unicode values instead, you would indeed have to encode those to be writable to a file. You'd want to use codecs.open() instead, which returns a file object that will encode unicode values to UTF-8 for you.

You also really don't want to write out the UTF-8 BOM, unless you have to support Microsoft tools that cannot read UTF-8 otherwise (such as MS Notepad).

For your MySQL insert problem, you need to do two things:

  • Add charset='utf8' to your MySQLdb.connect() call.

  • Use unicode objects, not str objects when querying or inserting, but use sql parameters so the MySQL connector can do the right thing for you:

    artiste = artiste.decode('utf8')  # it is already UTF8, decode to unicode
    
    c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))
    
    # ...
    
    c.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))
    

It may actually work better if you used codecs.open() to decode the contents automatically instead:

import codecs

sql = mdb.connect('localhost','admin','ugo&(-@F','music_vibration', charset='utf8')

with codecs.open('config/index/'+index, 'r', 'utf8') as findex:
    for line in findex:
        if u'#artiste' not in line:
            continue

        artiste=line.split(u'[:::]')[1].strip()

    cursor = sql.cursor()
    cursor.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))
    if not cursor.fetchone()[0]:
        cursor = sql.cursor()
        cursor.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))
        artists_inserted += 1

You may want to brush up on Unicode and UTF-8 and encodings. I can recommend the following articles:

这篇关于python编码utf-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆