如何使用python将阿拉伯文本存储在mysql数据库中? [英] How to store arabic text in mysql database using python?

查看:82
本文介绍了如何使用python将阿拉伯文本存储在mysql数据库中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个阿拉伯语字符串

txt = u'Arabic (\u0627\u0644\u0637\u064a\u0631\u0627\u0646)'

我想将此阿拉伯文转换为mySql数据库.我尝试使用

txt = smart_str(txt)

txt = text.encode('utf-8') 

这两个都不起作用,因为它们将字符串覆盖到

u'Arabic (\xd8\xa7\xd9\x84\xd8\xb7\xd9\x8a\xd8\xb1\xd8\xa7\xd9\x86)' 

我的数据库字符集已经设置为utf-8

ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;

因此,由于有了这个新的unicode,我的数据库正在显示与编码文本有关的字符.请帮忙.我希望保留我的阿拉伯文字.

还可以从MySQL数据库快速导出此阿拉伯文本并将相同的阿拉伯文本写入文件中,还是会再次将其转换回unicode?

我用下面的代码插入

cur.execute("INSERT INTO tab1(id, username, text, created_at) VALUES (%s, %s, %s, %s)", (smart_str(id), smart_str(user_name), smart_str(text), date))

在此之前,当我不使用smart_str时,它会引发错误,指出仅允许使用"latin-1".

解决方案

澄清一些事情,因为它也将在将来对您有所帮助.

txt = u'Arabic (\u0627\u0644\u0637\u064a\u0631\u0627\u0646)'

这不是阿拉伯字符串.这是具有Unicode代码点的Unicode object .如果您只是打印它,并且如果您的终端支持阿拉伯语,那么您将获得如下输出:

>>> txt = u'Arabic (\u0627\u0644\u0637\u064a\u0631\u0627\u0646)'
>>> print(txt)
Arabic (الطيران)

现在,要在数据库中获得与Arabic (الطيران)相同的输出,您需要对字符串进行编码.

编码正在获取这些代码点;并将其转换为字节,以便计算机知道如何处理它们.

因此,最常见的编码是utf-8,因为它支持英语的所有字符,以及许多其他语言(包括阿拉伯语).也有其他的,例如,windows-1256也支持阿拉伯语.有些没有这些数字的引用(称为代码点),当您尝试编码时,会出现如下错误:

>>> print(txt.encode('latin-1'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 8-14: ordinal not in range(256)

这是在告诉您Unicode对象中的某些数字在表latin-1中不存在,因此程序不知道如何将其转换为字节.

计算机存储字节.因此,在存储或传输信息时,您需要始终对其进行正确的编码/解码.

此编码/解码步骤有时称为 unicode三明治-外面的所有内容是字节,里面的所有都是unicode.


通过这种方式,您需要先对数据进行正确编码,然后再将其发送到数据库中.为此,请对其进行编码:

q = u"""
    INSERT INTO
       tab1(id, username, text, created_at)
    VALUES (%s, %s, %s, %s)"""

conn = MySQLdb.connect(host="localhost",
                       user='root',
                       password='',
                       db='',
                       charset='utf8',
                       init_command='SET NAMES UTF8')
cur = conn.cursor()
cur.execute(q, (id.encode('utf-8'),
                user_name.encode('utf-8'),
                text.encode('utf-8'), date))

要确认是否正确插入了它,请确保从支持阿拉伯语的终端或应用程序中使用mysql;否则-即使正确插入了它,当程序显示它时-您也会看到垃圾字符.

I have an arabic string say

txt = u'Arabic (\u0627\u0644\u0637\u064a\u0631\u0627\u0646)'

I want to write this text arabic converted into mySql database. I tried using

txt = smart_str(txt)

or

txt = text.encode('utf-8') 

both of these din't work as they coverted the string to

u'Arabic (\xd8\xa7\xd9\x84\xd8\xb7\xd9\x8a\xd8\xb1\xd8\xa7\xd9\x86)' 

Also my database character set is already set to utf-8

ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;

So due to this new unicodes, my database is displaying the characters related to the encoded text. Please help. I want my arabic text to be retained.

Also does quick export of this arabic text from MySQL database write the same arabic text into files or will it again convert it back to unicode?

I used the foolowing code to insert

cur.execute("INSERT INTO tab1(id, username, text, created_at) VALUES (%s, %s, %s, %s)", (smart_str(id), smart_str(user_name), smart_str(text), date))

Earlier to this when I didn't use smart_str, it throws an error saying only 'latin-1' is allowed.

解决方案

To clarify a few things, because it will help you along in the future as well.

txt = u'Arabic (\u0627\u0644\u0637\u064a\u0631\u0627\u0646)'

This is not an Arabic string. This is a unicode object, with unicode codepoints. If you were to simply print it, and if your terminal supports Arabic you would get output like this:

>>> txt = u'Arabic (\u0627\u0644\u0637\u064a\u0631\u0627\u0646)'
>>> print(txt)
Arabic (الطيران)

Now, to get the same output like Arabic (الطيران) in your database, you need to encode the string.

Encoding is taking these code points; and converting them to bytes so that computers know what to do with them.

So the most common encoding is utf-8, because it supports all the characters of English, plus a lot of other languages (including Arabic). There are others too, for example, windows-1256 also supports Arabic. There are some that don't have references for those numbers (called code points), and when you try to encode, you'll get an error like this:

>>> print(txt.encode('latin-1'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 8-14: ordinal not in range(256)

What that is telling you is that some number in the unicode object does not exist in the table latin-1, so the program doesn't know how to convert it to bytes.

Computers store bytes. So when storing or transmitting information you need to always encode/decode it correctly.

This encode/decode step is sometimes called the unicode sandwich - everything outside is bytes, everything inside is unicode.


With that out of the way, you need to encode the data correctly before you send it to your database; to do that, encode it:

q = u"""
    INSERT INTO
       tab1(id, username, text, created_at)
    VALUES (%s, %s, %s, %s)"""

conn = MySQLdb.connect(host="localhost",
                       user='root',
                       password='',
                       db='',
                       charset='utf8',
                       init_command='SET NAMES UTF8')
cur = conn.cursor()
cur.execute(q, (id.encode('utf-8'),
                user_name.encode('utf-8'),
                text.encode('utf-8'), date))

To confirm that it is being inserted correctly, make sure you are using mysql from a terminal or application that supports Arabic; otherwise - even if its inserted correctly, when it is displayed by your program - you will see garbage characters.

这篇关于如何使用python将阿拉伯文本存储在mysql数据库中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆