如何获得SQLAlchemy来将Unicode省略号正确插入到mySQL表中? [英] How do I get SQLAlchemy to correctly insert a unicode ellipsis into a mySQL table?

查看:82
本文介绍了如何获得SQLAlchemy来将Unicode省略号正确插入到mySQL表中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用feedparser解析RSS feed,并使用SQLAlchemy将其插入到mySQL表中.实际上,我能够使它运行得很好,但是今天,提要中有一个描述中带有省略号的项目,并且出现了以下错误:

I am trying to parse an RSS feed with feedparser and insert it into a mySQL table using SQLAlchemy. I was actually able to get this running just fine but today the feed had an item with an ellipsis character in the description and I get the following error:

UnicodeEncodeError:'latin-1'编解码器无法在位置35处编码字符u'\ u2026':序数不在范围内(256)

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2026' in position 35: ordinal not in range(256)

如果我向引擎添加convert_unicode = True选项,我可以使插入内容通过,但省略号不会显示出来,这只是奇怪的字符.据我所知,这是有道理的,因为据我所知,latin-1中没有水平省略号.即使我将编码设置为utf-8,也似乎没有什么不同.如果我使用phpmyadmin进行插入并包含省略号,则表示一切正常.

If I add the convert_unicode=True option to the engine I am able to get the insert to go through but the ellipsis doesn't show up it's just weird characters. This seems to make sense since to the best of my knowledge there is no horizontal ellipsis in latin-1. Even if I set the encoding to utf-8 it doesn't seem to make a difference. If I do an insert using phpmyadmin and include the ellipsis it goes through fine.

我在想我只是不了解字符编码或如何使SQLAlchemy使用我指定的字符编码.有谁知道如何在不使用怪异字符的情况下输入文本?

I'm thinking I just don't understand character encodings or how to get SQLAlchemy to use one I specify. Does anyone know how to get the text to go in without weird characters?

更新

我想我已经弄清楚了这个原因,但是我不太确定为什么这很重要...

I think I have figured this one out but I'm not really sure why it matters...

这是代码:

import sys
import feedparser
import sqlalchemy
from sqlalchemy import create_engine, MetaData, Table

COMMON_CHANNEL_PROPERTIES = [
  ('Channel title:','title', None),
  ('Channel description:', 'description', 100),
  ('Channel URL:', 'link', None),
]

COMMON_ITEM_PROPERTIES = [
  ('Item title:', 'title', None),
  ('Item description:', 'description', 100),
  ('Item URL:', 'link', None),
]

INDENT = u' '*4

def feedinfo(url, output=sys.stdout):
  feed_data = feedparser.parse(url)
  channel, items = feed_data.feed, feed_data.entries

  #adding charset=utf8 here is what fixed the problem

  db = create_engine('mysql://user:pass@localhost/db?charset=utf8')
  metadata = MetaData(db)
  rssItems = Table('rss_items', metadata,autoload=True)
  i = rssItems.insert();

  for label, prop, trunc in COMMON_CHANNEL_PROPERTIES:
    value = channel[prop]
    if trunc:
      value = value[:trunc] + u'...'
    print >> output, label, value
  print >> output
  print >> output, "Feed items:"
  for item in items:
    i.execute({'title':item['title'], 'description': item['description'][:100]})
    for label, prop, trunc in COMMON_ITEM_PROPERTIES:
      value = item[prop]
      if trunc:
        value = value[:trunc] + u'...'
      print >> output, INDENT, label, value
    print >> output, INDENT, u'---'
  return

if __name__=="__main__":
  url = sys.argv[1]
  feedinfo(url)

这是不带charset选项的运行代码的输出/回溯:

Here's the output/traceback from running the code without the charset option:

Channel title: [H]ardOCP News/Article Feed
Channel description: News/Article Feed for [H]ardOCP...
Channel URL: http://www.hardocp.com

Feed items:
     Item title: Windows 8 UI is Dropping the 'Start' Button
     Item description: After 15 years of occupying a place of honor on the desktop, the "Start" button will disappear from ...
     Item URL: http://www.hardocp.com/news/2012/02/05/windows_8_ui_dropping_lsquostartrsquo_button/
     ---
     Item title: Which Crashes More? Apple Apps or Android Apps
     Item description: A new study of smartphone apps between Android and Apple conducted over a two month period came up w...
     Item URL: http://www.hardocp.com/news/2012/02/05/which_crashes_more63_apple_apps_or_android/
     ---
Traceback (most recent call last):
  File "parse.py", line 47, in <module>
    feedinfo(url)
  File "parse.py", line 36, in feedinfo
    i.execute({'title':item['title'], 'description': item['description'][:100]})
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/sql/expression.py", line 2758, in execute
    return e._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2304, in _execute_clauseelement
    return connection._execute_clauseelement(elem, multiparams, params)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1538, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1639, in _execute_context
    context)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 330, in do_execute
    cursor.execute(statement, parameters)
  File "build/bdist.linux-i686/egg/MySQLdb/cursors.py", line 159, in execute
  File "build/bdist.linux-i686/egg/MySQLdb/connections.py", line 264, in literal
  File "build/bdist.linux-i686/egg/MySQLdb/connections.py", line 202, in unicode_literal
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2026' in position 35: ordinal not in range(256)

因此,看起来好像将字符集添加到mysql connect字符串中就可以了.我想它默认为latin-1吗?我尝试将content_engine上的编码标志设置为utf8,但没有执行任何操作.有人知道当表和字段设置为utf8 unicode时为什么会使用latin-1吗?我还尝试了在发送之前使用.encode('cp1252')对item ['description]进行编码,即使没有在连接字符串中添加charset选项,效果也很好.那不应该与latin-1一起使用,但显然可以吗?我有解决方案,但希望得到一个答案:)

So it looks like adding the charset to the mysql connect string did it. I suppose it defaults to latin-1? I had tried setting the encoding flag on content_engine to utf8 and that did nothing. Anyone know why it would use latin-1 when the tables and fields are set to utf8 unicode? I also tried encoding item['description] using .encode('cp1252') before sending it off and that worked as well even without adding the charset option to the connection string. That shouldn't have worked with latin-1 but apparently it did? I've got the solution but would love an answer :)

推荐答案

错误消息

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2026' 
in position 35: ordinal not in range(256)

似乎表明某些Python语言代码正在尝试将字符\u2026转换为Latin-1(ISO8859-1)字符串,并且失败.毫不奇怪,该字符为 U+2026 HORIZONTAL ELLIPSIS ,在ISO8859中没有单个等效字符-1.

seems to indicate that some Python language code is trying to convert the character \u2026 into a Latin-1 (ISO8859-1) string, and it is failing. Not surprising, that character is U+2026 HORIZONTAL ELLIPSIS, which has no single equivalent character in ISO8859-1.

您通过在SQLAlchemy连接调用中添加查询?charset=utf8来解决此问题:

You fixed the problem by adding the query ?charset=utf8 in your SQLAlchemy connection call:

import sqlalchemy
from sqlalchemy import create_engine, MetaData, Table

db = create_engine('mysql://user:pass@localhost/db?charset=utf8')

其中的 数据库网址 SQLAlchemy文档告诉我们,使用mysql-python驱动程序,以mysql开头的URL表示MySQL语言.

The section Database Urls of the SQLAlchemy documentation tells us that a URL beginning with mysql indicates a MySQL dialect, using the mysql-python driver.

以下部分 自定义DBAPI connect()参数 ,告诉我们查询参数已传递给基础DBAPI.

The following section, Custom DBAPI connect() arguments, tells us that query arguments are passed to the underlying DBAPI.

那么, mysql-python 驱动程序由什么参数{charset: 'utf8'}组成?他们文档的 函数和属性 部分说到charset属性"...如果存在,则连接字符集如果不相等,则将更改为此字符集."

So, what does the mysql-python driver make of a parameter {charset: 'utf8'}? Section Functions and attributes of their documentation says of the charset attribute "...If present, the connection character set will be changed to this character set, if they are not equal."

要了解连接字符集的含义,请转到 10.1 .4. MySQL 5.6参考手册的连接字符集和排序规则 .长话短说,MySQL可以将传入的查询解释为与数据库字符集不同的编码,并且与返回的查询结果的编码不同.

To find out what the connection character set means, we turn to 10.1.4. Connection Character Sets and Collations of the MySQL 5.6 reference manual. To make a long story short, MySQL can have interpret incoming queries as an encoding different than the database's character set, and different than the encoding of the returned query results.

由于您报告的错误消息看起来像是Python而不是SQL错误消息,因此我推测SQLAlchemy或mysql-python中的某些内容正在尝试将查询转换为默认的连接编码latin-1,然后将其发送.这就是触发错误的原因.但是,在connect()调用中的查询字符串?charset=utf8会更改连接编码,并且U+2026 HORIZONTAL ELLIPSIS可以通过.

Since the error message you reported looks like a Python rather than a SQL error message, I'll speculate that something in SQLAlchemy or mysql-python is attempting to convert the query to a default connection encoding of latin-1 before sending it. This is what triggers the error. However, the query string ?charset=utf8 in your connect() call changes the connection encoding, and the U+2026 HORIZONTAL ELLIPSIS is able to get through.

更新:您还问:如果我删除字符集选项,然后使用.encode('cp1252')对描述进行编码,它将顺利进行.省略号如何获得通过cp1252而不是unicode吗?"

Update: you also ask, "if I remove the charset option and then encode the description using .encode('cp1252') it will go through just fine. How is an ellipsis able to get through with cp1252 but not unicode?"

编码cp1252在字节值\x85处具有水平省略号字符.因此,可以将包含U+2026 HORIZONTAL ELLIPSIS的Unicode字符串编码为cp1252,而不会出现错误.

The encoding cp1252 has a horizontal ellipsis character at byte value \x85. Thus it is possible to encode a Unicode string containing U+2026 HORIZONTAL ELLIPSIS into cp1252 without error.

还请记住,在Python中,Unicode字符串和字节字符串是两种不同的数据类型.可以合理地推测MySQLdb可能具有通过SQL连接仅发送字节字符串的策略.因此,它将把以Unicode字符串接收的查询编码为字节字符串,但将仅留下以字节字符串接收的查询. (这是推测,我没有看过源代码.)

Remember also that in Python, Unicode strings and byte strings are two different data types. It's reasonable to speculate that MySQLdb might have a policy of sending only byte strings over a SQL connection. Thus it would encode a query received as a Unicode string into a byte string, but would leave a query received as a byte string alone. (This is speculation, I haven't looked at the source code.)

在发布的回溯中,最后两行(最接近错误发生的位置)显示方法名称literal,后跟unicode_literal.这倾向于支持MySQLdb将以Unicode字符串形式接收的查询编码为字节字符串的理论.

In the traceback you posted, the last two lines (closest to where the error occur) show the method names literal, followed by unicode_literal. That tends to support the theory that MySQLdb is encoding the query it receives as a Unicode string into a byte string.

当您自己对查询字符串进行编码时,您将绕过MySQLdb进行此编码方式不同的部分.但是请注意,如果您对查询字符串的编码方式与MySQL连接字符集所要求的编码方式不同,那么您将出现编码不匹配的情况,并且您的文本可能会存储错误.

When you encode the query string yourself, you bypass the part of MySQLdb that does this encoding differently. Note, however, that if you encode the query string differently than the MySQL connection charset calls for, then you'll have an encoding mismatch, and your text will likely be stored wrong.

这篇关于如何获得SQLAlchemy来将Unicode省略号正确插入到mySQL表中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆