python:打开并读取一个包含日耳曼语元音符号作为unicode的文件 [英] python: open and read a file containing germanic umlaut as unicode
问题描述
我已经编写了程序来从文本文件中读取单词,并将其输入sqlite数据库中,并将其视为字符串.但是,我需要输入一些包含日耳曼语umlates的单词:äöüß.
I have written my program to read words from a text file and enter them in sqlite database and also treat it as string. But I need to enter some words containing Germanic umlates: äöüß.
这是一段准备好的代码:
Here is a prepared piece of code:
我用#--编码:iso-8859-15--和#--编码:utf-8--都没有区别(!)
I treid both with # -- coding: iso-8859-15 -- and # -- coding: utf-8 -- No difference(!)
# -*- coding: iso-8859-15 -*-
import sqlite3
dbname = 'sampledb.db'
filename ='text.txt'
con = sqlite3.connect(dbname)
cur = con.cursor()
cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,name)''')
#f=open(filename)
#text = f.readlines()
#f.close()
text = u'süß'
print (text)
cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))
con.commit()
sentence = "The name is: %s" %(text,)
print (sentence)
f.close()
con.close()
上面的代码运行良好.但是我需要从包含单词süß"的文件中读取文本".因此,当我取消注释三行(f.open(filename)....),并注释 text =u'süß'时,就会出现错误
the above code runs well. But I need to read 'text' from a file containing the word 'süß'. So when I uncomment the 3 lines ( f.open(filename) .... ), and commenting text = u'süß' it brings the error
sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.
我尝试使用编解码器模块读取utf-8,iso-8859-15.但是我无法将它们解码为字符串süß",我需要在代码末尾完成我的句子.
I tried codecs module to read a utf-8, iso-8859-15. But I could not decode them to the string 'süß' which I need to complete my sentence at the end of the code.
在插入数据库之前,我曾尝试解码到utf-8.它有效,但是我不能将其用作字符串.
Once I tried decoding to utf-8 before inserting into the database. It worked, but I could not use it as string.
有没有办法从文件导入süß并将其用于插入sqlite和用作字符串?
Is there a way I can import süß from a file and use it both for inserting to sqlite and using as string?
更多细节:
在此添加更多详细信息以进行澄清.我以前用过codecs.open
.
包含单词süß的文本文件另存为utf-8
.使用f=codecs.open(filename, 'r', 'utf-8')
和text=f.read()
,我将文件读取为Unicode u'\ufeffs\xfc\xdf'
.将此unicode插入 sqlite3 的操作很顺利:cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))
.
Here I add more details for clarification. I have used codecs.open
before.
The text file containing the word süß is saved as utf-8
. Using f=codecs.open(filename, 'r', 'utf-8')
and text=f.read()
, I read the file as unicode u'\ufeffs\xfc\xdf'
. Inserting this unicode in sqlite3 is smoothly done: cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))
.
问题在这里:sentence = "The name is: %s" %(text,)
给出u'The name is: \ufeffs\xfc\xdf'
,我还需要print(text)
作为我的输出süß,而print(text)
会出现此错误UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>
.
The problem is here: sentence = "The name is: %s" %(text,)
gives u'The name is: \ufeffs\xfc\xdf'
, and I also need to print(text)
as my output süß, while print(text)
brings this error UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>
.
谢谢.
推荐答案
我可以解决问题.感谢您的帮助.
I could sort out the problem. Thanks for the helps.
这里是:
# -*- coding: iso-8859-1 -*-
import sys
import codecs
import sqlite3
f = codecs.open("suess_sweet.txt", "r", "utf-8") # suess_sweet.txt file contains two
text_in_unicode = f.read() # comma-separated words: süß, sweet
f.close()
stdout_encoding = sys.stdout.encoding or sys.getfilesystemencoding()
con = sqlite3.connect('dict1.db')
cur = con.cursor()
cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,German,English)''')
[ger,eng] = text_in_unicode.split(',')
cur.execute('''insert into table1 (id,German,English) VALUES (NULL,?,?)''',(ger,eng))
con.commit()
sentence = "The German word is: %s" %(ger,)
print sentence.encode(stdout_encoding)
con.close()
我从此页面(德语)获得了一些帮助
I got some help from this page (it's in German)
,输出为:
The German word is: ?süß
还有一个小问题是'?'.我以为Unicode编码u'
在编码后被?
取代了. sentence
给出:
Still a small problem is the '?'. I thought that the unicode u'
is replaced by ?
after encoding. sentence
gives:
>>> sentence
u'The German word is: \ufeffs\xfc\xdf '
和编码后的句子给出:
>>> sentence.encode(stdout_encoding)
'The German word is: ?s\xfc\xdf '
所以那不是我的想法.
我想到一个简单的解决方案,要摆脱问号,可以使用 replace 函数:
A simple solution comes to my mind, to get rid of the question mark is to use the replace function:
sentence = "The German word is: %s" %(ger,)
to_print = sentence.encode(stdout_encoding)
to_print = to_print.replace('?','')
>>> print(to_print)
The German word is: süß
谢谢你:)
这篇关于python:打开并读取一个包含日耳曼语元音符号作为unicode的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!