python:打开并读取一个包含日耳曼语元音符号作为unicode的文件 [英] python: open and read a file containing germanic umlaut as unicode

查看:95
本文介绍了python:打开并读取一个包含日耳曼语元音符号作为unicode的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经编写了程序来从文本文件中读取单词,并将其输入sqlite数据库中,并将其视为字符串.但是,我需要输入一些包含日耳曼语umlates的单词:äöüß.

I have written my program to read words from a text file and enter them in sqlite database and also treat it as string. But I need to enter some words containing Germanic umlates: äöüß.

这是一段准备好的代码:

Here is a prepared piece of code:

我用#--编码:iso-8859-15--和#--编码:utf-8--都没有区别(!)

I treid both with # -- coding: iso-8859-15 -- and # -- coding: utf-8 -- No difference(!)

    # -*- coding: iso-8859-15 -*-
    import sqlite3

    dbname = 'sampledb.db'
    filename ='text.txt'


    con = sqlite3.connect(dbname)
    cur = con.cursor()
    cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,name)''')    

    #f=open(filename)
    #text = f.readlines()
    #f.close()

    text = u'süß'

    print (text)
    cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))       

    con.commit()

    sentence = "The name is: %s" %(text,)

    print (sentence)
    f.close()
    con.close()

上面的代码运行良好.但是我需要从包含单词süß"的文件中读取文本".因此,当我取消注释三行(f.open(filename)....),并注释 text =u'süß'时,就会出现错误

the above code runs well. But I need to read 'text' from a file containing the word 'süß'. So when I uncomment the 3 lines ( f.open(filename) .... ), and commenting text = u'süß' it brings the error

    sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.

我尝试使用编解码器模块读取utf-8,iso-8859-15.但是我无法将它们解码为字符串süß",我需要在代码末尾完成我的句子.

I tried codecs module to read a utf-8, iso-8859-15. But I could not decode them to the string 'süß' which I need to complete my sentence at the end of the code.

在插入数据库之前,我曾尝试解码到utf-8.它有效,但是我不能将其用作字符串.

Once I tried decoding to utf-8 before inserting into the database. It worked, but I could not use it as string.

有没有办法从文件导入süß并将其用于插入sqlite和用作字符串?

Is there a way I can import süß from a file and use it both for inserting to sqlite and using as string?

更多细节:

在此添加更多详细信息以进行澄清.我以前用过codecs.open. 包含单词süß的文本文件另存为utf-8.使用f=codecs.open(filename, 'r', 'utf-8')text=f.read(),我将文件读取为Unicode u'\ufeffs\xfc\xdf'.将此unicode插入 sqlite3 的操作很顺利:cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,)).

Here I add more details for clarification. I have used codecs.open before. The text file containing the word süß is saved as utf-8. Using f=codecs.open(filename, 'r', 'utf-8') and text=f.read(), I read the file as unicode u'\ufeffs\xfc\xdf'. Inserting this unicode in sqlite3 is smoothly done: cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,)).

问题在这里:sentence = "The name is: %s" %(text,)给出u'The name is: \ufeffs\xfc\xdf',我还需要print(text)作为我的输出süß,而print(text)会出现此错误UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>.

The problem is here: sentence = "The name is: %s" %(text,) gives u'The name is: \ufeffs\xfc\xdf', and I also need to print(text) as my output süß, while print(text) brings this error UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>.

谢谢.

推荐答案

我可以解决问题.感谢您的帮助.

I could sort out the problem. Thanks for the helps.

这里是:

# -*- coding: iso-8859-1 -*-

import sys 
import codecs
import sqlite3

f = codecs.open("suess_sweet.txt", "r", "utf-8")    # suess_sweet.txt file contains two
text_in_unicode = f.read()                          # comma-separated words: süß, sweet 
f.close()

stdout_encoding = sys.stdout.encoding or sys.getfilesystemencoding()

con = sqlite3.connect('dict1.db')
cur = con.cursor()
cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,German,English)''')    

[ger,eng] = text_in_unicode.split(',')

cur.execute('''insert into table1 (id,German,English) VALUES (NULL,?,?)''',(ger,eng))       

con.commit()

sentence = "The German word is: %s" %(ger,)

print sentence.encode(stdout_encoding)

con.close()

我从此页面(德语)获得了一些帮助

I got some help from this page (it's in German)

,输出为:

The German word is: ?süß 

还有一个小问题是'?'.我以为Unicode编码u'在编码后被?取代了. sentence给出:

Still a small problem is the '?'. I thought that the unicode u' is replaced by ? after encoding. sentence gives:

>>> sentence
u'The German word is: \ufeffs\xfc\xdf '

和编码后的句子给出:

>>> sentence.encode(stdout_encoding)
'The German word is: ?s\xfc\xdf '

所以那不是我的想法.

我想到一个简单的解决方案,要摆脱问号,可以使用 replace 函数:

A simple solution comes to my mind, to get rid of the question mark is to use the replace function:

sentence = "The German word is: %s" %(ger,)
to_print = sentence.encode(stdout_encoding)
to_print = to_print.replace('?','')

>>> print(to_print)
The German word is: süß

谢谢你:)

这篇关于python:打开并读取一个包含日耳曼语元音符号作为unicode的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆