使用python中的英文和印地语字符解析csv文件 [英] Parsing csv file with english and hindi characters in python

查看:412
本文介绍了使用python中的英文和印地语字符解析csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图解析一个csv文件,它有英语和印地语字符,我使用utf-16。它工作正常,但一旦它击中的hindi charatcer它失败。我在这里失去了。



代码 - >

  import csv 
import codecs

csvReader = csv.reader(codecs.open('/ home / kuberkaul / Downloads / csv.csv','rb','utf-16'))
row in csvReader:
print row

我得到的错误是Traceback最后):

 >文件csvreader.py,第8行,在< module> 
>对于csvReader中的行:UnicodeEncodeError:'ascii'编解码器无法编码位置11-18中的字符:序数不在范围内(128)
> kuberkaul @ ubuntu:〜/ Desktop $

如何解决这个问题?



编辑1:



我试过解决方案并使用unicdoe csv reader,现在出现错误:


UnicodeDecodeError:'ascii'编解码器不能解码字节0xff在位置
0:序数不在范围内(128)



代码为:

  import csv 
import codecs,io


def unicode_csv_reader(unicode_csv_data,dialect = csv.excel,** kwargs):
# csv.py不做Unicode;编码为UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect = dialect,** kwargs)
对于csv_reader中的行:
#decode UTF -8返回Unicode,逐个单元格:
yield [unicode(cell,'utf-8')for row in row]

def utf_8_encoder(unicode_csv_data):
在unicode_csv_data中的行:
yield line.encode('utf-8')

filename ='/home/kuberkaul/Downloads/csv.csv'
reader = unicode_csv_reader .open(filename))
打印读取器
读取器中的行:
打印行


解决方案

由于文档< a>在顶部附近的一个大注释中说:


此版本的csv模块不支持Unicode输入。此外,目前有一些关于ASCII NUL字符的问题。因此,所有输入应为UTF-8或可打印的ASCII,以确保安全;请参阅示例中的示例。


如果链接到示例,它会显示解决方案:将每行编码为UTF- 8,然后传递给 csv 。他们甚至给你一个很好的包装,所以你可以用 unicode_csv_reader 替换 csv.reader 代码不变:

  csvReader = unicode_csv_reader(codecs.open('/ home / kuberkaul / Downloads / csv.csv' rb','utf-16'))
for csvReader:
print row


$ b b




当然, print 不会非常有用,因为 str 的列表使用每个元素的 repr ,所以你会得到像 [u'foo' ,u'bar',u'\\\ऐ\\\ऑ'] ...



你可以用通常的方式修复 - print u','.join(row)会工作,如果你还记得 u ,如果Python能够猜测你的终端的编码(它可以在Mac和现代linux,但可能无法在Windows和旧的linux,在这种情况下你需要映射一个明确的 encode 每列)。


I am trying to parse a csv file which has both english and hindi characters and I am using utf-16. It works fine but as soon as it hits the hindi charatcer it fails. I am at a loss here.

Heres the code -->

import csv
import codecs

csvReader = csv.reader(codecs.open('/home/kuberkaul/Downloads/csv.csv', 'rb', 'utf-16'))
for row in csvReader:
        print row

The error that I get is Traceback (most recent call last):

>  File "csvreader.py", line 8, in <module>
>     for row in csvReader: UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-18: ordinal not in range(128)
> kuberkaul@ubuntu:~/Desktop$

How do I solve this ?

Edit 1:

I tried the solutions and used unicdoe csv reader and now it gives the error :

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

The code is :

import csv
import codecs, io


def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

filename = '/home/kuberkaul/Downloads/csv.csv'
reader = unicode_csv_reader(codecs.open(filename))
  print reader
for rows in reader:
  print rows

解决方案

As the documentation says, in a big Note near the top:

This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.

If you follow link to the example, it shows you the solution: Encode each line to UTF-8 before passing it to csv. They even give you a nice wrapper, so you can just replace the csv.reader with unicode_csv_reader and the rest of your code is unchanged:

csvReader = unicode_csv_reader(codecs.open('/home/kuberkaul/Downloads/csv.csv', 'rb', 'utf-16'))
for row in csvReader:
    print row


Of course the print isn't going to be very useful, as the str of a list uses the repr of each element, so you're going to get something like [u'foo', u'bar', u'\u0910\u0911']

You can fix that in the usual ways—e.g., print u', '.join(row) will work if you remember the u, and if Python is able to guess your terminal's encoding (which it can on Mac and modern linux, but may not be able to on Windows and old linux, in which case you'll need to map an explicit encode over each column).

这篇关于使用python中的英文和印地语字符解析csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆