使用python中的csv.DictReader做最快的方式做数据类型转换 [英] Fastest way to do data type conversion using csv.DictReader in python

查看:5331
本文介绍了使用python中的csv.DictReader做最快的方式做数据类型转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是一个CSV文件在python中,在使用时将有约100,000行。



由于csv.DictReader或csv.reader返回值仅为字符串,因此我将其设置为字符串形式。

 对于csvDict中的i:
i [ col] = float(i [col])

有更好的方法,任何人都可以建议做这个?我一直在玩地图,izip,itertools的各种组合,并已广泛搜索了一些样品,它更有效地做,但不幸的是没有太多的成功。



如果它有帮助:
我在appengine上这样做。我相信我正在做的可能会导致我遇到这个错误:
超过软过程大小限制与267.789 MB在服务11请求总数 - 我只有当CSV是非常大。



编辑:我的目标
我解析此CSV,以便我可以将其用作> Google Visualizations API的数据源。最终的数据集将被加载到gviz DataTable中进行查询。必须在构建此表期间指定类型。我的问题也可以解决如果任何人知道一个好的gviz csv-> datatable转换器在python!



Edit2:我的代码 >

我相信我的问题与我尝试fixCsvTypes()的方式有关。此外,data_table.LoadData()期望一个可迭代的对象。

  class GvizFromCsv(object):
CSV to Gviz ready objects。

def __init __(self,csvFile,dateTimeFormat = None):
self.fileObj = StringIO.StringIO(csvFile)
self.csvDict = list(csv.DictReader(self.fileObj))
self.dateTimeFormat = dateTimeFormat
self.headers = {}
self.ParseHeaders()
self.fixCsvTypes b
$ b def IsNumber(self,st):
try:
float(st)
返回True
,除了ValueError:
return False

def IsDate(self,st):
try:
datetime.datetime.strptime(st,self.dateTimeFormat)
除了ValueError:
return False

def ParseHeaders(self):
尝试找出gviz的标题类型,基于第一行
for k,v in self.csvDict [0 ] .items():
if self.IsNumber(v):
self.headers [k] ='number'
elif self.dateTimeFormat和self.IsDate(v):
self.headers [k] ='date'
else:
self.headers [k] ='string'

def fixCsvTypes(self):
只修复数字。
update_to_numbers = []
for self.headers.items()中的k,v:
如果v =='number':
update_to_numbers.append(k)
for i in self.csvDict:
在update_to_numbers中的col:
i [col] = float(i [col])

def CreateDataTable(self):
创建一个gviz数据表
data_table = gviz_api.DataTable(self.headers)
data_table.LoadData(self.csvDict)
return data_table


解决方案

我第一次利用CSV文件正则表达式,但由于文件中的数据在每行中都非常严格,因此我们只需使用 split()函数

  import gviz_api 

scheme = [('col1','string','SURNAME'),('col2','number','ONE'), 'col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#--- surnames.csv中的行是:---
#surname,percent,cumulative percent,rank \\\

#SMITH,1.006,1.006,1,\\\

#JOHNSON,0.810,1.816,2,\\\

# WILLIAMS,0.699,2.515,3,\\\


with open('surnames.csv')as f:

def transf(surname,x,y):
return(surname,float(x),float(y))

f.readline()
#跳过第一行的姓氏,百分比, n

data_table.LoadData(transf(* line.split(',')[0:3])在f)
#通过迭代CSV来填充数据表文件

或没有要定义的函数:

  import gviz_api 

scheme = [('col1','string','SURNAME'),('col2','number','ONE '),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#--- surnames.csv中的行是: -
#surname,percent,cumulative percent,rank \\\

#SMITH,1.006,1.006,1,\\\

#JOHNSON,0.810,1.816,2,\\\

#WILLIAMS,0.699,2.515,3,\\\


with open('surnames.csv')as f:

f.readline()
#跳过第一行surname,percent,cumulative percent,rank \\\


datdata_table.LoadData([el if n == 0 else] float(el)for n,el in enumerate line.split(',')[0:3])]在f)
#通过迭代CSV文件来填充数据表

在某一时刻,我相信我不得不一次填充一行数据表,因为我正在使用正则表达式,需要获得匹配的组之前浮动数字的字符串。使用 split(),您可以使用 LoadData()



在一个指令中完成所有操作。



因此,您的代码可以缩短。顺便说一下,我不明白为什么应该继续定义一个类。相反,一个函数对我来说足够了:

  def GvizFromCsv(filename):
创建一个gviz数据表格从CSV文件

data_table = gviz_api.DataTable(['col1','string','SURNAME'),
('col2','number' 'ONE'),
('col3','number','TWO')])

#---这样的表模式,文件中的行必须像:---
#blah,number,number,... any else ... \\\

#SMITH,1.006,1.006,... anything else ... \\\

#JOHNSON,0.810,1.816,...任何其他... \\\

#WILLIAMS,0.699,2.515,...任何其他... \\\


如果n = 1,则返回0,否则返回0,否则返回0,否则返回0。
for line in f)
return data_table

。 b
$ b

现在您必须检查是否可以在此代码中插入从另一个API读取CSV数据的方式,以保持迭代原则来填充数据表。


I'm working with a CSV file in python, which will have ~100,000 rows when in use. Each row has a set of dimensions (as strings) and a single metric (float).

As csv.DictReader or csv.reader return values as string only, I'm currently iterating over all rows and converting the one numeric value to a float.

for i in csvDict:
    i[col] = float(i[col])

Is there a better way that anyone could suggest to do this? I've been playing around with various combinations of map, izip, itertools and have searched extensively for some samples of doing it more efficiently, but unfortunately haven't had much success.

In case it helps: I'm doing this on appengine. I believe that what I'm doing may be resulting in me hitting this error: Exceeded soft process size limit with 267.789 MB after servicing 11 requests total - I only get it when the CSV is quite large.

Edit: My Goal I'm parsing this CSV so that I can use it as a data source for the Google Visualizations API. The final data set will be loaded in to a gviz DataTable for querying. Type must be specified during the construction of this table. My problem could also be solved if anyone knew of a good gviz csv->datatable converter in python!

Edit2: My Code

I believe that my issue has to do with the way I attempt to fixCsvTypes(). Also, data_table.LoadData() expects an iterable object.

class GvizFromCsv(object):
  """Convert CSV to Gviz ready objects."""

  def __init__(self, csvFile, dateTimeFormat=None):
    self.fileObj = StringIO.StringIO(csvFile)
    self.csvDict = list(csv.DictReader(self.fileObj))
    self.dateTimeFormat = dateTimeFormat
    self.headers = {}
    self.ParseHeaders()
    self.fixCsvTypes()

  def IsNumber(self, st):
    try:
        float(st)
        return True
    except ValueError:
        return False

  def IsDate(self, st):
    try:
      datetime.datetime.strptime(st, self.dateTimeFormat)
    except ValueError:
      return False

  def ParseHeaders(self):
    """Attempts to figure out header types for gviz, based on first row"""
    for k, v in self.csvDict[0].items():
      if self.IsNumber(v):
        self.headers[k] = 'number'
      elif self.dateTimeFormat and self.IsDate(v):
        self.headers[k] = 'date'
      else:
        self.headers[k] = 'string'

  def fixCsvTypes(self):
    """Only fixes numbers."""
    update_to_numbers = []
    for k,v in self.headers.items():
      if v == 'number':
        update_to_numbers.append(k)
    for i in self.csvDict:
      for col in update_to_numbers:
        i[col] = float(i[col])

  def CreateDataTable(self):
    """creates a gviz data table"""
    data_table = gviz_api.DataTable(self.headers)
    data_table.LoadData(self.csvDict)
    return data_table

解决方案

I had first exploited the CSV file with a regex, but since the data in the file is very strictly arranged in each row, we can simply use the split() function

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    def transf(surname,x,y):
        return (surname,float(x),float(y))

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    data_table.LoadData( transf(*line.split(',')[0:3]) for line in f )
    # to populate the data table by iterating in the CSV file

Or without a function to be defined:

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    datdata_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])] for line in f )    
    # to populate the data table by iterating in the CSV file

At one moment, I believed I was obliged to populate the data table with one row at a time because I was using a regex and that needed to obtain the matches' groups before floating the numbers' strings. With split() all can be done in one instruction with LoadData()

.

Hence, your code can be shortened. By the way, I don't see why it should continue to define a class. Instead, a function seems enough for me:

def GvizFromCsv(filename):
  """ creates a gviz data table from a CSV file """

  data_table = gviz_api.DataTable([('col1','string','SURNAME'),
                                   ('col2','number','ONE'    ),
                                   ('col3','number','TWO'    ) ])

  #  --- with such a table schema , lines in the file must be like that: ---  
  #  blah, number, number, ...anything else...\n 
  #  SMITH,1.006,1.006, ...anything else...\n 
  #  JOHNSON,0.810,1.816, ...anything else...\n 
  #  WILLIAMS,0.699,2.515, ...anything else...\n

  with open(filename) as f:
    data_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])]
                         for line in f )
  return data_table

.

Now you must examine if the way in which the CSV data is read from another API can be inserted in this code to keep the iterating principle to populate the data table.

这篇关于使用python中的csv.DictReader做最快的方式做数据类型转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆