python中CSV数据的数据类型识别/猜测 [英] Data Type Recognition/Guessing of CSV data in python

查看:237
本文介绍了python中CSV数据的数据类型识别/猜测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题在于处理大型CSV文件中的数据。

My problem is in the context of processing data from large CSV files.

我正在寻找一种最有效的方法,根据该列中的值确定(即猜测)该列的数据类型。我可能正在处理非常混乱的数据。因此,该算法应在一定程度上容错。

I'm looking for the most efficient way to determine (that is, guess) the data type of a column based on the values found in that column. I'm potentially dealing with very messy data. Therefore, the algorithm should be error-tolerant to some extent.

下面是一个示例:

arr1 = ['0.83', '-0.26', '-', '0.23', '11.23']               # ==> recognize as float
arr2 = ['1', '11', '-1345.67', '0', '22']                    # ==> regognize as int
arr3 = ['2/7/1985', 'Jul 03 1985, 00:00:00', '', '4/3/2011'] # ==> recognize as date
arr4 = ['Dog', 'Cat', '0.13', 'Mouse']                       # ==> recognize as str

底线:我正在寻找可检测到python软件包或算法的

Bottom line: I'm looking for a python package or an algorithm that can detect either


  • CSV文件的模式,甚至更好

  • 单个列的数据类型
    作为数组

朝着相似的方向发展。
不过,我担心性能,因为我可能在处理许多大型电子表格(数据来自其中)

Method for guessing type of data represented currently represented as strings goes in a similar direction. I'm worried about performance, though, since I'm possibly dealing with many large spreadsheets (where the data stems from)

推荐答案

您可能对这个python库感兴趣,该库正是为您在CSV和XLS文件上进行这种类型的猜测:

You may be interested in this python library which does exactly this kind of type guessing on CSVs and XLS files for you:

  • https://github.com/okfn/messytables
  • https://messytables.readthedocs.org/ - docs

它可以愉快地扩展到非常大的文件,可以从Internet等流式传输数据。

It happily scales to very large files, to streaming data off the internet etc.

还有一个更简单的包装器库,其中包括一个名为dataconverters的命令行工具:< a href = http://okfnlabs.org/dataconverters/> http://okfnlabs.org/dataconverters/ (以及在线服务:https://github.com/okfn/dataproxy !)

There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)

执行gue类型的核心算法ssing在这里: https://github.com/okfn/messytables/blob/7e4f12abef257a4 messytables / types.py#L164

The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164

这篇关于python中CSV数据的数据类型识别/猜测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆