使用python清理大数据 [英] cleaning big data using python

查看:68
本文介绍了使用python清理大数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须在 python 中清理输入数据文件.由于拼写错误,数据字段可能包含字符串而不是数字.我想识别所有作为字符串的字段,并使用 Pandas 用 NaN 填充这些字段.另外,我想记录这些字段的索引.

I have to clean a input data file in python. Due to typo error, the datafield may have strings instead of numbers. I would like to identify all fields which are a string and fill these with NaN using pandas. Also, I would like to log the index of those fields.

最粗略的方法之一是遍历每个字段并检查它是否为数字,但是如果数据很大,这会消耗大量时间.

One of the crudest way is to loop through each and every field and checking whether it is a number or not, but this consumes lot of time if the data is big.

我的 csv 文件包含类似于下表的数据:

Country  Count  Sales
USA         1   65000
UK          3    4000
IND         8       g
SPA         3    9000
NTH         5   80000

....假设我有 60,000 行这样的数据.

.... Assume that i have 60,000 such rows in the data.

理想情况下,我想确定 IND 行在 SALES 列下具有无效值.有关如何有效执行此操作的任何建议?

Ideally I would like to identify that row IND has an invalid value under SALES column. Any suggestions on how to do this efficiently?

推荐答案

read_csv:

There is a na_values argument to read_csv:

na_values : 类似列表或字典,默认 None
     要识别为 NA/NaN 的其他字符串.如果 dict 通过,特定的每列 NA 值

na_values : list-like or dict, default None
       Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values

df = pd.read_csv('city.csv', sep='\s+', na_values=['g'])

In [2]: df
Out[2]:
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000

使用 pandas.isnull,您只能选择'Sales'列或'Country'系列中带有NaN的那些行:

Using pandas.isnull, you can select only those rows with NaN in the 'Sales' column, or the 'Country' series:

In [3]: df[pd.isnull(df['Sales'])]
Out[3]: 
  Country  Count  Sales
2     IND      8    NaN

In [4]: df[pd.isnull(df['Sales'])]['Country']
Out[4]: 
2    IND
Name: Country

如果它已经在 DataFrame 中,您可以使用 apply 将这些数字字符串转换为整数(使用 str.isdigit):

df = pd.DataFrame({'Count': {0: 1, 1: 3, 2: 8, 3: 3, 4: 5}, 'Country': {0: 'USA', 1: 'UK', 2: 'IND', 3: 'SPA', 4: 'NTH'}, 'Sales': {0: '65000', 1: '4000', 2: 'g', 3: '9000', 4: '80000'}})

In [12]: df
Out[12]: 
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8      g
3     SPA      3   9000
4     NTH      5  80000

In [13]: df['Sales'] = df['Sales'].apply(lambda x: int(x) 
                                                  if str.isdigit(x)
                                                  else np.nan)

In [14]: df
Out[14]: 
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000

这篇关于使用python清理大数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆