数据帧合并在大 pandas 中创建重复记录(0.7.3) [英] Dataframe merge creates duplicate records in pandas (0.7.3)

查看:130
本文介绍了数据帧合并在大 pandas 中创建重复记录(0.7.3)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我合并两个CSV文件格式(日期,someValue)时,我看到一些重复记录。

When I merge two CSV files, of the format (date, someValue), I see some duplicate records.

如果我将记录减少到一半问题消失了。但是,如果我的两个文件的大小加倍,它会恶化。欣赏任何帮助!

i = pd.DataFrame.from_csv('i.csv')
i = i.reset_index()
e = pd.DataFrame.from_csv('e.csv')
e = e.reset_index()

total_df = pd.merge(i, e, right_index=False, left_index=False,
                    right_on=['date'], left_on=['date'], how='left')
total_df = total_df.sort(column='date')

:11/15,11/16,12/17,12/18的复制记录。)

(Note: the dupulicate records for 11/15, 11/16, 12/17, 12/18.)

In [7]: total_df
Out[7]:
                  date  Cost  netCost
25 2012-11-15 00:00:00     1        2
26 2012-11-15 00:00:00     1        2
31 2012-11-16 00:00:00     1        2
32 2012-11-16 00:00:00     1        2
37 2012-11-17 00:00:00     1        2
2  2012-11-18 00:00:00     1        2
5  2012-11-19 00:00:00     1        2
8  2012-11-20 00:00:00     1        2
11 2012-11-21 00:00:00     1        2
14 2012-11-22 00:00:00     1        2
17 2012-11-23 00:00:00     1        2
20 2012-11-24 00:00:00     1        2
23 2012-11-25 00:00:00     1        2
29 2012-11-26 00:00:00     1        2
35 2012-11-27 00:00:00     1        2
0  2012-11-28 00:00:00     1        2
3  2012-11-29 00:00:00     1        2
6  2012-11-30 00:00:00     1        2
9  2012-12-01 00:00:00     1        2
12 2012-12-02 00:00:00     1        2
15 2012-12-03 00:00:00     1        2
18 2012-12-04 00:00:00     1        2
21 2012-12-05 00:00:00     1        2
24 2012-12-06 00:00:00     1        2
30 2012-12-07 00:00:00     1        2
36 2012-12-08 00:00:00     1        2
1  2012-12-09 00:00:00     2        2
4  2012-12-10 00:00:00     2        2
7  2012-12-11 00:00:00     2        2
10 2012-12-12 00:00:00     2        2
13 2012-12-13 00:00:00     1        2
16 2012-12-14 00:00:00     2        2
19 2012-12-15 00:00:00     2        2
22 2012-12-16 00:00:00     2        2
27 2012-12-17 00:00:00     1        2
28 2012-12-17 00:00:00     1        2
33 2012-12-18 00:00:00     1        2
34 2012-12-18 00:00:00     1        2



i.csv



i.csv

date,Cost
2012-11-15 00:00:00,1
2012-11-16 00:00:00,1
2012-11-17 00:00:00,1
2012-11-18 00:00:00,1
2012-11-19 00:00:00,1
2012-11-20 00:00:00,1
2012-11-21 00:00:00,1
2012-11-22 00:00:00,1
2012-11-23 00:00:00,1
2012-11-24 00:00:00,1
2012-11-25 00:00:00,1
2012-11-26 00:00:00,1
2012-11-27 00:00:00,1
2012-11-28 00:00:00,1
2012-11-29 00:00:00,1
2012-11-30 00:00:00,1
2012-12-01 00:00:00,1
2012-12-02 00:00:00,1
2012-12-03 00:00:00,1
2012-12-04 00:00:00,1
2012-12-05 00:00:00,1
2012-12-06 00:00:00,1
2012-12-07 00:00:00,1
2012-12-08 00:00:00,1
2012-12-09 00:00:00,2
2012-12-10 00:00:00,2
2012-12-11 00:00:00,2
2012-12-12 00:00:00,2
2012-12-13 00:00:00,1
2012-12-14 00:00:00,2
2012-12-15 00:00:00,2
2012-12-16 00:00:00,2
2012-12-17 00:00:00,1
2012-12-18 00:00:00,1



e.csv



e.csv

date,netCost
2012-11-15 00:00:00,2
2012-11-16 00:00:00,2
2012-11-17 00:00:00,2
2012-11-18 00:00:00,2
2012-11-19 00:00:00,2
2012-11-20 00:00:00,2
2012-11-21 00:00:00,2
2012-11-22 00:00:00,2
2012-11-23 00:00:00,2
2012-11-24 00:00:00,2
2012-11-25 00:00:00,2
2012-11-26 00:00:00,2
2012-11-27 00:00:00,2
2012-11-28 00:00:00,2
2012-11-29 00:00:00,2
2012-11-30 00:00:00,2
2012-12-01 00:00:00,2
2012-12-02 00:00:00,2
2012-12-03 00:00:00,2
2012-12-04 00:00:00,2
2012-12-05 00:00:00,2
2012-12-06 00:00:00,2
2012-12-07 00:00:00,2
2012-12-08 00:00:00,2
2012-12-09 00:00:00,2
2012-12-10 00:00:00,2
2012-12-11 00:00:00,2
2012-12-12 00:00:00,2
2012-12-13 00:00:00,2
2012-12-14 00:00:00,2
2012-12-15 00:00:00,2
2012-12-16 00:00:00,2
2012-12-17 00:00:00,2
2012-12-18 00:00:00,2


推荐答案

这看起来像一个大熊猫0.7.3的错误或麻木1.6。只有当合并的列是日期(内部转换为numpy.datetime64)时,才会发生这种情况。我的解决方案是将日期转换成字符串 -

This does seem like a bug with pandas 0.7.3 or numpy 1.6. This only happens if the column being merged on is a date (internally converted to numpy.datetime64). My solution was to convert date into a string-

def _DatetimeToString(datetime64):
  timestamp = datetime64.astype(long)/1000000000
  return datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d')

i = pd.DataFrame.from_csv('i.csv')
i = i.reset_index()
i['date'] = i['date'].map(_DatetimeToString)
e = pd.DataFrame.from_csv('e.csv')
e = e.reset_index()
i['date'] = i['date'].map(_DatetimeToString)

total_df = pd.merge(i, e, right_index=False, left_index=False,
                    right_on=['date'], left_on=['date'], how='left')
total_df = total_df.sort(column='date')

这篇关于数据帧合并在大 pandas 中创建重复记录(0.7.3)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆