pandas 将合并“日期”键与不同的日期格式(非时间戳) [英] Pandas left merging 'Date' keys with different date formats (Not Timestamps)

查看:434
本文介绍了 pandas 将合并“日期”键与不同的日期格式(非时间戳)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hello Stack Overflow社区,
我遇到一个问题,熊猫不了解我的合并条件。它与其他键一起工作,但一旦将日期列作为关键字就会中断。 日期列是两个数据帧(而不是时间戳)中的字符串对象。换句话说,我希望所有4个键在从左到右从df2到df的列中相同,而不会丢失df中的任何数据。此外,当我在Excel中打开csv文件时,日期格式看起来完全相同(例如:5/10/2015)。



但是,熊猫将csv_file1,[df]中的日期列读为5-6-2015:

 在[1]中:df ['Date'] 
输出[1]:
日期
0 2015-5-11
1 2015-5-11
2 2015-5-10
3 2015-5-12

熊猫将csv_file2,[df2]中的日期列读为5/6/2015:

 code>在[2]中:df2 ['Date'] 
Out [2]:
日期
0 5/11/2015
1 5/11/2015
2 5/12/2015
3 5/13/2015
4 5/17/2015

两者的dtypes是obj;我不明白为什么熊猫会以不同的方式阅读日期列的格式。



以下是左侧合并之前的数据框架:

 在[3]中:df 
输出[3]:
日期时间制造模型气体额定值安全评级
0 2015-5-11 1本田雅阁9 8
1 2015 -5-11 0丰田凯美瑞9 10
2 2015-5-10 23雪佛兰Sonic 7 6
3 2015-5-12 13本田思域8 7

在[4 ]:df2
出[4]:
日期时间制造模型里程数额定值速度
0 5/11/2015 1本田雅阁10 7
1 5/11/2015 0丰田凯美瑞10 7
2 5/12/2015 23本田思域9 6
3 5/13/2015 23本田思域9 6
4 5/17/2015 7雪佛兰Impala

当我尝试左合并时,会发生什么:

 在[5]中:final = pd.merge(left = df,right = df2,how ='left',on = ['Date','Hour','Make'模型'])


在[6]中:最终
出[6]:
日期小时制造模型气体额定值安全评级里程数额\
0 2015-5-11 1本田雅阁9 8 NaN
1 2015-5-11 0丰田凯美瑞9 10 NaN
2 2015-5-10 23雪佛兰Sonic 7 6 NaN
3 2015-5-12 13本田思域8 7 NaN


速度评分
0 NaN
1 NaN
2 NaN
3 NaN

如果我尝试合并没有日期键,数据大部分传输正确,但这是由于两者中的重复数据而导致的数据过多,并且将不会准确,因为我只需要所有四个键('Da te','Hour','Make','Model')匹配,并且从合并数据之前的df中的任何东西。



将会有更多的Make / Model&小时在df2,所以我只想把合并匹配df,无论df中有多少重复的实例。我也不想丢失df中的任何数据,所以df2中没有找到df的任何日期应保留。



如果Date合并条件有效,这是我试图实现的输出:

 在[7]中:final 
Out [7]:
日期时间制造型号气体额定值安全等级里程等级\
0 5/11/2015 1本田雅阁9 8 10
1 5/11/2015 0丰田凯美瑞9 10 10
2 5/10/2015 23 Chevy Sonic 7 6 NaN
3 5/12/2015 13本田思域8 7 8


速度评分
0 7
1 7
2 NaN
3 7

有没有人有想法为什么会这样发生?我尝试将'Date'列拼接成3列('Month','Day','Year'),并将dtype更改为int64,bool,obj,也没有成功。所以我认为它与格式有关。



提前感谢Stack Overflow社区!

解决方案

在合并之前运行以下代码应将日期设置为通用格式,以使合并正常工作。 >

  import time 

df ['Date'] = time.strftime('%Y-%m-% d',time.strptime(df ['date'],'%m /%d /%Y'))
df2 ['Date'] = time.strftime('%Y-%m-%d ',time.strptime(df2 ['date'],'%Y-%m-%d'))

只需更改其中一个日期就可以了,但是python时间库会使用%m %d 标签。 % - m % - d 标签不会添加前导0,但它们不能在所有系统。有关这种奇怪性的更多信息,请参阅此处


Hello Stack Overflow community, I am having an issue where Pandas is not understanding my merge conditions. It works with the other 'keys', but breaks as soon as I include the "Date" column as a key. The "Date" columns are string objects in both dataframes (not timestamps).

In other words, I want all 4 'keys' to be identical before "left merging" the columns from df2 to df without losing any data in df. Also, when I open the csv files in Excel, the dates format look exactly the same (ex: 5/10/2015).

But, Pandas reads the date column in "csv_file1", [df], as "5-6-2015" :

In [1]: df['Date']
Out[1]: 
         Date 
0   2015-5-11    
1   2015-5-11    
2   2015-5-10   
3   2015-5-12  

Pandas reads the date column in "csv_file2", [df2], as "5/6/2015" :

In [2]: df2['Date']
Out[2]: 
         Date 
0   5/11/2015    
1   5/11/2015    
2   5/12/2015 
3   5/13/2015
4   5/17/2015 

The dtypes for both are "obj"; I do not understand why Pandas would read the format of the 'Date' columns differently.

Here is what the dataframes look like before the left-merge:

In [3]: df
Out[3]: 
         Date Hour    Make   Model  Gas Rating  Safety Rating
0   2015-5-11    1   Honda   Accord         9             8
1   2015-5-11    0   Toyota  Camry          9            10
2   2015-5-10   23   Chevy   Sonic          7             6
3   2015-5-12   13   Honda   Civic          8             7

In [4]: df2
Out[4]: 
         Date Hour    Make   Model  Mileage  Rating  Speed Rating
0   5/11/2015    1   Honda   Accord             10            7
1   5/11/2015    0   Toyota   Camry             10            7
2   5/12/2015   23   Honda    Civic              9            6
3   5/13/2015   23   Honda    Civic              9            6
4   5/17/2015    7   Chevy   Impala                

This is what happens when I try to left-merge:

In [5]: final = pd.merge(left=df, right=df2, how='left', on=['Date', 'Hour', 'Make', 'Model'])


In [6]: final
Out[6]: 
            Date Hour   Make   Model  Gas Rating  Safety Rating  Mileage Rating \
   0   2015-5-11    1  Honda   Accord         9             8           NaN   
   1   2015-5-11    0  Toyota  Camry          9            10           NaN     
   2   2015-5-10   23  Chevy   Sonic          7             6           NaN   
   3   2015-5-12   13  Honda   Civic          8             7           NaN   


     Speed Rating  
   0          NaN  
   1          NaN  
   2          NaN  
   3          NaN    

If I on try merging without the 'Date' key, The data transfers correctly for the most part, but this is an excess of data due to duplicates in both and will not be accurate because I only need data where all four keys ('Date', 'Hour', 'Make', 'Model') match and anything from df before left merging the data.

There will always be many more duplicates of Make/Model & Hour in df2 so I only want to left merge matches to df, no matter how many duplicate instances within df. I also do not wish to lose any data in df so any dates from df that is not found in df2, should remain.

If the 'Date' merge condition worked, this is the output I am trying to achieve:

In [7]: final
Out[7]: 
                Date Hour   Make   Model  Gas Rating  Safety Rating  Mileage Rating \
       0   5/11/2015    1  Honda   Accord         9             8            10   
       1   5/11/2015    0  Toyota  Camry          9            10            10     
       2   5/10/2015   23  Chevy   Sonic          7             6           NaN   
       3   5/12/2015   13  Honda   Civic          8             7             8   


          Speed Rating  
       0            7  
       1            7  
       2          NaN  
       3            7 

Does anyone have an idea why this is happening? I have tried even splicing the 'Date' column into 3 columns ('Month', 'Day', 'Year') and changing the dtype to int64, bool, obj and no success there either. So I assume it has something to do with the format.

Thanks ahead of time Stack Overflow community!

解决方案

Running the below code before the merge should put the dates into a common format so that the merge works properly.

import time

df['Date']=time.strftime('%Y-%m-%d',time.strptime(df['date'],'%m/%d/%Y'))
df2['Date']=time.strftime('%Y-%m-%d',time.strptime(df2['date'],'%Y-%m-%d'))

It would have been nice to simply change one of the dates, but the python time library adds a leading 0 to the month and date with the %m and %d tags. The %-m and %-d tags would not add the leading 0s, but they don't work across all systems. See here for more information on that oddity.

这篇关于 pandas 将合并“日期”键与不同的日期格式(非时间戳)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆