合并两个 pandas 数据框(在公共列上连接) [英] Combine two pandas Data Frames (join on a common column)
问题描述
我有2个数据框:
restaurant_ids_dataframe
restaurant_ids_dataframe
Data columns (total 13 columns):
business_id 4503 non-null values
categories 4503 non-null values
city 4503 non-null values
full_address 4503 non-null values
latitude 4503 non-null values
longitude 4503 non-null values
name 4503 non-null values
neighborhoods 4503 non-null values
open 4503 non-null values
review_count 4503 non-null values
stars 4503 non-null values
state 4503 non-null values
type 4503 non-null values
dtypes: bool(1), float64(3), int64(1), object(8)`
和
restaurant_review_frame
restaurant_review_frame
Int64Index: 158430 entries, 0 to 229905
Data columns (total 8 columns):
business_id 158430 non-null values
date 158430 non-null values
review_id 158430 non-null values
stars 158430 non-null values
text 158430 non-null values
type 158430 non-null values
user_id 158430 non-null values
votes 158430 non-null values
dtypes: int64(1), object(7)
我想使用熊猫中的DataFrame.join()命令将这两个DataFrame合并为一个单独的数据框.
I would like to join these two DataFrames to make them into a single dataframe using the DataFrame.join() command in pandas.
我尝试了以下代码行:
#the following line of code creates a left join of restaurant_ids_frame and restaurant_review_frame on the column 'business_id'
restaurant_review_frame.join(other=restaurant_ids_dataframe,on='business_id',how='left')
但是当我尝试此操作时,出现以下错误:
But when I try this I get the following error:
Exception: columns overlap: Index([business_id, stars, type], dtype=object)
我对熊猫很陌生,不知道我在执行连接语句方面做错了什么.
I am very new to pandas and have no clue what I am doing wrong as far as executing the join statement is concerned.
任何帮助将不胜感激.
推荐答案
您可以使用合并将两个数据框组合为一个:
You can use merge to combine two dataframes into one:
import pandas as pd
pd.merge(restaurant_ids_dataframe, restaurant_review_frame, on='business_id', how='outer')
其中在上指定要加入的两个数据框中都存在的字段名称,以及如何
定义其内部/外部/左/右连接,是否使用两个框架中的键联合(SQL:完全外部连接)"与外部连接.由于两个数据框中都具有"star"列,因此默认情况下将在合并的数据框中创建两列star_x和star_y.正如@DanAllan在join方法中提到的那样,您可以通过将后缀传递为kwarg来修改后缀以进行合并.默认值为suffixes=('_x', '_y')
.如果您想执行star_restaurant_id
和star_restaurant_review
之类的操作,则可以执行以下操作:
where on specifies field name that exists in both dataframes to join on, and how
defines whether its inner/outer/left/right join, with outer using 'union of keys from both frames (SQL: full outer join).' Since you have 'star' column in both dataframes, this by default will create two columns star_x and star_y in the combined dataframe. As @DanAllan mentioned for the join method, you can modify the suffixes for merge by passing it as a kwarg. Default is suffixes=('_x', '_y')
. if you wanted to do something like star_restaurant_id
and star_restaurant_review
, you can do:
pd.merge(restaurant_ids_dataframe, restaurant_review_frame, on='business_id', how='outer', suffixes=('_restaurant_id', '_restaurant_review'))
The parameters are explained in detail in this link.
这篇关于合并两个 pandas 数据框(在公共列上连接)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!