Python数据框检查列数据框中的值是否在另一个数据框中报告的值的范围内 [英] Python dataframe check if a value in a column dataframe is within a range of values reported in another dataframe
问题描述
道歉,如果问题很小,但是作为python new,我找不到正确的解决方案.
我有两个数据帧,并且如果第一数据帧的某个值介于第二数据帧的两个值之间,则需要向第一数据帧添加一列,否则为false.
例如:
first_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2':[10,22,15,15,7,130,2]})
second_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2_start':[5,20,11,11,5,110,220],'code2_end':[15,25,20,20,10,120,230]})
first_df
code1 code2
0 1 10
1 1 22
2 2 15
3 2 15
4 3 7
5 1 130
6 1 2
second_df
code1 code2_end code2_start
0 1 15 5
1 1 25 20
2 2 20 11
3 2 20 11
4 3 10 5
5 1 120 110
6 1 230 220
对于第一个数据帧中的每一行,我应该检查在code2列中报告的值是否介于第二个数据帧second_df的行所标识的可能范围之一之间,例如:
code1=1
和code2=22
的第1行中的检查second_df
我有4行带有code1=1
,rows 0,1,5
和6
,值code2=22
在code2_start=20
和code2_end=25
标识的间隔内,因此该函数应返回True
考虑一个示例,该函数应返回False,
code1=1
和code2=130
的第5行中的,但是没有包含130的区间,其中<c1>
我尝试使用此功能
def check(first_df,second_df):
for i in range(len(first_df):
return ((second_df.code2_start <= first_df.code2[i]) & (second_df.code2_end <= first_df.code2[i]) & (second_df.code1 == first_df.code1[i])).any()
并将其矢量化
first_df['output'] = np.vectorize(check)(first_df, second_df)
但显然没有成功.
对于您能提供的任何输入,我都会感到满意.
thx.
A.
作为一个实际例子:
first_df.code1[0] = 1
因此,我需要在second_df上搜索所有位置,其中
second_df.code1 == first_df.code1[0]
0 True
1 True
2 False
3 False
4 False
5 True
6 True
对于状态为True的实例0、1、5、6,我需要检查该值
first_df.code2[0]
10
介于以下范围之一
second_df[second_df.code1 == first_df.code1[0]][['code2_start','code2_end']]
code2_start code2_end
0 5 15
1 20 25
5 110 120
6 220 230
因为first_df.code2 [0]的值是10,所以它在5到15之间,因此第0行标识的范围,因此我的函数应该返回True.在first_df.code1 [6]的情况下,值仍应为1,因此范围表仍与上面相同,但在这种情况下first_df.code2 [6]为2,并且没有包含2的间隔,因此resut应该为False
first_df['output'] = (second_df.code2_start <= first_df.code2) & (second_df.code2_end <= first_df.code2)
之所以起作用,是因为当您执行以下操作时:second_df.code2_start <= first_df.code2
您将获得一个布尔系列.如果然后对这两个布尔系列中的两个执行逻辑与运算,则会得到一个值为True
的系列,其中两个系列均为True
和False
.
这是一个例子:
>>> import pandas as pd
>>> a = pd.DataFrame([{1:2,2:4,3:6},{1:3,2:6,3:9},{1:4,2:8,3:10}])
>>> a['output'] = (a[2] <= a[3]) & (a[2] >= a[1])
>>> a
1 2 3 output
0 2 4 6 True
1 3 6 9 True
2 4 8 10 True
因此,根据您更新的问题和我对问题的新解释,我将执行以下操作:
import pandas as pd
# Define some data to work with
df_1 = pd.DataFrame([{'c1':1,'c2':5},{'c1':1,'c2':10},{'c1':1,'c2':20},{'c1':2,'c2':8}])
df_2 = pd.DataFrame([{'c1':1,'start':3,'end':6},{'c1':1,'start':7,'end':15},{'c1':2,'start':5,'end':15}])
# Function checks if c2 value is within any range matching c1 value
def checkRange(x, code_range):
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
return check.any()
# Apply the checkRange function to each row of the DataFrame
df_1['output'] = df_1.apply(lambda x: checkRange(x, df_2), axis=1)
我在这里定义的是名为checkRange
的函数,该函数将df_1
和code_range
的一行作为整个df_2
数据帧作为输入x
.它首先找到code_range
的行,这些行的c1
值与给定的行x.c1
相同.然后,不匹配的行将被丢弃.这是在前两行中完成的:
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
接下来,我们得到一个布尔系列,它告诉我们x.c2
是否落在简化的code_range
DataFrame中给出的任何范围内:
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
最后,由于我们只关心x.c2
落在其中一个范围内,因此我们返回check.any()
的值.当我们在布尔系列上调用any()
时,如果该系列中的任何值为True
,它将返回True
.
要在df_1
的每一行上调用checkRange
函数,我们可以使用apply()
.我定义了 lambda表达式,以便发送checkRange
对行以及df_2
起作用. axis=1
意味着将在DataFrame的每一行(而不是每一列)上调用该函数.
Apology if the problemis trivial but as a python newby I wasn't able to find the right solution.
I have two dataframes and I need to add a column to the first dataframe that is true if a certain value of the first dataframe is between two values of the second dataframe otherwise false.
for example:
first_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2':[10,22,15,15,7,130,2]})
second_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2_start':[5,20,11,11,5,110,220],'code2_end':[15,25,20,20,10,120,230]})
first_df
code1 code2
0 1 10
1 1 22
2 2 15
3 2 15
4 3 7
5 1 130
6 1 2
second_df
code1 code2_end code2_start
0 1 15 5
1 1 25 20
2 2 20 11
3 2 20 11
4 3 10 5
5 1 120 110
6 1 230 220
For each row in the first dataframe I should check if the value reported in the code2 columne is between one of the possible range identified by the row of the second dataframe second_df for example:
in row 1 of first_df
code1=1
and code2=22
checking second_df
I have 4 rows with code1=1
, rows 0,1,5
and 6
, the value code2=22
is in the interval identified by code2_start=20
and code2_end=25
so the function should return True
.
Considering an example where the function should return False,
in row 5 of first_df
code1=1
and code2=130
but there is no interval containing 130 where code1=1
I have tried to use this function
def check(first_df,second_df):
for i in range(len(first_df):
return ((second_df.code2_start <= first_df.code2[i]) & (second_df.code2_end <= first_df.code2[i]) & (second_df.code1 == first_df.code1[i])).any()
and to vectorize it
first_df['output'] = np.vectorize(check)(first_df, second_df)
but obviously with no success.
I would be happy for any input you could provide.
thx.
A.
As a practical example:
first_df.code1[0] = 1
therefore I need to search on second_df all the istances where
second_df.code1 == first_df.code1[0]
0 True
1 True
2 False
3 False
4 False
5 True
6 True
for the instances 0,1,5,6 where the status is True I need to check if the value
first_df.code2[0]
10
is between one of the range identified by
second_df[second_df.code1 == first_df.code1[0]][['code2_start','code2_end']]
code2_start code2_end
0 5 15
1 20 25
5 110 120
6 220 230
since the value of first_df.code2[0] is 10 it is between 5 and 15 so the range identified by row 0 therefore my function should return True. In case of first_df.code1[6] the value vould still be 1 therefore the range table would be still the same above but first_df.code2[6] is 2 in this case and there is no interval containing 2 therefore the resut should be False.
first_df['output'] = (second_df.code2_start <= first_df.code2) & (second_df.code2_end <= first_df.code2)
This works because when you do something like: second_df.code2_start <= first_df.code2
You get a boolean Series. If you then perform a logical AND on two of these boolean series, you get a Series which has value True
where both Series were True
and False
otherwise.
Here's an example:
>>> import pandas as pd
>>> a = pd.DataFrame([{1:2,2:4,3:6},{1:3,2:6,3:9},{1:4,2:8,3:10}])
>>> a['output'] = (a[2] <= a[3]) & (a[2] >= a[1])
>>> a
1 2 3 output
0 2 4 6 True
1 3 6 9 True
2 4 8 10 True
EDIT:
So based on your updated question and my new interpretation of your problem, I would do something like this:
import pandas as pd
# Define some data to work with
df_1 = pd.DataFrame([{'c1':1,'c2':5},{'c1':1,'c2':10},{'c1':1,'c2':20},{'c1':2,'c2':8}])
df_2 = pd.DataFrame([{'c1':1,'start':3,'end':6},{'c1':1,'start':7,'end':15},{'c1':2,'start':5,'end':15}])
# Function checks if c2 value is within any range matching c1 value
def checkRange(x, code_range):
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
return check.any()
# Apply the checkRange function to each row of the DataFrame
df_1['output'] = df_1.apply(lambda x: checkRange(x, df_2), axis=1)
What I do here is define a function called checkRange
which takes as input x
, a single row of df_1
and code_range
, the entire df_2
DataFrame. It first finds the rows of code_range
which have the same c1
value as the given row, x.c1
. Then the non matching rows are discarded. This is done in the first 2 lines:
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
Next, we get a boolean Series which tells us if x.c2
falls within any of the ranges given in the reduced code_range
DataFrame:
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
Finally, since we only care that the x.c2
falls within one of the ranges, we return the value of check.any()
. When we call any()
on a boolean Series, it will return True
if any of the values in the Series are True
.
To call the checkRange
function on each row of df_1
, we can use apply()
. I define a lambda expression in order to send the checkRange
function the row as well as df_2
. axis=1
means that the function will be called on each row (instead of each column) for the DataFrame.
这篇关于Python数据框检查列数据框中的值是否在另一个数据框中报告的值的范围内的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!