使用glob/merge删除NaN行,特定excel文件中的某些列 [英] Dropping NaN rows, certain columns in specific excel files using glob/merge
问题描述
我想在Excel文件的for循环加载中将最终文件中的NaN行删除,并删除所有公司,电子邮件,除了最终加载到excel文件之外的所有文件,创建重复的列.
I would like to drop NaN rows in the final file in a for loop loading in excel files, and dropping all company, emails, created duplicated columns from all but the final loaded in excel file.
这是我的for循环(以及随后合并为一个DF),目前:
Here is my for loop (and subsequent merging into a single DF), currently:
for f in glob.glob("./gowall-users-export-*.xlsx"):
df = pd.read_excel(f)
all_users_sheets_hosts.append(df)
j = re.search('(\d+)', f)
df.columns = df.columns.str.replace('.*Hosted Meetings.*', 'Hosted Meetings' + ' ' + j.group(1))
all_users_sheets_hosts = reduce(lambda left,right: pd.merge(left,right,on=['First Name', 'Last Name'], how='outer'), all_users_sheets_hosts)
以下是生成的DF的前几行:
Here are the first few rows of the resulting DF:
Company_x First Name Last Name Emails_x Created_x Hosted Meetings 03112016 Facilitated Meetings_x Attended Meetings_x Company_y Emails_y ... Created_x Hosted Meetings 04122016 Facilitated Meetings_x Attended Meetings_x Company_y Emails_y Created_y Hosted Meetings 04212016 Facilitated Meetings_y Attended Meetings_y
0 TS X Y X@Y.com 03/10/2016 0.0 0.0 0.0 TS X@Y.com ... 03/10/2016 0.0 0.0 2.0 NaN NaN NaN NaN NaN NaN
1 TS X Y X@Y.com 03/10/2016 0.0 0.0 0.0 TS X@Y.com ... 01/25/2016 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN
2 TS X Y X@Y.com 03/10/2016 0.0 0.0 0.0 TS X@Y.com ... 04/06/2015 9.0 10.0 17.0 NaN NaN NaN NaN NaN NaN
推荐答案
要防止出现多个Company
,Emails
,Created
,Facilitated Meetings
和Attended Meetings
列,请将其从right
DataFrame中删除.要删除具有所有NaN
值的行,请使用result.dropna(how='all', axis=0)
:
To prevent multiple Company
, Emails
, Created
, Facilitated Meetings
and Attended Meetings
columns, drop them from the right
DataFrame. To remove rows with all NaN
values, use result.dropna(how='all', axis=0)
:
import pandas as pd
import functools
for f in glob.glob("./gowall-users-export-*.xlsx"):
df = pd.read_excel(f)
all_users_sheets_hosts.append(df)
j = re.search('(\d+)', f)
df.columns = df.columns.str.replace('.*Hosted Meetings.*',
'Hosted Meetings' + ' ' + j.group(1))
# Drop rows of all NaNs from the final DataFrame in `all_users_sheets_hosts`
all_users_sheets_hosts[-1] = all_users_sheets_hosts[-1].dropna(how='all', axis=0)
def mergefunc(left, right):
cols = ['Company', 'Emails', 'Created', 'Facilitated Meetings', 'Attended Meetings']
right = right.drop(cols, axis=1)
result = pd.merge(left, right, on=['First Name', 'Last Name'], how='outer')
return result
all_users_sheets_hosts = functools.reduce(mergefunc, all_users_sheets_hosts)
由于Company
等. al.列将仅存在于left
DataFrame中,这些列将不会扩散.但是请注意,如果left
和right
数据框在这些列中具有不同的值,则将仅保留all_users_sheets_hosts
中的第一个数据框中的值.
Since the Company
et. al. columns will only exist in the left
DataFrame, there will be no proliferation of those columns. Note, however, that if the left
and right
DataFrames have different values in those columns, only the values in the first DataFrame in all_users_sheets_hosts
will be kept.
或者,如果left
和right
数据帧具有与Company
等相同的值. al.列,那么另一种选择是也可以简单地合并这些列:
Alternative, if the left
and right
DataFrames have the same values for the Company
et. al. columns, then another option would be to simple merge on those columns too:
def mergefunc(left, right):
cols = ['First Name', 'Last Name', 'Company', 'Emails', 'Created',
'Facilitated Meetings', 'Attended Meetings']
result = pd.merge(left, right, on=cols, how='outer')
return result
all_users_sheets_hosts = functools.reduce(mergefunc, all_users_sheets_hosts)
这篇关于使用glob/merge删除NaN行,特定excel文件中的某些列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!