选择要在具有不同列数的文件循环中合并哪些列 [英] Choose which columns to concat in a loop of files with different number of columns

查看:41
本文介绍了选择要在具有不同列数的文件循环中合并哪些列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有字典:

#file1 mentions 2 columns while file2 mentions 3
dict2 = ({'file1' : ['colA', 'colB'],'file2' : ['colY','colS','colX'], etc..})

首先,如何使字典以某种方式将以 one列级联开头的值与保留在最终数据帧中所需的列分开的方式保持不变.

First of all how to make the dictionary in a way that will separate somehow the values headed to a one column concatenation from the columns that are needed to remain in the final dataframe unaffected.

每个文件的列名称都不相同,因此很难自动执行此类自定义过程.你觉得呢?

The columns will not have the same names for each file and it is very difficult to automate such customized process. What do you think?

我想对每个文件在新列中进行上述列的串联. 这应该是自动化的.

I want to do a concatenation of the mentioned columns in a new column for each file. This should be automated.

for k, v in dict1.items():
    df = pd.DataFrame.from_records(data=arcpy.da.SearchCursor(k, v)) #reads to a df
    df['new'] = df.astype(str).apply(' '.join, axis=1)#concatenation

我如何每次都可以独立于每本词典中的列数进行这项工作?

How can I make this work every time, independent of the number of columns in each dictionary?

示例:

a = {'colA' : [123,124,112,165],'colB' :['alpha','beta','gamma','delta']}
file1 = pd.DataFrame(data = a)
file1

colA   colB
123    alpha
124    beta
112    gamma
165    delta

b = {'colY' : [123,124,112,165],'colS' :['alpha','beta','gamma','delta'], 'colX' :[323,326,378,399] }
file2 = pd.DataFrame(data = b)
file2

colY  colS      colX 
123   alpha     323
124   beta      326
112   gamma     378
165   delta     399

结果:

file1

col_all
123 alpha
124 beta
112 gamma
165 delta

file2

call_all
123 alpha 323
124 beta  326
112 gamma 378
165 delta 399

注意

file2可以再增加5列,但只有3列应连接到一列.如何使将要连接的列以及其中仅存在的列的初始字典不受影响.

file2 for example could have 5 more columns but only 3 should be concatenated to a one column. How to make the initial dict that would define which columns to be concatenated and what to just exist there unaffected.

推荐答案

因此,您必须为concat选择列名称,例如,按位置选择的前3列:

So you have to select columns names for concat, e.g first 3 columns selected by positions:

for k, v in dict1.items():
    df = pd.DataFrame.from_records(data=arcpy.da.SearchCursor(k, v)) #reads to a df
    df['new'] = df.iloc[:, :3].astype(str).apply(' '.join, axis=1)#concatenation

如果创建可能的列名称列表,请使用 intersection :

If create list of possible columns names use intersection:

for k, v in dict1.items():
    df = pd.DataFrame.from_records(data=arcpy.da.SearchCursor(k, v)) #reads to a df
    L = ['colA','colB','colS']
    cols = df.columns.intersection(L)
    df['new'] = df[cols].astype(str).apply(' '.join, axis=1)#concatenation

或过滤:

for k, v in dict1.items():
    df = pd.DataFrame.from_records(data=arcpy.da.SearchCursor(k, v)) #reads to a df
    L = ['colA','colB','colS']
    mask = df.columns.isin(L)
    df['new'] = df.loc[:, mask].astype(str).apply(' '.join, axis=1)#concatenation

如果要使用其他必要列名称列表创建另一个数据结构,则可能的解决方案是创建元组列表:

If want create another data structure with another list of necessary columns names, possible solution is create list of tuples:

L = [('file1', ['colA', 'colB'], ['colA','colB']), 
     ('file2', ['colY','colS','colX'], ['colY','colS'])]

for i, j, k in L:
    print (i)
    print (j)
    print (k)

file1
['colA', 'colB']
['colA', 'colB']
file2
['colY', 'colS', 'colX']
['colY', 'colS']

因此,您的解决方案应重写:

So your solution should be rewritten:

for i, j, k in L:
   df = pd.DataFrame.from_records(data=arcpy.da.SearchCursor(i, j)) #reads to a df
    df['new'] = df[k].astype(str).apply(' '.join, axis=1)#concatenation

这篇关于选择要在具有不同列数的文件循环中合并哪些列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆