合并具有一个包含另一个(但不相同)的索引的数据帧 [英] Merge dataframes that have indices that one contains another (but not the same)

查看:209
本文介绍了合并具有一个包含另一个(但不相同)的索引的数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如df1的形状(533,2176),指数如 Elkford(5901003)DM 01010 , df2的形状为(743,12),指数如 5901003 ; df1索引括号中的数字将与df2的数字相匹配。而且正如形状所示,一些指数根本不匹配。现在我想要一个形状(533,2176 + 12)的数据集,即在增加列时保持匹配的行。

For example df1 has shape (533, 2176), indices such as Elkford (5901003) DM 01010, df2 has shape (743, 12), indices such as 5901003; the number in the bracket of indices of df1 will match that of df2. And as the shape has shown some indices don't match at all. And now I want a dataset that has shape (533, 2176+12) i.e. keeping the matched rows while augmenting the columns.

加载数据

import pandas as pd

from tabulate import tabulate

if __name__ == '__main__':
    # Read data
    census_subdivision_profile = pd.read_excel('../data/census_subdivision_profile.xlsx', sheetname='Data',
                                               index_col='Geography', encoding='utf-8').T
    print(tabulate(census_subdivision_profile.head(), headers="keys", index_col='CNSSSBDVSN', tablefmt='psql'))
    print(census_subdivision_profile.shape)

    census_subdivision_count = pd.read_csv('../data/augmented/census_subdivision.csv', encoding='utf-8')
    print(tabulate(census_subdivision_count.head(), headers='keys', tablefmt='psql'))
    print(census_subdivision_count.shape)

使用第一个答案我有错误:

Using the first answer I've got error:

Traceback (most recent call last):
  File "/Users/Chu/Documents/dssg/ongoing/economy_vs_tourism.py", line 26, in <module>
    census_subdivision_profile.index = census_subdivision_profile.index.map(extract_id)
  File "/anaconda/lib/python2.7/site-packages/pandas/core/indexes/base.py", line 2727, in map
    mapped_values = self._arrmap(self.values, mapper)
  File "pandas/_libs/algos_common_helper.pxi", line 1212, in pandas._libs.algos.arrmap_object (pandas/_libs/algos.c:31954)
  File "/Users/Chu/Documents/dssg/ongoing/economy_vs_tourism.py", line 10, in extract_id
    return int(m.group(0)[1:-1])
ValueError: invalid literal for int() with base 10: 'Part 1) (5917054'

仅仅因为

Index([u'Canada (01)   20000',
       u'British Columbia / Colombie-Britannique (59)   21010',
       u'East Kootenay (5901)   01010', u'Elkford (5901003) DM 01010',
       u'Sparwood (5901006) DM 01010', u'Fernie (5901012) CY 01010',
       u'East Kootenay A (5901017) RDA 02020',
       u'East Kootenay B (5901019) RDA 01020', u'Cranbrook (5901022) CY 01011',
       u'Kimberley (5901028) CY 01010',

另一个是

Int64Index([5931813, 5941833, 5949832, 5919012, 5923033, 5924836, 5941016,
            5955040, 5923809, 5941801,

数据框太大抱歉我不能把它放在这里

The data frame is too large sorry I can't put it here

推荐答案

file1.csv

,col_1,col_2
5901001,a,-1
5901002,b,-2
5901003,c,-3
5901004,d,-4
5901005,e,-5
5901006,f,-6
5901007,g,-7
5901008,h,-8
5901009,i,-9
5901010,k,-10

这里 df1.shape =(10,2)

file2。 csv

,col_3
Elkford (Part 1) (5901003) DM 01010,1
Ahia (5901004) DM 01010,2
Canada (01)   20000,4
Fork (5901005) DM 01010,3
England (34)   20000,4

此处 df2.shape =(3,1)

运行此脚本:

import re

import pandas as pd
import numpy as np


def extract_id(s):
    m = re.search('\((\d{7})\)', s)
    if m:
        return int(m.group(1))


df1 = pd.read_csv('file1.csv', index_col=0)
df2 = pd.read_csv('file2.csv', index_col=0)


indexes = df2.index.map(extract_id)
mask = ~np.isnan(indexes)
# filter incorrect row (without id)
df2 = df2[mask]
# convert index
df2.index = indexes[mask]

df = pd.concat([df1, df2], axis=1)

print(df)

输出:

        col_1  col_2  col_3
5901001     a     -1    NaN
5901002     b     -2    NaN
5901003     c     -3    1.0
5901004     d     -4    2.0
5901005     e     -5    3.0
5901006     f     -6    NaN
5901007     g     -7    NaN
5901008     h     -8    NaN
5901009     i     -9    NaN
5901010     k    -10    NaN

此处 df.shape =(10,2 + 1)

这篇关于合并具有一个包含另一个(但不相同)的索引的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆