获取两个或更多数据帧并在python中的唯一键上提取数据 [英] Taking two and more data frames and extracting data on unique keys in python

查看:63
本文介绍了获取两个或更多数据帧并在python中的唯一键上提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我有2个数据框,其中有一个人的名字,在各列中有他喜欢的页面.所以不行.对于不同的人,列的数量会有所不同,这里是示例.第一列是用户名,然后用户喜欢的页面存储在该行中. 随机家伙"的列数与"mank rion"的列将有所不同. "BlackBick","500 Startups"等是页面的名称.假设此数据框的名称为User_page

Firstly I have 2 data frames one in which I have name of a guy and pages liked by him in columns. So no. of columns will be different for different person here is the example. 1st column is the name of user.Then pages liked by him is stored across the row.So no. of columns for 'random guy' will be different from 'mank rion'. 'BlackBick' , '500 Startups' e.t.c are name of the page. let say name of this data frame is User_page

random guy      BlackBuck            GiveMeSport    Analytics Ninja 
mank nion       DJ CHETAS            Celina Jaitly  Gurkeerat Singh
pop rajuel      WOW Editions         500 Startups   Biswapati Sarkar
Roshan ghai     MensXP               No Abuse       the smartian 

现在我有了另一个数据框,其中的一个与上一个相同,但是在页面名称的位置有一个页面类别.您现在可能在fb上有不同的页面类别.因此,假设"BlacBuck"的类别为运输/货运".有些页面具有相同的名称和不同的类别.这就是为什么我不能直接使用名称作为键的原因,这就是我的数据框的外观.让我们说一下该数据框的名称User_category.

Now I have another Data frame in which is kind of same as upper one but in the place of page's name there is a category of page.you might now there are different category of pages on fb. so let say 'BlacBuck''s category is 'Transport/Freight'. There are pages with same name and different category.That is why i cant use name directly as key this is how my data frame looks like.Let say name of this data frame User_category.

random guy      Transport/Freight    Sport      Insurance Company 
mank nion       Arts/Entertainment   Actress    Actor/Director
pop rajuel      Concert Tour         App Page   Actor/Director
Roshan ghai     News/Media Website   Community  Public Figure  

现在我还有两个数据框.其中我将fb页面的名称作为第一列,另将162个列作为每个页面的标签,如果第i页进入第j个标记,则i * j元素的值为1,否则将其留空,因此它将看起来像.let say该数据框的名称为Page_tag

Now I have two more Data frames. one in which I have name of fb pages as 1st column and 162 more columns with some tag for each page there is value 1 for i*j element if ith page comes in to jth tag otherwise left empty so it will look like.let say name of this dataframe is Page_tag

    name of page              tag 1        tag2        tag3
    BlackBuck                     1          1             
    GiveMeSport                   1                      1
    Analytics Ninja               1                      1
    DJ CHETAS                                1           1

另一个名称的类别为第一列,进一步的名称为162.像这样.假设此数据框的名称为Category_tag.

the another one have name of categories as 1st column and same 162 as further. like this. let say name of this dataframe is Category_tag.

   category_name              tag 1        tag2        tag3
    Sport                                     1           1
    App Page                      1                       1
    Actor/Director                1                                        
    Public Figure                         1               1

现在我需要从他喜欢的页面上获取每个用户的标签计数.首先,我必须首先检查他喜欢的页面在Page_tag数据框中是否存在,这是我问题中的第3个数据帧,是否存在该标签的计数,即特定标签为该用户出现了多少次.如果未找到页面名称为no,则这是第一步. Page_tag数据帧(第3个)中的页面数受到限制.我将转到页面的类别(此问题中的第二个数据框)查找遗漏的页面,对于该类别,我将从名为Category_tags(此问题中的第四个数据框)的数据框中计算特定用户的标签计数,然后将总和标签计数和我的输出是这样的. 输出

Now what I have to get the tag counts for each user from pages he has liked. for that first I have to first check that the page which he has liked where exist in data frame of Page_tag which is 3rd dataframe in my question if it exist there take the counts of tags that how many times a specific tags appeared for that user.this is first step if not found the name of page as no. of pages in Page_tag dataframe(3rd one) is limited. I will go to category of page (from 2nd dataframe in this question) for the pages which are left out and for that category i will count the tags count for the specific user from dataframe named Category_tags(4th dataframe in this question) and sum the tag count and my output something like this. Output

username             tag1                   tag2           tag3 
random guy              1                      2             2 
mank nion               2                      1             3
pop rajuel              4                      0             2 
Roshan ghai             0                      2             1

此数据帧上的i * j元素显示否.第j个标签为第i个用户显示的次数.我已经为此编写了代码,并且在R中有更多内容,我陷入了这一特定步骤. R的代码不是最优的,因为我使用了很多次循环.我想最理想的方式是起毛,希望可以在大熊猫中做.

a i*j element on this dataframe shows no. times that the jth tag appears for ith user. I have written code for this and more in R I am stuck in this particular step. The code of R wasnt optimal as I used loops many time. I wanted to rhis optimally, hopefully can be done in pandas.

这是我到目前为止编写的代码.但我认为可以以更有效的方式完成 PS:实际没有.用户和列的数量非常大.

This is the code I have written so far . But I think It can be done in more efficient way PS: actual no. of user and column are very large.

from io import StringIO
import pandas as pd
import numpy as np
# DATA FRAME IMPORT AND MELT
data1 = u'''  
random guy,BlackBuck,GiveMeSport,Analytics Ninja 
mank nion,DJ CHETAS,Celina Jaitly,Gurkeerat Singh
pop rajuel,WOW Editions,500 Startups,Biswapati Sarkar
Roshan ghai,MensXP,No Abuse,the smartian 
'''

## reading and melting the datasheet of user name and page_liked
df1 = pd.read_csv(StringIO(data1), sep=",", header=None)

df1 = pd.melt(df1, id_vars=[0], value_vars=[1,2,3])[[0,'value']]
df1.columns = ['user', 'page_name']

data2 = u'''
random guy,Transport/Freight,Sport,Insurance Company
mank nion,Arts/Entertainment,Actress,Actor/Director
pop rajuel,Concert Tour,App Page,Actor/Director
Roshan ghai,News/Media Website,Community,Public Figure
'''


#reading and melting the data sheet of user name and category of page liked
df2 = pd.read_csv(StringIO(data2), sep=",", header=None)
df2 = pd.melt(df2, id_vars=[0], value_vars=[1,2,3])[[0,'value']]
df2.columns = ['user', 'categories']




data3 = u'''
page_name,tag1,tag2,tag3
BlackBuck,1,1,0             
GiveMeSport,1,0,1
Gurkeerat Singh,1,0,1
DJ CHETAS,0,1,1
'''

##reading the meta data of page_name and tag
df3 = pd.read_csv(StringIO(data3), sep=",")

data4 = u'''
category,tag1,tag2,tag3
Sport,0,1,1
App Page,1,0,1
Actor/Director,1,0,0                                        
Public Figure,0,1,1
'''
df4 = pd.read_csv(StringIO(data4), sep=",")

##reading the data of category and tag

##adding a column ctegory in df1 based on index
category = df2['categories']
df1['category'] = category


##creating a list of page which i have in meta_data
meta_list = list(df3.iloc[:,0])

## creating two empty dataframes with column name as same as df1
new_df1=pd.DataFrame(columns=['user','page_name','category'])
new_df2=pd.DataFrame(columns=['user','page_name','category'])

#checking if page in meta list if it is there add that row in newdf1   else     in newdf2 
for i in range(len(df1)):
    if df1.iloc[i,1] in meta_list:
        x = df1.iloc[i]
        new_df1 = new_df1.append(x, ignore_index=True)
    else:
        y = df1.iloc[i]
        new_df2 = new_df2.append(y, ignore_index=True)


## merging newdf1 and newdf2 on page_name and category repectively 

mdf1 = pd.merge(new_df1, df3, how= 'left', on = ['page_name'])

mdf2 = pd.merge(new_df2, df4, how= 'left', on=['category'])
## concatenating the 2 data frame mdf1 and mdf2 and summing the tags for     each of them
finaldf = pd.concat([mdf1[['user', 'tag1', 'tag2', 'tag3']].groupby(['user']).agg(sum),
                 mdf2[['user', 'tag1', 'tag2', 'tag3']].groupby(['user']).agg(sum)]).reset_index()

   ## finally grouping on user and summing the tags for each user
   finaldf1 = finaldf.groupby(['user']).agg(sum).reset_index()

推荐答案

感谢您的代码.现在更清楚了.

Thank you for your code. Now it is more clear.

我尝试优化循环,我想您可以使用 any boolean indexing .我也简化了concat中的代码:

I try optimalize your loops and I think you can rather use isin with any for mask with boolean indexing. Also I simplier code in concat:

##adding a column category in df1 based on index
df1['category'] =  df2['categories']

##creating a list of page which i have in meta_data
meta_list = list(df3.iloc[:,0])

mask = df1.isin(meta_list).any(1)
new_df1 = (df1[mask])
new_df2 = (df1[~mask])

## merging newdf1 and newdf2 on page_name and category repectively 
mdf1 = pd.merge(new_df1, df3, how= 'left', on ='page_name')
mdf2 = pd.merge(new_df2, df4, how= 'left', on='category')
## concatenating the 2 data frame mdf1 and mdf2 and summing the tags for     each of them
finaldf = pd.concat([mdf1,mdf2])
## finally grouping on user and summing the tags for each user
finaldf1 = finaldf.groupby('user', as_index=False).sum()
print (finaldf1)
          user  tag1  tag2  tag3
0  Roshan ghai   0.0   1.0   1.0
1    mank nion   1.0   1.0   2.0
2   pop rajuel   2.0   0.0   1.0
3   random guy   2.0   1.0   1.0

这篇关于获取两个或更多数据帧并在python中的唯一键上提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆