如何使用以下数据集创建多索引数据框? [英] How can I create a multiindex data frame with the following datasets?

查看:81
本文介绍了如何使用以下数据集创建多索引数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须创建一个包含两个不同数据帧中包含的数据的多索引数据帧. 对于第二数据帧(日期)的每个索引,对于第一数据帧的每一行,如果第一数据帧的日期"列中的值等于第二数据帧的索引,则为我创建一个多索引数据帧每个日期,每天发布的推文数量和每一行的功能.

I have to create a multi index data frame condisering data contained in two different data frames. For each index of the second data frame (Date), for each row of the first data frame, if the value in the column Date of the first data frame is equal to the index of the second data frame then create me a multi index dataframe with each date, the number of tweets published each day and the features of each row.

这是Twitter数据的第一个数据框:

This is the first data frame with Datas from Twitter:

        Date            Full text   Retweets    Likes
333     2018-04-13  RT @Tesla...    2838             0
332     2018-04-13  @timkhiggins... 7722             40733
331     2018-04-13  @TheEconomist.. 1911             18634

这是特斯拉股票市场数据的第二个数据框:

This is the second data frame with Datas from Tesla stock market:

                Open        High     Low         Close  Volume       Gap
Date                        
2018-04-13  283.000000  296.859985   279.519989  294.089996 8569400  11.089996
2018-04-14  303.320000  304.940002   291.619995  291.970001 7286800  -11.349999
2018-04-25  287.760010  288.000000   273.420013  275.010010 8945800  -12.750000

这是我试图做的:

for i in TeslaData.index:
    for row in sortedTweetsData.iterrows():
        if row[1]==i:
            NumTweetsByDay+=1
            for num in NumTweetsByDay:
                idx=pd.MultiIndex.from_product([[i],[NumTweetsBy]])
                colum=col
                df= pd.DataFrame(row,idx,column)

我正在寻找的输出是以下内容:

The output that I am looking for is the following one:

Date        Number of Tweets    Full text       Retweets    Likes

2018-04-13        1              RT @Tesla...    2838        0
                  2              @timkhiggins... 7722        40733
                  3              @TheEconomist.. 1911        18634

推荐答案

如果我正确理解,如果股票数据集中存在相同日期的条目,则希望按日期过滤Twitter数据.

If I understand correctly, you want to filter twitter data by date if there is an entry in the stock dataset for the same date.

您可以使用isin()做到这一点:

You can do this with isin():

# convert datatypes first:
sortedTweetsData['Date'] = pd.to_datetime(sortedTweetsData['Date'])
TeslaData.index = pd.to_datetime(TeslaData.index)

# do filtering
df = sortedTweetsData[sortedTweetsData['Date'].isin(TeslaData.index.values)]

接下来,您可以确定每个组有多少条推文:

next, you can determine how many tweets each group has:

groupsizes = df.groupby(by='Date').size()

并使用它构建一个元组列表,定义您的多索引(执行此操作可能是更优雅的方法):

and use that to build a list of tuples, to define your multiindex (there is likely a more elegant way to do this):

tups = [(ix, gs + 1) for ix in groupsizes.index.values for gs in range(groupsizes[ix])]

最后:

df.index = pd.MultiIndex.from_tuples(tups, names=['Date', 'Number of Tweets'])

这篇关于如何使用以下数据集创建多索引数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆