如何使用以下数据集创建多索引数据框? [英] How can I create a multiindex data frame with the following datasets?
问题描述
我必须创建一个包含两个不同数据帧中包含的数据的多索引数据帧. 对于第二数据帧(日期)的每个索引,对于第一数据帧的每一行,如果第一数据帧的日期"列中的值等于第二数据帧的索引,则为我创建一个多索引数据帧每个日期,每天发布的推文数量和每一行的功能.
I have to create a multi index data frame condisering data contained in two different data frames. For each index of the second data frame (Date), for each row of the first data frame, if the value in the column Date of the first data frame is equal to the index of the second data frame then create me a multi index dataframe with each date, the number of tweets published each day and the features of each row.
这是Twitter数据的第一个数据框:
This is the first data frame with Datas from Twitter:
Date Full text Retweets Likes
333 2018-04-13 RT @Tesla... 2838 0
332 2018-04-13 @timkhiggins... 7722 40733
331 2018-04-13 @TheEconomist.. 1911 18634
这是特斯拉股票市场数据的第二个数据框:
This is the second data frame with Datas from Tesla stock market:
Open High Low Close Volume Gap
Date
2018-04-13 283.000000 296.859985 279.519989 294.089996 8569400 11.089996
2018-04-14 303.320000 304.940002 291.619995 291.970001 7286800 -11.349999
2018-04-25 287.760010 288.000000 273.420013 275.010010 8945800 -12.750000
这是我试图做的:
for i in TeslaData.index:
for row in sortedTweetsData.iterrows():
if row[1]==i:
NumTweetsByDay+=1
for num in NumTweetsByDay:
idx=pd.MultiIndex.from_product([[i],[NumTweetsBy]])
colum=col
df= pd.DataFrame(row,idx,column)
我正在寻找的输出是以下内容:
The output that I am looking for is the following one:
Date Number of Tweets Full text Retweets Likes
2018-04-13 1 RT @Tesla... 2838 0
2 @timkhiggins... 7722 40733
3 @TheEconomist.. 1911 18634
推荐答案
如果我正确理解,如果股票数据集中存在相同日期的条目,则希望按日期过滤Twitter数据.
If I understand correctly, you want to filter twitter data by date if there is an entry in the stock dataset for the same date.
您可以使用isin()做到这一点:
You can do this with isin():
# convert datatypes first:
sortedTweetsData['Date'] = pd.to_datetime(sortedTweetsData['Date'])
TeslaData.index = pd.to_datetime(TeslaData.index)
# do filtering
df = sortedTweetsData[sortedTweetsData['Date'].isin(TeslaData.index.values)]
接下来,您可以确定每个组有多少条推文:
next, you can determine how many tweets each group has:
groupsizes = df.groupby(by='Date').size()
并使用它构建一个元组列表,定义您的多索引(执行此操作可能是更优雅的方法):
and use that to build a list of tuples, to define your multiindex (there is likely a more elegant way to do this):
tups = [(ix, gs + 1) for ix in groupsizes.index.values for gs in range(groupsizes[ix])]
最后:
df.index = pd.MultiIndex.from_tuples(tups, names=['Date', 'Number of Tweets'])
这篇关于如何使用以下数据集创建多索引数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!