如何转置一个数据帧(将列转换为行)来处理整洁的数据原理 [英] How do you transpose a dask dataframe (convert columns to rows) to approach tidy data principles

查看:274
本文介绍了如何转置一个数据帧(将列转换为行)来处理整洁的数据原理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TLDR :我从一个dask包中创建了一个数据框。数据帧将每个观察(事件)视为一列。所以,不是每个事件都有行的数据,我有一个列为每个事件。目标是将列转换为与熊猫可以使用df.T相同的方式转置数据框的方式。

TLDR: I created a dask dataframe from a dask bag. The dask dataframe treats every observation (event) as a column. So, instead of having rows of data for each event, I have a column for each event. The goal is to transpose the columns to rows in the same way that pandas can transpose a dataframe using df.T.

详细信息
我有从我的时间轴在这里采样twitter数据。要得到我的起点,这里是从磁盘读取json到一个 dask.bag 的代码,然后将其转换为一个 dask.dataframe

Details: I have sample twitter data from my timeline here. To get to my starting point, here is the code to read a json from disk into a dask.bag and then convert that into a dask.dataframe

import dask.bag as db
import dask.dataframe as dd
import json


b = db.read_text('./sampleTwitter.json').map(json.loads)
df = b.to_dataframe()
df.head()

问题我所有的个人事件(即微博)都被记录为副行。根据整洁原则,我想为每个事件添加行。 熊猫具有数据框架的转置方法,而dask.array具有数组的转置方法。我的目标是做同样的转置操作,但是在一个数据帧上。如何做?

The Problem All my individual events (i.e. tweets) are recorded as columns vice rows. In keeping with tidy principles, I would like to have rows for each event. pandas has a transpose method for dataframes and dask.array has a transpose method for arrays. My goal is to do the same transpose operation, but on a dask dataframe. How would I do that?


  1. 将列转换为列



解决方案编辑



此代码解决了原始的转置问题,通过定义要保留的列并删除其余的列来清除Twitter json文件,并创建通过将功能应用于系列的新列。然后,我们将较小的清理文件写入磁盘。

Edit for solution

This code resolves the original transpose problem, cleans Twitter json files by defining the columns you want to keep and dropping the rest, and creates a new column by applying a function to a Series. Then, we write a MUCH smaller, cleaned file to disk.

import dask.dataframe as dd
from dask.delayed import delayed
import dask.bag as db
from dask.diagnostics import ProgressBar,Profiler, ResourceProfiler, CacheProfiler
import pandas as pd
import json
import glob

# pull in all files..
filenames = glob.glob('~/sampleTwitter*.json')


# df = ... # do work with dask.dataframe
dfs = [delayed(pd.read_json)(fn, 'records') for fn in filenames]
df = dd.from_delayed(dfs)


# see all the fields of the dataframe 
fields = list(df.columns)

# identify the fields we want to keep
keepers = ['coordinates','id','user','created_at','lang']

# remove the fields i don't want from column list
for f in keepers:
    if f in fields:
        fields.remove(f)

# drop the fields i don't want and only keep whats necessary
df = df.drop(fields,axis=1)

clean = df.coordinates.apply(lambda x: (x['coordinates'][0],x['coordinates'][1]), meta= ('coords',tuple))
df['coords'] = clean

# making new filenames from old filenames to save cleaned files
import re
newfilenames = []
for l in filenames:
    newfilenames.append(re.search('(?<=\/).+?(?=\.)',l).group()+'cleaned.json')
#newfilenames

# custom saver function for dataframes using newfilenames
def saver(frame,filename):
    return frame.to_json('./'+filename)

# converting back to a delayed object
dfs = df.to_delayed()
writes = [(delayed((saver)(df, fn))) for df, fn in zip(dfs, newfilenames)]

# writing the cleaned, MUCH smaller objects back to disk
dd.compute(*writes)


推荐答案

我想你可以通过绕过来获得想要的结果一起包装,代码如

I think you can get the result you want by bypassing bag altogether, with code like

import glob

import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed

filenames = glob.glob('sampleTwitter*.json')
dfs = [delayed(pd.read_json)(fn, 'records') for fn in filenames]
ddf = dd.from_delayed(dfs)

这篇关于如何转置一个数据帧(将列转换为行)来处理整洁的数据原理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆