如何转置一个数据帧（将列转换为行）来处理整洁的数据原理 [英] How do you transpose a dask dataframe (convert columns to rows) to approach tidy data principles

查看：274 发布时间：2017/3/26 2:30:26 python twitter dataframe transpose dask

本文介绍了如何转置一个数据帧（将列转换为行）来处理整洁的数据原理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

TLDR ：我从一个dask包中创建了一个数据框。数据帧将每个观察（事件）视为一列。所以，不是每个事件都有行的数据，我有一个列为每个事件。目标是将列转换为与熊猫可以使用df.T相同的方式转置数据框的方式。

TLDR: I created a dask dataframe from a dask bag. The dask dataframe treats every observation (event) as a column. So, instead of having rows of data for each event, I have a column for each event. The goal is to transpose the columns to rows in the same way that pandas can transpose a dataframe using df.T.

详细信息：
我有从我的时间轴在这里采样twitter数据。要得到我的起点，这里是从磁盘读取json到一个 dask.bag 的代码，然后将其转换为一个 dask.dataframe

Details: I have sample twitter data from my timeline here. To get to my starting point, here is the code to read a json from disk into a dask.bag and then convert that into a dask.dataframe

import dask.bag as db
import dask.dataframe as dd
import json


b = db.read_text('./sampleTwitter.json').map(json.loads)
df = b.to_dataframe()
df.head()

问题我所有的个人事件（即微博）都被记录为副行。根据整洁原则，我想为每个事件添加行。 熊猫具有数据框架的转置方法，而dask.array具有数组的转置方法。我的目标是做同样的转置操作，但是在一个数据帧上。如何做？

The Problem All my individual events (i.e. tweets) are recorded as columns vice rows. In keeping with tidy principles, I would like to have rows for each event. pandas has a transpose method for dataframes and dask.array has a transpose method for arrays. My goal is to do the same transpose operation, but on a dask dataframe. How would I do that?

将列转换为列

解决方案编辑

此代码解决了原始的转置问题，通过定义要保留的列并删除其余的列来清除Twitter json文件，并创建通过将功能应用于系列的新列。然后，我们将较小的清理文件写入磁盘。

Edit for solution

This code resolves the original transpose problem, cleans Twitter json files by defining the columns you want to keep and dropping the rest, and creates a new column by applying a function to a Series. Then, we write a MUCH smaller, cleaned file to disk.

import dask.dataframe as dd
from dask.delayed import delayed
import dask.bag as db
from dask.diagnostics import ProgressBar,Profiler, ResourceProfiler, CacheProfiler
import pandas as pd
import json
import glob

# pull in all files..
filenames = glob.glob('~/sampleTwitter*.json')


# df = ... # do work with dask.dataframe
dfs = [delayed(pd.read_json)(fn, 'records') for fn in filenames]
df = dd.from_delayed(dfs)


# see all the fields of the dataframe 
fields = list(df.columns)

# identify the fields we want to keep
keepers = ['coordinates','id','user','created_at','lang']

# remove the fields i don't want from column list
for f in keepers:
    if f in fields:
        fields.remove(f)

# drop the fields i don't want and only keep whats necessary
df = df.drop(fields,axis=1)

clean = df.coordinates.apply(lambda x: (x['coordinates'][0],x['coordinates'][1]), meta= ('coords',tuple))
df['coords'] = clean

# making new filenames from old filenames to save cleaned files
import re
newfilenames = []
for l in filenames:
    newfilenames.append(re.search('(?<=\/).+?(?=\.)',l).group()+'cleaned.json')
#newfilenames

# custom saver function for dataframes using newfilenames
def saver(frame,filename):
    return frame.to_json('./'+filename)

# converting back to a delayed object
dfs = df.to_delayed()
writes = [(delayed((saver)(df, fn))) for df, fn in zip(dfs, newfilenames)]

# writing the cleaned, MUCH smaller objects back to disk
dd.compute(*writes)

如何转置一个数据帧（将列转换为行）来处理整洁的数据原理 [英] How do you transpose a dask dataframe (convert columns to rows) to approach tidy data principles

问题描述

解决方案编辑

Edit for solution

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何转置一个数据帧（将列转换为行）来处理整洁的数据原理 [英] How do you transpose a dask dataframe (convert columns to rows) to approach tidy data principles

问题描述

解决方案编辑

Edit for solution

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭