功能应该将数据清理一半,而不是将数据放大一个数量级 [英] Function should clean data to half the size, instead it enlarges it by an order of magnitude

查看:364
本文介绍了功能应该将数据清理一半,而不是将数据放大一个数量级的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这一周周末一直在推动我。我正在尝试针对不同资产的合并数据围绕公共时间戳。每个资产的数据都是字典中的一个值。感兴趣的数据存储在一列中的列表中,因此需要首先分离。以下是未处理的df的示例:

  data ['Dogecoin']。head()

market_cap_by_available_supply price_btc price_usd volume_usd
0 [1387118554000,3488670] [1387118554000,6.58771e-07] [1387118554000,0.000558776] [1387118554000,0.0]
1 [1387243928000,1619159] [1387243928000,3.18752e-07 ] [1387243928000,0.000218176] [1387243928000,0.0]
2 [1387336027000,2191987] [1387336027000,4.10802e-07] [1387336027000,0.000267749] [1387336027000,0.0]

然后我应用此功能将market_cap_by_available_supply分开,将其组件保存到该资产的新数据框中:




$ s


$ s
从每个元素中的相应列表中分出时间戳和市值$ b TS = data [coin] .market_cap_by_available_supply.map(lambda r:r [0])
cap = data [coin] .market_cap_ by_available_supply.map(lambda r:r [1])$ ​​b $ b#创建DataFrame并将时间戳和市值数据存储到dictionairy中
df = DataFrame(columns = ['timestamp','cap'])
df.timestamp = TS
df.cap = cap
df.columns = ['timestamp',str(coin)+'_ cap']
data2 [coin] = df
#转换时间戳到datetime'yyy-mm-dd'
data2 [coin] ['timestamp'] = pd.to_datetime(data2 [coin] ['timestamp'],unit ='ms'日期

似乎工作正常,产生正确的数据,示例:

  data2 ['Namecoin']。head()
时间戳Namecoin_cap
0 2013-04-28 5969081
1 2013 -04-29 7006114
2 2013-04-30 7049003
3 2013-05-01 6366350
4 2013-05-02 5848626
/ pre>

但是,当我尝试合并所有数据帧时,我有一个内存错误,我花了几个小时试图找出根,似乎是'排序功能将数据帧的大小从12Mb增加到131Mb!应该这样做相反。任何想法?



这边的一个附注是数据
https://www.mediafire.com/?9pcwroe1x35nnwl



我用这个泡菜功能打开它

  with open(CMC_no_tuple_data.pickle,rb)as myFile:
data = pickle.load(myFile)

编辑:对不起,在pickle文件名中输入错字。
@Goyo来计算大小我简单的保存数据和数据2通过pickle.dump并查看各自的大小
@ Padraic Cunningham你使用我提供的排序功能,它产生一个较小的文件?这不是我的情况,我尝试合并数据框时收到内存错误

解决方案

当您合并数据框时,您是对不是唯一的值进行连接。当您将所有这些数据帧加入时,您将获得许多匹配。随着您添加越来越多的货币,您将获得类似于笛卡尔积分产品而不是连接。在下面的代码段中,我添加了代码来对值进行排序,然后删除重复项。

 从pandas import Series,DataFrame 
进口大熊猫为pd

coins ='''
Bitcoin
纹理
Ethereum
Litecoin
Dogecoin
Dash
Peercoin
MaidSafeCoin
Stellar
Factom
Nxt
BitShares
'''
coins = coins.split('\\\
' )
API ='https://api.coinmarketcap.com/v1/datapoints/'
data = {}

硬币硬币:
print(硬币)
try:
data [coin] =(pd.read_json(API + coin))
除了:pass
data2 = {}

TS = data [coin] .market_cap_by_available_supply.map(lambda r:r [0])
TS = pd.to_datetime(TS,unit ='ms')dt.date
cap = data [coin] .market_cap_by_available_supply.map(lambda r:r [1])$ ​​b $ b df = DataFrame(columns = ['timestamp','cap'])
df.timestamp = TS
df.cap = cap
df.colum ns = ['timestamp',coin +'_ cap']
df.sort_values(by = ['timestamp',coin +'_ cap'])
df = df.drop_duplicates(subset ='timestamp',keep ='last')
data2 [coin] = df

df = data2 ['Bitcoin']
keys = data2.keys()
keys.remove 'Bitcoin')
用于键中的硬币:
df = pd.merge(left = df,right = data2 [coin],left_on ='timestamp',right_on ='timestamp',how ='left ')
print len(df),len(df.columns)
df.to_csv('caps.csv')

编辑:我已经添加了一个表格,显示了当您进行加入操作时,表格的大小如何变大。



此表显示加入5,10,15,20,25和30种货币后的行数。

 行,列
1015 5
1255 10
5095 15
132071 20
4195303 25
16778215 30

此表显示如何删除重复项使您的连接只匹配一行。

 行,列
1000 5
1000 10
1000 15
1000 20
1000 25
1000 30


This has been driving me nuts all week weekend. I am trying merge data for different assets around a common timestamp. Each asset's data is a value in dictionary. The data of interest is stored in lists in one column, so this needs to be separated first. Here is a sample of the unprocessed df:

data['Dogecoin'].head()

    market_cap_by_available_supply  price_btc   price_usd   volume_usd
0   [1387118554000, 3488670]    [1387118554000, 6.58771e-07]    [1387118554000, 0.000558776]    [1387118554000, 0.0]
1   [1387243928000, 1619159]    [1387243928000, 3.18752e-07]    [1387243928000, 0.000218176]    [1387243928000, 0.0]
2   [1387336027000, 2191987]    [1387336027000, 4.10802e-07]    [1387336027000, 0.000267749]    [1387336027000, 0.0]

Then I apply this function to separate market_cap_by_available_supply, saving it's components into a new dataframe for that asset:

data2 = {}
#sorting function 
for coin in data:
    #seperates timestamp and marketcap from their respective list inside each element
    TS = data[coin].market_cap_by_available_supply.map(lambda r: r[0])
    cap = data[coin].market_cap_by_available_supply.map(lambda r: r[1])
    #Creates DataFrame and stores timestamp and marketcap data into dictionairy
    df = DataFrame(columns=['timestamp','cap'])
    df.timestamp = TS
    df.cap = cap
    df.columns = ['timestamp',str(coin)+'_cap']
    data2[coin] = df
    #converts timestamp into datetime 'yyy-mm-dd'
    data2[coin]['timestamp'] = pd.to_datetime(data2[coin]['timestamp'], unit='ms').dt.date

It seemed to work perfectly, producing the correct data, sample:

data2['Namecoin'].head()
timestamp   Namecoin_cap
0   2013-04-28  5969081
1   2013-04-29  7006114
2   2013-04-30  7049003
3   2013-05-01  6366350
4   2013-05-02  5848626 

However when I attempted to merge all dataframes, I got a memory error, I've spent hours trying to figure out the root and it seems like the 'sorting function' above is increasing the size of the dataframe from 12Mb to 131Mb! It should do this opposite. Any ideas ?

On a side note here is the data https://www.mediafire.com/?9pcwroe1x35nnwl

I open it with this pickle funtion

with open("CMC_no_tuple_data.pickle", "rb") as myFile:
    data = pickle.load(myFile)

EDIT: Sorry for the typo in the pickle file name. @Goyo to compute the size i simple saved data and data 2 via pickle.dump and looked at their respective sizes @ Padraic Cunningham you used the sort funtion i provided and it produced a smaller file ?? This is not my case and i get a memory error when trying to merge the dataframes

解决方案

When you merge your dataframes, you are doing a join on values that are not unique. When you are joining all these dataframes together, you are getting many matches. As you add more and more currencies you are getting something similar to a Cartesian product rather than a join. In the snippet below, I added code to sort the values and then remove duplicates.

from pandas import Series, DataFrame
import pandas as pd

coins='''
Bitcoin
Ripple
Ethereum
Litecoin
Dogecoin
Dash
Peercoin
MaidSafeCoin
Stellar
Factom
Nxt
BitShares
'''
coins = coins.split('\n')
API = 'https://api.coinmarketcap.com/v1/datapoints/'
data = {}

for coin in coins:
    print(coin)
    try:
        data[coin]=(pd.read_json(API + coin))
    except: pass
data2 = {}
for coin in data:
    TS = data[coin].market_cap_by_available_supply.map(lambda r: r[0])
    TS = pd.to_datetime(TS,unit='ms').dt.date
    cap = data[coin].market_cap_by_available_supply.map(lambda r: r[1])
    df = DataFrame(columns=['timestamp','cap'])
    df.timestamp = TS
    df.cap = cap
    df.columns = ['timestamp',coin+'_cap']
    df.sort_values(by=['timestamp',coin+'_cap'])
    df= df.drop_duplicates(subset='timestamp',keep='last')
    data2[coin] = df

df = data2['Bitcoin']
keys = data2.keys()
keys.remove('Bitcoin')
for coin in keys:
    df = pd.merge(left=df,right=data2[coin],left_on='timestamp', right_on='timestamp', how='left')
    print len(df),len(df.columns)
df.to_csv('caps.csv')

EDIT:I have added a table belowing showing how the size of the table grows as you do your join operation.

This table shows the number of rows after joining 5,10,15,20,25, and 30 currencies.

Rows,Columns
1015 5
1255 10
5095 15
132071 20
4195303 25
16778215 30

This table shows how removing duplicates makes your joins only match a single row.

Rows,Columns
1000 5
1000 10
1000 15
1000 20
1000 25
1000 30

这篇关于功能应该将数据清理一半,而不是将数据放大一个数量级的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆