pandas 抽样组 [英] Sampling groups in Pandas

查看：77 发布时间：2020/5/24 1:56:32 python pandas

本文介绍了 pandas 抽样组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

说我想从Pandas的数据框中做一个分层样本，这样对于给定列的每个值，我都会得到5%行.我该怎么办?

Say I want to do a stratified sample from a dataframe in Pandas so that I get 5% of rows for every value of a given column. How can I do that?

例如，在下面的数据框中，我想对与列Z的每个值关联的行的5%进行采样.有什么办法可以从加载到内存中的数据框中采样组?

For example, in the dataframe below, I would like to sample 5% of the rows associated with each value of the column Z. Is there any way to sample groups from a dataframe loaded in memory?

更一般地说，如果我将此数据帧放在磁盘中的一个大文件(例如8 GB的csv文件)中，该怎么办.有什么方法可以执行此采样而不必将整个数据帧加载到内存中吗?

More generally, what if I this dataframe in disk in a huge file (e.g. 8 GB of a csv file). Is there any way to do this sampling without having to load the entire dataframe in memory?

推荐答案

如何使用"usecols"选项将"Z"列仅加载到内存中.假设文件为sample.csv.如果您有一堆列，那应该使用少得多的内存.然后，假设它适合内存，我认为这对您有用.

How about loading only the 'Z' column into memory using the 'usecols' option. Say the file is sample.csv. That should use much less memory if you have a bunch of columns. Then assuming that fits into memory, I think this will work for you.

stratfraction = 0.05
#Load only the Z column
df = pd.read_csv('sample.csv', usecols = ['Z'])
#Generate the counts per value of Z
df['Obs']  = 1
gp = df.groupby('Z')
#Get number of samples per group 
df2 = np.ceil(gp.count()*stratfraction)
#Generate the indices of the request sample (first entrie)
stratsample = []
for i, key in enumerate(gp.groups):
    FirstFracEntries = gp.groups[key][0:int(df2['Obs'][i])]
    stratsample.extend(FirstFracEntries) 
#Generate a list of rows to skip since read_csv doesn't have a rows to keep option
stratsample.sort
RowsToSkip = set(df.index.values).difference(stratsample)
#Load only the requested rows (no idea how well this works for a really giant list though)         
df3 = df = pd.read_csv('sample.csv', skiprows  = RowsToSkip)

这篇关于 pandas 抽样组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas 抽样组 [英] Sampling groups in Pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas 抽样组 [英] Sampling groups in Pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭