map_partitions的返回值是多少? [英] What is the return value of map_partitions?
问题描述
dask API表示,map_partition可用于在每个DataFrame分区上应用Python函数".根据此描述并根据"map"的通常行为,我希望map_partitions的返回值是(类似)一个列表,其长度等于分区数.列表中的每个元素应该是函数调用的返回值之一.
The dask API says, that map_partition can be used to "apply a Python function on each DataFrame partition." From this description and according to the usual behaviour of "map", I would expect the return value of map_partitions to be (something like) a list whose length equals the number of partitions. Each element of the list should be one of the return values of the function calls.
但是,对于以下代码,我不确定返回值取决于什么:
However, with respect to the following code, I am not sure, what the return value depends on:
#generate example dataframe
pdf = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(pdf, npartitions=3)
#define helper function for map. VAL is the return value
VAL = pd.Series({'A': 1})
#VAL = pd.DataFrame({'A': [1]}) #other return values used in this example
#VAL = None
#VAL = 1
def helper(x):
print('function called\n')
return VAL
#check result
out = ddf.map_partitions(helper).compute()
print(len(out))
-
VAL = pd.Series({'A': 1})
导致4个函数调用(可能是一个函数来推断dtype,而3个则是分区),并输出len == 3且类型为pd.Series. -
pd.DataFrame({'A': [1]})
得出相同的数字,但是结果类型为pd.DataFrame. -
VAL = None
导致TypeError ...为什么?不能使用map_partitions 做某事,而不是返回某事? -
VAL = 1
仅导致2个函数调用. map_partitions的结果是整数1. VAL = pd.Series({'A': 1})
causes 4 function calls (probably one to infer the dtype and 3 for the partitions) and an output with len == 3 and the type pd.Series.pd.DataFrame({'A': [1]})
results in the same numbers, however the resulting type is pd.DataFrame.VAL = None
causes an TypeError ... why? Couldn't a possible use of map_partitions be to do something rather than to return something?VAL = 1
results in only 2 function calls. The result of map_partitions is the integer 1.- 如何确定map_partitions的返回值?
- 除了分区数量之外,还有什么影响函数调用的数量/每个分区有一次要满足一次调用的条件是什么?
- 仅执行"某项操作(即过程)的函数的返回值应该是什么?
- 应如何设计返回任意对象的函数?
-
如何确定map_partitions的返回值?
How is the return value of map_partitions determined?
因此,我想问一些问题:
Therefore, I want to ask some questions:
推荐答案
API文档解释.
The Dask DataFrame.map_partitions function returns a new Dask Dataframe or Series, based on the output type of the mapped function. See the API documentation for a thorough explanation.
请参阅上面提到的API文档.
See the API docs referred to above.
除了分区数量以外,还有什么影响函数调用的数量/每个分区有一次调用哪个标准的功能?
What influences the number of function calls besides the number of partitions / What criteria has a function to fulfil to be called once with each partition?
您是正确的,我们立即调用一次以猜测输出的dtypes/columns.您可以通过直接指定meta=
关键字来避免这种情况.除此之外,每个分区只调用一次该函数.
You're correct that we're calling it once immediately to guess the dtypes/columns of the output. You can avoid this by specifying a meta=
keyword directly. Other than that the function is called once per partition.
仅做某事(即过程)的函数的返回值应该是什么?
What should be the return value of a function, that only "does" something, i.e. a procedure?
您总是可以返回一个空的数据框.您可能还需要考虑将数据框转换为一系列 dask.delayed 对象,通常通常用于临时计算.
You could always return an empty dataframe. You might also want to consider transforming your dataframe into a sequence of dask.delayed objects, which are typically more often used for ad-hoc computations.
应如何设计一个返回任意对象的函数?
How should a function be designed, that returns arbitrary objects?
如果您的函数没有返回序列/数据框,那么我建议您将数据框转换为对象. > DataFrame.to_delayed 方法.
If your function doesn't return series/dataframes then I recommend converting your dataframe to a sequence of dask.delayed objects with the DataFrame.to_delayed method.
这篇关于map_partitions的返回值是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!