将图像加载到Dask Dataframe中 [英] Load images into a Dask Dataframe

查看:78
本文介绍了将图像加载到Dask Dataframe中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个dask数据框,该数据框在一列中包含图像路径(称为img_paths).下一步,我想使用这些图像路径将图像加载到另一列(称为 img_loaded )中,然后应用一些预处理功能.

I have a dask dataframe which contains image paths in a column (called img_paths). What I want to do in the next steps is to load images using those image paths into an another column (called img_loaded) and followed by applying some pre-processing functions.

但是,在加载(或图像读取)过程中,我总是得到不同的结果,包括一次延迟的imread函数包装,另一次正确的图像加载(我可以看到数组)以及其他时间: FileNotFoundError .

However, during loading (or, image reading) process I am always getting different results including one time delayed wrapping of the imread function, another time correct loading of the image (I can see the arrays) and rest of the times: FileNotFoundError.

除了以下示例,我还使用了 map_partitions 函数,但是除了没有数组之外,我最终还得到了类似的输出.最后,我想使用 map_partitions 函数而不是 apply 函数.

In addition to the following examples, I have used map_partitions function as well but I am also ended up in similar outputs except without having the arrays. In the end, I want to use map_partitions function than apply function.

以下是我的代码和有关问题的说明:

Following is my code and descriptions about the problems:

import pandas as pd
import dask
import dask.dataframe as dd
from skimage.io import imread

imgs = ['https://cdn.sstatic.net/Sites/stackoverflow/company/img/logos/so/so-logo.png?v=9c558ec15d8a'] * 42

# create a pandas dataframe using image paths
df = pd.DataFrame({"img_paths": imgs})

# convert it into dask dataframe
ddf = dd.from_pandas(df, npartitions=2)

# convert imread function as delayed
delayed_imread = dask.delayed(imread, pure=True)

首次尝试:使用lambda函数并将延迟的 imread 应用于每个单元格

First try: using lambda function and apply delayed imread to each cell

ddf["img_loaded"] = ddf.images.apply(lambda x: delayed_imread(x))
ddf.compute()

在这里,当使用 compute()方法时,我得到的是延迟的 imread 函数的包装.我不明白为什么?以下是输出:

Here what I get is wrapping of the delayed imread function when using the compute() method. I do not understand why? Following is the output:

ddf["img_loaded"] = ddf.images.apply(delayed_imread)
ddf.compute()

这行得通!至少,我可以将加载的图像视为数组.但是,我真的不明白为什么吗?为什么这与第一个解决方案(即使用lambda函数)不同?输出如下:

This has worked! At least, I can see the loaded images as the arrays. But, I really do not get it why? why is this different than the first solution (i.e., using lambda function) Following is the output:

ddf["load"] = ddf.images.apply(imread) # or, lambda x: imread(x)
ddf.compute()

这里,再次只是为了进行实验,我没有使用延迟的 imread 函数,而是仅使用了 skimage.io.imread 函数.而且,我尝试了同时使用lambda函数和不使用lambda函数.每次,我都会收到 FileNotFoundError .我没有得到这个.使用无延迟读取功能时,为什么找不到图像路径(尽管它们是正确的)?

Here, again just for an experimentation I did not use the delayed imread function, rather I use simply the skimage.io.imread function. And, I have tried both using with and without lambda function. In each time, I got FileNotFoundError. I did not get this. Why can't it find the image path (although, they are correct) when using non-delayed imread function?

ddf["img_loaded"] = ddf.map_partitions(lambda df: df.images.apply(lambda x: imread(x)), meta=("images", np.uint8)).compute()
ddf.compute()

推荐答案

解决方案

import pandas as pd
import dask
import dask.dataframe as dd
import numpy as np
from skimage.io import imread

imgs = ['https://cdn.sstatic.net/Sites/stackoverflow/company/img/logos/so/so-logo.png?v=9c558ec15d8a'] * 4

# create a pandas dataframe using image paths
df = pd.DataFrame({"img_paths": imgs})

# convert it into dask dataframe
ddf = dd.from_pandas(df, npartitions=2)

# convert imread function as delayed
delayed_imread = dask.delayed(imread, pure=True)

# give dask information about the function output type
ddf['img_paths'].apply(imread, meta=('img_loaded', np.uint8)).compute()

# OR turn it into dask.dealayed, which infers output type `object`
ddf['img_paths'].apply(delayed_imread).compute()

说明

如果您尝试应用 print 函数而不进行计算,则会看到 FileNotFoundError 的原因: ddf.images.apply(imread).compute()

The explanation

If you do try applying the print function, without computation you see the reason for FileNotFoundError of code: ddf.images.apply(imread).compute()

ddf['img_paths'].apply(print)

输出:

> foo
> foo

在图形中添加 apply 函数时,Dask遍历字符串 foo 以推断输出的类型=> imread 尝试打开名为 foo 的文件.

When you add apply function to the graph, Dask runs through it string foo to infer the type of the output => imread was trying to open file named foo.

为了获得更好的理解,我鼓励您尝试:

To get a better understanding I encourage you to try:

ddf.apply(print, axis=1)

并尝试预测要打印的内容.

And try to predict what gets printed.

原因是 apply 需要一个函数引用,然后调用该函数.通过创建调用延迟函数的lambda函数,基本上可以使函数双延迟.

The reason is apply expects a function reference which is then called. By creating lambda function calling the delayed function you are basically double-delaying your function.

这篇关于将图像加载到Dask Dataframe中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆