使用Google DataFlow/Apache Beam并行化图像处理或爬网任务是否有意义? [英] Does it make sense to use Google DataFlow/Apache Beam to parallelize image processing or crawling tasks?

本文介绍了使用Google DataFlow/Apache Beam并行化图像处理或爬网任务是否有意义?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在考虑将Google DataFlow作为运行涉及以下步骤的管道的选项:

I am considering Google DataFlow as an option for running a pipeline that involves steps like:

  1. 从网络上下载图像;
  2. 处理图像.

我喜欢DataFlow管理完成任务所需的VM的生命周期,因此我不需要自己启动或停止它们,但是我遇到的所有示例都将其用于数据挖掘等任务.我想知道它是否对其他批处理任务(如图像处理和爬网)是否可行.

I like that DataFlow manages the lifetime of VMs required to complete the job, so I don't need to start or stop them myself, but all examples I came across use it for data mining kind of tasks. I wonder if it is a viable option for other batch tasks like image processing and crawling.

推荐答案

此用例可能是Dataflow/Beam的应用程序.

This use case is a possible application for Dataflow/Beam.

如果您想以流式方式执行此操作,则可以让搜寻器生成URL并将其添加到PubSub或Kafka队列中;并对Beam管道进行编码以执行以下操作:

If you want to do this in a streaming fashion, you could have a crawler generating URLs and adding them to a PubSub or Kafka queue; and code a Beam pipeline to do the following:

  1. 从PubSub读取
  2. 在ParDo中下载网站内容
  3. 在另一个ParDo *中解析来自网站的图像URL *
  4. 下载每个图像并进行处理,然后再次使用ParDo
  5. 根据所需的图像信息,将结果存储在GCS,BigQuery或其他格式中.

您可以对批处理作业执行相同的操作,只需更改读取URL的来源即可.

You can do the same with a batch job, just changing the source you're reading the URLs from.

*在解析这些图像URL之后,您可能还希望重新整理数据以获取一些并行性.

*After parsing those image URLs, you may also want to reshuffle your data, to gain some parallelism.

这篇关于使用Google DataFlow/Apache Beam并行化图像处理或爬网任务是否有意义?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆