Google Cloud Data Fusion——从 REST API 端点源构建管道 [英] Google Cloud Data Fusion -- building pipeline from REST API endpoint source

查看:31
本文介绍了Google Cloud Data Fusion——从 REST API 端点源构建管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试构建管道以从 3rd 方 REST API 端点数据源读取数据.

Attempting to build a pipeline to read from a 3rd party REST API endpoint data source.

我正在使用 Hub 中的 HTTP(1.2.0 版)插件.

I am using the HTTP (version 1.2.0) plugin found in the Hub.

响应请求地址为:https://api.example.io/v2/somedata?return_count=false

响应正文示例:

{
  "paging": {
    "token": "12456789",
    "next": "https://api.example.io/v2/somedata?return_count=false&__paging_token=123456789"
  },
  "data": [
    {
      "cID": "aerrfaerrf",
      "first": true,
      "_id": "aerfaerrfaerrf",
      "action": "aerrfaerrf",
      "time": "1970-10-09T14:48:29+0000",
      "email": "example@aol.com"
    },
    {...}
  ]
}

日志中的主要错误是:

java.lang.NullPointerException: null
    at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.getNextPage(BaseHttpPaginationIterator.java:118) ~[1580429892615-0/:na]
    at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.ensurePageIterable(BaseHttpPaginationIterator.java:161) ~[1580429892615-0/:na]
    at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.hasNext(BaseHttpPaginationIterator.java:203) ~[1580429892615-0/:na]
    at io.cdap.plugin.http.source.batch.HttpRecordReader.nextKeyValue(HttpRecordReader.java:60) ~[1580429892615-0/:na]
    at io.cdap.cdap.etl.batch.preview.LimitingRecordReader.nextKeyValue(LimitingRecordReader.java:51) ~[cdap-etl-core-6.1.1.jar:na]
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
    at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:128) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.scheduler.Task.run(Task.scala:109) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) [spark-core_2.11-2.3.3.jar:2.3.3]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_232]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
    at java.lang.Thread.run(Thread.java:748) [na:1.8.0_232]

可能的问题

在尝试解决此问题一段时间后,我认为问题可能出在

Possible issues

After trying to troubleshoot this for awhile, I'm thinking the issue might be with

  • Data Fusion HTTP 插件有很多处理分页的方法
    • 根据上面的响应正文,分页类型的最佳选择似乎是响应正文中的链接
    • 对于必需的下一页JSON/XML 字段路径 参数,我尝试了$.paging.nextpaging/next.都不行.
    • 我已验证 /paging/next 中的链接在 Chrome 中打开时有效
    • Data Fusion HTTP plugin has a lot of methods to deal with pagination
      • Based on the response body above, it seems like the best option for Pagination Type is Link in Response Body
      • For the required Next Page JSON/XML Field Path parameter, I've tried $.paging.next and paging/next. Neither work.
      • I have verified that the link in /paging/next works when opening in Chrome
      • 当只是尝试在 Chrome 中查看响应 URL 时,会弹出一个提示,要求输入用户名和密码
        • 只需输入用户名的 API 密钥即可在 Chrome 中跳过此提示
        • 为了在 Data Fusion HTTP 插件中执行此操作,API 密钥用于基本身份验证部分中的用户名
        • When simply trying to view the response URL in Chrome, a prompt will pop up asking for username and password
          • Only need to input API key for username to get past this prompt in Chrome
          • To do this in the Data Fusion HTTP plugin, the API Key is used for Username in the Basic Authentication section

          有人在 Google Cloud Data Fusion 中创建数据源是 REST API 的管道方面取得成功吗?

          Anyone have any success in creating a pipeline in Google Cloud Data Fusion where the data source is a REST API?

          推荐答案

          In answer to

          In answer to

          有人在 Google Cloud Data Fusion 中创建数据源是 REST API 的管道方面取得成功吗?

          Anyone have any success in creating a pipeline in Google Cloud Data Fusion where the data source is a REST API?

          这不是实现这一目标的最佳方法,最好的方法是摄取数据用于发布/订阅的服务 API 概述,然后您将使用发布/订阅作为管道的来源,这将为您的数据提供一个简单可靠的暂存位置,用于处理、存储和分析,请参阅文档发布/订阅 API .为了将此与 Dataflow 结合使用,要遵循的步骤在此处的官方文档中 将 Pub/Sub 与 Dataflow 结合使用

          This is not the optimal way to achieve this the best way would be to ingest data Service APIs Overview to pub/sub your would then use pub/sub as the source for your pipeline this would provide a simple and reliable staging location for your data on its for processing, storage, and analysis, see the documentation for the pub/sub API . In order to use this in conjunction with Dataflow, the steps to follow are in the official documentation here Using Pub/Sub with Dataflow

          这篇关于Google Cloud Data Fusion——从 REST API 端点源构建管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆