Pentaho Hadoop文件输入 [英] Pentaho Hadoop File Input

查看:634
本文介绍了Pentaho Hadoop文件输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用 Pentaho Kettle 独立Hadoop (版本 2.7.2 默认配置的qith属性)HDFS检索数据。 (版本 6.0.1.0-386 )。 Pentaho和Hadoop不在同一台机器上,但我可以从一个到另一个访问。



我创建了一个新的Hadoop文件输入,它具有以下属性:



环境文件/文件夹通配符Rquired包含子文件夹
网址到文件NN



网址到文件建立如下:
$ {PROTOCOL}:// $ {USER}:$ {PASSWORD} @ $ {IP}:$ {PORT} $ {PATH_TO_FILE}

例如:hdfs:// hadoop:@the_ip:50010 / user / hadoop / red_libelium / Ikusi / libelium_waspmote_AC_2_libelium_waspmote / libelium_waspmote_AC_2_libelium_waspmote.txt



密码为空



我检查了这个文件,它存在于HDFS中,并通过web mannager和haddop命令行正确下载。

场景A)
当我使用$ {PROTOCOL} = hdfs和$ {PORT} = 50010时,我在Pentaho和Hadoop控制台中都出现错误:

Pentaho:

  SLF4J:实际绑定是类型[org.slf4j.impl.Log4jLoggerFactory] ​​
2016/04/05 15:23:46 - FileInputList - 错误(版本6.0.1.0-386,从2015年12月3日build 1 11.37.25通过buildguy ):org.apache.commons.vfs2.FileSystemEx
ception:无法列出文件夹hdfs://hadoop@172.21.0.35:50010 / user / hadoop / red_libelium / Ikusi / libelium_waspmote_AC_2_libelium_waspmot
e的内容/libelium_waspmote_AC_2_libelium_waspmote.txt。
2016/04/05 15:23:46 - FileInputList - 在org.apache.commons.vfs2.provider.AbstractFileObject.getChildren(AbstractFileObject.java:1193)
2016/04/05 15:23 :46 - FileInputList - 在org.pentaho.di.core.fileinput.FileInputList.createFileList(FileInputList.java:243)
2016/04/05 15:23:46 - FileInputList - 在org.pentaho.di。 core.fileinput.FileInputList.createFileList(FileInputList.java:142)
2016/04/05 15:23:46 - FileInputList - 位于org.pentaho.di.trans.steps.textfileinput.TextFileInputMeta.getTextFileList(TextFileInputMeta。 java:1580)
2016/04/05 15:23:46 - FileInputList - 位于org.pentaho.di.trans.steps.textfileinput.TextFileInput.init(TextFileInput.java:1513)
2016 / 04/05 15:23:46 - FileInputList - 在org.pentaho.di.trans.step.StepInitThread.run(StepInitThread.java:69)
2016/04/05 15:23:46 - FileInputList - at java.lang.Thread.run(Thread.java:745)
2016/04/05 15:23:46 - FileInputList - 引发:java.io.EOFException:文件结束异常cal主机是:EI001115 / 192.168.231.248;目的地
ation主机是:172.21.0.35:50010; :java.io.EOFException;有关更多详细信息,请参阅:http://wiki.apache.org/hadoop/EOFException
2016/04/05 15:23:46 - FileInputList - 位于sun.reflect.NativeConstructorAccessorImpl.newInstance0(本地方法)
2016/04/05 15:23:46 - FileInputList - 在sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
2016/04/05 15:23:46 - FileInputList - 在sun.reflect .DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2016/04/05 15:23:46 - FileInputList - 在java.lang.reflect.Constructor.newInstance(Constructor.java:526)
2016 / 04/05 15:23:46 - FileInputList - at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
2016/04/05 15:23:46 - FileInputList - at org .apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
2016/04/05 15:23:46 - FileInputList - 位于org.apache.hadoop.ipc.Client.call(Client.java :1472)
2016/04/05 15:23:46 - FileInputList - 在org.apache.hadoop.ipc.Client.call(Client .java:1399)
2016/04/05 15:23:46 - FileInputList - 在org.apache.hadoop.ipc.ProtobufRpcEngine $ Invoker.invoke(ProtobufRpcEngine.java:232)
2016/04 / 05 15:23:46 - FileInputList - at com.sun.proxy。$ Proxy70.getListing(Unknown Source)
2016/04/05 15:23:46 - FileInputList - at org.apache.hadoop.hdfs .protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTrans
latorPB.java:554)
2016/04/05 15:23:46 - FileInputList - 在sun.reflect.NativeMethodAccessorImpl.invoke0(本地方法)
2016/04/05 15:23:46 - FileInputList - 在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
2016/04/05 15:23:46 - FileInputList - 在sun.reflect .DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2016/04/05 15:23:46 - FileInputList - 在java.lang.reflect.Method.invoke(Method.java:606)
2016 / 04/05 15:23:46 - FileInputList - 在org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(Retr yInvocationHandler.java:187)
2016/04/05 15:23:46 - FileInputList - 在org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
2016 / 04/05 15:23:46 - FileInputList - at com.sun.proxy。$ Proxy71.getListing(Unknown Source)
2016/04/05 15:23:46 - FileInputList - 在org.apache.hadoop。 hdfs.DFSClient.listPaths(DFSClient.java:1969)
2016/04/05 15:23:46 - FileInputList - 在org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952)
2016/04/05 15:23:46 - FileInputList - 在org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:693)
2016/04/05 15:23:46 - FileInputList - 在org.apache.hadoop.hdfs.DistributedFileSystem.access $ 600(DistributedFileSystem.java:105)
2016/04/05 15:23:46 - FileInputList - 在org.apache.hadoop.hdfs.DistributedFileSystem $ 15。 doCall(DistributedFileSystem.java:755)
2016/04/05 15:23:46 - FileInputList - 在org.apache.hadoop.hdfs.DistributedFileSystem $ 15。 doCall(DistributedFileSystem.java:751)
2016/04/05 15:23:46 - FileInputList - 位于org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
2016 / 04/05 15:23:46 - FileInputList - 在org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
2016/04/05 15:23:46 - FileInputList - at com。 pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl $ 9.call(HadoopFileSystemImpl.java:126)
2016/04/05 15:23:46 - FileInputList - at com.pentaho.big.data .bundles.impl.shim.hdfs.HadoopFileSystemImpl $ 9.call(HadoopFileSystemImpl.java:124)
2016/04/05 15:23:46 - FileInputList - com.pentaho.big.data.bundles.impl。 shim.hdfs.HadoopFileSystemImpl.callAndWrapExceptions(HadoopFileSystemImpl
.java:200)
2016/04/05 15:23:46 - FileInputList - com.pentaho.big.data.bundles.impl.shim。 hdfs.HadoopFileSystemImpl.listStatus(HadoopFileSystemImpl.java:124)
2016/04/05 15:23:46 - FileInputList - 位于org.pentaho.big。 data.impl.vfs.hdfs.HDFSFileObject.doListChildren(HDFSFileObject.java:115)
2016/04/05 15:23:46 - FileInputList - at org.apache.commons.vfs2.provider.AbstractFileObject.getChildren( AbstractFileObject.java:1184)
2016/04/05 15:23:46 - FileInputList - ... 6 more
2016/04/05 15:23:46 - FileInputList - 引起:java。 io.EOFException
2016/04/05 15:23:46 - FileInputList - 在java.io.DataInputStream.readInt(DataInputStream.java:392)
2016/04/05 15:23:46 - FileInputList - 在org.apache.hadoop.ipc.Client $ Connection.receiveRpcResponse(Client.java:1071)
2016/04/05 15:23:46 - FileInputList - 在org.apache.hadoop.ipc.Client $ Connection.run(Client.java:966)
2016/04/05 15:23:48 - cfgbuilder - 警告:scheme的默认配置生成器不支持配置参数[org]:sftp

Hadoop:

  2016-04-05 14:22:56,045错误org.apache.hadoop.hdfs.server.datanode.DataNode:fiware-hado op:50010:处理未知操作的DataXceiver错误src:/192.168.231.248:62961 dst:/172.21.0.35:50010 
java.io.IOException:版本不匹配(预期:28,收到:26738)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:60)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229)
at java.lang.Thread.run(Thread.java:745)

场景其他)
在其他情况下使用不同的por编号(50070,9000 ...)我刚从Pentaho得到错误,Hadoop独立似乎没有收到任何请求。



阅读Pentaho的一些文档,似乎大数据插件是Hadoop v 2.2.x的代名词,因为我试图连接到2.7.2。它能成为问题的根源吗?
有更高版本的插件吗?
Os只是我的网址到HDFS文件是错误的?



感谢大家对于你的时间,任何暗示都将超过欢迎。

解决方案

我会自己回答这个问题,因为我解决了这个问题,并且它对于一个简单的评论来说太大了。



此问题已解决,在Hadoop配置中进行了一些更改。


  1. 我更改了core-site.xml中的配置

from:

 < property> 
<名称> fs.default.name< /名称>
< value> hdfs:// hadoop:9000< / value>
< / property>

到:

 <性> 
<名称> fs.default.name< /名称>
< value> hdfs:// server_ip_address:8020< / value>
< / property>

由于我在使用端口9000时遇到问题,我终于将其更改为端口8020(相关问题)。


  1. 打开您的端口8020(以防万一您有防火墙规则)
  2. Pentaho Kettle转换网址将如下所示:
    $ {PROTOCOL}:// $ {USER}:$ {PASSWORD} @ $ {HOST}:$ {PORT} $ {FILE_PATH}
    现在$ {PORT}将会是8020.

这样我就可以通过Pentaho转换预览HDFS中的数据。



非常感谢您的宝贵时间。

I'm trying to retrieve data from an standalone Hadoop (version 2.7.2 qith properties configured by default) HDFS using Pentaho Kettle (version 6.0.1.0-386). Pentaho and Hadoop are not in the same machine but I have acces from one to another.

I created a new "Hadoop File Input" with the following properties:

Environment File/Folder Wildcard Rquired Include subfolders url-to-file N N

url-to-file is built like: ${PROTOCOL}://${USER}:${PASSWORD}@${IP}:${PORT}${PATH_TO_FILE}

eg: hdfs://hadoop:@the_ip:50010/user/hadoop/red_libelium/Ikusi/libelium_waspmote_AC_2_libelium_waspmote/libelium_waspmote_AC_2_libelium_waspmote.txt

Password is empty

I checked and this file exist in HDFS and downloaded correctly via web mannager and using haddop command line.

Scenario A) When I'm using ${PROTOCOL} = hdfs and ${PORT} = 50010 I'm getting error in both Pentaho and Hadoop consoles:

Pentaho:

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2016/04/05 15:23:46 - FileInputList - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : org.apache.commons.vfs2.FileSystemEx
ception: Could not list the contents of folder "hdfs://hadoop@172.21.0.35:50010/user/hadoop/red_libelium/Ikusi/libelium_waspmote_AC_2_libelium_waspmot
e/libelium_waspmote_AC_2_libelium_waspmote.txt".
2016/04/05 15:23:46 - FileInputList -   at org.apache.commons.vfs2.provider.AbstractFileObject.getChildren(AbstractFileObject.java:1193)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.core.fileinput.FileInputList.createFileList(FileInputList.java:243)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.core.fileinput.FileInputList.createFileList(FileInputList.java:142)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.trans.steps.textfileinput.TextFileInputMeta.getTextFileList(TextFileInputMeta.java:1580)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.trans.steps.textfileinput.TextFileInput.init(TextFileInput.java:1513)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.trans.step.StepInitThread.run(StepInitThread.java:69)
2016/04/05 15:23:46 - FileInputList -   at java.lang.Thread.run(Thread.java:745)
2016/04/05 15:23:46 - FileInputList - Caused by: java.io.EOFException: End of File Exception between local host is: "EI001115/192.168.231.248"; destin
ation host is: "172.21.0.35":50010; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2016/04/05 15:23:46 - FileInputList -   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.Client.call(Client.java:1472)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
2016/04/05 15:23:46 - FileInputList -   at com.sun.proxy.$Proxy70.getListing(Unknown Source)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTrans
latorPB.java:554)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2016/04/05 15:23:46 - FileInputList -   at java.lang.reflect.Method.invoke(Method.java:606)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
2016/04/05 15:23:46 - FileInputList -   at com.sun.proxy.$Proxy71.getListing(Unknown Source)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1969)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:693)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
2016/04/05 15:23:46 - FileInputList -   at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl$9.call(HadoopFileSystemImpl.java:126)
2016/04/05 15:23:46 - FileInputList -   at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl$9.call(HadoopFileSystemImpl.java:124)
2016/04/05 15:23:46 - FileInputList -   at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl.callAndWrapExceptions(HadoopFileSystemImpl
.java:200)
2016/04/05 15:23:46 - FileInputList -   at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl.listStatus(HadoopFileSystemImpl.java:124)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.big.data.impl.vfs.hdfs.HDFSFileObject.doListChildren(HDFSFileObject.java:115)
2016/04/05 15:23:46 - FileInputList -   at org.apache.commons.vfs2.provider.AbstractFileObject.getChildren(AbstractFileObject.java:1184)
2016/04/05 15:23:46 - FileInputList -   ... 6 more
2016/04/05 15:23:46 - FileInputList - Caused by: java.io.EOFException
2016/04/05 15:23:46 - FileInputList -   at java.io.DataInputStream.readInt(DataInputStream.java:392)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
2016/04/05 15:23:48 - cfgbuilder - Warning: The configuration parameter [org] is not supported by the default configuration builder for scheme: sftp

Hadoop:

2016-04-05 14:22:56,045 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: fiware-hadoop:50010:DataXceiver error processing unknown operation  src: /192.168.231.248:62961 dst: /172.21.0.35:50010
java.io.IOException: Version Mismatch (Expected: 28, Received: 26738 )
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:60)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229)
        at java.lang.Thread.run(Thread.java:745)

Scenario Other) In other cases using different por number (50070, 9000...) I'm just getting error from Pentaho, Hadoop standalone seems not to be receiving any request.

Reading some documentation of Pentaho it seems that the Big Data plugin is buit form Hadoop v 2.2.x, since I'm trying to connect to a 2.7.2. Can it be the source of the problem? Is there any pluging working with higher versions? Os simply my url to HDFS file is wrong?

Thanks you everyone for your time, any hint will be more than welcome.

解决方案

I will answer the question myself because I solved the issue and it too large for a simple comment.

This issue was solved making some changes in Hadoop configuration.

  1. I changed configuration in core-site.xml

from:

<property>
    <name>fs.default.name</name>
    <value>hdfs://hadoop:9000</value>
</property>

to:

<property>
    <name>fs.default.name</name>
    <value>hdfs://server_ip_address:8020</value>
</property>

Since I was having problems with port 9000 I finally changed to port 8020 (related issue).

  1. Open your port 8020 (just in case you have some firewall rule)
  2. Pentaho Kettle transformation url will be like: ${PROTOCOL}://${USER}:${PASSWORD}@${HOST}:${PORT}${FILE_PATH} Now ${PORT} will be 8020.

This way I was able to preview data from HDFS via Pentaho transformation.

Thanks you all for your time.

这篇关于Pentaho Hadoop文件输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆