Cloudera 5.4.2:在使用Flume和Twitter流时,Avro块大小无效或过大 [英] Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

查看:416
本文介绍了Cloudera 5.4.2:在使用Flume和Twitter流时,Avro块大小无效或过大的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试Cloudera 5.4.2时有一个小问题。基于这篇文章



Apache Flume - 获取Twitter数据
http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm



它试图使用Flume和Twitter流获取推文进行数据分析。所有的事情都很开心,创建Twitter应用程序,在HDFS上创建目录,配置Flume然后开始获取数据,在推文之上创建模式。

然后,问题就出在这里。 Twitter流媒体将推文转换为Avro格式,并发送Avro事件使HDFS接收器下降,当Avro加载数据后,Hive表格提示Avro数据块大小无效或太大。

哦,avro块和块大小的限制是什么?我可以改变它吗?根据这条信息,这意味着什么?它是文件的错吗?这是否是一些记录的错误?如果Twitter的流媒体遇到错误数据,它应该核心。如果将推文转换为Avro格式都是好事,反过来说,应该正确读取Avro数据,对吗? b
我也尝试使用av​​ro-tools-1.7 .jar

  java -jar avro-tools-1.7.7.jar tojson FlumeData.1458090051232 

{id:710300089206611968,user_friends_count:{int:1527},user_location:{string:1633},user_description:{string: Empire ..... #UGA},user_statuses_count:{int:44471},user_followers_count:{int:2170},user_name:{string:Esquire Shakur}, user_screen_name:{string:Esquire_Bowtie},created_at:{string:2016-03-16T23:01:52Z},text:{string:RT @ugaunion: 。@ ugasga将在三张SGA行政机票之间举行辩论。详细了解他们为您服务的计划https://t.co/...\"},\"retweet_count\":{\"long\":0},\"retweeted: {boolean:true},in_reply_to_user_id:{long: - 1},source:{string:< a href = \http://twitter.com/download/iphone \rel = \nofollow\ > iPhone的Twitter< / a>},in_reply_to_status_id:{long: - 1},media_url_https:null,expanded_url:null}

{id : 710300089198088196, user_friends_count:{ INT:100}, USER_LOCATION:{ 字符串: DM开放してます(`·ω·')}, USER_DESCRIPTION:{ 字符串 :线程中的异常mainorg.apache.avro.AvroRuntimeException:java.io.IOException:此实现的块大小无效或太大:-40

at org.apache.avro.file .DataFileStream.hasNextBlock(DataFileStream.java:275)

在org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
在org.apache.avro.tool .DataFileReadTool.run(DataFileReadTool.java:77)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main (Main.java:73)
导致:java.io.IOException:此实现的块大小无效或太大:-40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream .java:266)
... 4 more

同样的问题。我谷歌很多,根本没有答案。



如果你遇到这个问题,任何人都可以给我一个解决方案吗?或者如果你完全理解Avro或Twitter下面的流媒体,有人会提供一些线索。

这真的是个棘手的问题。想想看。

解决方案

使用Cloudera TwitterSource



否则会遇到这个问题。 / p>

无法正确地将twitter avro数据加载到配置单元表中



在文章中:这是apache TwitterSource

  TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource 
Twitter 1%Firehose源
此源代码是高度实验性的。它使用流媒体API连接到1%的示例Twitter Firehose,并持续下载推文,将其转换为Avro格式,并将Avro事件发送到下游的Flume接收器。

但它应该是cloudera TwitterSource:

https://blog.cloudera.com/ blog / 2012/09 / analytics-twitter-data-with-hadoop /

http://blog.cloudera.com/blog/2012/ 10 / analyze-twitter-data-with-hadoop-part-2-gathering-data-flu-flu /

http: //blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/

  TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource 

而不是只需下载预编译jar,因为我们的cloudera版本是5.4.2,否则你会得到这个错误: / questions / 19189979 / can not-run-flume-because-of-jar-conflict>由于JAR冲突,无法运行Flume



您应该编译它使用maven



https:// github .com / cloudera / cdh-twitter-example 下载并编译:flume-sources.1.0-SNAPSHOT.jar。这个jar包含了Cloudera TwitterSource的实现。

步骤:

https://github.com/cloudera/cdh-twitter-example/archive/master.zip



sudo yum安装apache-maven
进入flume插件目录:

  / var / lib / flume-ng / plugins.d / twitter-streaming / lib / flume-sources-1.0-SNAPSHOT.jar 


$ b $注意:Yum更新到最新版本,否则编译(mvn包)会因某些安全问题而失败。


There is tiny problem when I try Cloudera 5.4.2. Base on this article

Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm

It tries to fetching tweets using Flume and twitter streaming for data analysis. All things are happy, create Twitter app, create directory on HDFS, configure Flume then start to fetch data, create schema on top of tweets.

Then, here is the problem. Twitter streaming converts tweets to Avro format and send Avro events to downsteam HDFS sinks, when Hive table backed by Avro load the data, I got the error message said "Avro block size is invalid or too large".

Oh, what is avro block and the limitation of the block size? Can I change it? What does it mean according to this message? Is it file's fault? Is it some records' fault? If Twitter's streaming met error data, it should core down. If it is all good to convert the tweets to Avro format, reversely, the Avro data should be read correctly, right?

And I also try the avro-tools-1.7.7.jar

java -jar avro-tools-1.7.7.jar tojson FlumeData.1458090051232

{"id":"710300089206611968","user_friends_count":{"int":1527},"user_location":{"string":"1633"},"user_description":{"string":"Steady Building an Empire..... #UGA"},"user_statuses_count":{"int":44471},"user_followers_count":{"int":2170},"user_name":{"string":"Esquire Shakur"},"user_screen_name":{"string":"Esquire_Bowtie"},"created_at":{"string":"2016-03-16T23:01:52Z"},"text":{"string":"RT @ugaunion: .@ugasga is hosting a debate between the three SGA executive tickets. Learn more about their plans to serve you https://t.co/…"},"retweet_count":{"long":0},"retweeted":{"boolean":true},"in_reply_to_user_id":{"long":-1},"source":{"string":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"},"in_reply_to_status_id":{"long":-1},"media_url_https":null,"expanded_url":null}

{"id":"710300089198088196","user_friends_count":{"int":100},"user_location":{"string":"DM開放してます(`・ω・´)"},"user_description":{"string":"Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40

at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)

at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:77)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266)
... 4 more

The same problem. I google it a lot, no answers at all.

Could anyone give me a solution if you have met this problem too? Or somebody help to give a clue if you fully understand Avro stuff or Twitter streaming underneath.

It is really intereting problem. Think about it.

解决方案

Use Cloudera TwitterSource

Otherwise will meet this problem.

Unable to correctly load twitter avro data into hive table

In the article: This is apache TwitterSource

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
Twitter 1% Firehose Source
This source is highly experimental. It connects to the 1% sample Twitter Firehose using streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.

But it should be cloudera TwitterSource:

https://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/

http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

And not just download the pre build jar, because our cloudera version is 5.4.2, otherwise you will get this error:

Cannot run Flume because of JAR conflict

You should compile it using maven

https://github.com/cloudera/cdh-twitter-example

Download and compile: flume-sources.1.0-SNAPSHOT.jar. This jar contains the implementation of Cloudera TwitterSource.

Steps:

wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip

sudo yum install apache-maven Put to flume plugins directory:

/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar 

mvn package

Notice: Yum update to latest version, otherwise compile (mvn package) fails due to some security problem.

这篇关于Cloudera 5.4.2:在使用Flume和Twitter流时,Avro块大小无效或过大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆