Cloudera 5.4.2:在使用Flume和Twitter流时,Avro块大小无效或过大 [英] Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming
问题描述
Apache Flume - 获取Twitter数据
http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm
它试图使用Flume和Twitter流获取推文进行数据分析。所有的事情都很开心,创建Twitter应用程序,在HDFS上创建目录,配置Flume然后开始获取数据,在推文之上创建模式。
然后,问题就出在这里。 Twitter流媒体将推文转换为Avro格式,并发送Avro事件使HDFS接收器下降,当Avro加载数据后,Hive表格提示Avro数据块大小无效或太大。
哦,avro块和块大小的限制是什么?我可以改变它吗?根据这条信息,这意味着什么?它是文件的错吗?这是否是一些记录的错误?如果Twitter的流媒体遇到错误数据,它应该核心。如果将推文转换为Avro格式都是好事,反过来说,应该正确读取Avro数据,对吗? b
我也尝试使用avro-tools-1.7 .jar
java -jar avro-tools-1.7.7.jar tojson FlumeData.1458090051232
{id:710300089206611968,user_friends_count:{int:1527},user_location:{string:1633},user_description:{string: Empire ..... #UGA},user_statuses_count:{int:44471},user_followers_count:{int:2170},user_name:{string:Esquire Shakur}, user_screen_name:{string:Esquire_Bowtie},created_at:{string:2016-03-16T23:01:52Z},text:{string:RT @ugaunion: 。@ ugasga将在三张SGA行政机票之间举行辩论。详细了解他们为您服务的计划https://t.co/...\"},\"retweet_count\":{\"long\":0},\"retweeted: {boolean:true},in_reply_to_user_id:{long: - 1},source:{string:< a href = \http://twitter.com/download/iphone \rel = \nofollow\ > iPhone的Twitter< / a>},in_reply_to_status_id:{long: - 1},media_url_https:null,expanded_url:null}
{id : 710300089198088196, user_friends_count:{ INT:100}, USER_LOCATION:{ 字符串: DM开放してます(`·ω·')}, USER_DESCRIPTION:{ 字符串 :线程中的异常mainorg.apache.avro.AvroRuntimeException:java.io.IOException:此实现的块大小无效或太大:-40
at org.apache.avro.file .DataFileStream.hasNextBlock(DataFileStream.java:275)
在org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
在org.apache.avro.tool .DataFileReadTool.run(DataFileReadTool.java:77)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main (Main.java:73)
导致:java.io.IOException:此实现的块大小无效或太大:-40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream .java:266)
... 4 more
同样的问题。我谷歌很多,根本没有答案。
如果你遇到这个问题,任何人都可以给我一个解决方案吗?或者如果你完全理解Avro或Twitter下面的流媒体,有人会提供一些线索。
这真的是个棘手的问题。想想看。
使用Cloudera TwitterSource
否则会遇到这个问题。 / p>
在文章中:这是apache TwitterSource
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
Twitter 1%Firehose源
此源代码是高度实验性的。它使用流媒体API连接到1%的示例Twitter Firehose,并持续下载推文,将其转换为Avro格式,并将Avro事件发送到下游的Flume接收器。
但它应该是cloudera TwitterSource:
https://blog.cloudera.com/ blog / 2012/09 / analytics-twitter-data-with-hadoop /
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
而不是只需下载预编译jar,因为我们的cloudera版本是5.4.2,否则你会得到这个错误: / questions / 19189979 / can not-run-flume-because-of-jar-conflict>由于JAR冲突,无法运行Flume
您应该编译它使用maven
https:// github .com / cloudera / cdh-twitter-example 下载并编译:flume-sources.1.0-SNAPSHOT.jar。这个jar包含了Cloudera TwitterSource的实现。
步骤:
sudo yum安装apache-maven
进入flume插件目录:
/ var / lib / flume-ng / plugins.d / twitter-streaming / lib / flume-sources-1.0-SNAPSHOT.jar
$ b $注意:Yum更新到最新版本,否则编译(mvn包)会因某些安全问题而失败。
There is tiny problem when I try Cloudera 5.4.2. Base on this article
Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm
It tries to fetching tweets using Flume and twitter streaming for data analysis. All things are happy, create Twitter app, create directory on HDFS, configure Flume then start to fetch data, create schema on top of tweets.
Then, here is the problem. Twitter streaming converts tweets to Avro format and send Avro events to downsteam HDFS sinks, when Hive table backed by Avro load the data, I got the error message said "Avro block size is invalid or too large".
Oh, what is avro block and the limitation of the block size? Can I change it? What does it mean according to this message? Is it file's fault? Is it some records' fault? If Twitter's streaming met error data, it should core down. If it is all good to convert the tweets to Avro format, reversely, the Avro data should be read correctly, right?
And I also try the avro-tools-1.7.7.jar
java -jar avro-tools-1.7.7.jar tojson FlumeData.1458090051232
{"id":"710300089206611968","user_friends_count":{"int":1527},"user_location":{"string":"1633"},"user_description":{"string":"Steady Building an Empire..... #UGA"},"user_statuses_count":{"int":44471},"user_followers_count":{"int":2170},"user_name":{"string":"Esquire Shakur"},"user_screen_name":{"string":"Esquire_Bowtie"},"created_at":{"string":"2016-03-16T23:01:52Z"},"text":{"string":"RT @ugaunion: .@ugasga is hosting a debate between the three SGA executive tickets. Learn more about their plans to serve you https://t.co/…"},"retweet_count":{"long":0},"retweeted":{"boolean":true},"in_reply_to_user_id":{"long":-1},"source":{"string":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"},"in_reply_to_status_id":{"long":-1},"media_url_https":null,"expanded_url":null}
{"id":"710300089198088196","user_friends_count":{"int":100},"user_location":{"string":"DM開放してます(`・ω・´)"},"user_description":{"string":"Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:77)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266)
... 4 more
The same problem. I google it a lot, no answers at all.
Could anyone give me a solution if you have met this problem too? Or somebody help to give a clue if you fully understand Avro stuff or Twitter streaming underneath.
It is really intereting problem. Think about it.
Use Cloudera TwitterSource
Otherwise will meet this problem.
Unable to correctly load twitter avro data into hive table
In the article: This is apache TwitterSource
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
Twitter 1% Firehose Source
This source is highly experimental. It connects to the 1% sample Twitter Firehose using streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.
But it should be cloudera TwitterSource:
https://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
And not just download the pre build jar, because our cloudera version is 5.4.2, otherwise you will get this error:
Cannot run Flume because of JAR conflict
You should compile it using maven
https://github.com/cloudera/cdh-twitter-example
Download and compile: flume-sources.1.0-SNAPSHOT.jar. This jar contains the implementation of Cloudera TwitterSource.
Steps:
wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip
sudo yum install apache-maven Put to flume plugins directory:
/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar
mvn package
Notice: Yum update to latest version, otherwise compile (mvn package) fails due to some security problem.
这篇关于Cloudera 5.4.2:在使用Flume和Twitter流时,Avro块大小无效或过大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!