Apache Spark是否处理非结构化多行数据? [英] Does Apache Spark process Unstructured multi-line data?

查看:239
本文介绍了Apache Spark是否处理非结构化多行数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据如下:

make,Model,MPG,Cylinders,Engine Disp,Horsepower,Weight,Accelerate,Year,Origin<br>
amc,amc ambassador dpl,15,8,390,190,3850,8.5,70,Indian<br>
amc,amc gremlin,21,6,199,90,2648,15,70,Indian<br>
amc,amc hornet,18,6,199,97,2774,15.5,70,Indian<br>
amc,amc rebel sst,16,8,304,150,3433,12,70,Indian<br>
.............
.............
.............

现在上面是一个纯结构化的数据,如下所示,我已经用scala用spark进行了愉快的处理

Now above is a purely structured data which i have processed happily with spark with scala as shown below

val rawData=sc.textFile("/hdfs/spark/cars2.txt") <br>
case class cars(make:String, model:String, mpg:Integer, cylinders :Integer, engine_disp:Integer, horsepower:Integer,weight:Integer ,accelerate:Double,  year:Integer, origin:String)<br>
val carsData=rawData.map(x=>x.split(",")).map(x=>cars(x(0).toString,x(1).toString,x(2).toInt,x(3).toInt,x(4).toInt,x(5).toInt,x(6).toInt,x(7).toDouble,x(8).toInt,x(9).toString))<br>
carsData.take(2)<br>
carsData.cache()<br>
carsData.map(x=>(x.origin,1)).reduceByKey((x,y)=>x+y).collect<br>
val indianCars=carsData.filter(x=>(x.origin=="Indian"))<br>
indianCars.count() <br>
val makeWeightSum=indianCars.map(x=>(x.make,x.weight.toInt)).combineByKey((x:Int) => (x, 1),(acc:(Int, Int), x) => (acc._1 + x, acc._2 +  1),(acc1:(Int, Int), acc2:(Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2))<br>
makeWeightSum.collect()<br>
val makeWeightAvg=makeWeightSum.map(x=>(x._1,(x._2._1/x._2._2)))<br>
makeWeightAvg.collect()<br>
makeWeightAvg.saveAsTextFile("carsMakeWeightAvg.txt")<br>

现在我也可以在HIVE中进行此分析,为什么我需要火花(Spark可能很快,谁真的想乘坐ROCKET出行).所以问题是,SPARK是否处理多行非结构化数据,如下所示: 数据:

Now i can do this analysis in HIVE also, why i need spark ( Spark might be fast, who really wants to travel on ROCKET ). So the Question is, does SPARK process multi-line unstructured data as shown below: Data:

Brand:Nokia, Model:1112, price:100, time:201604091,<br>
redirectDomain:xyz.com, type:online,status:completed,<br>
tx:credit,country:in,<br>

Brand:samsung, Model:s6, price:5000, time:2016045859,<br>
redirectDomain:abc.com, type:online,status:completed,<br>

.....thousands of records...

推荐答案

是的,应使用Spark来做到这一点.

Yes, Spark shall be used to do that.

DataFrame是组织为命名列的分布式数据集合. Spark SQL通过手动指定选项此类数据的数据源.

A DataFrame is a distributed collection of data organized into named columns. Spark SQL supports operating on a variety of data sources through the DataFrame interface. You may Manually Specify Options of data source for such data.

引用: Spark DataFrames 火花中的多行输入

注意:您的数据并非如此非结构化.它更像一个csv文件,如果您执行一些基本的转换,它可能会转换为数据集/数据帧.

Note: Your data is not so unstructured. Its more like a csv file and if you perform few basic transformations, it may be converted to a data-set/data-frame.

如果您只是测试各种可以使用的工具/框架,我还建议您 Apache Flink .

If you are just testing various possible tools/frameworks which can be used to do it, I would also like to suggest Apache Flink.

这篇关于Apache Spark是否处理非结构化多行数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆