将在线CSV转换为DataFrame Scala的最佳方法 [英] Best way to convert online csv to dataframe scala

查看:124
本文介绍了将在线CSV转换为DataFrame Scala的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出最有效的方法来完成将在线csv文件放入Scala的数据框中.

I am trying to figure out the most efficient way to accomplish putting this online csv file into a data frame in Scala.

要保存下载,代码中的csv文件如下所示:

To save a download, the csv file in the code looks like this:

"Symbol","Name","LastSale","MarketCap","ADR 
TSO","IPOyear","Sector","Industry","Summary Quote"
"DDD","3D Systems Corporation","18.09","2058834640.41","n/a","n/a","Technology","Computer Software: Prepackaged Software","http://www.nasdaq.com/symbol/ddd"
"MMM","3M Company","211.68","126423673447.68","n/a","n/a","Health Care","Medical/Dental Instruments","http://www.nasdaq.com/symbol/mmm"
....

根据我的研究,我首先下载csv,然后将其放入列表缓冲区中(因为您不能使用列表,因为它是不可变的):

From my research, I start by downloading the csv, and placing it into a list buffer (since you can't do this with a list because it's immutable):

import scala.collection.mutable.ListBuffer

val sc = new SparkContext(conf)

var stockInfoNYSE_ListBuffer = new ListBuffer[java.lang.String]()


import scala.io.Source
    val bufferedSource = 
    Source.fromURL("http://www.nasdaq.com/screening/companies-by-
    industry.aspx?exchange=NYSE&render=download")

for (line <- bufferedSource.getLines) {
    val cols = line.split(",").map(_.trim)

    stockInfoNYSE_ListBuffer += s"${cols(0)},${cols(1)},${cols(2)},${cols(3)},${cols(4)},${cols(5)},${cols(6)},${cols(7)},${cols(8)}"

}
bufferedSource.close

val stockInfoNYSE_List = stockInfoNYSE_ListBuffer.toList

因此,我们有一个列表.您基本上可以像这样获得每个值:

So we have a list. You can basically get each value like this:

// SYMBOL : stockInfoNYSE_List(1).split(",")(0)
// COMPANY NAME : stockInfoNYSE_List(1).split(",")(1)
// IPOYear : stockInfoNYSE_List(1).split(",")(5)
// Sector : stockInfoNYSE_List(1).split(",")(6)
// Industry : stockInfoNYSE_List(1).split(",")(7)

这是我卡住的地方-如何将其保存到数据帧?我采取了错误的方法.我尚未将所有值都放在一个简单的测试中.

Here is where I get stuck- how do I get this to a dataframe? The wrong approaches I have taken. I didn't put all the values in just yet- was a simple test.

case class StockMap(Symbol: String, Name: String)
val caseClassDS = Seq(StockMap(stockInfoNYSE_List(1).split(",")(0), 
StockMap(stockInfoNYSE_List(1).split(",")(1))).toDS()

caseClassDS.show()

上述方法的问题:我只能找出如何通过硬编码添加一个序列(行)的方法.我想要列表中的每一行.

The problem with the approach above: I can only figure out how to add one sequence (row) by hard coding it. I want every Row in the list.

第二次失败尝试:

val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val test = stockInfoNYSE_List.toDF

这只会给您数组,我想将值相除.

This will just give you the array, and I want to divide up the values.

Array(["Symbol","Name","LastSale","MarketCap","ADR TSO","IPOyear","Sector","Industry","Summary Quote"], ["DDD","3D Systems Corporation","18.09","2058834640.41","n/a","n/a","Technology","Computer Software: Prepackaged Software","http://www.nasdaq.com/symbol/ddd"], ["MMM","3M Company","211.68","126423673447.68","n/a","n/a","Health Care","Medical/Dental Instruments","http://www.nasdaq.com/symbol/mmm"],....... 

推荐答案

case class TestClass(Symbol:String,Name:String,LastSale:String,MarketCap :String,ADR_TSO:String,IPOyear:String,Sector: String,Industry:String,Summary_Quote:String
     | )
 defined class TestClass

var stockDF= stockInfoNYSE_ListBuffer.drop(1)

val demoDS = stockDF.map(line => {
  val fields = line.replace("\"","").split(",")
  TestClass(fields(0), fields(1), fields(2),fields(3), fields(4), fields(5),fields(6), fields(7), fields(8))
})

scala> demoDS.toDS.show

+------+--------------------+--------+---------------+-------------+-------+-----------------+--------------------+--------------------+
|Symbol|                Name|LastSale|      MarketCap|      ADR_TSO|IPOyear|           Sector|            Industry|       Summary_Quote|
+------+--------------------+--------+---------------+-------------+-------+-----------------+--------------------+--------------------+
|   DDD|3D Systems Corpor...|   18.09|  2058834640.41|          n/a|    n/a|       Technology|Computer Software...|http://www.nasdaq...|
|   MMM|          3M Company|  211.68|126423673447.68|          n/a|    n/a|      Health Care|Medical/Dental In...|http://www.nasdaq...|

这篇关于将在线CSV转换为DataFrame Scala的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆