如何加载多行记录的CSV文件? [英] How to load CSV file with records on multiple lines?

查看:87
本文介绍了如何加载多行记录的CSV文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Spark 2.3.0.

I use Spark 2.3.0.

作为Apache Spark的项目,我正在使用数据集进行处理.尝试使用Spark读取csv时,spark数据帧中的行与csv中的正确行不对应(请参阅示例csv 此处)文件.代码如下:

As a Apache Spark's project I am using this data set to work on. When trying to read csv using spark, row in spark dataframe does not corresponds to correct row in csv (See sample csv here) file. Code looks like following:

answer_df = sparkSession.read.csv('./stacksample/Answers_sample.csv', header=True, inferSchema=True, multiLine=True);
answer_df.show(2)

输出

+--------------------+-------------+--------------------+--------+-----+--------------------+
|                  Id|  OwnerUserId|        CreationDate|ParentId|Score|                Body|
+--------------------+-------------+--------------------+--------+-----+--------------------+
|                  92|           61|2008-08-01T14:45:37Z|      90|   13|"<p><a href=""htt...|
|<p>A very good re...| though.</p>"|                null|    null| null|                null|
+--------------------+-------------+--------------------+--------+-----+--------------------+
only showing top 2 rows

但是, 当我使用大熊猫时,它就像一种魅力.

However, When i used pandas, it worked like a charm.

df = pd.read_csv('./stacksample/Answers_sample.csv')
df.head(3) 

输出

Index Id    OwnerUserId CreationDate    ParentId    Score   Body
0   92  61  2008-08-01T14:45:37Z    90  13  <p><a href="http://svnbook.red-bean.com/">Vers...
1   124 26  2008-08-01T16:09:47Z    80  12  <p>I wound up using this. It is a kind of a ha...

我的观察: Apache spark将csv文件中的每一行都视为数据帧的记录(这是合理的),但另一方面,大熊猫会智能地(不确定基于哪个参数)找出记录的实际结束位置.

My Observation: Apache spark is treating every line in csv file as a record for dataframe( which is reasonable) but on the other hand, pandas intelligently ( not sure based on which parameters) figures out where the record end actually.

问题 我想知道,如何指示Spark正确加载数据帧.

Question I would like to know, how can i instruct Spark to load the dataframe properly.

要加载的数据如下,以92124开头的行是两个记录.

The data to be loaded is as follows with the lines starting with 92 and 124 being two records.

Id,OwnerUserId,CreationDate,ParentId,Score,Body
92,61,2008-08-01T14:45:37Z,90,13,"<p><a href=""http://svnbook.red-bean.com/"">Version Control with Subversion</a></p>

<p>A very good resource for source control in general. Not really TortoiseSVN specific, though.</p>"
124,26,2008-08-01T16:09:47Z,80,12,"<p>I wound up using this. It is a kind of a hack, but it actually works pretty well. The only thing is you have to be very careful with your semicolons. : D</p>

<pre><code>var strSql:String = stream.readUTFBytes(stream.bytesAvailable);      
var i:Number = 0;
var strSqlSplit:Array = strSql.split("";"");
for (i = 0; i &lt; strSqlSplit.length; i++){
    NonQuery(strSqlSplit[i].toString());
}
</code></pre>
"

推荐答案

认为您应该使用option("escape", "\""),因为似乎"被用作所谓的

I think you should use option("escape", "\"") as it seems that " is used as so-called quote escape characters.

val q = spark.read
  .option("multiLine", true)
  .option("header", true)
  .option("escape", "\"")
  .csv("input.csv")
scala> q.show
+---+-----------+--------------------+--------+-----+--------------------+
| Id|OwnerUserId|        CreationDate|ParentId|Score|                Body|
+---+-----------+--------------------+--------+-----+--------------------+
| 92|         61|2008-08-01T14:45:37Z|      90|   13|<p><a href="http:...|
|124|         26|2008-08-01T16:09:47Z|      80|   12|<p>I wound up usi...|
+---+-----------+--------------------+--------+-----+--------------------+

这篇关于如何加载多行记录的CSV文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆