使用python将csv转换为镶木地板文件 [英] Convert csv to parquet file using python

查看:126
本文介绍了使用python将csv转换为镶木地板文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将.csv文件转换为.parquet文件.
csv文件(Temp.csv)具有以下格式

I am trying to convert a .csv file to a .parquet file.
The csv file (Temp.csv) has the following format

1,Jon,Doe,Denver

我正在使用以下python代码将其转换为实木复合地板

I am using the following python code to convert it into parquet

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import os

if __name__ == "__main__":
    sc = SparkContext(appName="CSV2Parquet")
    sqlContext = SQLContext(sc)

    schema = StructType([
            StructField("col1", IntegerType(), True),
            StructField("col2", StringType(), True),
            StructField("col3", StringType(), True),
            StructField("col4", StringType(), True)])
    dirname = os.path.dirname(os.path.abspath(__file__))
    csvfilename = os.path.join(dirname,'Temp.csv')    
    rdd = sc.textFile(csvfilename).map(lambda line: line.split(","))
    df = sqlContext.createDataFrame(rdd, schema)
    parquetfilename = os.path.join(dirname,'output.parquet')    
    df.write.mode('overwrite').parquet(parquetfilename)

结果仅是一个名为output.parquet的文件夹,而不是我要查找的镶木地板文件,其后是控制台上的以下错误.

The result is only a folder named, output.parquet and not a parquet file that I'm looking for, followed by the following error on the console.

我也尝试过运行以下代码来面对类似的问题.

I have also tried running the following code to face a similar issue.

from pyspark.sql import SparkSession
import os

spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# read csv
dirname = os.path.dirname(os.path.abspath(__file__))
csvfilename = os.path.join(dirname,'Temp.csv')    
df = spark.read.csv(csvfilename)

# Displays the content of the DataFrame to stdout
df.show()
parquetfilename = os.path.join(dirname,'output.parquet')    
df.write.mode('overwrite').parquet(parquetfilename)

如何做到最好?使用Windows,python 2.7.

How to best do it? Using windows, python 2.7.

推荐答案

使用包pyarrowpandas,您可以将CSV转换为Parquet,而无需在后台使用JVM:

Using the packages pyarrow and pandas you can convert CSVs to Parquet without using a JVM in the background:

import pandas as pd
df = pd.read_csv('example.csv')
df.to_parquet('output.parquet')

运行的一个限制是pyarrow仅适用于Windows上的Python 3.5+.可以使用Linux/OSX作为Python 2运行代码,也可以将Windows安装程序升级到Python 3.6.

One limitation in which you will run is that pyarrow is only available for Python 3.5+ on Windows. Either use Linux/OSX to run the code as Python 2 or upgrade your windows setup to Python 3.6.

这篇关于使用python将csv转换为镶木地板文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆