PySpark:如何指定以逗号作为十进制的列 [英] PySpark: How to specify column with comma as decimal

查看：97 发布时间：2020/7/11 23:31:20 csv pyspark format comma

本文介绍了PySpark:如何指定以逗号作为十进制的列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用PySpark并加载csv文件.我有一列带有欧洲格式数字的列，这意味着逗号替换了点，反之亦然.

I am working with PySpark and loading a csv file. I have a column with numbers in European format, which means that comma replaces the dot and vice versa.

例如:我有2.416,67而不是2,416.67.

My data in .csv file looks like this -    
ID;    Revenue
21;    2.645,45
23;   31.147,05
.
.
55;    1.009,11

在熊猫中，可以通过在pd.read_csv()中指定decimal=','和thousands='.'选项以读取欧洲格式来轻松读取此类文件.

In pandas, such a file can easily be read by specifying decimal=',' and thousands='.' options inside pd.read_csv() to read European formats.

熊猫代码:

import pandas as pd
df=pd.read_csv("filepath/revenues.csv",sep=';',decimal=',',thousands='.')

我不知道如何在PySpark中做到这一点.

I don't know how can this be done in PySpark.

PySpark代码:

from pyspark.sql.types import StructType, StructField, FloatType, StringType
schema = StructType([
            StructField("ID", StringType(), True),
            StructField("Revenue", FloatType(), True)
                    ])
df=spark.read.csv("filepath/revenues.csv",sep=';',encoding='UTF-8', schema=schema, header=True)

任何人都可以建议我们如何使用上述.csv()函数将此类文件加载到PySpark中吗?

Can anyone suggest as to how we can load such a file in PySpark using the above mentioned .csv() function?

推荐答案

由于数据格式的原因，您将无法以浮点形式读取它.您需要将其读取为字符串，将其清理，然后转换为浮点数:

You won't be able to read it as a float because the format of the data. You need to read it as a string, clean it up and then cast to float:

from pyspark.sql.functions import regexp_replace
from pyspark.sql.types import FloatType

df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
df = df.withColumn('revenue', regexp_replace('revenue', '\\.', ''))
df = df.withColumn('revenue', regexp_replace('revenue', ',', '.'))
df = df.withColumn('revenue', df['revenue'].cast("float"))

您可能也可以将所有这些链接在一起:

You can probably just chain these all together too:

df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
df = (
         df
         .withColumn('revenue', regexp_replace('revenue', '\\.', ''))
         .withColumn('revenue', regexp_replace('revenue', ',', '.'))
         .withColumn('revenue', df['revenue'].cast("float"))
     )

请注意，我尚未对此进行测试，因此其中可能有一两个错字.

Please note this I haven't tested this so there may be a typo or two in there.

这篇关于PySpark:如何指定以逗号作为十进制的列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark:如何指定以逗号作为十进制的列 [英] PySpark: How to specify column with comma as decimal

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

PySpark:如何指定以逗号作为十进制的列 [英] PySpark: How to specify column with comma as decimal

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭