在pyspark中处理大数字的数据类型 [英] datatype for handling big numbers in pyspark

查看:69
本文介绍了在pyspark中处理大数字的数据类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 python 中使用 spark.上传 csv 文件后,我需要解析 csv 文件中包含 22 位数字的列.为了解析该列,我使用了 LongType() .我使用 map() 函数来定义列.以下是我在 pyspark 中的命令.

<预><代码>>>>test=sc.textFile("test.csv")>>>标头=test.first()>>>schemaString = header.replace('"','')>>>testfields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(',')]>>>testfields[5].dataType = LongType()>>>测试模式 = 结构类型(测试字段)>>>testHeader = test.filter(lambda l: "test_date" in l)>>>testNoHeader = test.subtract(testHeader)>>>test_temp = testNoHeader.map(lambda k: k.split(",")).map(lambdap:(p[0],p[1],p[2],p[3],p[4],***float(p[5].strip('"'))***,p[6],p[7]))>>>test_temp.top(2)

注意:我也试过在我的变量 test_temp 中用 'long' 和 'bigint' 代替 'float',但是 spark 中的错误是 'keyword not成立'以下是输出

[('2012-03-14', '7', '1698.00', 'XYZ02abc008793060653', 'II93', ***8.27370028700801e+21***, '00807007'W0807008), ('2002-03-14', '1', '999.00', 'ABC02E000050086941', 'II93', 8.37670028702205e+21, 'A0B008000001080707000000070702'83

我的csv文件中的值如下:8.27370028700801e+21 是 82737002870080100123458.37670028702205e+21 是 8376700287022050054321

当我从中创建一个数据框然后查询它时,

<预><代码>>>>test_df = sqlContext.createDataFrame(test_temp,testschema)>>>test_df.registerTempTable("测试")>>>sqlContext.sql("SELECT test_column FROM test").show()

test_column 为所有记录提供值 'null'.

那么,如何解决spark中解析大数的问题,非常感谢您的帮助

解决方案

嗯,类型很重要.由于您将数据转换为 float,您不能在 DataFrame 中使用 LongType.它不会失败,仅仅是因为 PySpark 在类型方面相对宽容.

此外,8273700287008010012345 太大,无法表示为 LongType,它只能表示 -9223372036854775808 和 9223372036854775807 之间的值.

如果要将数据转换为 DataFrame,则必须使用 DoubleType:

from pyspark.sql.types import *rdd = sc.parallelize([(8.27370028700801e+21, )])schema = StructType([StructField("x", DoubleType(), False)])rdd.toDF(模式).show()## +-------------------+## |×|## +-------------------+## |8.27370028700801E21|## +-------------------+

通常情况下,最好直接使用 DataFrames 处理此问题:

from pyspark.sql.functions import colstr_df = sc.parallelize([("8273700287008010012345", )]).toDF(["x"])str_df.select(col("x").cast("double")).show()## +-------------------+## |×|## +-------------------+## |8.27370028700801E21|## +-------------------+

如果您不想使用 Double,您可以使用指定的精度强制转换为 Decimal:

str_df.select(col("x").cast(DecimalType(38))).show(1, False)## +----------------------+## |x |## +----------------------+## |8273700287008010012345|## +----------------------+

I am using spark with python.After uploading a csv file,I needed to parse a column in a csv file which has numbers that are 22 digits long. For parsing that column I used LongType() . I used map() function for defining column. Following are my commands in pyspark.

>>> test=sc.textFile("test.csv")
>>> header=test.first()
>>> schemaString = header.replace('"','')
>>> testfields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(',')]
>>> testfields[5].dataType = LongType()
>>> testschema = StructType(testfields)
>>> testHeader = test.filter(lambda l: "test_date" in l)
>>> testNoHeader = test.subtract(testHeader)
>>> test_temp = testNoHeader.map(lambda k: k.split(",")).map(lambda
p:(p[0],p[1],p[2],p[3],p[4],***float(p[5].strip('"'))***,p[6],p[7]))
>>> test_temp.top(2)

Note: I have also tried 'long' and 'bigint' in place of 'float' in my variable test_temp, but the error in spark was 'keyword not found' And following is the output

[('2012-03-14', '7', '1698.00', 'XYZ02abc008793060653', 'II93', ***8.27370028700801e+21*** , 'W0W0000000000007', '879870080088815007'), ('2002-03-14', '1', '999.00', 'ABC02E000050086941', 'II93', 8.37670028702205e+21, 'A0B0080000012523', '870870080000012421')]

The value in my csv file is as follows: 8.27370028700801e+21 is 8273700287008010012345 8.37670028702205e+21 is 8376700287022050054321

When I create a data frame out of it and then query it,

>>> test_df = sqlContext.createDataFrame(test_temp, testschema)
>>> test_df.registerTempTable("test")
>>> sqlContext.sql("SELECT test_column FROM test").show()

the test_column gives value 'null' for all the records.

So, how to solve this problem of parsing big number in spark, really appreciate your help

解决方案

Well, types matter. Since you convert your data to float you cannot use LongType in the DataFrame. It doesn't blow only because PySpark is relatively forgiving when it comes to types.

Also, 8273700287008010012345 is too large to be represented as LongType which can represent only the values between -9223372036854775808 and 9223372036854775807.

If you want to convert your data to a DataFrame you'll have to use DoubleType:

from pyspark.sql.types import *

rdd = sc.parallelize([(8.27370028700801e+21, )])
schema = StructType([StructField("x", DoubleType(), False)])
rdd.toDF(schema).show()

## +-------------------+
## |                  x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+

Typically it is a better idea to handle this with DataFrames directly:

from pyspark.sql.functions import col

str_df = sc.parallelize([("8273700287008010012345", )]).toDF(["x"])
str_df.select(col("x").cast("double")).show()

## +-------------------+
## |                  x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+

If you don't want to use Double you can cast to Decimal with specified precision:

str_df.select(col("x").cast(DecimalType(38))).show(1, False)

## +----------------------+
## |x                     |
## +----------------------+
## |8273700287008010012345|
## +----------------------+

这篇关于在pyspark中处理大数字的数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆