用于在pyspark中处理大数的数据类型 [英] datatype for handling big numbers in pyspark

查看:244
本文介绍了用于在pyspark中处理大数的数据类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python的spark.上传一个csv文件后,我需要解析一个csv文件中的一列,该列的长度为22位数字.为了解析该列,我使用了 LongType().我使用map()函数定义列. 以下是我在pyspark中的命令.

I am using spark with python.After uploading a csv file,I needed to parse a column in a csv file which has numbers that are 22 digits long. For parsing that column I used LongType() . I used map() function for defining column. Following are my commands in pyspark.

>>> test=sc.textFile("test.csv")
>>> header=test.first()
>>> schemaString = header.replace('"','')
>>> testfields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(',')]
>>> testfields[5].dataType = LongType()
>>> testschema = StructType(testfields)
>>> testHeader = test.filter(lambda l: "test_date" in l)
>>> testNoHeader = test.subtract(testHeader)
>>> test_temp = testNoHeader.map(lambda k: k.split(",")).map(lambda
p:(p[0],p[1],p[2],p[3],p[4],***float(p[5].strip('"'))***,p[6],p[7]))
>>> test_temp.top(2)

注意:在变量 test_temp 中,我也尝试用'long'和'bigint'代替'float',但是spark中的错误是'keyword not成立' 接下来是输出

Note: I have also tried 'long' and 'bigint' in place of 'float' in my variable test_temp, but the error in spark was 'keyword not found' And following is the output

[('2012-03-14', '7', '1698.00', 'XYZ02abc008793060653', 'II93', ***8.27370028700801e+21*** , 'W0W0000000000007', '879870080088815007'), ('2002-03-14', '1', '999.00', 'ABC02E000050086941', 'II93', 8.37670028702205e+21, 'A0B0080000012523', '870870080000012421')]

我的csv文件中的值如下: 8.27370028700801e + 21是 8273700287008010012345 8.37670028702205e + 21是 8376700287022050054321

The value in my csv file is as follows: 8.27370028700801e+21 is 8273700287008010012345 8.37670028702205e+21 is 8376700287022050054321

当我用它创建一个数据框然后查询时,

When I create a data frame out of it and then query it,

>>> test_df = sqlContext.createDataFrame(test_temp, testschema)
>>> test_df.registerTempTable("test")
>>> sqlContext.sql("SELECT test_column FROM test").show()

test_column为所有记录赋予值空".

the test_column gives value 'null' for all the records.

因此,如何解决在Spark中解析大量数字的问题,非常感谢您的帮助

So, how to solve this problem of parsing big number in spark, really appreciate your help

推荐答案

好吧,类型很重要.由于将数据转换为float,因此不能在DataFrame中使用LongType.它不仅仅因为PySpark在类型方面相对宽容而已.

Well, types matter. Since you convert your data to float you cannot use LongType in the DataFrame. It doesn't blow only because PySpark is relatively forgiving when it comes to types.

此外,8273700287008010012345太大,无法用LongType表示,它只能表示-9223372036854775808和9223372036854775807之间的值.

Also, 8273700287008010012345 is too large to be represented as LongType which can represent only the values between -9223372036854775808 and 9223372036854775807.

如果要将数据转换为DataFrame,则必须使用DoubleType:

If you want to convert your data to a DataFrame you'll have to use DoubleType:

from pyspark.sql.types import *

rdd = sc.parallelize([(8.27370028700801e+21, )])
schema = StructType([StructField("x", DoubleType(), False)])
rdd.toDF(schema).show()

## +-------------------+
## |                  x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+

通常,最好直接使用DataFrames进行处理:

Typically it is a better idea to handle this with DataFrames directly:

from pyspark.sql.functions import col

str_df = sc.parallelize([("8273700287008010012345", )]).toDF(["x"])
str_df.select(col("x").cast("double")).show()

## +-------------------+
## |                  x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+

如果您不想使用Double,则可以以指定的精度转换为Decimal:

If you don't want to use Double you can cast to Decimal with specified precision:

str_df.select(col("x").cast(DecimalType(38))).show(1, False)

## +----------------------+
## |x                     |
## +----------------------+
## |8273700287008010012345|
## +----------------------+

这篇关于用于在pyspark中处理大数的数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆