用于在pyspark中处理大数的数据类型 [英] datatype for handling big numbers in pyspark
问题描述
我正在使用python的spark.上传一个csv文件后,我需要解析一个csv文件中的一列,该列的长度为22位数字.为了解析该列,我使用了 LongType().我使用map()函数定义列. 以下是我在pyspark中的命令.
I am using spark with python.After uploading a csv file,I needed to parse a column in a csv file which has numbers that are 22 digits long. For parsing that column I used LongType() . I used map() function for defining column. Following are my commands in pyspark.
>>> test=sc.textFile("test.csv")
>>> header=test.first()
>>> schemaString = header.replace('"','')
>>> testfields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(',')]
>>> testfields[5].dataType = LongType()
>>> testschema = StructType(testfields)
>>> testHeader = test.filter(lambda l: "test_date" in l)
>>> testNoHeader = test.subtract(testHeader)
>>> test_temp = testNoHeader.map(lambda k: k.split(",")).map(lambda
p:(p[0],p[1],p[2],p[3],p[4],***float(p[5].strip('"'))***,p[6],p[7]))
>>> test_temp.top(2)
注意:在变量 test_temp 中,我也尝试用'long'和'bigint'代替'float',但是spark中的错误是'keyword not成立' 接下来是输出
Note: I have also tried 'long' and 'bigint' in place of 'float' in my variable test_temp, but the error in spark was 'keyword not found' And following is the output
[('2012-03-14', '7', '1698.00', 'XYZ02abc008793060653', 'II93', ***8.27370028700801e+21*** , 'W0W0000000000007', '879870080088815007'), ('2002-03-14', '1', '999.00', 'ABC02E000050086941', 'II93', 8.37670028702205e+21, 'A0B0080000012523', '870870080000012421')]
我的csv文件中的值如下: 8.27370028700801e + 21是 8273700287008010012345 8.37670028702205e + 21是 8376700287022050054321
The value in my csv file is as follows: 8.27370028700801e+21 is 8273700287008010012345 8.37670028702205e+21 is 8376700287022050054321
当我用它创建一个数据框然后查询时,
When I create a data frame out of it and then query it,
>>> test_df = sqlContext.createDataFrame(test_temp, testschema)
>>> test_df.registerTempTable("test")
>>> sqlContext.sql("SELECT test_column FROM test").show()
test_column
为所有记录赋予值空".
the test_column
gives value 'null' for all the records.
因此,如何解决在Spark中解析大量数字的问题,非常感谢您的帮助
So, how to solve this problem of parsing big number in spark, really appreciate your help
推荐答案
好吧,类型很重要.由于将数据转换为float
,因此不能在DataFrame
中使用LongType
.它不仅仅因为PySpark在类型方面相对宽容而已.
Well, types matter. Since you convert your data to float
you cannot use LongType
in the DataFrame
. It doesn't blow only because PySpark is relatively forgiving when it comes to types.
此外,8273700287008010012345
太大,无法用LongType
表示,它只能表示-9223372036854775808和9223372036854775807之间的值.
Also, 8273700287008010012345
is too large to be represented as LongType
which can represent only the values between -9223372036854775808 and 9223372036854775807.
如果要将数据转换为DataFrame
,则必须使用DoubleType
:
If you want to convert your data to a DataFrame
you'll have to use DoubleType
:
from pyspark.sql.types import *
rdd = sc.parallelize([(8.27370028700801e+21, )])
schema = StructType([StructField("x", DoubleType(), False)])
rdd.toDF(schema).show()
## +-------------------+
## | x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+
通常,最好直接使用DataFrames
进行处理:
Typically it is a better idea to handle this with DataFrames
directly:
from pyspark.sql.functions import col
str_df = sc.parallelize([("8273700287008010012345", )]).toDF(["x"])
str_df.select(col("x").cast("double")).show()
## +-------------------+
## | x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+
如果您不想使用Double
,则可以以指定的精度转换为Decimal
:
If you don't want to use Double
you can cast to Decimal
with specified precision:
str_df.select(col("x").cast(DecimalType(38))).show(1, False)
## +----------------------+
## |x |
## +----------------------+
## |8273700287008010012345|
## +----------------------+
这篇关于用于在pyspark中处理大数的数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!