将具有Numpy值的 pandas 数据框转换为pysparkSQL.DataFrame [英] Converting pandas Dataframe with Numpy values to pysparkSQL.DataFrame

查看:299
本文介绍了将具有Numpy值的 pandas 数据框转换为pysparkSQL.DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用random.int方法创建了一个2列的pandas df,以应用groupby操作生成第二个两列的数据框. df.col1是一系列列表,df.col2是一系列整数,并且列表内的元素是 type'numpy.int64',与第二列的元素相同,这是随机产生的结果.int.

I created a 2 columns pandas df with random.int method to generate a second two column dataframe applying groupby operations. df.col1 is a series of lists, df.col2 a series of integers, and elements inside the list are type 'numpy.int64', same for the elements of the second column, as result of random.int.

df.a        df.b
3            7
5            2
1            8
...

groupby operations 

df.col1        df.col2
[1,2,3...]    1
[2,5,6...]    2
[6,4,....]    3
...

当我尝试使用spark.createDataFrame(df)创建pyspark.sql数据帧时,出现以下错误: TypeError:不支持的类型:键入"numpy.int64".

When I try to crete the pyspark.sql dataframe with spark.createDataFrame(df), I get this error: TypeError: not supported type: type 'numpy.int64'.

回到df世代,我尝试了不同的方法将元素从numpy.int64转换为python int,但是没有一个主题起作用:

Going back to the df generation, I tried different methods to convert the elements from numpy.int64 to python int, but none of theme worked:

np_list = np.random.randint(0,2500, size = (10000,2)).astype(IntegerType)
df = pd.DataFrame(np_list,columns = list('ab'), dtype = 'int')

我还尝试使用lambda x映射:int(x)或x.item(),但类型仍为'numpy.int64'.

I also tried to map with lambda x: int(x) or x.item() but the type still remains 'numpy.int64'.

根据pyspark.sql文档,应该可以加载pandas数据框,但是当它带有numpy值时似乎不兼容. 有提示吗?

According to pyspark.sql documentation, it should be possible to load a pandas dataframe, but it seems not compatible when it comes with numpy values. Any hints?

谢谢!

推荐答案

好的方法是行不通的.如果您有这样的事情.由于第一列,您将收到错误. Spark无法识别类型为numpy.int64

Well the way how you do it doesn't work. If you have something like this. You will get the error because of the first column. Spark doesn't understand a list with the type numpy.int64

df.col1        df.col2
[1,2,3...]    1
[2,5,6...]    2
[6,4,....]    3
...

如果您有这样的事情.这应该没关系.

If you have something like this. The this should be okay.

df.a        df.b
3            7
5            2
1            8

就您的代码而言,请尝试以下操作:

In terms of your code, try this:

np_list = np.random.randint(0,2500, size = (10000,2))
df = pd.DataFrame(np_list,columns = list('ab'))
spark_df = spark.createDataFrame(df)

您真的不需要再次将其强制转换为int,如果您想显式地将其转换为array.astype(int).然后只需执行spark_df.head.应该可以!

You don't really need to cast this as int again and if you want to do it explicitly, then it is array.astype(int). Then just do spark_df.head. This should work!

这篇关于将具有Numpy值的 pandas 数据框转换为pysparkSQL.DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆