将带有 Numpy 值的 Pandas Dataframe 转换为 pysparkSQL.DataFrame [英] Converting pandas Dataframe with Numpy values to pysparkSQL.DataFrame

查看:43
本文介绍了将带有 Numpy 值的 Pandas Dataframe 转换为 pysparkSQL.DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用 random.int 方法创建了一个 2 列的 Pandas df 来生成第二个应用 groupby 操作的两列数据帧.df.col1 是一系列的列表, df.col2 是一系列的整数,列表中的元素是type 'numpy.int64',第二列的元素也是一样,是随机的.int.

I created a 2 columns pandas df with random.int method to generate a second two column dataframe applying groupby operations. df.col1 is a series of lists, df.col2 a series of integers, and elements inside the list are type 'numpy.int64', same for the elements of the second column, as result of random.int.

df.a        df.b
3            7
5            2
1            8
...

groupby operations 

df.col1        df.col2
[1,2,3...]    1
[2,5,6...]    2
[6,4,....]    3
...

当我尝试使用 spark.createDataFrame(df) 创建 pyspark.sql 数据框时,出现此错误:类型错误:不支持类型:类型 'numpy.int64'.

When I try to crete the pyspark.sql dataframe with spark.createDataFrame(df), I get this error: TypeError: not supported type: type 'numpy.int64'.

回到 df 生成,我尝试了不同的方法将元素从 numpy.int64 转换为 python int,但没有一个主题有效:

Going back to the df generation, I tried different methods to convert the elements from numpy.int64 to python int, but none of theme worked:

np_list = np.random.randint(0,2500, size = (10000,2)).astype(IntegerType)
df = pd.DataFrame(np_list,columns = list('ab'), dtype = 'int')

我也尝试使用 lambda x: int(x) 或 x.item() 进行映射,但类型仍然是 'numpy.int64'.

I also tried to map with lambda x: int(x) or x.item() but the type still remains 'numpy.int64'.

根据 pyspark.sql 文档,应该可以加载一个 Pandas 数据框,但是当它带有 numpy 值时似乎不兼容.有什么提示吗?

According to pyspark.sql documentation, it should be possible to load a pandas dataframe, but it seems not compatible when it comes with numpy values. Any hints?

谢谢!

推荐答案

好吧,你这样做的方式行不通.如果你有这样的事情.由于第一列,您将收到错误.Spark 不理解类型为 numpy.int64

Well the way how you do it doesn't work. If you have something like this. You will get the error because of the first column. Spark doesn't understand a list with the type numpy.int64

df.col1        df.col2
[1,2,3...]    1
[2,5,6...]    2
[6,4,....]    3
...

如果你有这样的事情.这个应该没问题.

If you have something like this. The this should be okay.

df.a        df.b
3            7
5            2
1            8

就你的代码而言,试试这个:

In terms of your code, try this:

np_list = np.random.randint(0,2500, size = (10000,2))
df = pd.DataFrame(np_list,columns = list('ab'))
spark_df = spark.createDataFrame(df)

您真的不需要再次将其转换为 int 并且如果您想明确地执行此操作,则它是 array.astype(int).然后只需执行spark_df.head.这应该有效!

You don't really need to cast this as int again and if you want to do it explicitly, then it is array.astype(int). Then just do spark_df.head. This should work!

这篇关于将带有 Numpy 值的 Pandas Dataframe 转换为 pysparkSQL.DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆