如何从数据框中提取一个值(我想要一个int而不是行)并对其进行简单计算? [英] How do I extract a value (I want an int not row) from a dataframe and do simple calculations on it?

查看:110
本文介绍了如何从数据框中提取一个值(我想要一个int而不是行)并对其进行简单计算?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,可以用apache spark中的3列和大约1000行将其称为"df".其中一个列在每一行中存储"一个双精度数为1.00或0.00的列,将其称为"column x"我需要获取"column x"中的行数1.00,以用作变量.

I've got a dataframe lets call it "df" in apache spark with 3 colums and about 1000 rows. One of the colums "stores" a double in each row that either is 1.00 or 0.00 lets call it "column x" I need to get the amount of rows in "column x" that is 1.00 to use as a variable.

我至少知道两种方法,但是我不知道如何完成这两种方法.

I know at least 2 ways of doing it but I can't figure out how to finish either of them.

对于第一个,我首先制作了新的数据框,然后选择"column x"命名为df2(摆脱了我不需要的其他列):

For the first one I first off made new dataframe and selecting "column x" lets call it df2 (getting rid of the other columns that I dont need for this):

df2 = df.select('column_x')

然后我创建了另一个将1.00和0.00分组的数据框,将其称为grouped_df:

then I created another dataframe that groups up the 1.00 and 0.00 lets call it grouped_df:

grouped_df = df2.map(lambda label : (label, 1)).reduceByKey(lambda a, b: a +b)

此数据框现在仅由2行组成,而不是1000行.第一行是将1.00行加在一起成为双精度值,第二行是0.00.

This dataframe now only consist of 2 rows instead of 1000. The first row are the 1.00 rows added together into a double and the second rows 0.00.

现在这是问题所在,我不知道如何将元素提取"为一个值,以便可以将其用于计算.我只设法使用.take(1)或collect()来显示dataframes元素是正确的,但是我不能用它进行例如简单的除法,因为它不返回int

Now here is the problem, I have no idea how to "extract" the element into a value so I can use it for a calculation. I only managed to use .take(1) or collect() to display that the dataframes element is correct but I cant make for example simple division with that since it doesnt return an int

另一种方法是只过滤掉df2中的所有0.00,然后在过滤后的数据帧上使用.count(),因为这似乎返回了我可以使用的int.

The other way of doing this is by just filtering out all the 0.00 in df2 and then use .count() on the filtered dataframe since that seem to return an int I can use.

外观如下:

推荐答案

一旦您拥有包含汇总列数的最终数据框,则可以在该数据框上调用"收集",这将返回行列表数据类型.

Once you have the final dataframe with aggregated counts for column, then you can call 'collect' on that Dataframe, this will return the rows of DataFrame as List of Rows datatype.

从行"列表中,您可以按列名查询对列值的访问并分配给变量,如下所示:

From the list of Rows, you can query the access the column value by column name and assign to the variable, as below:

>>> df.show()
+--------+----+
|    col1|col2|
+--------+----+
|column_x|1000|
|column_y|2000|
+--------+----+

>>>
>>> test = df.collect()
>>> test
[Row(col1=u'column_x', col2=1000), Row(col1=u'column_y', col2=2000)]
>>>
>>> count_x = test[0].col2
>>> count_x
1000
>>>
>>> count_y = test[1].col2
>>> count_y
2000
>>>

这篇关于如何从数据框中提取一个值(我想要一个int而不是行)并对其进行简单计算?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆