Rpy2 pandas2ri.ri2py() 正在将 NA 值转换为整数 [英] Rpy2 pandas2ri.ri2py() is converting NA values to integers

查看:71
本文介绍了Rpy2 pandas2ri.ri2py() 正在将 NA 值转换为整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将 Rpy2 2.8.4 版与 R 3.3.0 和 python 2.7.10 结合使用来创建 R 数据帧

import rpy2.robjects as ro从 rpy2.robjects 导入 r从 rpy2.robjects 导入 pandas2ridf = ro.DataFrame({'Col1': ro.vectors.IntVector([1, 2, 3, 4, 5]),'Col2': ro.vectors.StrVector(['a', 'b', 'c', 'd', 'e']),'Col3': ro.vectors.FactorVector([1, 2, 3, ro.NA_Integer, ro.NA_Integer])})打印文件|Col2 |Col3 |列 1 |---------------1 ||1 |1 |2 |乙 |2 |2 |3 || |3 |3 |4 |d |不适用 |4 |5 |电子 |不适用 |5 |

我可以毫无困难地将其转换为熊猫数据框.

pandas2ri.ri2py(df)|Col2 |Col3 |列 1 |---------------1 ||1 |1 |2 |乙 |2 |2 |3 || |3 |3 |4 |d |不适用 |4 |5 |电子 |不适用 |5 |

但是,我注意到 FactorVector 元数据包含NA"作为因子级别,

print r('levels(df$Col3)')[1] "1" "2" "3" "NA"

我理解这不是创建 R 中的因子.

如果我从因子水平中去掉NA",

r.assign('df', df)r('df$Col3 <- factor(as.numeric(levels(df$Col3))[df$Col3])')

然后在将 R 数据帧转换为熊猫数据帧时得到非常不同的结果.

df2 = r['df']pandas2ri.ri2py(df2)|Col2 |Col3 |列 1 |---------------1 ||1 |1 |2 |乙 |2 |2 |3 || |3 |3 |4 |d |1 |4 |5 |电子 |1 |5 |

我的问题是这是否是一个错误,或者我是否通过假设 NA_Integer 值不应该作为 R 数据帧中的因子级别包含在内而做错了什么?

解决方案

将 R data.frame 中的一列因子转换为 Pandas DataFrame 中的一列> 正在发生 使用该代码.没有以特定方式处理 NA,因此这必须发生在转换的上游.如果您查看 "Col3" 列,您会看到 NA 已作为因子中的级别列出.

<预><代码>>>>打印(df.rx2(Col3"))[1] 1 2 3 不适用 不适用级别:1 2 3 不适用

这甚至是创建 R data.frame 的上游:

<预><代码>>>>lst = [1, 2, 3, ro.NA_Integer, ro.NA_Integer]>>>打印(ro.vectors.FactorVector(lst))[1] 1 2 3 不适用 不适用级别:1 2 3 不适用

发生的事情是 rpy2 中 FactorVector 的构造函数对参数 exclude 使用的默认值与 R 中的 factor() 参数不同code> 函数(我认为这样做是为了使整数之间的映射默认用作级别向量的索引).

R 的默认行为可以通过以下方式恢复:

<预><代码>>>>v = ro.vectors.FactorVector(lst, exclude=ro.StrVector(["NA"]))>>>打印(五)[1] 1 2 3 <NA><不适用>等级:1 2 3

这里的问题是没有关于缺失值表示的指南(在 IEEE 标准的意义上).R 使用的是任意极值,但 Python 没有缺失值的概念.

I'm using Rpy2 version 2.8.4 in conjunction with R 3.3.0 and python 2.7.10 to create an R dataframe

import rpy2.robjects as ro
from rpy2.robjects import r
from rpy2.robjects import pandas2ri

df = ro.DataFrame({'Col1': ro.vectors.IntVector([1, 2, 3, 4, 5]),
               'Col2': ro.vectors.StrVector(['a', 'b', 'c', 'd', 'e']),
               'Col3': ro.vectors.FactorVector([1, 2, 3, ro.NA_Integer, ro.NA_Integer])})
print df

   | Col2 | Col3 | Col1 |
   ----------------------
 1 |  a   | 1    | 1    |
 2 |  b   | 2    | 2    |
 3 |  c   | 3    | 3    |
 4 |  d   | NA   | 4    |
 5 |  e   | NA   | 5    |

and I can convert this to a pandas dataframe without any trouble.

pandas2ri.ri2py(df)

   | Col2 | Col3 | Col1 |
   ----------------------
 1 |  a   | 1    | 1    |
 2 |  b   | 2    | 2    |
 3 |  c   | 3    | 3    |
 4 |  d   | NA   | 4    |
 5 |  e   | NA   | 5    |

However, I notice that the FactorVector metadata includes 'NA' as a factor level,

print r('levels(df$Col3)')

[1] "1"  "2"  "3"  "NA"

which I understand is not default behaviour when creating factors in R.

If I drop 'NA' from the factor levels,

r.assign('df', df)
r('df$Col3 <- factor(as.numeric(levels(df$Col3))[df$Col3])')

then I get a very different result when converting the R dataframe to a pandas dataframe.

df2 = r['df']
pandas2ri.ri2py(df2)

   | Col2 | Col3 | Col1 |
   ----------------------
 1 |  a   | 1    | 1    |
 2 |  b   | 2    | 2    |
 3 |  c   | 3    | 3    |
 4 |  d   | 1    | 4    |
 5 |  e   | 1    | 5    |

My question is whether this is a bug, or am I doing something wrong by assuming that NA_Integer values should not be included as factor levels within R dataframes?

解决方案

The conversion of a column of factors in an R data.frame to a column in a pandas DataFrame is happening with that code. Nothing handling NAs in a specific way, so this must happen upstream of the conversion. If you look at your column "Col3" you'll see that NAs are already listed as levels in the factor.

>>> print(df.rx2("Col3"))
[1] 1  2  3  NA NA
Levels: 1 2 3 NA

This is even upstream of the creation of the R data.frame:

>>> lst = [1, 2, 3, ro.NA_Integer, ro.NA_Integer]
>>> print(ro.vectors.FactorVector(lst))
[1] 1  2  3  NA NA
Levels: 1 2 3 NA

What is happening is that the constructor for FactorVector in rpy2 is using a different default for the parameter exclude than what is in R's factor() function (I think that it was made so to make the mapping between the integers work as index for the vector of levels by default).

R's default behaviour can be restored with:

>>> v = ro.vectors.FactorVector(lst, exclude=ro.StrVector(["NA"]))
>>> print(v)
[1] 1    2    3    <NA> <NA>
Levels: 1 2 3

The issue here is that there are no guidelines for the representation of missing values (in the sense of an IEEE standard). R is using a arbitrary extreme value but Python does not have the notion of missing values.

这篇关于Rpy2 pandas2ri.ri2py() 正在将 NA 值转换为整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆