Pandas DataFrame对象继承还是对象使用? [英] Pandas DataFrame Object Inheritance or Object Use?
问题描述
我正在构建一个用于处理非常具体的结构化数据的库,我正在Pandas之上构建我的基础架构。目前,我正在为不同的用例编写一堆不同的数据容器,例如CTMatrix for Country x Time Data等,以容纳适用于所有CountryxTime结构化数据的方法。
I am building a library for working with very specific structured data and I am building my infrastructure on top of Pandas. Currently I am writing a bunch of different data containers for different use cases, such as CTMatrix for Country x Time Data etc. to house methods appropriate for all CountryxTime structured data.
我目前正在讨论
选项1:对象继承
class CTMatrix(pd.DataFrame):
methods etc. here
或选项2:对象使用
class CTMatrix(object):
_data = pd.DataFrame
then use getter, setter methods to control access to _data etc.
从软件工程的角度来看,这里有一个明显的选择吗?
From a software engineering perspective is there an obvious choice here?
到目前为止,我的想法是:
My thoughts so far are:
选项1:
- 可以直接在CTMatrix类上使用DataFrame方法(如
CTmatrix.sort()
)而无需通过选项#2中封装的_data
对象上的方法来支持它们 - 继承更新和Pandas中的新方法,方法除外可能被本地类方法覆盖
- Can use DataFrame methods directly on the CTMatrix Class (like
CTmatrix.sort()
) without having to support them via methods on the encapsulated_data
object in Option #2 - Updates and New methods in Pandas are inherited, except for methods that may be overwritten with local class methods
但
- 使用某些方法的并发症,例如
__ init __()
并且必须将属性传递给超类super(MyDF,self).__ init__ (* args,** kw)
- Complications with some methods such as
__init__()
and having to pass the attributes up to the superclasssuper(MyDF, self).__init__(*args, **kw)
选项2:
- 对类及其行为的更多控制
- Pandas的更新可能更具弹性?
但是
- 必须使用getter()或非隐藏属性使用对象,如数据框,如(
CTMatrix.data.sort()
)
- Having to use a getter() or non-hidden attribute to use the object like a dataframe such as (
CTMatrix.data.sort()
)
在选项#1中采用这种方法还有其他缺点吗?
Are there any additional downsides for taking the approach in Option #1?
推荐答案
我会避免继承子类 DataFrame
,因为许多 DataFrame
方法将返回一个新的 DataFrame
而不是 CTMatrix
对象的另一个实例。
I would avoid subclassing DataFrame
, because many of the DataFrame
methods will return a new DataFrame
and not another instance of your CTMatrix
object.
有一些是开放的关于GitHub的问题,例如:
There are a few of open issues on GitHub around this e.g.:
https://github.com/pydata/pandas/issues/2485
更一般地说,这是一个构成与继承的问题。我会特别警惕#2的好处。它现在看起来很棒,但除非你密切关注熊猫的更新(它是一个快速移动的目标),否则你很容易就会产生意想不到的后果,你的代码最终会与熊猫交织在一起。
More generally, this is a question of composition vs inheritance. I would be especially wary of benefit #2. It might seem great now, but unless you are keeping a close eye on updates to Pandas (and it is a fast moving target), you can easily end up with unexpected consequences and your code will end up intertwined with Pandas.
这篇关于Pandas DataFrame对象继承还是对象使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!