构建独立但可合成的“原子” SQL数据库/ Pandas数据框的过滤函数? [英] Build stand-alone but composeable "atomic" filter functions for a SQL database/Pandas dataframe?

查看:108
本文介绍了构建独立但可合成的“原子” SQL数据库/ Pandas数据框的过滤函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,我问了这个问题: DataFrame在Python中的功能链/组合过滤函数?,并且它被错误地标记为重复。



所以我们再试一次:



我拥有的是一堆可以作为SQL表或熊猫数据框加载的数据。我想要做的是提供一些简单的过滤器函数,这些函数可以组成(但我不会在运行时间之前知道合成的顺序)。所以我想给这些函数提供一个as-syntatically-simple-as-possible接口。



理想情况下,我希望能够做的是提供用户使用一个函数工具箱(比如is_size,is_red,max_price,for_male,for_female,is_shirt等),然后让他们混合并匹配他们想要得到结果的方式:

result = my_clothes.is_red()。is_size('large')。max_price(100).for_male()

将会返回相同的结果:

result = my_clothes.max_price(100).is_size('large').for_male()。is_red()等。



现在,正如我在上一个问题中所述,我可以在Pandas中使用管道来做到这一点:

  def get_color(df,color):
return df [df ['color'] == color]
def is_shirt(df):
return df [df ['shirt '] == True]

(poll.pipe(is_shirt)
.pipe(get_color,color = red)

这是一个有点丑陋的syntacti对于这个图书馆来说,这个图书馆是有意为之的。



我也想出了一种围绕数据框构建类的方法,它有一个链成员,然后调用一个完成函数来提醒我们返回了我们构造的数据框:

  class wrapper_for_dataframe() :
my_data = pd.DataDrame()#实际数据
chain = pd.DataFrame()

def is_shirt_chain(self):
self.chain = self .chain [self.chain.type =='shirt']
return self
$ b $ def max_price_chain(self,price):
self.chain = self.chain [self。 chain.price<价格]
返回self

def done(self):
temp = self.chain.copy()
self.chain = self.schedule.copy()
#something to reset / delete self.chain
return temp

我可以这样做:

  result = wrapper_for_dataframe_instance.is_shirt_chain()。max_price_chain(200).done()$ (注意:上面的记号可能不是100%正确的;它是从我建造的其他东西中简化的,但是我可以得到这个工作)



所以这个更接近,但是当你围绕一个数据框构建一个包装时,这会遇到正常的问题;用DF做普通熊猫的东西有点麻烦(看起来你必须为所有东西构建一个函数,尽管我认为可能有一种方法可以将普通熊猫函数传递给底层数据框)。

还有其他一些原因会导致这种情况很糟糕(如果您一次只有一个以上的链,会发生什么情况?混沌?是否有Extra副本数据是个好主意?可能不是)

那么,有没有另外一种方法来做到这一点?我认为Django有这个设施,但是这有点重量级。

另一个想法是SQLalchemy;我可以将整个事情从Pandas转移到SQL领域并构建使用or_函数和SQLalchemy过滤的函数(如下所示:在SQLAlchemy中使用OR)。但这意味着我必须学习SQLalchemy(如果这是最好的解决方案,我会这样做)。



总之,有什么想法?感谢。

解决方案

所以,使用这个提示:我想我可以做到这一点:

  def tocontainer(func):
def wrapper(* args,** kwargs):
result = func * args,** kwargs)
返回容器(结果)
返回包装

类容器(对象):
def __init __(self,df):
self.contained = df

def __str __(self):
display(self.contained)

def __getitem __(self,item):
result = self.contained [item]
if isinstance(result,type(self.contained)):
result =容器(结果)
返回结果

def __getattr __(self,item):
result = getattr(self.contained,item)
如果可调用(结果):
结果= tocontainer(结果)
返回结果

def __repr __(self):
display(self.contained)

def max_price(self,cost ):
return Container(self.contained [self.contained.price<成本])

def is_shirt(self):
return容器(self.contained [self.contained.is_shirt == True])

def _repr_html_(self ):
return self.contained._repr_html_()

所以我可以做这样的事情:

  my_data = pd.read_csv('my_data.csv')
my_clothes =容器(my_data)
cheap_shirts = my_clothes.is_shirt()。max_price(20)

这正是我想要的。请注意必要的调用,以便将包含的数据帧备份到每个简单过滤器函数的容器类中。这可能是由于记忆原因,但它是我现在可以想到的最好的解决方案。



我确信我会碰到一些提到的警告上面链接的SO答案,但是现在可以使用。我在这个问题上看到很多变化(但不完全相同),所以我希望这可以帮助某人。

新增奖金:花了我一段时间来弄清楚如何让组合类的数据框在iPython中看起来不错,但是_repr_html_函数可以实现这个功能(请注意单身,不是双下划线)。


Okay, I asked this: Functional chaining / composing filter functions of DataFrame in Python? and it was erroneously marked duplicate.

So we're trying again:

What I have is a bunch of data that I can load as a SQL table or a Pandas dataframe. What I'd like to do is offer a bunch of simple filter functions that can be composed (but I will not know the order of composition until run time). So I want to offer the use an as-syntatically-simple-as-possible interface to these functions.

Ideally, I'd like to be able to do is offer the user a toolbox of functions (say, is_size, is_red, max_price, for_male, for_female, is_shirt, etc.) and then let them mix and match those how they'd like to get their result:

result = my_clothes.is_red().is_size('large').max_price(100).for_male()

which, of course, would return the same as

result = my_clothes.max_price(100).is_size('large').for_male().is_red(), etc.

Now, as I stated in the previous question, I can do this in Pandas using pipes:

def get_color(df, color):
    return df[df['color'] == color]
def is_shirt(df):
    return df[df['shirt'] == True]

(poll.pipe(is_shirt)
    .pipe(get_color, color=red)
)

That's a little ugly syntactically for the audience this library is intended.

I also figured out a way to build a class around the dataframe, which has a "chain" member, that gets built, and then calls a "done" function that alerts that we're returning the dataframe we've constructed:

class wrapper_for_dataframe():
    my_data = pd.DataDrame()  # the actual data  
    chain = pd.DataFrame()

    def is_shirt_chain(self):
       self.chain = self.chain[self.chain.type == 'shirt']
       return self

    def max_price_chain(self, price):
        self.chain = self.chain[self.chain.price < price]
        return self

    def done(self):
        temp = self.chain.copy()
        self.chain = self.schedule.copy()
        #something to reset/delete self.chain
        return temp

So, with that, I can do things like:

result = wrapper_for_dataframe_instance.is_shirt_chain().max_price_chain(200).done()

(Note: the above notation may not be 100% correct; it's simplified from something else I built, but I can get that to work)

So this is closer, but this suffers from the normal problems of when you build a wrapper around a dataframe; it's sort of bitch to do "normal" pandas stuff with the DF (seemingly you have to build a function for everything, although I think there's probably a way to pass normal Pandas functions to the underlying dataframe).

There are a number of other reasons why this is bad (what happens if you have more than 1 chain at a time? Chaos? Is having an "Extra" copy of the data a good idea? Probably not)

So, is there another way of doing this? I think Django has this facility, but that's a little heavyweight.

Another thought was SQLalchemy; I could shift the whole thing out of Pandas and in to the SQL realm and build functions that make use of the or_ function and SQLalchemy filtering (like this: Using OR in SQLAlchemy). But that means I have to learn SQLalchemy (which I will do, if that's the best solution here).

Anyway, any ideas? Thanks.

解决方案

So, using hints from this: How to redirect all methods of a contained class in Python? I think I can do it:

def tocontainer(func):
def wrapper(*args, **kwargs):
    result = func(*args, **kwargs)
    return Container(result)
return wrapper

class Container(object):
    def __init__(self, df):
        self.contained = df

    def __str__(self):
        display(self.contained)

    def __getitem__(self, item):
        result = self.contained[item]
        if isinstance(result, type(self.contained)):
           result = Container(result)
        return result

    def __getattr__(self, item):
        result = getattr(self.contained, item)
        if callable(result):
            result = tocontainer(result)
        return result

    def __repr__(self):
        display(self.contained)

    def max_price(self, cost):
        return Container(self.contained[self.contained.price < cost])

    def is_shirt(self):
        return Container(self.contained[self.contained.is_shirt == True])

    def _repr_html_(self):
        return self.contained._repr_html_()

so I can do things like:

my_data = pd.read_csv('my_data.csv')
my_clothes = Container(my_data)
cheap_shirts = my_clothes.is_shirt().max_price(20)

which is exactly what I wanted. Note the necessary calls to wrap the contained dataframe back up in to the container class for each simple filter function. This may be bad for memory reasons, but it's the best solution I can think of so far.

I'm sure I'll run into some of the caveats mentioned in the above-linked SO answer, but this will work for now. I see many variations on this question (but not quite the same), so I hope this helps someone.

ADDED BONUS: Took me awhile to figure out how to get the data frames of a composed class to look nice in iPython, but the _repr_html_ function does the trick (note the single, not double, underscore).

这篇关于构建独立但可合成的“原子” SQL数据库/ Pandas数据框的过滤函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆