同时最大化/优化 3 个结果 [英] Maximizing / Optimizing 3 results at the same time

查看:45
本文介绍了同时最大化/优化 3 个结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 100 多列和 3500 行的 csv 文件,它看起来像这样(只是一个例子):

将pandas导入为pd数据 = pd.DataFrame(data={'利润': [90, -70, 111, 40, -5, -1],'Crit1': [真,真,假,真,假,真],'Crit2':[假,假,假,真,真,假],'Crit3':[真,真,假,真,真,真],'Crit4':[假,真,真,假,假,假],'Crit5': [真、假、假、真、真、真]})

我想定义 3 个结果:

1 - totalProfit:是Profit"列的总和

2 - posValues:列结果中有多少正值

3 - negValues:列结果中有多少负值

totalProfit = data['Profit'].sum() # 总和为 165

在这个例子中,posValues 为 3,negValues 为 3.但我不知道如何用公式计算它们.

我想找到过滤的列的最佳组合(真/假),以增加 posValues 并减少 negValues,同时最大化 totalProfit.

猜测,我认为最好的组合是:Crit1 和 Crit5 设置为 True

print(data[(data.Crit5 == True) & (data.Crit1 == True)])totalResult = data['Profit'][(data.Crit5 == True) &(data.Crit1 == True)].sum()打印(总结果)

因此,通过这种组合,我们将得到 totalResult = 129、posValues = 2 和 negValues = 1 并且组合是:将 Crit1 和 Crit5 设置为 True

请记住,过滤所有列不是强制性的,我可以有一些未过滤的(如示例中所示).

我怎样才能有一个代码来增加 posValues 并减少 negValues,同时最大化 totalProfit,并显示 True/False 列的最佳组合?

解决方案

TL;DR

我们在这里提出了一种有效的算法,虽然仍然可能具有指数时间的最坏情况(尽管正式分析可能证明即使最坏情况也比这好得多),但实际上具有 O(n^2) 中值时间,适用于数百列.作为测试,在 15min31s 上找到了随机 n=120(选择器列)x nrows=3500 df 的解决方案单核.

警告:下面的答案相当长.

更好的算法解决方案

虽然我的

但是,我们可以通过使用下一节中描述的算法来找到避免探索完整组合的方法.

波束搜索算法

此对数图显示了蛮力搜索和光束搜索的单个运行时间(点)和中值时间(线).正如预期的那样,蛮力时间是指数级的.波束搜索速度要快得多,中值时间似乎是 n 多项式(对数图上的一条直线).使用 2 次多项式可以获得对中值时间的合理拟合(因此,O(n^2)):

请注意,最大尺寸 n=64,在数千次运行中,我们从未达到我们设置的最大预算 max_iters = 100_000.因此,这些解决方案都是最优的.

另请注意,实际时间存在长尾离散,具体取决于搜索面的复杂性.这是 n = 64 的各个时间直方图:

但这可以通过提前停止搜索的能力得到缓解.事实上,我们的观察是,即使在最长的情况下,非常好的解决方案(或通常是最优方案本身)在搜索的早期就会出现.

I have a csv file with more than 100 columns and 3500 rows which looks like this (just an example):

import pandas as pd

data = pd.DataFrame(data={
    'Profit': [90, -70, 111, 40, -5, -1],
    'Crit1': [True, True, False, True, False, True],
    'Crit2': [False, False, False, True, True, False],
    'Crit3': [True, True, False, True, True, True],
    'Crit4': [False, True, True, False, False, False],
    'Crit5': [True, False, False, True, True, True]
})

I'd like to define 3 results:

1 - totalProfit: is the sum of the column "Profit"

2 - posValues: how many positive values in the column result

3 - negValues: how many negative values in the column result

totalProfit = data['Profit'].sum() # The sum is 165

In this example, posValues will be 3 and negValues will be 3. But I don't know how to count them with a formula.

I'd like to find the best combination of the columns filtered (true / false) to increase the posValues and decrease the negValues, while maximising the totalProfit.

By guess, I think that the best combination is : Crit1 and Crit5 set to True

print(data[(data.Crit5 == True) & (data.Crit1 == True)])

totalResult = data['Profit'][(data.Crit5 == True)  & (data.Crit1 == True)].sum()

print(totalResult)

So, with this combination, we'll have totalResult = 129, posValues = 2 and negValues = 1 and the combination is : Set Crit1 and Crit5 to True

Please keep in mind that, it's not mandatory to filter all the columns, I can have some unfiltered (as in the example).

How can I have a code that will increase the posValues and decrease the negValues, while maximising the totalProfit, and display what combination of True/False columns is the best, please ?

解决方案

TL;DR

We present here an efficient algorithm that, while still potentially having a worst-case that is exponential time (although a formal analysis might prove that even the worst-case is much better than that), in practice has O(n^2) median time and is practical for hundreds of columns. As a test, the solution for a random n=120 (selector columns) x nrows=3500 df was found in 15min31s on a single core.

Warning: fairly long answer below.

A better algorithmic solution

While my other answer focused on the trivial question of how to use pandas for this problem, it only provided a brute-force algorithm (simple and appropriate for small number of columns). In this answer, I address the computational complexity and how to cut it down to handle the kind of scale the OP mentioned (more than 100 columns, thousands of rows).

My initial intuition was that the problem may be NP-hard. The brute force approach has exponential complexity (it explicitly considers the O(2^n) combinations of n columns). That clearly is infeasible for even medium-size n. Other classic problems fall into that category, e.g. the Travelling Salesman and the deceptively simple Subset sum.

There is no obvious way to decide that a partial solution in the search path of all combinations is going to lead to the optimum. In fact, observing the solutions found by brute-force, ordered by descending score, shows that that their arrangement can be quite intricate. Here is a heat map of all column combinations for a randomly generated df with n=9 (and 500 rows). The rows have been sorted by the original df.profit descending. Columns are all the 512 solutions, ordered from best (left) to worst (right). The content of the heatmap is the resulting effective filter (or-ing all boolean columns selected by that solution), True shown as black and False as white:

We can, however, find ways to avoid exploring the full combinations by using an algorithm described in the next section.

Beam search algorithm

A Beam search is "a heuristic algorithm that explores a graph by expanding the most promising node in a limited set."

We use this approach to maintain a set of candidates (each a promising column subset) that are worthy of exploration. While often used as a heuristic to find reasonably good solutions (but not necessarily the best), in our case we only discard paths that can provably not lead to the optimum. As such, as long as we let the algorithm complete, we are guaranteed to deliver the best solution. We also observed that interrupting the search partway still yields excellent solutions, and often the optimum itself.

Definition of a candidate

A candidate is a particular column subset (a selection of columns to use in filtering df) and has a score corresponding to the profit it yields. In addition to being a solution on its own, it can also be used as prefix for "descendant" candidates, by merging it with another candidate. For example, a candidate {a,c} can be merged with another {d,z} to produce a new candidate {a,c,d,z}.

For a candidate x, we store the following quantities:

  • x.k: the column set
  • x.p: indicator of rows where profit > 0
  • x.n: indicator of rows where profit < 0 (we disregard profit == 0)
  • x.tot: sum(profit | (x.p + x.n)) (total profit for that candidate)
  • x.totp: sum(profit | x.p) (sum of positive-only profit for that candidate)

Cutting the exploration path

Some observations lead us to aggressively cut down the exploration path:

  1. The maximum profit that a candidate x and all its descendants can ever hope to achieve is x.totp: when adding columns in the column set, one could perhaps reduce the set of negative rows, but never increase the set of positive ones. So x.totp is the upper bound of any profit down the exploration path of this candidate.
  2. Each candidate should be uniquely visited: after visiting {a,b} by merging {a} with {b}, we could inadvertently revisit the same candidate by merging {b} with {a}. Thus, when merging x with y, we require that all columns in x.k are strictly smaller than every column y.k (max(x.k) < min(y.k)).
  3. When considering a merge of two candidates, there is no opportunity for improvement if either set of negative columns in one is a subset of the other: reject if x.n < y.n or y.n < x.n (< here being the 'is strict subset' set operator).

This leads to the following implementation:

class candidate(namedtuple('Cand', 'tot totp k p n')):
    """
    A candidate solution or partial solution.
    
    k: (colset) set of columns (e.g. {'a', 'b', 'c'})
    tot: sum(profit | (p | n)) (sum of profit for all rows selected by this colset)
    totp: sum(profit | p) (sum of profit for positive-only rows selected by this colset)
    p: bool np.array indicating where profit > 0 for this colset
    n: bool np.array indicating where profit < 0 for this colset
    """
    def name(self):
        cols = ''.join(sorted(self.k))
        return cols

    def __str__(self):
        cols = ''.join(sorted(self.k))
        return f'{cols} ({self.tot:.2f}, max {self.totp:.2f}, '
               f'|p| = {self.p.sum()}, |n| = {self.n.sum()})'

    def __repr__(self):
        return str(self)

def make_candidate(df, k):
    truth = df[k].values if k else np.ones(df.shape[0], dtype=bool)
    xk = frozenset({k} if k else {})  # frozenset can be used as dict key, if needed
    xp = (truth & (df['profit'] > 0)).values
    xn = (truth & (df['profit'] < 0)).values
    xtotp = df.loc[xp, 'profit'].sum()
    xtot = xtotp + df.loc[xn, 'profit'].sum()
    return candidate(xtot, xtotp, xk, xp, xn)    

def merge(beam, x, y):
    """merge two candidates x, y if deemed viable, else return None"""
    if max(x.k) >= min(y.k):
        return None  # avoid visiting same colset several times
    if (x.totp < y.tot or y.totp < x.tot:
        return None  # z could never best x or y
    zn = x.n * y.n  # intersection
    zp = x.p * y.p
    ztotp = beam.df.loc[zp, 'profit'].sum()
    if ztotp < beam.best.tot or ztotp <= x.tot or ztotp <= y.tot:
        return None  # z could never best the beam's best so far, or x or y
    ztot = ztotp + beam.df.loc[zn, 'profit'].sum()
    z = candidate(ztot, ztotp, x.k.union(y.k), zp, zn)
    return z

class Beam:
    def __init__(self, df, best, singles, sol):
        self.df = df
        self.best = best
        self.singles = singles
        self.sol = sol
        self.loops = 0
    
    @classmethod
    def from_df(cls, df):
        cols = [k for k in df.columns if k != 'profit']
        # make solutions, first: empty set, then single-column ones
        oa = make_candidate(df, None)
        singles = [make_candidate(df, k) for k in cols]
        best = max([oa] + singles, key=lambda x: x.tot)
        singles = [x for x in singles if x.totp > best.tot]
        return cls(df, best, singles, singles.copy())

    def add_candidate(self, z):
        if z is None:
            return False
        self.sol.append(z)
        if z.tot > self.best.tot:
            self.best = z
            self.prune()
        return True

    def prune(self):
        """remove solutions that cannot become better than the current best"""
        self.sol = [x for x in self.sol if x.totp >= self.best.tot]

    def __str__(self):
        return f'Beam: best: {self.best}, |sol| = {len(self.sol)}, ' \
               f'|singles| = {len(self.singles)}, loops = {self.loops}'

    def __repr__(self):
        return str(self)
    
    def optimize(self, max_iters=None, report_freq=None):
        i = 0
        while self.sol and (max_iters is None or i < max_iters):
            if report_freq is not None and i % report_freq == 0:
                print(f'loop {i:5d}, beam = {self}')
            x = self.sol.pop(0)
            for y in self.singles:
                self.add_candidate(merge(self, x, y))
            i += 1
            self.loops += 1
        if report_freq:
            print(f'done {i:5d}, beam = {self}')

For experimentation, we can generate random frames:

def gen_colnames():
    for n in count(1):
        for t in combinations(ascii_lowercase, n):
            yield ''.join(t)

def colnames(n):
    if n > len(ascii_lowercase):
        return [f'c{k}' for k in range(n)]
    else:
        return list(islice(gen_colnames(), n))

def generate_df(ncols, nrows = 500):
    df = pd.DataFrame({'profit': np.random.normal(scale=100, size=nrows)})
    cols = pd.DataFrame(np.random.choice([False, True], size=(nrows, ncols)), columns=colnames(ncols))
    df = pd.concat([df, cols], axis=1)
    return df

And then we can test it:

df = generate_df(15)

%%time
res = brute_force_all(df)
# CPU times: user 40.9 s, sys: 68 ms, total: 41 s

%%time
beam = Beam.from_df(df)
beam.optimize()
# CPU times: user 165 ms, sys: 3 µs, total: 165 ms

# verify that we have the same solution as the brute-force
assert beam.best.k == res.colset.iloc[0]

# summary of the beam result:
beam
# Beam: best: bf (2567.64, max 5930.26, |p| = 68, |n| = 51), |sol| = 0, |singles| = 15, loops = 228

Performance

We do not provide a formal analysis of the time or space complexity of our solution (we'll leave that as an exercise for the reader...), but here are some empirical observations obtained on a series of runs on random frames of various sizes:

This log-log plot shows individual run times (points) and median times (lines) for both the brute-force and the beam search. As expected, the brute force time is exponential. The beam search is considerably faster, and the median times appear to be polynomial with n (a straight line on the log-log plot). A reasonably good fit for the median time is obtained with a degree-2 polynom (so, O(n^2)):

Note that, up to size n=64, in thousands of runs we have never hit the maximum budget we had set max_iters = 100_000. The solutions were therefore all optimal.

Note also that there is a long-tail dispersion of the actual time, depending on the complexity of the search surface. Here is the individual times histogram for n = 64:

But this is mitigated by the ability to stop the search early. In fact, our observations are that, even in the longest cases, very good solutions (or often the optimal itself) appear quite early in the search.

这篇关于同时最大化/优化 3 个结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆