使用函数过滤 pandas 数据框 [英] Filtering a Pandas DataFrame Using a Function

查看:63
本文介绍了使用函数过滤 pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

该问题与我昨天发布的问题有关,可以在此处

This question is related to the question I posted yesterday, which can be found here.

因此,我继续将Jan所提供的解决方案应用于整个数据集。解决方案如下:

So, I went ahead and implemented the solution provided by Jan to the entire data set. The solution is as follows:

import re

def is_probably_english(row, threshold=0.90):
    regular_expression = re.compile(r'[-a-zA-Z0-9_ ]')
    ascii = [character for character in row['App'] if regular_expression.search(character)]
    quotient = len(ascii) / len(row['App'])
    passed = True if quotient >= threshold else False
    return passed

google_play_store_is_probably_english = google_play_store_no_duplicates.apply(is_probably_english, axis=1)

google_play_store_english = google_play_store_no_duplicates[google_play_store_is_probably_english]

所以,根据我的意思可以理解,我们正在使用is_probably_english函数过滤google_play_store_no_duplicates DataFrame并将结果(布尔值)存储到另一个DataFrame(google_play_store_is_probably_english)中。然后,使用google_play_store_is_probably_english过滤掉google_play_store_no_duplicates DataFrame中的非英语应用程序,最终结果存储在新的DataFrame中。

So, from what I understand, we are filtering the google_play_store_no_duplicates DataFrame using the is_probably_english function and storing the result, which is a boolean, into another DataFrame (google_play_store_is_probably_english). The google_play_store_is_probably_english is then used to filter out the non-English apps in the google_play_store_no_duplicates DataFrame, with the end result being stored in a new DataFrame.

这是否有意义,并且看起来像解决问题的正确方法?有更好的方法吗?

Does this make sense and does it seem like a sound way to approach the problem? Is there a better way to do this?

推荐答案

这很有意义,我认为这是最好的方法,函数的结果是您所说的布尔值,然后在 pd.Series 中应用它时,您最终得到的是 pd.Series 的布尔值,通常称为布尔值掩码。当您想通过某些参数过滤行时,此概念在熊猫中非常有用。

This makes sense, I think this is the best way to do it, the result of the function is a boolean as you said and then when you apply it in a pd.Series you end up with a pd.Series of booleans, which is usually called a boolean mask. This concept can be very useful in pandas when you want to filter rows by some parameters.

此处是有关熊猫中的布尔型掩码的文章。

Here is an article about boolean masks in pandas.

这篇关于使用函数过滤 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆