pandas 替换和不区分大小写 [英] Pandas .str.replace and case insensitivity

查看:154
本文介绍了 pandas 替换和不区分大小写的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在以下示例中,使替换大小写不敏感似乎没有任何效果(我想用 jr替换 jr. Jr. ):

In [0]: pd.Series('Jr. eng').str.replace('jr.', 'jr', regex=False, case=False)
Out[0]: 0    Jr. eng

为什么?我有什么误会?

解决方案

case参数实际上是一种方便的方法,可以代替指定flags=re.IGNORECASE.如果替换不是基于正则表达式的,则与替换无关.

因此,当regex=True时,这些是您可能的选择:

pd.Series('Jr. eng').str.replace(r'jr\.', 'jr', regex=True, case=False)
# pd.Series('Jr. eng').str.replace(r'jr\.', 'jr', case=False)

0    jr eng
dtype: object

或者,

pd.Series('Jr. eng').str.replace(r'jr\.', 'jr', regex=True, flags=re.IGNORECASE)
# pd.Series('Jr. eng').str.replace(r'jr\.', 'jr', flags=re.IGNORECASE)

0    jr eng
dtype: object

通过将不区分大小写的标志作为?i的一部分纳入模式,您还可以变得轻松自如并绕过两个关键字参数.见

pd.Series('Jr. eng').str.replace(r'(?i)jr\.', 'jr')
0    jr eng
dtype: object

注意
您需要在正则表达式模式下转义句点\.,因为 未转义的点是具有不同含义的元字符(匹配 任何字符).如果您想动态地转义模式中的元字符,可以使用 re.escape .

有关标志和锚点的更多信息,请参见本节docs re HOWTO . /p>


来自来源代码,很明显,如果regex=False,则忽略"case"参数.见

# Check whether repl is valid (GH 13438, GH 15055)
if not (is_string_like(repl) or callable(repl)):
    raise TypeError("repl must be a string or callable")

is_compiled_re = is_re(pat)
if regex:
    if is_compiled_re:
        if (case is not None) or (flags != 0):
            raise ValueError("case and flags cannot be set"
                             " when pat is a compiled regex")
    else:
        # not a compiled regex
        # set default case
        if case is None:
            case = True

        # add case flag, if provided
        if case is False:
            flags |= re.IGNORECASE
    if is_compiled_re or len(pat) > 1 or flags or callable(repl):
        n = n if n >= 0 else 0
        compiled = re.compile(pat, flags=flags)
        f = lambda x: compiled.sub(repl=repl, string=x, count=n)
    else:
        f = lambda x: x.replace(pat, repl, n)

您可以看到case参数仅在if语句内被选中.

IOW,唯一的方法是确保regex=True以便替换基于正则表达式.

Making the replace case insensitive does not seem to have an effect in the following example (I want to replace jr. or Jr. with jr):

In [0]: pd.Series('Jr. eng').str.replace('jr.', 'jr', regex=False, case=False)
Out[0]: 0    Jr. eng

Why? What am I misunderstanding?

解决方案

The case argument is actually a convenience as an alternative to specifying flags=re.IGNORECASE. It has no bearing on replacement if the replacement is not regex-based.

So, when regex=True, these are your possible choices:

pd.Series('Jr. eng').str.replace(r'jr\.', 'jr', regex=True, case=False)
# pd.Series('Jr. eng').str.replace(r'jr\.', 'jr', case=False)

0    jr eng
dtype: object

Or,

pd.Series('Jr. eng').str.replace(r'jr\.', 'jr', regex=True, flags=re.IGNORECASE)
# pd.Series('Jr. eng').str.replace(r'jr\.', 'jr', flags=re.IGNORECASE)

0    jr eng
dtype: object

You can also get cheeky and bypass both keyword arguments by incorporating the case insensitivity flag as part of the pattern as ?i. See

pd.Series('Jr. eng').str.replace(r'(?i)jr\.', 'jr')
0    jr eng
dtype: object

Note
You will need to escape the period \. in regex mode, because the unescaped dot is a meta-character with a different meaning (match any character). If you want to dynamically escape meta-chars in patterns, you can use re.escape.

For more information on flags and anchors, see this section of the docs and re HOWTO.


From the source code, it is clear that the "case" argument is ignored if regex=False. See

# Check whether repl is valid (GH 13438, GH 15055)
if not (is_string_like(repl) or callable(repl)):
    raise TypeError("repl must be a string or callable")

is_compiled_re = is_re(pat)
if regex:
    if is_compiled_re:
        if (case is not None) or (flags != 0):
            raise ValueError("case and flags cannot be set"
                             " when pat is a compiled regex")
    else:
        # not a compiled regex
        # set default case
        if case is None:
            case = True

        # add case flag, if provided
        if case is False:
            flags |= re.IGNORECASE
    if is_compiled_re or len(pat) > 1 or flags or callable(repl):
        n = n if n >= 0 else 0
        compiled = re.compile(pat, flags=flags)
        f = lambda x: compiled.sub(repl=repl, string=x, count=n)
    else:
        f = lambda x: x.replace(pat, repl, n)

You can see the case argument is only checked inside the if statement.

IOW, the only way is to ensure regex=True so that replacement is regex-based.

这篇关于 pandas 替换和不区分大小写的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆