使用正则表达式在Pandas系列的单个单元格中使用逗号分隔值 [英] Separate comma-separated values within individual cells of Pandas Series using regex

查看:97
本文介绍了使用正则表达式在Pandas系列的单个单元格中使用逗号分隔值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据库中的csv文件,我已将其转换为我要清理的Pandas DataFrame.问题之一是多个值已输入到需要拆分的单个单元格中.复杂的因素是,需要完整保留字符串注释(也带有逗号).下面的示例以系列形式说明了该问题.

I have a csv file from a database I've converted into a Pandas DataFrame that I'm trying to clean up. One of the issues is that multiple values have been input into single cells that need to be split up. The complicating factor is that there are string comments (also with commas) that need to be kept intact. The problem is illustrated in the example below, in Series form.

我所拥有的:

Index  |  values    
0      | 2.54,3.563
1      | bad design, right?

我想要什么:

Index  |   level_0   |  values      
0      |     0       |    2.54   
1      |     0       |    3.563 
2      |     1       |    bad design, right?      

如您所见,逗号分隔了我要拆分的值,逗号后面没有空格,而字符串注释中的逗号后面都有空格.应用正则表达式似乎很容易.我下面的解决方案(使用来自另一个StackOverflow解决方案的策略)是使用Series.str.split将值分离到单独的列中,然后堆叠这些列.该策略效果很好.但是,在这种情况下,正则表达式显然无法识别拆分.这是我的代码:

As you can see, there are commas separating the values I want to split, with no following space after the comma, while the commas in string comments all have spaces after them. Seems like an easy thing to apply regex to split up. My solution below, using a strategy taken from another StackOverflow solution, is to use Series.str.split to separate the values into separate columns, then stack the columns. That strategy works great. However, in this case, the regex is apparently not identifying the split. Here's my code:

Import pandas as pd

# Example Series:
data = pd.Series(("2.54,3.56", "3.24,5.864", "bad design, right?"), name = "values")

# Split cells with multiple entries into separate rows 
split_data = data.str.split('[,]\b').apply(pd.Series)

# Stack the results and pull out the index into a column (which is sample number in my case)
split_data = split_data.stack().reset_index(0)
split_data = split_data.reset_index(drop=True)

我是正则表达式的新手,但是从我所查看的指南以及使用几个特定于Python的正则表达式沙箱来看,正则表达式[,] \ b似乎应该拆分值,而不是注释.但是,此正则表达式不会拆分任何内容.

I'm new to regular expressions, but from the guides I've looked at and from using a couple regex sandboxes specific to Python, it seems like the regex [,]\b should split the values, but not the comments. However, it does not split anything with this regex.

这是调试器的结果,该调试器应该工作: Debuggex演示

Here's the result of the debugger, which says this should work: Debuggex Demo

我在这里错过了一些简单的事情吗?关于这项工作还有什么更好的主意吗?我正在使用Python 3.5,如果有帮助的话.谢谢.

Am I missing something easy here? Any better ideas on making this work? I'm using Python 3.5, if that makes a difference. Thanks.

推荐答案

我倾向于先行使用;如何执行取决于您的预期数据.

I would be inclined to use a lookahead; how you do so depends on your expected data.

这是一个否定的前瞻.它说逗号后没有空格",如果您确保所有带有逗号的注释都带有空格,并且希望将红色,绿色"视为要拆分的内容,则首选使用该逗号.

This is a negative lookahead. it says "a comma that is not followed by whitespace" and would be preferred if you are sure that all comments with commas have whitespace, and would want to treat "red,green" as something to split.

data.str.split('[,](?!\s)').apply(pd.Series)

另一个选择是对看起来像有效值的东西进行正向超前;您的示例是数字,因此例如,它将仅在逗号后跟数字进行分割:

Another option is a positive lookahead for something that looks like a valid value; your example was numbers, so for instance this would split only on a comma that is followed by a number:

data.str.split('[,](?:\d)').apply(pd.Series)

正则表达式非常强大,但是老实说,如果这是一个长期的问题,我不确定此解决方案对您是否有用.一次迁移就可以解决大多数问题,但是从长远来看,我会考虑在问题解决之前尝试解决问题.无论如何,这是Debuggex的python regex备忘单,以备不时之需: https://www. debuggex.com/cheatsheet/regex/python

Regular expressions are very powerful, but honestly, I am not sure that this solution will be great for you if this is a long-term problem. Getting most cases right as a one-time migration should be fine, but longer term I would consider trying to solve the problem before it gets here. Anyway, here's Debuggex's python regex cheat sheet, in case it is useful to you: https://www.debuggex.com/cheatsheet/regex/python

这篇关于使用正则表达式在Pandas系列的单个单元格中使用逗号分隔值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆