从“ pandas ”列中的元素中提取文本,并写入新列 [英] Extracting text from elements in Pandas column, writing to new column

查看:68
本文介绍了从“ pandas ”列中的元素中提取文本,并写入新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Pandas DataFrame的列(COL_NAME)中有一些数据。我想在(和)之间提取一些文本(尽管数据中可能有一组以上的括号,但该数据要么存在,要么根本不存在括号)。然后,我想将括号中的数据写到另一列,然后从原始字符串中删除(XXX)。

I've got some data in a column (COL_NAME) of a Pandas DataFrame. I'd like to extract some text between '(' and ')' (this data either exists, or the parens don't exist at all, although there may be more than one set of parens in the data). I'd then like to write the data in the paren to another column, and then remove the '(XXX)' from the original string.

Ie

COL_NAME
========
(info) text (yay!)
I love text
Text is fun
(more info) more text
lotsa text (boo!)

转到:

COL_NAME          NEW_COL
========          =======
text (yay!)       info
i love text       None
Text is fun       None
more text         more info
lots text (boo!)  None

我可以通过隔离列,遍历其元素,在(

I can do this by isolating the column, iterating through its elements, splitting on the (, creating two new lists and then adding them to the DataFrame, but there's assuredly a way more Pythonic/Pandic way of doing this, right?

谢谢!

推荐答案

不清楚,为什么第二个括号不匹配,可能是因为char

It isn't clear, why second parentheses doesn't match. Maybe because char !.

然后您可以使用使用正则表达式提取

正则表达式 \(([[A-Za-z0-9 _ ] +)\)的意思是:


  1. \(匹配文字字符

  2. 开始一个新组

  3. [A-Za-z0-9 _] 是与任何字母(大写或小写),数字或数字匹配的字符集下划线和空格

  4. + 匹配前一个元素(字符集)一次或多次。

  5. 结束组

  6. \)与文字匹配字符

  1. \( matches a literal ( character
  2. ( begins a new group
  3. [A-Za-z0-9 _] is a character set matching any letter (capital or lower case), digit or underscore and space
  4. + matches the preceding element (the character set) one or more times.
  5. ) ends the group
  6. \) matches a literal ) character

第二个括号不匹配,因为正则表达式排除了字符-它不在方括号 [A-Za-z0-9 _] 中。

Second parentheses isn't matched, because regex exclude character ! - it isn't in brackets [A-Za-z0-9 _].

import pandas as pd
import numpy as np
import io

temp=u"""(info) text (yay!)
I love text
Text is fun
(more info) more text
lotsa text (boo!)"""

df = pd.read_csv(io.StringIO(temp), header=None, names=['original'])
print df
#                  original
#0       (info) text (yay!)
#1              I love text
#2              Text is fun
#3  (more info) more text
#4        lotsa text (boo!)

df['col1'] = df['original'].str.extract(r"\(([A-Za-z0-9 _]+)\)")
df['col2'] = df['original'].str.replace(r"\(([A-Za-z0-9 _]+)\)", "")
print df
#                original       col1               col2
#0     (info) text (yay!)       info        text (yay!)
#1            I love text        NaN        I love text
#2            Text is fun        NaN        Text is fun
#3  (more info) more text  more info          more text
#4      lotsa text (boo!)        NaN  lotsa text (boo!)

这篇关于从“ pandas ”列中的元素中提取文本,并写入新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆