Python Pandas-如何格式化和拆分列中的文本? [英] Python Pandas - How to format and split a text in column ?

查看:263
本文介绍了Python Pandas-如何格式化和拆分列中的文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在如下所示的数据框中有一组字符串

I have a set of strings in a dataframe like below

ID TextColumn
1 This is line number one
2 I love pandas, they are so puffy
3 [This $tring is with specia| characters, yes it is!]

A.我想格式化此字符串以消除所有特殊字符 B.格式化后,我想获得一个唯一单词的列表(空格是唯一的分隔符)

A. I want to format this string to eliminate all the special characters B. Once formatted, I'd like to get a list of unique words (space being the only split)

这是我编写的代码:

get_df_by_id数据帧具有一个选定的帧,例如ID 3.

get_df_by_id dataframe has one selected frame, say ID 3.

#replace all special characters
formatted_title = get_df_by_id['title'].str.replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?]' , '')
# then split the words
results = set()
get_df_by_id['title'].str.lower().str.split().apply(results.update)
print results

但是当我检查输出时,我可以看到特殊字符仍在列表中.

But when I check output, I could see that special characters are still in the list.

Output

set([u'[this', u'is', u'it', u'specia|', u'$tring', u'is!]', u'characters,', u'yes', u'with'])

预期的输出应如下所示:

Intended output should be like below:

set([u'this', u'is', u'it', u'specia', u'tring', u'is', u'characters,', u'yes', u'with'])

为什么格式化的数据框仍然保留特殊字符?

Why does formatted dataframe still retain the special characters?

推荐答案

我认为您可以先 stack 将其替换为Series drop_duplicates 和最后一个 tolist :

I think you can first replace special characters (I add \| to the end), then lower text, split by \s+ (arbitrary wtitespaces). Output is DataFrame. So you can stack it to Series, drop_duplicates and last tolist:

print (df['title'].str
                  .replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?\|]','')
                  .str
                  .lower()
                  .str
                  .split('\s+', expand=True)
                  .stack()
                  .drop_duplicates()
                  .tolist())

['this', 'is', 'line', 'number', 'one', 'i', 'love', 'pandas', 'they', 'are', 
'so', 'puffy', 'tring', 'with', 'specia', 'characters', 'yes', 'it']

这篇关于Python Pandas-如何格式化和拆分列中的文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆