根据标点符号列表替换数据框中的标点符号 [英] Replacing punctuation in a data frame based on punctuation list

查看：109 发布时间：2020/4/29 3:24:31 python pandas dataframe large-data

本文介绍了根据标点符号列表替换数据框中的标点符号的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用冠层和熊猫，我有一个数据框，其定义为:

Using Canopy and Pandas, I have data frame a which is defined by:

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"]

test.txt是一个单列文件，其中包含一个包含文本，数字和标点符号的字符串列表.

test.txt is a single column file that contains a list of string that contains text, numerical and punctuation.

假设df看起来像:

测试

％hgh& 12

%hgh&12

abc123 !!!

abc123!!!

炸薯条

我希望我的结果是:

I want my results to be:

测试

hgh12

abc123

炸薯条

到目前为止的努力:

Effort so far:

from string import punctuation /-- import punctuation list from python itself

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"] /-- define the dataframe


for p in list(punctuation):

     ...:     df2=df.med.str.replace(p,'')

     ...:     df2=pd.DataFrame(df2);

     ...:     df2

上面的命令基本上只是给我返回相同的数据集. 感谢任何潜在客户.

The command above basically just returns me with the same data set. Appreciate any leads.

之所以使用Pandas，是因为数据量巨大，跨越了大约100万行，并且将来使用的编码方式将应用于多达3000万行的列表. 长话短说，我需要以非常有效的方式清理大数据集的数据.

Reason why I am using Pandas is because data is huge, spanning to bout 1M rows, and future usage of the coding will be applied to list that go up to 30M rows. Long story short, I need to clean data in a very efficient manner for big data sets.

推荐答案

在正确的正则表达式中使用replace会更容易:

Use replace with correct regex would be easier:

In [41]:

import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
         text
0        test
1     %hgh&12
2   abc123!!!
3  porkyfries

[4 rows x 1 columns]

使用正则表达式的模式表示不使用字母数字/空格

use regex with the pattern which means not alphanumeric/whitespace

In [49]:

df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
         text
0        test
1       hgh12
2      abc123
3  porkyfries

[4 rows x 1 columns]

这篇关于根据标点符号列表替换数据框中的标点符号的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

根据标点符号列表替换数据框中的标点符号 [英] Replacing punctuation in a data frame based on punctuation list

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

根据标点符号列表替换数据框中的标点符号 [英] Replacing punctuation in a data frame based on punctuation list

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭