使用Regex从TSV文件中删除特殊字符 [英] Remove Special Chars from a TSV file using Regex

查看:138
本文介绍了使用Regex从TSV文件中删除特殊字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为"X.tsv"的文件,我想使用正则表达式删除特殊字符(包括双空格)(不包括.单空格制表符/-),然后再将其导出到python中的子文件中

I have a File called "X.tsv" i want to remove special characters (including double spaces) (excluding . Single spaces Tabs / -) using regex before i export them to sub files in python

我想在下面的代码中实现它.

I want to implement it in the following code.

import pandas as pd 
import csv
from itertools import chain, combinations 
df = pd.read_table('xa.tsv')
def all_subsets(ss): 
    return chain(*map(lambda x: combinations(ss,x), range(0, len(ss) + 1)))

cols = [x for x in df.columns if not x == 'acm_classification'    if not x== 'publicationId'    if not x== 'publisher'    if not x== 'publication_link'    if not x== 'source'] # Exclude Extra Cols
subsets = all_subsets(cols)
for subset in subsets: 
    if len(subset) > 0: #
        df1 = df[list(subset) + ['acm_classification']]
        df1.to_csv('_'.join(subset) + '.csv', index=False) 

推荐答案

您可以使用read_csv()帮助加载TSV文件.然后,您可以指定要保留的列,并使用\t作为分隔符:

You could use read_csv() to help with loading the TSV file. You could then specify the columns you want to keep and for it to use \t as the delimiter:

import pandas as pd
import re

def normalise(text):
    text = re.sub('[{}]'.format(re.escape('",$!@#$%^&*()')), ' ', text.strip())  # Remove special characters
    text = re.sub(r'\s+', ' ', text)        # Convert multiple whitespace into a single space
    return text

fieldnames = ['title', 'abstract', 'keywords', 'general_terms', 'acm_classification']
df = pd.read_csv('xa.tsv', delimiter='\t', usecols=fieldnames, dtype='object', na_filter=False)
df = df.applymap(normalise)
print(df)

然后,您可以使用df.applymap()将功能应用于每个单元格,以根据需要设置其格式.在此示例中,它首先删除任何前导或尾随空格,将多个空白字符转换为单个空格,并还删除特殊字符列表.

You can then use df.applymap() to apply a function to each cell to format it as you need. In this example it first removes any leading or trailing spaces, converts multiple whitespace characters into a single space and also removes your list of special characters.

保存之前,可以使用all_subsets()函数进一步处理生成的数据框.

The resulting dataframe could then be further processed using your all_subsets() function before saving.

这篇关于使用Regex从TSV文件中删除特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆