正则表达式来删除csv文件中的空白用引号分隔文本? [英] regex to delete whitespace in csv-file with quotes to separate text?

查看:137
本文介绍了正则表达式来删除csv文件中的空白用引号分隔文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对编程和正则表达式很陌生,并且阅读掌握正则表达式,但我找不到如何摆脱制表符,换行符和奇怪的非字或非数字字符(图标和奇怪的非线性字符)西方的linebreaks(?)大部分)在我的tsv文件的文本列中。这是utf-8格式和瑞典语言。



它看起来像这样:

 from_user,month,full_text
bellaboo,4,RT @BodilMalmsten:\om man klarar av attföraett bestt until munnen ellerbehöverhjälppåtoaletten\\
有一颗心,borgarrådet
有一个hea,RT @BodilMalmsten:Borgarrådetom riktlinjerna \om man klarar av attföraett bestick until munnen ellerbehöverhjälppåtoaletten\
Hjälp
1 mindröm
2 allasönskningar
3 viljan att segra
H,RT @BodilMalmsten:Klarar du av attföraett bestt until munnen ellerbehöverhjälppåtoaletten?
http://t.co/fcvcf0U2dW

任何人都可以帮助我,让我继续与文本分析我真的要做这个文件?

解决方案

由于您使用<$ c $标记问题c> python-3.x 这里是一个Python 3.x答案。



我认为你遇到的问题是,CSV阅读器会对第三列中的所有换行符感到不满。这个程序去掉了所有额外的换行符并且规范化了所有的空格(单词用一个空格分隔)。

我使用了一个verbosePython模式注释清楚它与列匹配的方式。棘手的是第三个,它可以包含换行符。它只是匹配任何东西,直到看到一个终止双引号。



我不确定你想如何清理字符串;我给出的模式只是将所有控制字符(ASCII 0x01 0x1f )加上ASCII DEL 字符 0x7f )。然后,空白标准化清除任何多余的空格。

  import re 
import sys

_,infile,outfile = sys.argv

s_pat_row = r'''
([^] +)#match column; this is group 1
\ s *,\ s *#匹配分隔逗号和任意可选空白区域
(\S +)#匹配列;这是组2
\ s *,\ s *#match separated comma和任何可选的空白区域
((?:\\| [^])*)#匹配可包含转义引号的字符串数据
'''
pat_row = re.compile(s_pat_row,re.MULTILINE | re.VERBOSE)

s_pat_clean = r'''[\x01 -\x1f\x7f]'''
pat_clean = re .compile(s_pat_clean)

row_template ='{},{},{}\\\
'

with open(infile,rt)as inf,open(outfile,wt)作为outf:
data = inf.read()
for re.finditer(pat_row,data):
row = m.groups( )
已清理= re.sub(p (row [2])
words = cleaned.split()
cleared =''.join(words)
outrow = row_template.format(row [0],row [1],清除)
outf.write(outrow)

您可以编辑模式在 s_pat_clean 中指定,以清除您需要清理的所有字符。



要使用它,请将其保存在名为 cleaner.py 并将您的输入放入一个名为 data.txt 的文件中,然后运行:

  python3 cleaner.py data.txt cleaned.txt 

结果保存在输出文件 cleaned.txt中



运行此结果在你提供的例子中:

 from_user,month,full_text
bellaboo,4 ,RT @BodilMalmsten:\\om man klarar av attföraett bestick until munnen ellerbehöverhjälppåtoaletten \有一颗心,borgarrådet有一个hea,RT @BodilMalmsten:Borgarråde t om riktlinjerna \om man klarar av attföraet bestick until munnen ellerbehöverhjälppåtoaletten \Hjälp1 mindröm2 allasönskningar3 viljan att segra H,RT @ BodilMalmsten:Klarar du av attföraett bestick till munnen ellerbehöverhjälppåtoaletten? http://t.co/fcvcf0U2dW

现在一个CSV阅读器应该没有问题解析文件。



编辑:用正确的输入重新运行程序,并用正确输入的结果替换输出示例。当输入具有重音时,它们正确地通过你可以在上面看到。


I'm new to programming and regex and read Mastering Regular Expression, but I can't find an answer to how to get rid of tabs, newlines and strange non-word or non-digit characters (icons and strange non-western linebreaks(?) mostly) within the text column of my tsv-file. It's utf-8 formatted and in Swedish language.

It looks like this:

"from_user","month","full_text"
"bellaboo",4,"RT @BodilMalmsten: \"om man klarar av att föra ett bestick till munnen eller      behöver hjälp på toaletten\"
Have a heart, borgarrådet
Have a hea,RT @BodilMalmsten: Borgarrådet om riktlinjerna \"om man klarar av att föra ett   bestick till munnen eller behöver hjälp på toaletten\"
Hjälp
1   min dröm
2   allas önskningar
3   viljan att segra
H,RT @BodilMalmsten: Klarar du av att föra ett bestick till munnen eller behöver hjälp på  toaletten?
http://t.co/fcvcf0U2dW"

Can anyone please help me so I get on with the text analysis I'm really about to do with this file?

解决方案

Since you tagged the question with python-3.x here is a Python 3.x answer.

I think the problem you have is that a CSV reader will get upset with all the newlines inside the third column. This program strips out all the extra newlines and normalizes all the white space (words are separated by a single space).

I'm using a "verbose" Python pattern with comments to make it clear how it matches the columns. The tricky one is the third one, which can contain newlines. It just matches anything until a terminating double-quote is seen.

I'm not sure exactly how you want to clean the string; the pattern I gave just replaces all "control characters" (ASCII 0x01 through 0x1f inclusive, plus the ASCII DEL character 0x7f) with spaces. Then the whitespace normalization cleans up any extra spaces.

import re
import sys

_, infile, outfile = sys.argv

s_pat_row = r'''
    "([^"]+)"  # match column; this is group 1
    \s*,\s*  # match separating comma and any optional white space
    (\S+)  # match column; this is group 2
    \s*,\s*  # match separating comma and any optional white space
    "((?:\\"|[^"])*)"  # match string data that can include escaped quotes
'''
pat_row = re.compile(s_pat_row, re.MULTILINE|re.VERBOSE)

s_pat_clean = r'''[\x01-\x1f\x7f]'''
pat_clean = re.compile(s_pat_clean)

row_template = '"{}",{},"{}"\n'

with open(infile, "rt") as inf, open(outfile, "wt") as outf:
    data = inf.read()
    for m in re.finditer(pat_row, data):
        row = m.groups()
        cleaned = re.sub(pat_clean, ' ', row[2])
        words = cleaned.split()
        cleaned = ' '.join(words)
        outrow = row_template.format(row[0], row[1], cleaned)
        outf.write(outrow)

You can edit the pattern specified in s_pat_clean to clean any characters you need cleaned.

To use this, save it in a file called cleaner.py and put your input in a file called data.txt and then run:

python3 cleaner.py data.txt cleaned.txt

Results are saved in the output file cleaned.txt.

The result of running this on the example you provided:

"from_user","month","full_text"
"bellaboo",4,"RT @BodilMalmsten: \"om man klarar av att föra ett bestick till munnen eller behöver hjälp på toaletten\"Have a heart, borgarrådet Have a hea,RT @BodilMalmsten: Borgarrådet om riktlinjerna \"om man klarar av att föra ett bestick till munnen eller behöver hjälp på toaletten\" Hjälp 1 min dröm 2 allas önskningar 3 viljan att segra H,RT @BodilMalmsten: Klarar du av att föra ett bestick till munnen eller behöver hjälp på toaletten? http://t.co/fcvcf0U2dW"

Now a CSV reader should have no trouble parsing the file.

EDIT: Re-ran the program with correct input and replaced output example with result of running on correct input. When the input has accents, they are correctly passed through as you can see above.

这篇关于正则表达式来删除csv文件中的空白用引号分隔文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆