读取带有转义字符的csv [英] read csv with escape characters

查看:135
本文介绍了读取带有转义字符的csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个csv文件,其中包含一些文本.我想将此文本标记化(分成单词列表),并且在pd.read_csv解释转义字符的方式上遇到问题.

我的csv文件如下:

text, number
one line\nother line, 12

,代码如下:

df = pd.read_csv('test.csv')
word_tokenize(df.iloc[0,0])

输出为:

['one', 'line\\nother', 'line']

而我想要的是:

['one', 'line', 'other', 'line']

问题在于pd.read_csv()不会将\n解释为换行符,而是解释为两个字符(\n).

我尝试将escapechar参数设置为'\''\\',但是两者都只是从字符串中删除斜杠而不对换行符进行任何解释,即字符串变为on one linenon other line.

如果我明确设置了df.iloc[0,0] = 'one line\nother line',则word_tokenize可以正常工作,因为这次\n实际上被解释为换行符.

理想情况下,我只是更改pd.read_csv()解释文件的方式即可,但是其他解决方案也可以.

解决方案

问题的措词有点差.我猜想pandas在字符串中转义\会使nltk.word_tokenize感到困惑. pandas.read_csv只能使用一个分隔符(或正则表达式,但我怀疑您想要这样做),因此它将始终将文本列读取为"one line\nother line",并转义反斜杠以保留它.如果要进一步解析和格式化它,可以使用转换器.这是一个示例:

import pandas as pd
import re

df = pd.read_csv(
         "file.csv", converters={"text":lambda s: re.split("\\\\n| ", s)}
)

以上结果为:

                       text   number
0  [one, line, other, line]       12

:如果需要使用nltk进行拆分(例如,拆分取决于语言模型),则需要先对字符串进行转义,然后再传递给word_tokenize ;尝试这样的事情:

lambda s: word_tokenize(s.encode('utf-8').decode('unicode_escape')

注意::查询中的匹配列表非常棘手,因此您可能希望通过更改lambda来将它们转换为元组:

lambda s: tuple(re.split("\\\\n| ", s))

I have a csv file with some text, among others. I want to tokenize (split into a list of words) this text and am having problems with how pd.read_csv interprets escape characters.

My csv file looks like this:

text, number
one line\nother line, 12

and the code is like follows:

df = pd.read_csv('test.csv')
word_tokenize(df.iloc[0,0])

output is:

['one', 'line\\nother', 'line']

while what I want is:

['one', 'line', 'other', 'line']

The problem is pd.read_csv() is not interpreting the \n as a newline character but as two characters (\ and n).

I've tried setting the escapechar argument to '\' and to '\\' but both just remove the slash from the string without doing any interpretation of a newline character, i.e. the string becomes on one linenon other line.

If I explicitly set df.iloc[0,0] = 'one line\nother line', word_tokenize works just fine, because \n is actually interpreted as a newline character this time.

Ideally I would do this simply changing the way pd.read_csv() interprets the file, but other solutions are also ok.

解决方案

The question is a bit poorly worded. I guess pandas escaping the \ in the string is confusing nltk.word_tokenize. pandas.read_csv can only use one separator (or a regex, but I doubt you want that), so it will always read the text column as "one line\nother line", and escape the backslash to preserve it. If you want to further parse and format it, you could use converters. Here's an example:

import pandas as pd
import re

df = pd.read_csv(
         "file.csv", converters={"text":lambda s: re.split("\\\\n| ", s)}
)

The above results to:

                       text   number
0  [one, line, other, line]       12

Edit: In case you need to use nltk to do the splitting (say the splitting depends on the language model), you would need to unescape the string before passing on to word_tokenize; try something like this:

lambda s: word_tokenize(s.encode('utf-8').decode('unicode_escape')

Note: Matching lists in queries is incredibly tricky, so you might want to convert them to tuples by altering the lambda like this:

lambda s: tuple(re.split("\\\\n| ", s))

这篇关于读取带有转义字符的csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆