读取带有转义字符的csv [英] read csv with escape characters
问题描述
我有一个csv文件,其中包含一些文本.我想将此文本标记化(分成单词列表),并且在pd.read_csv
解释转义字符的方式上遇到问题.
我的csv文件如下:
text, number
one line\nother line, 12
,代码如下:
df = pd.read_csv('test.csv')
word_tokenize(df.iloc[0,0])
输出为:
['one', 'line\\nother', 'line']
而我想要的是:
['one', 'line', 'other', 'line']
问题在于pd.read_csv()
不会将\n
解释为换行符,而是解释为两个字符(\
和n
).
我尝试将escapechar
参数设置为'\'
和'\\'
,但是两者都只是从字符串中删除斜杠而不对换行符进行任何解释,即字符串变为on one linenon other line
.>
如果我明确设置了df.iloc[0,0] = 'one line\nother line'
,则word_tokenize
可以正常工作,因为这次\n
实际上被解释为换行符.
理想情况下,我只是更改pd.read_csv()
解释文件的方式即可,但是其他解决方案也可以.
问题的措词有点差.我猜想pandas
在字符串中转义\
会使nltk.word_tokenize
感到困惑. pandas.read_csv
只能使用一个分隔符(或正则表达式,但我怀疑您想要这样做),因此它将始终将文本列读取为"one line\nother line"
,并转义反斜杠以保留它.如果要进一步解析和格式化它,可以使用转换器.这是一个示例:
import pandas as pd
import re
df = pd.read_csv(
"file.csv", converters={"text":lambda s: re.split("\\\\n| ", s)}
)
以上结果为:
text number
0 [one, line, other, line] 12
:如果需要使用nltk
进行拆分(例如,拆分取决于语言模型),则需要先对字符串进行转义,然后再传递给word_tokenize
;尝试这样的事情:
lambda s: word_tokenize(s.encode('utf-8').decode('unicode_escape')
注意::查询中的匹配列表非常棘手,因此您可能希望通过更改lambda来将它们转换为元组:
lambda s: tuple(re.split("\\\\n| ", s))
I have a csv file with some text, among others. I want to tokenize (split into a list of words) this text and am having problems with how pd.read_csv
interprets escape characters.
My csv file looks like this:
text, number
one line\nother line, 12
and the code is like follows:
df = pd.read_csv('test.csv')
word_tokenize(df.iloc[0,0])
output is:
['one', 'line\\nother', 'line']
while what I want is:
['one', 'line', 'other', 'line']
The problem is pd.read_csv()
is not interpreting the \n
as a newline character but as two characters (\
and n
).
I've tried setting the escapechar
argument to '\'
and to '\\'
but both just remove the slash from the string without doing any interpretation of a newline character, i.e. the string becomes on one linenon other line
.
If I explicitly set df.iloc[0,0] = 'one line\nother line'
, word_tokenize
works just fine, because \n
is actually interpreted as a newline character this time.
Ideally I would do this simply changing the way pd.read_csv()
interprets the file, but other solutions are also ok.
The question is a bit poorly worded. I guess pandas
escaping the \
in the string is confusing nltk.word_tokenize
. pandas.read_csv
can only use one separator (or a regex, but I doubt you want that), so it will always read the text column as "one line\nother line"
, and escape the backslash to preserve it. If you want to further parse and format it, you could use converters. Here's an example:
import pandas as pd
import re
df = pd.read_csv(
"file.csv", converters={"text":lambda s: re.split("\\\\n| ", s)}
)
The above results to:
text number
0 [one, line, other, line] 12
Edit: In case you need to use nltk
to do the splitting (say the splitting depends on the language model), you would need to unescape the string before passing on to word_tokenize
; try something like this:
lambda s: word_tokenize(s.encode('utf-8').decode('unicode_escape')
Note: Matching lists in queries is incredibly tricky, so you might want to convert them to tuples by altering the lambda like this:
lambda s: tuple(re.split("\\\\n| ", s))
这篇关于读取带有转义字符的csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!