解决“错误:在数据中找到原义换行符".在Postgres吗? [英] Solving "error: literal newline found in data" in Postgres?

查看:354
本文介绍了解决“错误:在数据中找到原义换行符".在Postgres吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含3200万个元组的数据集,并且我正在使用COPY FROM将这些元组复制到一个表中.在前700万左右的范围内,有3个格式不正确的元组.在这些情况下,不是下一个记录在当前记录下方,而是将其追加到同一行的当前记录的末尾.因此,而不是

I have a data set with 32 million tuples, and I'm using COPY FROM to copy those tuples into a table. Within the first 7 million or so there are 3 improperly-formatted tuples. In these cases, instead of the next record being below the current record, it is instead appended to the end of the current record on the same line. So instead of

record1
record2

record1record2

我通过导航到应该在换行符的位置并按每个实例一次"Enter"(一个典型的换行符)的方式来解决此问题.一旦我固定了这些行,COPY函数就可以很好地读取它们.对我来说,这意味着"Enter"是插入换行符的有效方法,因为COPY函数可以毫无问题地读取那些由"Enter"生成的换行符.

I fixed this by navigating to where the line breaks should have been and pressing "Enter" once per instance, a typical newline entry. Once I fixed those lines, the COPY function was reading them in completely fine. To me, this implied that "Enter" was a valid way to insert a newline, as the COPY function read in those "Enter"-generated newline characters with no problem.

但是,后来当我通过第700万个元组时,我遇到了:

However, later on when I passed the 7 millionth tuple, I encountered:

ERROR: literal newline found in data
HINT: Use "\n" to represent newline.
Context: COPY time_raw, line 7308000

SQL状态:22P04

SQL State: 22P04

我查看了该记录中的数据,发现它与上面或下面的数据没有什么不同.为了确保,我在记录7308000的开头按了退格",将其上移到记录7307999的末尾,就像上面的格式record1record2一样.然后,我按Enter键,以确保开始记录7308000的换行符将与COPY以前毫无问题地使用的字符相同.同样,结束记录7307999的换行符与我用来分隔较早的格式错误的记录的换行符完全相同(再次,COPY毫无问题地接受了该换行符).为了覆盖基础,我对记录7308001进行了相同的操作,确保结束记录7308000的换行符与我先前键入的换行符相同,而COPY接受了该换行符.但是,保存并尝试再次从文件复制后,我得到了:

I looked in the data at that record, and found that it was no different from the one above or below it. Just to make sure, I pressed "Backspace" at the beginning of record 7308000 to move it up one line to the end of record 7307999, just like the above format record1record2. Then I pressed "Enter" to ensure that the newline character that started record 7308000 would be identical to the character that COPY had previously taken in with no trouble. In that same vein, the newline character that ended record 7307999 was the exact same newline character I used to separate the earlier incorrectly-formatted records (once again, COPY took this newline character in with no problems). To cover my bases, I did the same to record 7308001, making sure that the newline that ended record 7308000 would be identical to the newline that I typed earlier, which COPY accepted. However, upon saving and trying to COPY from the file again, I got:

ERROR: literal newline found in data
HINT: Use "\n" to represent newline.
Context: COPY time_raw, line 7307999

因此,显然错误已经加剧,COPY以前使用的换行符突然变得无效.我再次查看了数据,并将7307999的开头移到了7307998的结尾,然后按Enter键以插入换行符,COPY认为该换行符对超过700万条记录有效.因此,在这一点上,我相信7307998的结尾是换行符,应该早先被证明可以在COPY中使用.我再次运行查询,现在得到:

So apparently the error had moved up, and the newline character that COPY previously took in suddenly became invalid. Once again, I looked in the data and moved the beginning of 7307999 up to the end of 7307998, and then pressed "Enter" to insert a newline character that COPY had been recognizing as valid for over 7 million records. So at this point, I am confident that the end of 7307998 is a newline character that is supposed to have been proven to work in COPY earlier on. I run the query again and now I get:

ERROR: literal newline found in data
HINT: Use "\n" to represent newline.
Context: COPY time_raw, line 7307998

错误再次上移.据我所知,我只是在输入与以前相同的"Enter"换行符,由于某种原因,在7308000 COPY之后,COPY会将其视为无效.

The error has moved up again. To my knowledge, I am just putting in the same "Enter" newline characters that I did earlier, and for some reason past 7308000 COPY reads them as invalid.

这些是数据集的7307996-7308000行(56列,最没有意义)

These are lines 7307996-7308000 of the data set (56 columns, most meaningless)

2012-02-23T13:10:03.1769237+00:00   9863996 12604   13807   8171    0000    0001    0000    0000    0000    0000    0000    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
2012-02-23T13:10:03.1869189+00:00   9863997 12604   13807   8171    0000    0001    0000    0000    0000    0000    0000    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
2012-02-23T13:10:03.1969230+00:00   9863998 12604   13807   8171    0000    0001    0000    0000    0000    0000    0000    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
2012-02-23T13:10:03.2069124+00:00   9863999 12604   13807   8171    0000    0001    0000    0000    0000    0000    0000    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
2012-02-23T13:10:03.2169261+00:00   9864000 12604   13807   8171    0000    0001    0000    0000    0000    0000    0000    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

我在每条记录之后在stackoverflow编辑器中添加了一个"Enter",以便一个接一个地显示它们,但要注意的是相关的中间行(7307998)的格式与其他行完全相同,因此我确保了它被换行符包围,这些换行符已经由COPY函数显示为有效字符.每当我尝试确保围绕记录的换行符与我之前使用的换行符相同时,错误就会增加一行.我不知道在这种情况下该怎么办.

I added one "Enter" in the stackoverflow editor after each record in order to display them one after another, but the point is that the relevant middle line (7307998) is formatted exactly like the others, and I made sure that it was surrounded by newlines that were already shown to be valid characters by the COPY function. Every time I try to ensure that the newlines surrounding a record are the same as the ones I used earlier, the error goes up a line. I have no idea what to do in this situation.

我的查询是:

copy time_raw from E'C:\\Users\\bozon92\\Documents\\YorkU\\Summer 2016\\4080 Project\\Other Files\\allData.txt';

我尝试附加"with(format csv,delimiter'\ t')",但是这告诉我COPY分隔符必须是一个单字节字符,而"with(format csv,delimiter'')"(''作为文字的制表符空间而不是\ t),它给我相同的错误性质,只是语法略有不同:

I have tried appending "with (format csv, delimiter '\t')" but that tells me COPY delimiter must be a single one-byte character, whereas "with (format csv, delimiter ' ')" (' ' being a literal tab space instead of \t) it gives me the same nature of error, just with a slightly different syntax:

ERROR:  unquoted newline found in data
HINT:  Use quoted CSV field to represent newline.
CONTEXT:  COPY time_raw, line 7307998

它只是不带引号"而不是文字".

It's just "unquoted" instead of "literal".

作为一个说明,有人告诉我我可以将数据削减到700万个元组,最终可能会这样做,但是我想知道为什么会发生此问题,所以将来可以避免.我不知道数据出了什么问题,因为那些特定的记录看起来很好,并且与记录前后的记录格式完全相同,那么我该如何处理这个字面的换行问题呢?我不知道该怎么办,因为我什至都找不到什么错.

As a note, I have been told that I can just cut the data at 7 million tuples, which I will probably eventually do, but I want to know why this issue happened, so I can avoid it in the future. I have no idea what is wrong with the data, because those particular records look completely fine and exactly the same format as the records before and after it, so how can I deal with this literal newline issue? I have no idea how to approach it because I can't find even a trace of what is wrong.

推荐答案

根据PostgreSQL源copy.c:

According to code snippet from PostgreSQL source, copy.c:

 /* Process \n */
 if (c == '\n' && (!cstate->csv_mode || !in_quote))
 {
     if (cstate->eol_type == EOL_CR || cstate->eol_type == EOL_CRNL)
         ereport(ERROR,
             (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
             !cstate->csv_mode ?
             errmsg("literal newline found in data") :
             errmsg("unquoted newline found in data"),
             !cstate->csv_mode ?
             errhint("Use \"\\n\" to represent newline.") :
             errhint("Use quoted CSV field to represent newline.")));
      cstate->eol_type = EOL_NL;      /* in case not set yet */
      /* If reach here, we have found the line terminator */
      break;
}

表示您的输入数据在字符串内部的某个位置使用字节0x0A.您使用的是"abcNxyz",实际上不是值0x0A的字节,而是值0x0A的字节.

it means that your input data is using byte 0x0A somewhere inside your strings, e.g. you use "abcNxyz", where instead of N actually there is byte with value 0x0A.

解决方案是改用字符串"abc\n". 您应该能够找到所有虚假的换行符,并使用某些脚本(例如Python或Perl)将它们替换为\n.

Solution is to use string "abc\n" instead. You should be able to find all spurious newlines and replace them to \n using some script, perhaps Python or Perl.

这篇关于解决“错误:在数据中找到原义换行符".在Postgres吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆