如何通过段落或块将数据读入数组 [英] How to read a file by paragraphs or chunks into arrays

查看:97
本文介绍了如何通过段落或块将数据读入数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件,其中包含用空行分隔的文本块,如下所示:

I have a file that contains chunks of text separated by blank lines, like this:

block 1
some text
some text

block 2
some text
some text

如何将其读入数组?

推荐答案

这经常被问到我认为解释做什么是有用的,但首先我需要这样说:

This is asked often enough that I thought it'd be useful to explain what to do, but first I need to say this:

不要试图在一大堆中读取文件。这就是所谓的啜饮,这是一个坏主意,除非你能保证你总能获得大小不到1MB的文件。有关详细信息,请参阅为什么诽谤文件不是一个好习惯?

Don't try to read a file in one big gulp. That's called "slurping", and is a bad idea unless you can guarantee that you'll ALWAYS get files significantly less than 1MB in size. See "Why is "slurping" a file not a good practice?" for more information.

如果我有一个类似的文件:

If I have a file that looks like:

block 1
some text
some text

block 2
some text
some text

我尝试正常阅读,我会得到类似的结果:

and I tried to read it normally, I'd get something like:

File.read('foo.txt')
#=> "block 1\nsome text\nsome text\n\nblock 2\nsome text\nsome text\n"

这会让我不得不将它拆分成单独的行,试图找到空行,然后将其分成块。并且,总是天真的解决方案是使用正则表达式,这种方式有效,但它不是最优的。

which would leave me having to split it into separate lines, trying to find the blank lines, and then break it into chunks. And, invariably, the naive solution would be to use a regular expression, which kind-a works but it's not optimal.

或者我们可以尝试:

File.readlines('foo.txt')
#=> ["block 1\n", "some text\n", "some text\n", "\n", "block 2\n", "some text\n", "some text\n"]

然后仍然需要找到空行并将数组转换为子阵列。

and then still have to find the blank lines and turn the array into sub-arrays.

相反,有两种简单的方法可以加载文件。

Instead, there are two easy ways to load the file.


  1. <请记住先前关于啜饮文件的警告,如果它是我们可以使用的小文件:
  1. Keeping in mind the previous warning about slurping files, if it's a small file we can use:

File.readlines('foo.txt', "\n\n")
#=> ["block 1\nsome text\nsome text\n\n", "block 2\nsome text\nsome text\n"]

注意在第二个参数中使用\ n\\\
。这就是行分隔符,对于* nix类型的操作系统通常定义为\ n,对于Windows则定义为\\\\ n。它实际上是基于OS派生的全局值Ruby集合亲切地称为 $ / $ RS $ INPUT_RECORD_SEPARATOR 。它们记录在英语模块中。记录分隔符是文本文件中用于分隔两行的字符,或者,就我们的目的而言,由两个行尾字符分隔的一组行,或者换句话说,一个段落。

Notice the use of "\n\n" in the second parameter. That's the "line separator", which normally is defined as "\n" for *nix-type OSes and "\r\n" for Windows. It's actually based on an OS-derived global value Ruby sets known affectionately as $/, $RS or $INPUT_RECORD_SEPARATOR. They're documented in the English module. A record separator is the character used in a text file to separate two lines, or, for our purposes a group of lines separated by two line-end characters, or, in other words, a paragraph.

一旦阅读,很容易清理内容以删除尾随行结束:

Once read, it's easy to clean up the contents to remove the trailing line-ends:

File.readlines('foo.txt', "\n\n").map(&:rstrip)
#=> ["block 1\nsome text\nsome text", "block 2\nsome text\nsome text"]

或者将它们分成子数组:

Or break them into sub-arrays:

File.readlines('foo.txt', "\n\n").map{ |s| s.rstrip.split("\n") }
#=> [["block 1", "some text", "some text"], ["block 2", "some text", "some text"]]

所有示例都可以使用类似于以下的段落:

All the examples could be used with a paragraph similar to:

File.readlines('foo.txt', "\n\n").map(&:rstrip).each do |line|
  # do something with line
end

或:

File.readlines('foo.txt', "\n\n").map{ |s| s.rstrip.split("\n") }.each do |paragraph|
  # do something with the sub-array `paragraph`
end


  • 如果它是一个大文件,如果文件尚未打开,我们可以通过 foreach 使用Ruby的逐行IO,或 each_line 如果它是已打开的文件。而且,既然您已经阅读了上面的链接,那么您已经知道我们为什么要使用逐行IO。

  • If it's a big file, we can use Ruby's line-by-line IO via foreach if the file isn't already open, or each_line if it's an already opened file. And, since you read the link above, you already know why we'd want to use line-by-line IO.

    File.foreach('foo.txt', "\n\n")  #=> #<Enumerator: File:foreach("foo.txt", "\n\n")>
    

    foreach 返回一个枚举器,所以我们需要点击 to_a 来读取数组以便我们看到结果,但通常我们不必这样做:

    foreach returns an enumerator so we need to tack on to_a to read the array so we can see the results, but normally we'd not have to do that:

    File.foreach('foo.txt', "\n\n").to_a
    #=> ["block 1\nsome text\nsome text\n\n", "block 2\nsome text\nsome text\n"]
    

    它很容易使用 foreach 如上所述:

    It's easy to use foreach like above:

    File.foreach('foo.txt', "\n\n").map(&:rstrip) 
    #=> ["block 1\nsome text\nsome text", "block 2\nsome text\nsome text"]
    
    File.foreach('foo.txt', "\n\n").map(&:rstrip).map{ |s| s.rstrip.split("\n") } 
    #=> [["block 1", "some text", "some text"], ["block 2", "some text", "some text"]]
    

    注意:我强烈怀疑使用 map 这样会导致类似于啜饮文件的问题,因为Ruby会缓冲 foreach 的输出,然后将其传递给 map 。相反,我们需要对 do 块中读取的每个段落进行操作:

    Note: I strongly suspect using map like that will cause a similar problem as slurping the file, since Ruby will buffer the output of foreach before passing it to map. Instead we need to do the manipulation of each paragraph read inside the do block:

    File.foreach('foo.txt', "\n\n") do |ary|
      ary.rstrip.split("\n").each do |line|
        # do something with the individual line
      end
    end
    

    这样做的性能很小但是,因为目标是按段落或块进行处理,所以可以接受。

    Doing that is a small hit in performance but, because the goal is to process by paragraphs or blocks, it's acceptable.

    另请注意,这是一个社区Wiki,因此请进行适当的编辑和贡献。

    Also note, this is a community Wiki, so edit and contribute appropriately.

    这篇关于如何通过段落或块将数据读入数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆