RUBY CSV计算返回 [英] RUBY CSV Calculate Return

查看:105
本文介绍了RUBY CSV计算返回的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算公司名单的2天回报率。所有信息以CSV格式保存。结构如下:第一列是公司名称,第二列是日期,第三列是价格,第四列是return = p(t + 2)/ p(t)。



(1)CSV是1.8G。使用CSV.each_with_index ..非常慢。如果我使用CSV.foreach,它不让我在两天内找到价格。



(2)价格中缺少值。因此,即使我使用CSV.each_with_index,i + 2可能无法识别正确的日期。



感谢您的帮助。



输入:

  [
['a' ,'2014-6-1','1'],
['a','2014-6-2','2'],
['a' 4','3'],
['a','2014-6-5','4'],
['b','2014-6-1','1' ],
['b','2014-6-2','2'],
['b','2014-6-3','3'],
['b','2014-6-4','4'],
['b','2014-6-5','5']

]

输出:

  [
['a','2014-6-1','1',''],#缺少,因为没有2014-6-3价格为
['a' '2014-6-2','2','1.5'],#p(a2014-6-4)/ p(a2014-6-2)= 1.5
['a','2014-6 -4','3',''],#缺少,因为没有2014-6-6价格
['a','2014-6-5','4',''] no 2014-6-7 price
['b','2014-6-1','1','3'],
['b','2014-6-2' '2','2'],
['b','2014-6-3','3','1.7'],
['b','2014-6-4 ','4',''],
['b','2014-6-5','5','']

]

我记住的逻辑如下。它与第一个注释中的逻辑相同。我没有编码的第二部分,因为我不知道什么是一个很好的方式来合并一个大的CSV与自己在ruby。我也想过在第n个工作日在以下观察中搜索。但我想避免使用each_with_index作为CSV是非常大。我不知道如何在ruby中实现这个逻辑。



(1)计算日期后的第n个工作日
(2)将数据集与其本身合并,以便我在第n个工作日的价格

  require'csv'
要求'business_time'
#30/60/90/365工作天
#cdate ncusip prc permno firm

csvIn ='in.csv'
csvOut ='out.csv'

csv = CSV.open(csvOut, w)
csv<< ['cdate','ncusip','prc','permon','firm','day60']


CSV.foreach(csvIn,:headers => true) do | row |
current_date = Time.parse(row ['cdate'])
day60 = 42.business_days.after(current_date)
csv< [row ['cdate'],row ['ncusip'],row ['prc'],row ['permno'],row ['firm'],day60]

end


csv.close


解决方案

您的代码导致一些新的要求,例如找到第n个工作日,但他们没有在问题中清楚地定义,也许更适当的方式是打开另一个问题最快。



因此,我们只对您在结果示例中评论的请求发表意见。



要求的要点:




  • 读取一个大型csv文件,日期格式为

  • 每天查找价格n天内(n = 2)在一个组

  • 每天的记录追加按两天价格计算的比率,如果没有


  • 基本基准:



    该示例数据重复了45,000次,我得到了一个10MB的csv文件,其中有36万条记录。



    我的第一个想法是生成一个Buffer类来缓冲记录没有遇到下一个n天记录。当把一个新记录推送到缓冲区时,缓冲区将会移出在新记录之前n天的所有记录。



    但是我需要知道一些基本操作,处理时间可能在此实施中使用,那么我可以通过选择更有效的操作找出总处理时间的下限:


    1. 转换日期

    2. 比较两天360,000次

    3. 获取在另一日期之后n天的日期,共360,000次

    4. 计算两个日期之间的天数36万次

    5. 比较存储在数组数组中的两个日期36万次


    6. 向每条记录附加一个比率或空字符串360,000次

    $ b
    $ b

    我听说CSV是一种非常不经济的方法,所以我将比较两个文件解析处理时间:


    1. 使用CSV.foreach逐行读取csv文件,并将其解析为数组

    2. 使用IO.read将csv文件一次读入字符串,将字符串转换为数组

    基本基准脚本

      require'csv'
    require'benchmark'

    Benchmark.bm {| x |
    epoch = Date.new(1970,1,1)
    date1 = Date.today
    date2 = Date.today.next
    i1 = 1
    i2 = 200000
    date_str ='2014-6-1'
    a = [[1,2,4,date2],2,[1,2,4,date1]]
    #1. convert date格式化的字符串到现在至少360000次
    x.report(1.string2date){
    360000.times {Date.strptime(date_str,%Y-%m-%d)}
    }
    #2.比较两天36万次
    x.report(2.DateCompare){
    360000.times {date2> = date1}
    }
    #3.获取另一个日期后n天的日期,共360,000次
    x.report(3.DateAdd2){
    360000.times {date1 + 2}
    }
    #4.计算两个日期之间的天数36万次
    x.report(4.Date Differ){
    360000.times {date2-date1}
    }
    #5.比较存储在数组数组中的两个日期36万次。
    x.report(5.ArrDateComp){
    360000.times {a.last [3] ; a.first [3]}
    }
    #6.将行推入缓冲区并移出360,000次
    x.report(6.array shift){
    360000次{a << [1,2,3]; a.shift}
    }
    #7.为每个记录追加一个比率或空字符串36万次
    x.report(7.Add Ratio){
    360000。 times {res<< (['1','2014-6-1',3]<<(2 == 2?(3.to_f / 2.to_f).round(2):))
    }

    x.report('CSVparse'){
    CSV.foreach(data.csv){| row |
    }
    }
    x.report('IOread'){
    data = IO.read(data.csv)。split.inject([]){| memo ,o |备注 o.split(',')}。每个{| x | }
    }
    }

    结果

     用户系统总真实
    1.string2date 0.827000 0.000000 0.827000(0.820001)
    2.DateCompare 0.078000 0.000000 0.078000(0.070000)
    3.DateAdd2 0.109000 0.000000 0.109000(0.110000)
    4.Date不同0.359000 0.000000 0.359000(0.360000)
    5.ArrDateComp 0.109000 0.000000 0.109000(0.110001)
    。数组移位0.094000 0.000000 0.094000(0.090000)
    7.加法比0.530000 0.000000 0.530000(0.530000)
    CSVparse 2.902000 0.016000 2.918000(2.910005)
    IOread 0.515000 0.015000 0.530000(0.540000)

    分析结果




    • 将日期格式化的字符串传送到日期是所有这些操作中最慢的操作,因此应该在文件解析过程中使用,以确保每个记录只执行一次传送字符串到日期的操作。 / li>
    • 比较两个日期比计算两个日期之间的日期快大约7倍,因此我将存储n天后的日期,而不是将从纪元日期起的整数存储在缓冲区中

    • 总处理时间至少包括1,2,3,5,6,7个部分。因此,估计处理时间的下限应为1.75秒。

    • 对于CSV解析,下限为4.24秒。

    • 为2.262秒。



    执行Buffer Class和push方法



      class Buff 
    def initialize
    @ buff = []
    @epoch = Date.new(1970,1,1)
    @ n = 2
    end
    def push_date(row)
    #附带两个日期值的存储buff,[a,2014-6-1,1,# 2014-06-01((2456908j,0s,0n),+ 0s,2299161j)>,#< Date:2014-06-03((2456908j,0s,0n),+ 0s,2299161j) b $ b#日期的最后一个元素是记录日期后n天
    res = []
    @buff<< (行<[row [3] + @n))
    while(@ buff.last [3]> = @ buff.first [4] || row [0]!= @buff。第一个[0])
    v =(@ buff.last [3] == @ buff.first [4]&& .to_f / @ buff.first [2] .to_f).round(2):)
    res<<(@ buff.shift [0..2]< $ b end
    return res
    end

    def tails
    @ buff.inject([]){| res,x | res<< (x [0..2]<)}
    end
    def clear
    @ buff = []
    end
    end



    基准



      buff = Buff.new 
    res = []
    Benchmark.bm {| x |
    buff.clear
    res = []
    x.report(CSVdate){
    CSV.foreach(data.csv){| row |
    buff.push_date(row<< Date.strptime(row [1],%Y-%m-%d))。 res<< x}
    }
    buff.tails.each {| x | res<< x}
    }

    buff.clear
    res = []
    x.report(IOdate){
    IO.read(data。 csv)。split.inject([]){| memo,o |备注 o.split(',')}。每个{| row |
    buff.push_date(row<< Date.strptime(row [1],%Y-%m-%d))。 res<< x}
    }
    buff.tails.each {| x | res<< x}
    }

    }
    puts输出结果计数:#{res.size}
    puts这里是第12个样本输出:
    res [0..11] .each {| x | puts x.to_s}

    结果

     用户系统总实际
    CSVdate 6.411000 0.047000 6.458000(6.500009)
    IOdate 3.557000 0.109000 3.666000(3.710005)

    输出结果计数:360000
    这里是第一个12个样本输出:
    [a,2014-6-1,1,]
    [a 2014-6-2,2,1.5]
    [a,2014-6-4,3,]
    [ 6-5,4,]
    [b,2014-6-1,1,3.0]
    [b,2014-6-2 ,2,2.0]
    [b,2014-6-3,3,1.67]
    [b,2014-6-4,4 ,]
    [b,2014-6-5,5,]
    [a,2014-6-1,1 ]
    [a,2014-6-2,2,1.5]
    [a,2014-6-4,3,

    结束




    • 真正的处理时间是3.557秒,比估计的下限慢57%,但仍然有一些开销没有考虑。

    • CSV版本比IO#读取版本慢2倍。

    • 我们应该通过IO#读取输入文件防止内存不足错误。

    • 它必须有一些空间来调整。

    • $ b h2>

      通过更改组比较和日期比较的顺序更快地推送:

        class Buff 
      def push_fast(row)
      #附带两个日期值的存储buff,[a,2014-6-1,1,#< Date:2014-06 -01((2456908j,0s,0n),+ 0s,2299161j)>,# #date的最后一个元素是记录的日期之后n天
      res = []
      row<< (row [3] + @n)
      #改变两个比较的顺序,可以减少日期比较的计数
      while @ buff.first&& (row [0]!= @ buff.first [0] || row [3]> = @ buff.first [4])
      v =(row [0] == @ buff.first [0] &&&&&&&&&&&    == @ buff.first [4]?(row [2] .to_f / @ buff.first [2] .to_f).round(2):)
      res <<<(@ buff.shift [0..2]<< v)
      end
      @buff< row
      return res
      end
      end

      基准结果

       用户系统总计
      IOdate 3.806000 0.031000 3.837000(3.830005)
      IOfast 3.323000 0.062000 3.385000(3.390005)

      可以获得0.480秒的促销。通过比较组首先保存许多日期比较时间,如果组更改,则移出所有缓冲区记录而不进行日期比较。


      I am trying to calculate 2 days return for a list of companies. All information are saved in a CSV. The structure is like this: first column is company name, second is the date, the third is price, the forth column is return = p(t+2)/p(t).

      (1) The CSV is 1.8G. Using "CSV.each_with_index.." is very slow. If I use "CSV.foreach", it doesn't let me to find the price in two days.

      (2) There are missing values in price. So even if I use CSV.each_with_index, i + 2 may not identify the right date.

      Thanks for your help.

      Input:

      [
          ['a', '2014-6-1', '1'],
          ['a', '2014-6-2', '2'],
          ['a', '2014-6-4', '3'],
          ['a', '2014-6-5', '4'],
          ['b', '2014-6-1', '1'],
          ['b', '2014-6-2', '2'],
          ['b', '2014-6-3', '3'],
          ['b', '2014-6-4', '4'],
          ['b', '2014-6-5', '5']
      
      ]
      

      output:

      [
          ['a', '2014-6-1', '1', ''],     # Missing because no 2014-6-3 price for a
          ['a', '2014-6-2', '2', '1.5'],  # p(a2014-6-4)/p(a2014-6-2) = 1.5
          ['a', '2014-6-4', '3', ''],     # Missing because no 2014-6-6 price
          ['a', '2014-6-5', '4', ''],     # Missing because no 2014-6-7 price
          ['b', '2014-6-1', '1', '3'],
          ['b', '2014-6-2', '2', '2'],
          ['b', '2014-6-3', '3', '1.7'],
          ['b', '2014-6-4', '4', ''],
          ['b', '2014-6-5', '5', '']
      
      ]
      

      The logic I have in mind is as below. It is the same as the logic in the first comment. I haven't coded the second part as I am not sure what is a good way to merge a big CSV with itself in ruby. I also thought about searching the nth business day in following observations. But I want to avoid to use each_with_index as the CSV is very big. I don't know how to implement this logic in ruby.

      (1) Calculate the nth business day after a date (2) Merge the dataset with itself so I have the price on nth business day

      require 'csv'
      require 'business_time'
      # 30/60/90/365 business days 
      # cdate ncusip prc permno firm
      
      csvIn   = 'in.csv'
      csvOut  = 'out.csv'
      
      csv = CSV.open(csvOut, "w")
      csv << ['cdate', 'ncusip', 'prc', 'permon', 'firm', 'day60']
      
      
      CSV.foreach(csvIn, :headers => true) do |row|
          current_date = Time.parse(row['cdate'])
          day60 = 42.business_days.after(current_date)
          csv << [row['cdate'], row['ncusip'], row['prc'], row['permno'], row['firm'], day60]
      
      end
      
      
      csv.close
      

      解决方案

      Your code lead into some new requirements such as finding nth business day, but they are not clearly defined in the question, maybe more proper way is to open another question about "quickest way of finding nth business day in ruby" .

      So let's only foucus on the requiment you commented in the result sample.

      Main points of the requirement:

      • read a large csv file, with date formatted string in it
      • for each day find price in n days after (n=2) in one group
      • for each day's record append a ratio calculated by two days price, if there is no price after n days, leave it blank

      Basic Benchmark:

      With the sample data repeated by 45,000 times, I got a 10MB csv file of 360,000 records in it.

      My first thought is to generate a Buffer class to buffer the records that haven't met the next n days record yet. When push a new record to the buffer, the buffer will shift out all the records that is n days before the new record.

      But I need to know some basic operations' processing time maybe used in this implementation, then I can figure out the lower limit of total processing time by choosing more efficient operations:

      1. convert date formatted string to date at least 360,000 times
      2. compare two days for 360,000 times
      3. get the date that is n days after another date for 360,000 times
      4. calculate the days between two dates for 360,000 times
      5. compare two dates stored in an array of arrays for 360,000 times
      6. push a row into buffer and shift out for 360,000 times
      7. append a ratio or empty string to every record for 360,000 times

      And I was heard that CSV is a very unefficient way , so I will compare two file parsing proccesing time too:

      1. using CSV.foreach to read csv file row by row, and parse them into an array
      2. using IO.read to read csv file into a string at once, and split the string into an array

      Basic benchmark scripts:

      require 'csv'
      require 'benchmark'
      
      Benchmark.bm{|x|
        epoch=Date.new(1970,1,1)
        date1=Date.today
        date2=Date.today.next
        i1=1
        i2=200000
        date_str='2014-6-1'
        a = [[1,2,4,date2],2,[1,2,4,date1]]
        #  1. convert date formatted string to date at least 360,000 times
        x.report("1.string2date"){
          360000.times{Date.strptime(date_str,"%Y-%m-%d")}
        }
        #  2. compare two days for 360,000 times
        x.report("2.DateCompare"){
          360000.times{date2>=date1}
        }
        #  3. get the date that is n days after another date for 360,000 times
        x.report("3.DateAdd2   "){
          360000.times{date1 + 2}
        }
        #  4. calculate the days between two dates for 360,000 times
        x.report("4.Date Differ"){
          360000.times{date2-date1}
        }
        #  5. compare two dates stored in an array of arrays for 360,000 times 
        x.report("5.ArrDateComp"){
          360000.times{ a.last[3] > a.first[3]}
        }
        #  6. push a row into buffer and shift out for 360,000 times 
        x.report("6.array shift"){
          360000.times{ a<<[1,2,3]; a.shift}
        }
        #  7. append a ratio or empty string to every record for 360,000 times 
        x.report("7.Add Ratio  "){
          360000.times{ res << (['1','2014-6-1',"3"]<< (2==2 ? (3.to_f/2.to_f).round(2) : "" ))}
        }
      
        x.report('CSVparse    '){
          CSV.foreach("data.csv"){|row| 
          }
        }
        x.report('IOread      '){
        data = IO.read("data.csv").split.inject([]){|memo,o| memo << o.split(',')}.each{|x| }
        }
      }
      

      The result:

                         user     system      total        real
      1.string2date  0.827000   0.000000   0.827000 (  0.820001)
      2.DateCompare  0.078000   0.000000   0.078000 (  0.070000)
      3.DateAdd2     0.109000   0.000000   0.109000 (  0.110000)
      4.Date Differ  0.359000   0.000000   0.359000 (  0.360000)
      5.ArrDateComp  0.109000   0.000000   0.109000 (  0.110001)
      6.array shift  0.094000   0.000000   0.094000 (  0.090000)
      7.Add Ratio    0.530000   0.000000   0.530000 (  0.530000)
       CSVparse      2.902000   0.016000   2.918000 (  2.910005)
       IOread        0.515000   0.015000   0.530000 (  0.540000)
      

      Analyze the result

      • Transfer date formatted string to date is the slowest operation of all those operations, so it should be used at file parsing procedure, to ensure that the operation of transfering string to date will be executed only once for every record.
      • Compare two dates is about 7 times faster than calculate the days between two dates, so I'll store the date after n days instead of store an integer of the days since epoch date in the buffer.
      • The total proccessing time at least includes 1,2,3,5,6,7 those parts. So the lower limit of the estimate proccessing time should be 1.75 seconds. There are some overheads not included.
      • With CSV parsing the lower limit would be 4.24 seconds.
      • With IO#read and split the lower limit would be 2.262 seconds.

      The implementation of Buffer Class and push method

      class Buff
        def initialize
          @buff=[]
          @epoch = Date.new(1970,1,1)
          @n=2
        end
        def push_date( row ) 
          # store buff with two date value appended, ["a", "2014-6-1", "1",  #<Date: 2014-06-01 ((2456908j,0s,0n),+0s,2299161j)>,#<Date: 2014-06-03 ((2456908j,0s,0n),+0s,2299161j)>]
          # the last element of date is n days after the record's date 
          res = []
          @buff << (row << (row[3] + @n) )
          while (@buff.last[3] >= @buff.first[4] || row[0] != @buff.first[0])
            v = (@buff.last[3] == @buff.first[4] && row[0] == @buff.first[0] ? (row[2].to_f/@buff.first[2].to_f).round(2) : "")
            res <<(@buff.shift[0..2]<< v)
          end
          return res
        end
      
        def tails
          @buff.inject([]) {|res,x|  res << (x[0..2]<< "")}
        end
        def clear
          @buff=[]
        end
      end
      

      Benchmark

      buff=Buff.new
      res=[]
      Benchmark.bm{|x|
        buff.clear
        res = []
        x.report("CSVdate"){
          CSV.foreach("data.csv"){|row| 
            buff.push_date(row << Date.strptime(row[1],"%Y-%m-%d")).each{|x| res << x}
          }
          buff.tails.each{|x| res << x}
        }
      
        buff.clear
        res = []
        x.report("IOdate"){
          IO.read("data.csv").split.inject([]){|memo,o| memo << o.split(',')}.each {|row| 
            buff.push_date(row << Date.strptime(row[1],"%Y-%m-%d")).each{|x| res << x}
          }
          buff.tails.each{|x| res << x}
        }
      
      }
      puts "output result count:#{res.size}"
      puts "Here is the fist 12 sample outputs:"
      res[0..11].each{|x| puts x.to_s}
      

      Result

                   user     system      total        real
      CSVdate  6.411000   0.047000   6.458000 (  6.500009)
      IOdate   3.557000   0.109000   3.666000 (  3.710005)
      
      output result count:360000
      Here is the fist 12 sample outputs:
      ["a", "2014-6-1", "1", ""]
      ["a", "2014-6-2", "2", 1.5]
      ["a", "2014-6-4", "3", ""]
      ["a", "2014-6-5", "4", ""]
      ["b", "2014-6-1", "1", 3.0]
      ["b", "2014-6-2", "2", 2.0]
      ["b", "2014-6-3", "3", 1.67]
      ["b", "2014-6-4", "4", ""]
      ["b", "2014-6-5", "5", ""]
      ["a", "2014-6-1", "1", ""]
      ["a", "2014-6-2", "2", 1.5]
      ["a", "2014-6-4", "3", ""]
      

      Conclusion

      • The real proccesing time is 3.557 seconds, it's about 57% slower than the estimated lower limit , but there are still some overheads not considerd.
      • The CSV version is 2 times slower than the IO#read version.
      • We should read the input file block by block with IO#read to prevent insufficient memory error.
      • It must have some space to tuning.

      UPDATE1:

      Tuning

      More faster push by changeing the order of group compare and date compare:

      class Buff
        def push_fast( row ) 
          # store buff with two date value appended, ["a", "2014-6-1", "1",  #<Date: 2014-06-01 ((2456908j,0s,0n),+0s,2299161j)>,#<Date: 2014-06-03 ((2456908j,0s,0n),+0s,2299161j)>]
          # the last element of date is n days after the record's date 
          res = []
          row << (row[3] + @n) 
          # change the order of the two compares, can reduce the counts of date compares
          while @buff.first && (row[0] != @buff.first[0] || row[3] >= @buff.first[4] )
            v = (row[0] == @buff.first[0] && row[3] == @buff.first[4] ? (row[2].to_f/@buff.first[2].to_f).round(2) : "")
            res <<(@buff.shift[0..2]<< v)
          end
          @buff << row
          return res
        end
      end
      

      Benchmark result

                  user     system      total        real
      IOdate  3.806000   0.031000   3.837000 (  3.830005)
      IOfast  3.323000   0.062000   3.385000 (  3.390005)
      

      Can get 0.480 seconds promotion. Saving many date compare time by compare group first, if group changes, shift out all buffer records out without date compare.

      这篇关于RUBY CSV计算返回的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆