RUBY CSV计算返回 [英] RUBY CSV Calculate Return

查看：105 发布时间：2017/2/24 21:43:01 ruby csv

本文介绍了RUBY CSV计算返回的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想计算公司名单的2天回报率。所有信息以CSV格式保存。结构如下：第一列是公司名称，第二列是日期，第三列是价格，第四列是return = p（t + 2）/ p（t）。

（1）CSV是1.8G。使用CSV.each_with_index ..非常慢。如果我使用CSV.foreach，它不让我在两天内找到价格。

（2）价格中缺少值。因此，即使我使用CSV.each_with_index，i + 2可能无法识别正确的日期。

感谢您的帮助。

输入：

  [
 ['a' ，'2014-6-1'，'1']，
 ['a'，'2014-6-2'，'2']，
 ['a' 4'，'3']，
 ['a'，'2014-6-5'，'4']，
 ['b'，'2014-6-1'，'1' ]，
 ['b'，'2014-6-2'，'2']，
 ['b'，'2014-6-3'，'3']，
 ['b'，'2014-6-4'，'4']，
 ['b'，'2014-6-5'，'5'] 
 
]

输出：

  [
 ['a'，'2014-6-1'，'1'，'']，＃缺少，因为没有2014-6-3价格为
 ['a' '2014-6-2'，'2'，'1.5']，＃p（a2014-6-4）/ p（a2014-6-2）= 1.5 
 ['a'，'2014-6 -4'，'3'，'']，＃缺少，因为没有2014-6-6价格
 ['a'，'2014-6-5'，'4'，''] no 2014-6-7 price 
 ['b'，'2014-6-1'，'1'，'3']，
 ['b'，'2014-6-2' '2'，'2']，
 ['b'，'2014-6-3'，'3'，'1.7']，
 ['b'，'2014-6-4 '，'4'，'']，
 ['b'，'2014-6-5'，'5'，''] 
 
]

我记住的逻辑如下。它与第一个注释中的逻辑相同。我没有编码的第二部分，因为我不知道什么是一个很好的方式来合并一个大的CSV与自己在ruby。我也想过在第n个工作日在以下观察中搜索。但我想避免使用each_with_index作为CSV是非常大。我不知道如何在ruby中实现这个逻辑。

（1）计算日期后的第n个工作日
（2）将数据集与其本身合并，以便我在第n个工作日的价格

  require'csv'
要求'business_time'
＃30/60/90/365工作天
＃cdate ncusip prc permno firm 
 
 csvIn ='in.csv'
 csvOut ='out.csv'
 
 csv = CSV.open（csvOut， w）
 csv<< ['cdate'，'ncusip'，'prc'，'permon'，'firm'，'day60'] 
 
 
 CSV.foreach（csvIn，：headers => true） do | row | 
 current_date = Time.parse（row ['cdate']）
 day60 = 42.business_days.after（current_date）
 csv< [row ['cdate']，row ['ncusip']，row ['prc']，row ['permno']，row ['firm']，day60] 
 
 end 
 
 
 csv.close

解决方案

您的代码导致一些新的要求，例如找到第n个工作日，但他们没有在问题中清楚地定义，也许更适当的方式是打开另一个问题最快。

因此，我们只对您在结果示例中评论的请求发表意见。

要求的要点：

读取一个大型csv文件，日期格式为

每天查找价格n天内（n = 2）在一个组

每天的记录追加按两天价格计算的比率，如果没有

基本基准：

该示例数据重复了45,000次，我得到了一个10MB的csv文件，其中有36万条记录。

我的第一个想法是生成一个Buffer类来缓冲记录没有遇到下一个n天记录。当把一个新记录推送到缓冲区时，缓冲区将会移出在新记录之前n天的所有记录。

但是我需要知道一些基本操作，处理时间可能在此实施中使用，那么我可以通过选择更有效的操作找出总处理时间的下限：

转换日期

比较两天360,000次

获取在另一日期之后n天的日期，共360,000次

计算两个日期之间的天数36万次

比较存储在数组数组中的两个日期36万次

向每条记录附加一个比率或空字符串360,000次

$ b
$ b

我听说CSV是一种非常不经济的方法，所以我将比较两个文件解析处理时间：

使用CSV.foreach逐行读取csv文件，并将其解析为数组

使用IO.read将csv文件一次读入字符串，将字符串转换为数组

基本基准脚本：

  require'csv'
 require'benchmark'
 
 Benchmark.bm {| x | 
 epoch = Date.new（1970,1,1）
 date1 = Date.today 
 date2 = Date.today.next 
 i1 = 1 
 i2 = 200000 
 date_str ='2014-6-1'
a = [[1,2,4，date2]，2，[1,2,4，date1]] 
＃1. convert date格式化的字符串到现在至少360000次
 x.report（1.string2date）{
 360000.times {Date.strptime（date_str，％Y-％m-％d）} 
} 
＃2.比较两天36万次
 x.report（2.DateCompare）{
 360000.times {date2> = date1} 
} 
＃3.获取另一个日期后n天的日期，共360,000次
 x.report（3.DateAdd2）{
 360000.times {date1 + 2} 
} 
＃4.计算两个日期之间的天数36万次
 x.report（4.Date Differ）{
 360000.times {date2-date1} 
 } 
＃5.比较存储在数组数组中的两个日期36万次。
 x.report（5.ArrDateComp）{
 360000.times {a.last [3] ; a.first [3]} 
} 
＃6.将行推入缓冲区并移出360,000次
 x.report（6.array shift）{
 360000次{a << [1,2,3]; a.shift} 
} 
＃7.为每个记录追加一个比率或空字符串36万次
 x.report（7.Add Ratio）{
 360000。 times {res<< （['1'，'2014-6-1'，3]<<（2 == 2？（3.to_f / 2.to_f）.round（2）：））
} 
 
 x.report（'CSVparse'）{
 CSV.foreach（data.csv）{| row | 
} 
} 
 x.report（'IOread'）{
 data = IO.read（data.csv）。split.inject（[]）{| memo ，o |备注 o.split（'，'）}。每个{| x | } 
} 
}

结果

 用户系统总真实
 1.string2date 0.827000 0.000000 0.827000（0.820001）
 2.DateCompare 0.078000 0.000000 0.078000（0.070000）
 3.DateAdd2 0.109000 0.000000 0.109000（0.110000）
 4.Date不同0.359000 0.000000 0.359000（0.360000）
 5.ArrDateComp 0.109000 0.000000 0.109000（0.110001）
。数组移位0.094000 0.000000 0.094000（0.090000）
 7.加法比0.530000 0.000000 0.530000（0.530000）
 CSVparse 2.902000 0.016000 2.918000（2.910005）
 IOread 0.515000 0.015000 0.530000（0.540000）

分析结果

 
 
  
 将日期格式化的字符串传送到日期是所有这些操作中最慢的操作，因此应该在文件解析过程中使用，以确保每个记录只执行一次传送字符串到日期的操作。 / li> 
 
比较两个日期比计算两个日期之间的日期快大约7倍，因此我将存储n天后的日期，而不是将从纪元日期起的整数存储在缓冲区中
 
 总处理时间至少包括1,2,3,5,6,7个部分。因此，估计处理时间的下限应为1.75秒。 
 
 对于CSV解析，下限为4.24秒。
 
 为2.262秒。
 
 
 
 
 执行Buffer Class和push方法
 
 
 
  class Buff 
 def initialize 
 @ buff = [] 
 @epoch = Date.new（1970,1,1）
 @ n = 2 
 end 
 def push_date（row）
＃附带两个日期值的存储buff，[a，2014-6-1，1，＃ 2014-06-01（（2456908j，0s，0n），+ 0s，2299161j）>，＃< Date：2014-06-03（（2456908j，0s，0n），+ 0s，2299161j） b $ b＃日期的最后一个元素是记录日期后n天
 res = [] 
 @buff<< （行<[row [3] + @n））
 while（@ buff.last [3]> = @ buff.first [4] || row [0]！= @buff。第一个[0]）
v =（@ buff.last [3] == @ buff.first [4]&& .to_f / @ buff.first [2] .to_f）.round（2）：）
 res<<（@ buff.shift [0..2]< $ b end 
 return res 
 end 
 
 def tails 
 @ buff.inject（[]）{| res，x | res<< （x [0..2]<）} 
 end 
 def clear 
 @ buff = [] 
 end 
 end 
  
 
 
 
基准
 
 
 
  buff = Buff.new 
 res = [] 
 Benchmark.bm {| x | 
 buff.clear 
 res = [] 
 x.report（CSVdate）{
 CSV.foreach（data.csv）{| row | 
 buff.push_date（row<< Date.strptime（row [1]，％Y-％m-％d））。 res<< x} 
} 
 buff.tails.each {| x | res<< x} 
} 
 
 buff.clear 
 res = [] 
 x.report（IOdate）{
 IO.read（data。 csv）。split.inject（[]）{| memo，o |备注 o.split（'，'）}。每个{| row | 
 buff.push_date（row<< Date.strptime（row [1]，％Y-％m-％d））。 res<< x} 
} 
 buff.tails.each {| x | res<< x} 
} 
 
} 
 puts输出结果计数：＃{res.size}
 puts这里是第12个样本输出：
 res [0..11] .each {| x | puts x.to_s} 
  
 结果 
 用户系统总实际
 CSVdate 6.411000 0.047000 6.458000（6.500009）
 IOdate 3.557000 0.109000 3.666000（3.710005）
 
输出结果计数：360000 
这里是第一个12个样本输出：
 [a，2014-6-1，1，] 
 [a 2014-6-2，2，1.5] 
 [a，2014-6-4，3，] 
 [ 6-5，4，] 
 [b，2014-6-1，1，3.0] 
 [b，2014-6-2 ，2，2.0] 
 [b，2014-6-3，3，1.67] 
 [b，2014-6-4，4 ，] 
 [b，2014-6-5，5，] 
 [a，2014-6-1，1 ] 
 [a，2014-6-2，2，1.5] 
 [a，2014-6-4，3， 
  
 结束
 
 
  
 真正的处理时间是3.557秒，比估计的下限慢57％，但仍然有一些开销没有考虑。 
 
  CSV版本比IO＃读取版本慢2倍。
 
 我们应该通过IO＃读取输入文件防止内存不足错误。
 
 它必须有一些空间来调整。 
 
 
 $ b       h2> 
 
 通过更改组比较和日期比较的顺序更快地推送：
  class Buff 
 def push_fast（row）
＃附带两个日期值的存储buff，[a，2014-6-1，1，＃< Date：2014-06 -01（（2456908j，0s，0n），+ 0s，2299161j）>，＃ ＃date的最后一个元素是记录的日期之后n天
 res = [] 
 row<< （row [3] + @n）
＃改变两个比较的顺序，可以减少日期比较的计数
 while @ buff.first&& （row [0]！= @ buff.first [0] || row [3]> = @ buff.first [4]）
v =（row [0] == @ buff.first [0] &&&&&&&&&&&    == @ buff.first [4]？（row [2] .to_f / @ buff.first [2] .to_f）.round（2）：）
 res <<<（@ buff.shift [0..2]<< v）
 end 
 @buff< row 
 return res 
 end 
 end 
  
 基准结果 
 用户系统总计
 IOdate 3.806000 0.031000 3.837000（3.830005）
 IOfast 3.323000 0.062000 3.385000（3.390005）
  
可以获得0.480秒的促销。通过比较组首先保存许多日期比较时间，如果组更改，则移出所有缓冲区记录而不进行日期比较。
 
I am trying to calculate 2 days return for a list of companies. All information are saved in a CSV. The structure is like this: first column is company name, second is the date, the third is price, the forth column is return = p(t+2)/p(t).

(1) The CSV is 1.8G. Using "CSV.each_with_index.." is very slow. If I use "CSV.foreach", it doesn't let me to find the price in two days. 

(2) There are missing values in price. So even if I use CSV.each_with_index, i + 2 may not identify the right date. 

Thanks for your help.  

Input:
[
    ['a', '2014-6-1', '1'],
    ['a', '2014-6-2', '2'],
    ['a', '2014-6-4', '3'],
    ['a', '2014-6-5', '4'],
    ['b', '2014-6-1', '1'],
    ['b', '2014-6-2', '2'],
    ['b', '2014-6-3', '3'],
    ['b', '2014-6-4', '4'],
    ['b', '2014-6-5', '5']

]
output:
[
    ['a', '2014-6-1', '1', ''],     # Missing because no 2014-6-3 price for a
    ['a', '2014-6-2', '2', '1.5'],  # p(a2014-6-4)/p(a2014-6-2) = 1.5
    ['a', '2014-6-4', '3', ''],     # Missing because no 2014-6-6 price
    ['a', '2014-6-5', '4', ''],     # Missing because no 2014-6-7 price
    ['b', '2014-6-1', '1', '3'],
    ['b', '2014-6-2', '2', '2'],
    ['b', '2014-6-3', '3', '1.7'],
    ['b', '2014-6-4', '4', ''],
    ['b', '2014-6-5', '5', '']

]
The logic I have in mind is as below. It is the same as the logic in the first comment. I haven't coded the second part as I am not sure what is a good way to merge a big CSV with itself in ruby. I also thought about searching the nth business day in following observations. But I want to avoid to use each_with_index as the CSV is very big. I don't know how to implement this logic in ruby. 

(1) Calculate the nth business day after a date
(2) Merge the dataset with itself so I have the price on nth business day
require 'csv'
require 'business_time'
# 30/60/90/365 business days 
# cdate ncusip prc permno firm

csvIn   = 'in.csv'
csvOut  = 'out.csv'

csv = CSV.open(csvOut, "w")
csv << ['cdate', 'ncusip', 'prc', 'permon', 'firm', 'day60']


CSV.foreach(csvIn, :headers => true) do |row|
    current_date = Time.parse(row['cdate'])
    day60 = 42.business_days.after(current_date)
    csv << [row['cdate'], row['ncusip'], row['prc'], row['permno'], row['firm'], day60]

end


csv.close

 解决方案 
Your code lead into some new requirements such as finding nth business day, but they are not clearly defined in the question, maybe more proper way is to open another question about "quickest way of finding nth business day in ruby" .

So let's only foucus on the requiment you commented in the result sample.

Main points of the requirement:


read a large csv file, with date formatted string in it
for each day find price in n days after (n=2) in one group
for each day's record append a ratio calculated by two days price, if there is no price after n days, leave it blank


Basic Benchmark:

With the sample data repeated by 45,000 times, I got a 10MB csv file of 360,000 records in it.

My first thought is to generate a Buffer class to buffer the records that haven't met the next n days record yet. When push a new record to the buffer, the buffer will shift out all the records that is n days before the new record.

But I need to know some basic operations' processing time maybe used in this implementation, then I can figure out the lower limit of total processing time by choosing more efficient operations:

convert date formatted string to date at least 360,000 times
compare two days for 360,000 times
get the date that is n days after another date for 360,000 times
calculate the days between two dates for 360,000 times
compare two dates stored in an array of arrays for 360,000 times 
push a row into buffer and shift out for 360,000 times 
append a ratio or empty string to every record for 360,000 times 
And I was heard that CSV is a very unefficient way , so I will compare two file parsing proccesing time too:

using CSV.foreach to read csv file row by row, and parse them into an array 
using IO.read to read csv file into a string at once, and split the string into an array 
Basic benchmark scripts:
require 'csv'
require 'benchmark'

Benchmark.bm{|x|
  epoch=Date.new(1970,1,1)
  date1=Date.today
  date2=Date.today.next
  i1=1
  i2=200000
  date_str='2014-6-1'
  a = [[1,2,4,date2],2,[1,2,4,date1]]
  #  1. convert date formatted string to date at least 360,000 times
  x.report("1.string2date"){
    360000.times{Date.strptime(date_str,"%Y-%m-%d")}
  }
  #  2. compare two days for 360,000 times
  x.report("2.DateCompare"){
    360000.times{date2>=date1}
  }
  #  3. get the date that is n days after another date for 360,000 times
  x.report("3.DateAdd2   "){
    360000.times{date1 + 2}
  }
  #  4. calculate the days between two dates for 360,000 times
  x.report("4.Date Differ"){
    360000.times{date2-date1}
  }
  #  5. compare two dates stored in an array of arrays for 360,000 times 
  x.report("5.ArrDateComp"){
    360000.times{ a.last[3] > a.first[3]}
  }
  #  6. push a row into buffer and shift out for 360,000 times 
  x.report("6.array shift"){
    360000.times{ a<<[1,2,3]; a.shift}
  }
  #  7. append a ratio or empty string to every record for 360,000 times 
  x.report("7.Add Ratio  "){
    360000.times{ res << (['1','2014-6-1',"3"]<< (2==2 ? (3.to_f/2.to_f).round(2) : "" ))}
  }

  x.report('CSVparse    '){
    CSV.foreach("data.csv"){|row| 
    }
  }
  x.report('IOread      '){
  data = IO.read("data.csv").split.inject([]){|memo,o| memo << o.split(',')}.each{|x| }
  }
}
The result:
                   user     system      total        real
1.string2date  0.827000   0.000000   0.827000 (  0.820001)
2.DateCompare  0.078000   0.000000   0.078000 (  0.070000)
3.DateAdd2     0.109000   0.000000   0.109000 (  0.110000)
4.Date Differ  0.359000   0.000000   0.359000 (  0.360000)
5.ArrDateComp  0.109000   0.000000   0.109000 (  0.110001)
6.array shift  0.094000   0.000000   0.094000 (  0.090000)
7.Add Ratio    0.530000   0.000000   0.530000 (  0.530000)
 CSVparse      2.902000   0.016000   2.918000 (  2.910005)
 IOread        0.515000   0.015000   0.530000 (  0.540000)
Analyze the result


Transfer date formatted string to date is the slowest operation of all those operations, so it should be used at file parsing procedure, to ensure that the operation of transfering string to date will be executed only once for every record.
Compare two dates is about 7 times faster than calculate the days between two dates, so I'll store the date after n days instead of store an integer of the days since epoch date in the buffer.
The total proccessing time at least includes 1,2,3,5,6,7 those parts. So the lower limit of the estimate proccessing time should be 1.75 seconds. There are some overheads not included.
With CSV parsing the lower limit would be 4.24 seconds.
With IO#read and split the lower limit would be 2.262 seconds.


The implementation of Buffer Class and push method

class Buff
  def initialize
    @buff=[]
    @epoch = Date.new(1970,1,1)
    @n=2
  end
  def push_date( row ) 
    # store buff with two date value appended, ["a", "2014-6-1", "1",  #<Date: 2014-06-01 ((2456908j,0s,0n),+0s,2299161j)>,#<Date: 2014-06-03 ((2456908j,0s,0n),+0s,2299161j)>]
    # the last element of date is n days after the record's date 
    res = []
    @buff << (row << (row[3] + @n) )
    while (@buff.last[3] >= @buff.first[4] || row[0] != @buff.first[0])
      v = (@buff.last[3] == @buff.first[4] && row[0] == @buff.first[0] ? (row[2].to_f/@buff.first[2].to_f).round(2) : "")
      res <<(@buff.shift[0..2]<< v)
    end
    return res
  end

  def tails
    @buff.inject([]) {|res,x|  res << (x[0..2]<< "")}
  end
  def clear
    @buff=[]
  end
end


Benchmark

buff=Buff.new
res=[]
Benchmark.bm{|x|
  buff.clear
  res = []
  x.report("CSVdate"){
    CSV.foreach("data.csv"){|row| 
      buff.push_date(row << Date.strptime(row[1],"%Y-%m-%d")).each{|x| res << x}
    }
    buff.tails.each{|x| res << x}
  }

  buff.clear
  res = []
  x.report("IOdate"){
    IO.read("data.csv").split.inject([]){|memo,o| memo << o.split(',')}.each {|row| 
      buff.push_date(row << Date.strptime(row[1],"%Y-%m-%d")).each{|x| res << x}
    }
    buff.tails.each{|x| res << x}
  }

}
puts "output result count:#{res.size}"
puts "Here is the fist 12 sample outputs:"
res[0..11].each{|x| puts x.to_s}
Result
             user     system      total        real
CSVdate  6.411000   0.047000   6.458000 (  6.500009)
IOdate   3.557000   0.109000   3.666000 (  3.710005)

output result count:360000
Here is the fist 12 sample outputs:
["a", "2014-6-1", "1", ""]
["a", "2014-6-2", "2", 1.5]
["a", "2014-6-4", "3", ""]
["a", "2014-6-5", "4", ""]
["b", "2014-6-1", "1", 3.0]
["b", "2014-6-2", "2", 2.0]
["b", "2014-6-3", "3", 1.67]
["b", "2014-6-4", "4", ""]
["b", "2014-6-5", "5", ""]
["a", "2014-6-1", "1", ""]
["a", "2014-6-2", "2", 1.5]
["a", "2014-6-4", "3", ""]
Conclusion


The real proccesing time is 3.557 seconds, it's about 57% slower than the estimated lower limit , but there are still some overheads not considerd. 
The CSV version is 2 times slower than the IO#read version.
We should read the input file block by block with IO#read to prevent insufficient memory error.
It must have some space to tuning. 


UPDATE1:

Tuning

More faster push by changeing the order of group compare and date compare:
class Buff
  def push_fast( row ) 
    # store buff with two date value appended, ["a", "2014-6-1", "1",  #<Date: 2014-06-01 ((2456908j,0s,0n),+0s,2299161j)>,#<Date: 2014-06-03 ((2456908j,0s,0n),+0s,2299161j)>]
    # the last element of date is n days after the record's date 
    res = []
    row << (row[3] + @n) 
    # change the order of the two compares, can reduce the counts of date compares
    while @buff.first && (row[0] != @buff.first[0] || row[3] >= @buff.first[4] )
      v = (row[0] == @buff.first[0] && row[3] == @buff.first[4] ? (row[2].to_f/@buff.first[2].to_f).round(2) : "")
      res <<(@buff.shift[0..2]<< v)
    end
    @buff << row
    return res
  end
end
Benchmark result
            user     system      total        real
IOdate  3.806000   0.031000   3.837000 (  3.830005)
IOfast  3.323000   0.062000   3.385000 (  3.390005)
Can get 0.480 seconds promotion. Saving many date compare time by compare group first, if group changes, shift out all buffer records out without date compare.

                        这篇关于RUBY CSV计算返回的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

RUBY CSV计算返回 [英] RUBY CSV Calculate Return

问题描述

要求的要点：

基本基准：

执行Buffer Class和push方法

基准

Main points of the requirement:

Basic Benchmark:

The implementation of Buffer Class and push method

Benchmark

Tuning

相关文章

Office最新文章

热门教程

热门工具

登录关闭

RUBY CSV计算返回 [英] RUBY CSV Calculate Return

问题描述

要求的要点：

基本基准：

执行Buffer Class和push方法

基准

Main points of the requirement:

Basic Benchmark:

The implementation of Buffer Class and push method

Benchmark

Tuning

相关文章

Office最新文章

热门教程

热门工具

登录 关闭

登录关闭