如何在不将实际文件加载到环境的情况下将多个csv文件合并为一个大文件? [英] How to combine multiple csv files into one big file without loading the actual file into the environment?

查看:116
本文介绍了如何在不将实际文件加载到环境的情况下将多个csv文件合并为一个大文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有不使用read.csv/read_csv函数将多个CSV文件组合到一个超级文件中的方法?

Is there anyway to combine multiple CSV files together into a super file without using the read.csv/read_csv functions?

我想将文件夹中的所有表格(CSV)合并到一个csv文件中,因为它们每个都代表一个单独的月份.该文件夹如下所示:

I want to combine all the tables (CSV) in the folder into one csv file, since each of them represents a separate month. The folder looks like this:

列表.文件(文件夹)

 [1] "2013-07 - Citi Bike trip data.csv" "2013-08 - Citi Bike trip data.csv" "2013-09 - Citi Bike trip data.csv"
 [4] "2013-10 - Citi Bike trip data.csv" "2013-11 - Citi Bike trip data.csv" "2013-12 - Citi Bike trip data.csv"
 [7] "2014-01 - Citi Bike trip data.csv" "2014-02 - Citi Bike trip data.csv" "2014-03 - Citi Bike trip data.csv"
[10] "2014-04 - Citi Bike trip data.csv" "2014-05 - Citi Bike trip data.csv" "2014-06 - Citi Bike trip data.csv"
[13] "2014-07 - Citi Bike trip data.csv" "2014-08 - Citi Bike trip data.csv" "201409-citibike-tripdata.csv"     
[16] "201410-citibike-tripdata.csv"      "201411-citibike-tripdata.csv"      "201412-citibike-tripdata.csv"     
[19] "201501-citibike-tripdata.csv"      "201502-citibike-tripdata.csv"      "201503-citibike-tripdata.csv"     
[22] "201504-citibike-tripdata.csv"      "201505-citibike-tripdata.csv"      "201506-citibike-tripdata.csv"     
[25] "201507-citibike-tripdata.csv"      "201508-citibike-tripdata.csv"      "201509-citibike-tripdata.csv"     
[28] "201510-citibike-tripdata.csv"      "201511-citibike-tripdata.csv"      "201512-citibike-tripdata.csv"     
[31] "201601-citibike-tripdata.csv"      "201602-citibike-tripdata.csv"      "201603-citibike-tripdata.csv"     

我尝试了以下操作,但确实获得了大数据,这是一个包含33个元素和3.6 Gbs的大型列表.但是,整个过程花费了一段时间.考虑到网站每月更新一次的事实,数据量的增加将使合并过程更加缓慢.因此,有人可以帮助我将所有数据文件组合在一起而无需将它们加载到环境中吗?可以在以下位置找到数据源: https://s3.amazonaws.com/tripdata/index.html .

I tried the following and did get the big data, which is a large list of 33 elements and 3.6 Gbs. However, the full process took a while. Considering the fact that the website is updated monthly, the increasing data size will make the merging process more slowly. Thus, could someone help me combine all the data files together without loading them into the environment? The data source could be found here: https://s3.amazonaws.com/tripdata/index.html.

filenames<- list.files(folder, full.names =TRUE)
data<- lapply(filenames,read_csv)

数据文件看起来像这样,这不是我想要的形式.我想有一张大桌子,将所有信息合并在一起.

The data file looks like this, which is not the form I want. I would like to have a big table with all the information merged together.

> head(data)
[[1]]
Source: local data frame [843,416 x 15]

   tripduration           starttime            stoptime start station id      start station name start station latitude
          (int)              (time)              (time)            (int)                   (chr)                  (dbl)
1           634 2013-07-01 00:00:00 2013-07-01 00:10:34              164         E 47 St & 2 Ave               40.75323
2          1547 2013-07-01 00:00:02 2013-07-01 00:25:49              388        W 26 St & 10 Ave               40.74972
3           178 2013-07-01 00:01:04 2013-07-01 00:04:02              293   Lafayette St & E 8 St               40.73029
4          1580 2013-07-01 00:01:06 2013-07-01 00:27:26              531  Forsyth St & Broome St               40.71894
5           757 2013-07-01 00:01:10 2013-07-01 00:13:47              382 University Pl & E 14 St               40.73493
6           861 2013-07-01 00:01:23 2013-07-01 00:15:44              511      E 14 St & Avenue B               40.72939
7           550 2013-07-01 00:01:59 2013-07-01 00:11:09              293   Lafayette St & E 8 St               40.73029
8           288 2013-07-01 00:02:16 2013-07-01 00:07:04              224   Spruce St & Nassau St               40.71146
9           766 2013-07-01 00:02:16 2013-07-01 00:15:02              432       E 7 St & Avenue A               40.72622
10          773 2013-07-01 00:02:23 2013-07-01 00:15:16              173      Broadway & W 49 St               40.76065
..          ...                 ...                 ...              ...                     ...                    ...
Variables not shown: start station longitude (dbl), end station id (int), end station name (chr), end station latitude (dbl), end
  station longitude (dbl), bikeid (int), usertype (chr), birth year (chr), gender (int)

[[2]]
Source: local data frame [1,001,958 x 15]

   tripduration           starttime            stoptime start station id        start station name start station latitude
          (int)              (time)              (time)            (int)                     (chr)                  (dbl)
1           664 2013-08-01 00:00:00 2013-08-01 00:11:04              449           W 52 St & 9 Ave               40.76462
2          2115 2013-08-01 00:00:01 2013-08-01 00:35:16              254           W 11 St & 6 Ave               40.73532
3           385 2013-08-01 00:00:03 2013-08-01 00:06:28              460        S 4 St & Wythe Ave               40.71286
4           653 2013-08-01 00:00:10 2013-08-01 00:11:03              398  Atlantic Ave & Furman St               40.69165
5           954 2013-08-01 00:00:11 2013-08-01 00:16:05              319       Park Pl & Church St               40.71336
6           145 2013-08-01 00:00:37 2013-08-01 00:03:02              521           8 Ave & W 31 St               40.75045
7           331 2013-08-01 00:01:25 2013-08-01 00:06:56             2000  Front St & Washington St               40.70255
8           194 2013-08-01 00:01:26 2013-08-01 00:04:40              313 Washington Ave & Park Ave               40.69610
9           598 2013-08-01 00:01:40 2013-08-01 00:11:38              528           2 Ave & E 31 St               40.74291
10          360 2013-08-01 00:01:45 2013-08-01 00:07:45              500        Broadway & W 51 St               40.76229
..          ...                 ...                 ...              ...                       ...                    ...
Variables not shown: start station longitude (dbl), end station id (int), end station name (chr), end station latitude (dbl), end
  station longitude (dbl), bikeid (int), usertype (chr), birth year (chr), gender (int)

推荐答案

我会用这个:

library(data.table)
multmerge = function(path){
  filenames=list.files(path=path, full.names=TRUE)
  rbindlist(lapply(filenames, fread))
} 
path <- "C:/Users/kkk/Desktop/test/test1"
mergeA <- multmerge(path)
write.csv(mergeA, "mergeA.csv")

该解决方案发布在不同的线程下,作为合并多个文件的一种方式

That solution was posted under a different thread as a way to merge multiple files

这篇关于如何在不将实际文件加载到环境的情况下将多个csv文件合并为一个大文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆