合并大量的csv数据集 [英] Merging a large number of csv datasets

查看:219
本文介绍了合并大量的csv数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是2个样本数据集。

PRISM-APPT_1895.csv
https://copy.com/SOO2KbCHBX4MRQbn

PRISM-APPT_1895.csv https://copy.com/SOO2KbCHBX4MRQbn

PRISM-APPT_1896.csv
https://copy.com/JDytBqLgDvk6JzUe

PRISM-APPT_1896.csv https://copy.com/JDytBqLgDvk6JzUe

我有100种这样的数据集,我正在尝试合并到一个数据框中,将其导出到csv,然后将其合并到另一个非常大的数据集中。

I have 100 of these types of data sets that I'm trying to merge into one data frame, export that to csv, and then merge that into another very large dataset.

我需要将所有内容合并为gridNumber和Year,创建一个时间序列数据集。

I need to merge everything by "gridNumber" and "Year", creating a time series dataset.

最初,我导入了所有的年度数据集,然后尝试将它们与这些数据集合:

Originally, I imported all of the annual datasets and then tried to merge them with this :

df <- join_all(list(Year_1895,    Year_1896,    Year_1897,    Year_1898,    Year_1899,    Year_1900,    Year_1901,    Year_1902,   
                    Year_1903,    Year_1904,    Year_1905,    Year_1906,   Year_1907,    Year_1908,    Year_1909,    Year_1910,   
                    Year_1911,    Year_1912,    Year_1913,    Year_1914,   Year_1915,    Year_1916,    Year_1917,    Year_1918,   
                    Year_1919,    Year_1920,    Year_1921,    Year_1922,   Year_1923,    Year_1924,    Year_1925,    Year_1926,   
                    Year_1927,    Year_1928,    Year_1929,    Year_1930,   Year_1931,    Year_1932,    Year_1933,    Year_1934,   
                    Year_1935,    Year_1936,    Year_1937,    Year_1938,   Year_1939,    Year_1940,    Year_1941,    Year_1942,   
                    Year_1943,    Year_1944,    Year_1945,    Year_1946,   Year_1947,    Year_1948,    Year_1949,    Year_1950,   
                    Year_1951,    Year_1952,    Year_1953,    Year_1954,   Year_1955,    Year_1956,    Year_1957,    Year_1958,   
                    Year_1959,    Year_1960,    Year_1961,    Year_1962,   Year_1963,    Year_1964,    Year_1965,    Year_1966,   
                    Year_1967,    Year_1968,    Year_1969,    Year_1970,   Year_1971,    Year_1972,    Year_1973,    Year_1974,   
                    Year_1975,    Year_1976,    Year_1977,    Year_1978,   Year_1979,    Year_1980,    Year_1981,    Year_1982,   
                    Year_1983,    Year_1984,    Year_1985,    Year_1986,   Year_1987,    Year_1988,    Year_1989,    Year_1990,   
                    Year_1991,    Year_1992,    Year_1993,    Year_1994,   Year_1995,    Year_1996,    Year_1997,    Year_1998,   
                    Year_1999,    Year_2000), 
                    by = c("gridNumber","Year"),type="full")

但是R一直在崩溃,因为我认为合并有点大它处理,所以我正在寻找一些会更好的工作。也许data.table?或另一个选择。

But R keeps crashing because I think the merge is a bit to large for it to handle, so I'm looking for something that would work better. Maybe data.table? Or another option.

感谢您提供的任何帮助。

Thanks for any help you can provide.

推荐答案

几个月后,你的问题没有答案。我找不到您的数据集,但是,我会显示一种方式来完成这项工作。在awk中是微不足道的。

Almost nine months later and your question has no answer. I could not find your datasets, however, I will show one way to do the job. It is trivial in awk.

这是一个最小的awk脚本:

Here is a minimal awk script:

BEGIN {
    for(i=0;i<10;i++) {
        filename = "out" i ".csv";
        while(getline < filename) print $0;
        close(filename);
    }
}

该脚本运行为
awk -f s.awk
其中s.awk是文本文件中的上述脚本。

The script is run as awk -f s.awk where s.awk is the above script in a text file.

脚本创建十个文件名:out0.csv,out1.csv ... out9.csv。这些是已经存在的数据文件。打开第一个文件,将所有记录发送到标准输出。然后关闭文件,创建并打开下一个文件。上面的脚本很少提供命令行读/重定向。通常使用awk处理从另一个文件读取的长列表文件名;使用语句来根据各种标准有选择地忽略行或列。

This script creates ten filenames: out0.csv, out1.csv ... out9.csv. These are the already-existing files with the data. The first file is opened and all records sent to the standard output. The file is then closed and the next filename created and opened. The above script has little to offer over a command line read/redirect. You would typically use awk to process a long list of filenames read from another file; with statements to selectively ignore lines or columns depending on various criteria.

这篇关于合并大量的csv数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆