如何导入csv数据,其中一些观察是在两行 [英] How to import csv data where some observations are on two rows

查看:798
本文介绍了如何导入csv数据,其中一些观察是在两行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数百万行的数据集。它是csv格式。我想将它导入Stata。我可以这样做,但有一个问题 - 一小部分(但仍有很多)的观察结果出现在CSV文件中的两行。大多数条目只出现在一行上。占用2行的麻烦的观察仍然遵循相同的模式,直到用逗号分隔。但是在Stata数据集中,观察结果显示在两行,两行只包含部分数据。



我使用 import delimited 导入数据。在Stata的进程的数据导入阶段有什么可以做的吗?



***更新



这里是csv文件的样子的示例:

  var1,var2,var3,var4,var5 
text 1,text 2,text 3,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1
2,text 13,text14 ,text15
text16,text17,text18,text19,text20

请注意,在行的结尾。还要注意,问题是以 text 11

$>开始的观察。



这是基本上如何显示在Stata中:

  var1 var2 var3 var4 var5 
1文本1文本2文本3文本4文本5
2 text 6 text 7 text 8 text9 text10
3 text 11 text 1
4 2 text 13 text14 text15
5 text16 text17 text18 text19 text20

有时数字紧挨着 text 不是一个错误 - 说明数据比这里显示的更复杂。



当然,这是我需要的数据:

  var1 var2 var3 var4 var5 
1 text 1 text 2 text 3 text 4 text 5
2 text 6 text 7 text 8 text9 text10
3 text 11 text 12 text 13 text14 text15
4 text16 text17 text18 text19 text20


解决方案

一个复杂的方式是(内联注释):

 清除
设置更多off
$ b * -----示例数据-----

//如果需要,更改分隔符
使用〜/ Desktop / stata_tests / test.csv ,names delim(;)

list

* -----你想要什么-----

//计算数字的逗号
gen numcom = length(var1var2var3var4var5)///
- length(subinstr(var1var2var3var4var5,,,。))

//保存所有数据
tempfile orig
保存`orig'

//保持精确的观察结果
如果numcom!= 4则丢弃

/ /保存精细数据
tempfile origfine
保存`origfine'

* -----

//加载所有数据
使用`orig',清除

//保持违规观察
drop如果numcom == 4

//用于-reshape-
gen i = int((_ n-1)/ 2)+1
bysort i:gen j = _n

//检查对是否加上4个逗号
i:egen check = total(numcom)
assert check == 4

//不再需要
drop numcom check

// reshape wide
reshape wide var1var2var3var4var5,i(i)j(j)

// gen definitive variable
gen var1var2var3var4var5 = var1var2var3var4var51 + var1var2var3var4var52
keep var1var2var3var4var5

//用原始的好的附加新的观察结果
append使用`origfine'

// split
split var1var2var3var4var5,parse(,)gen b
$ b //我们完成了
drop var1var2var3var4var5 numcom
list


b $ b

但我们并不真的有你的数据的细节,所以这可能或可能不工作。这只是一个粗略的草稿。根据您的数据占用的内存空间和其他详细信息,您可能需要改进部分代码,使其更有效。



注意:



var2,var3,var4,var5
text 1,text 2,text 3,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1
2,text 13,text14,text15
text16,text17,text18,text19,text20


$ b b

注意2:我使用小册子,因为我目前没有Stata 13。 import delimited 是可行的方法。



注意3:有关如何计算逗号的详细信息,请参阅 Stata tip 98:计算字符串中的子字符串 by Nick Cox。


I have a dataset with a couple million rows. It is in csv format. I wish to import it into Stata. I can do this, but there is a problem - a small percentage (but still many) of the observations appear on two lines in the CSV file. Most of the entries occur on only one line. The troublesome observations that take up 2 lines still follow the same pattern as far as being delimited by commas. But in the Stata dataset, the observation shows up on two rows, both rows containing only part of the total data.

I used import delimited to import the data. Is there anything that can be done at the data import stage of the process in Stata? I would prefer to not have to deal with this in the original CSV file if possible.

***Update

Here is an example of what the csv file looks like:

var1,var2,var3,var4,var5 
text 1,    text 2,text 3   ,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1     
         2,text 13,text14,text15
text16,text17,text18,text19,text20 

Notice that there is no comma at the end of the line. Also notice that the problem is with the observation that begins with text 11.

This is basically how it shows up in Stata:

    var1     var2     var3     var4     var5 
1   text 1   text 2   text 3   text 4   text 5
2   text 6   text 7   text 8   text9    text10
3   text 11  text 1
4   2        text 13  text14  text15
5   text16   text17   text18   text19   text20

That sometimes the number is right next to text isn't a mistake - it is just to illustrate that the data is more complex than is shown here.

Of course, this is how I need the data:

    var1     var2     var3     var4     var5 
1   text 1   text 2   text 3   text 4   text 5
2   text 6   text 7   text 8   text9    text10
3   text 11  text 12  text 13  text14   text15
4   text16   text17   text18   text19   text20

解决方案

A convoluted way is (comments inline):

clear
set more off

*----- example data -----

// change delimiter, if necessary
insheet using "~/Desktop/stata_tests/test.csv", names delim(;)

list

*----- what you want -----

// compute number of commas
gen numcom = length(var1var2var3var4var5) ///
    - length(subinstr(var1var2var3var4var5, ",", "", .))

// save all data
tempfile orig
save "`orig'"

// keep observations that are fine
drop if numcom != 4

// save fine data
tempfile origfine
save "`origfine'"

*-----

// load all data
use "`orig'", clear

// keep offending observations
drop if numcom == 4

// for the -reshape-
gen i = int((_n-1)/2) +1
bysort i : gen j = _n

// check that pairs add up to 4 commas
by i : egen check = total(numcom)
assert check == 4

// no longer necessary
drop numcom check

// reshape wide
reshape wide var1var2var3var4var5, i(i) j(j)

// gen definitive variable
gen var1var2var3var4var5 = var1var2var3var4var51 + var1var2var3var4var52
keep var1var2var3var4var5

// append new observations with original good ones
append using "`origfine'"

// split
split var1var2var3var4var5, parse(,) gen(var)

// we're "done"
drop var1var2var3var4var5 numcom
list

But we don't really have the details of your data, so this may or may not work. It's just meant to be a rough draft. Depending on the memory space occupied by your data, and other details, you may need to improve parts of the code so it be made more efficient.

Note: the file test.csv looks like

var1,var2,var3,var4,var5 
text 1,    text 2,text 3   ,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1     
         2,text 13,text14,text15
text16,text17,text18,text19,text20

Note 2: I'm using insheet because I don't have Stata 13 at the moment. import delimited is the way to go if available.

Note 3: details on how the counting of commas works can be reviewed at Stata tip 98: Counting substrings within strings, by Nick Cox.

这篇关于如何导入csv数据,其中一些观察是在两行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆