导入和分析R中的非矩形.csv文件 [英] Importing and analysing non-rectangular .csv files in R
问题描述
我从Mathematica转到R,我不需要在输入过程中预测数据结构,特别是在导入之前我不需要预测数据的矩形性。
I'm moving to R from Mathematica where I don't need to anticipate data structures during importation, in particular I do not need to anticipate the rectangularness of my data before import.
我有很多文件 .csv
文件格式如下:
I have many files .csv
files formatted as follows:
tasty,chicken,cinnamon
not_tasty,butter,pepper,onion,cardamom,cayenne
tasty,olive_oil,pepper
okay,olive_oil,onion,potato,black_pepper
not_tasty,tomato,fenugreek,pepper,onion,potato
tasty,butter,cheese,wheat,ham
行有不同的长度,只包含字符串。
Rows have differing lengths and will only contain strings.
在R中,我该如何处理这个问题?
In R, how should I approach this problem?
你有什么尝试?
我尝试过使用 read.table
:
dataImport <- read.table("data.csv", header = FALSE)
class(dataImport)
##[1] "data.frame"
dim(dataImport)
##[1] 6 1
dataImport[1]
##[1] tasty,chicken,cinnamon
##6 Levels: ...
我从文档中将其解释为一个单独的列,每个成分列表作为一个独特的行。我可以按如下方式提取前三行,每行是 class
factor
但似乎包含的数据多于我期待:
I interpret this from the documentation to be a singular column with each list of ingredients as a distinct row. I may extract the first three rows as follows, each row is of class
factor
but appears to contain more data than what I expect:
dataImport[c(1,2,3),1]
## my rows
rowOne <- dataImport[c(1),1];
class(rowOne)
## "factor"
rowOne
## [1] tasty,chicken,cinnamon
## 6 Levels: not_tasty,butter,cheese [...]
这就是我现在追求这个问题,我会感谢有关 read.table
适用于此数据结构的建议。
This is as far as I've pursued this problem for now, I would appreciate advice on suitability of read.table
for this data structure.
我的目标是按以下方式对数据进行分组每行的第一个元素,并分析每种配方之间的差异。如果它有助于影响数据结构建议,我会在Mathematica中执行以下操作:
My goal is to group the data by the first element of each row, and analyse the difference between each type of recipe. In case it helps influence data structure advice, in Mathematica I would do the following:
dataImport=Import["data.csv"];
tasty = Cases[dataImport, {"tasty", ingr__} :> {ingr}]
回答讨论
@ G.Grothendieck提供了使用 read.table
的解决方案,并使用 reshape2 $ c $进行后续处理c>包 - 这看起来非常有用,我稍后会调查。这里的一般建议解决了我的问题,因此接受。
@G.Grothendieck has provided a solution in using read.table
and subsequent processing using the reshape2
package - this seems tremendously useful and I'll investigate later. General advice here solved my issue, hence accept.
@ MrFlick建议使用 tm
包对以后有用分析使用 DataframeSource
@MrFlick's suggestion of using the tm
package was useful for later analysis using DataframeSource
推荐答案
read.table 尝试 read.table
, fill = TRUE
:
d1 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE)
给予:
> d1
V1 V2 V3 V4 V5 V6
1 tasty chicken cinnamon
2 not_tasty butter pepper onion cardamom cayenne
3 tasty olive_oil pepper
4 okay olive_oil onion potato black_pepper
5 not_tasty tomato fenugreek pepper onion potato
6 tasty butter cheese wheat ham
具有NA的read.table
或用NA值填充空单元格添加 na.strings =
:
or to fill the empty cells with NA values add na.strings = ""
:
d2 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE, na.strings = "")
给予:
> d2
V1 V2 V3 V4 V5 V6
1 tasty chicken cinnamon <NA> <NA> <NA>
2 not_tasty butter pepper onion cardamom cayenne
3 tasty olive_oil pepper <NA> <NA> <NA>
4 okay olive_oil onion potato black_pepper <NA>
5 not_tasty tomato fenugreek pepper onion potato
6 tasty butter cheese wheat ham <NA>
长表格
如果你想要它的长形式:
If you want it in long form:
library(reshape2)
long <- na.omit(melt(d2, id.var = c("id", "V1"))[-3])
long <- long[order(long$id), ]
给予:
> long
id V1 value
1 1 tasty chicken
7 1 tasty cinnamon
2 2 not_tasty butter
8 2 not_tasty pepper
14 2 not_tasty onion
20 2 not_tasty cardamom
26 2 not_tasty cayenne
3 3 tasty olive_oil
9 3 tasty pepper
4 4 okay olive_oil
10 4 okay onion
16 4 okay potato
22 4 okay black_pepper
5 5 not_tasty tomato
11 5 not_tasty fenugreek
17 5 not_tasty pepper
23 5 not_tasty onion
29 5 not_tasty potato
6 6 tasty butter
12 6 tasty cheese
18 6 tasty wheat
24 6 tasty ham
宽格式0/1二进制变量
将变量部分表示为0 / 1个二进制变量试试这个:
To represent the variable portion as 0/1 binary variables try this:
wide <- cast(id + V1 ~ value, data = long)
wide[-(1:2)] <- 0 + !is.na(wide[-(1:2)])
给出:
数据框中的列表
不同的表示形式是数据框中的以下列表,以便 ag $ value
是一个字符向量列表:
A different representation would be the following list in a data frame so that ag$value
is a list of character vectors:
ag <- aggregate(value ~., transform(long, value = as.character(value)), c)
ag <- ag[order(ag$id), ]
giving:
> ag
id V1 value
4 1 tasty chicken, cinnamon
1 2 not_tasty butter, pepper, onion, cardamom, cayenne
5 3 tasty olive_oil, pepper
3 4 okay olive_oil, onion, potato, black_pepper
2 5 not_tasty tomato, fenugreek, pepper, onion, potato
6 6 tasty butter, cheese, wheat, ham
> str(ag)
'data.frame': 6 obs. of 3 variables:
$ id : int 1 2 3 4 5 6
$ V1 : chr "tasty" "not_tasty" "tasty" "okay" ...
$ value:List of 6
..$ 15: chr "chicken" "cinnamon"
..$ 1 : chr "butter" "pepper" "onion" "cardamom" ...
..$ 17: chr "olive_oil" "pepper"
..$ 11: chr "olive_oil" "onion" "potato" "black_pepper"
..$ 6 : chr "tomato" "fenugreek" "pepper" "onion" ...
..$ 19: chr "butter" "cheese" "wheat" "ham"
这篇关于导入和分析R中的非矩形.csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!