导入和分析R中的非矩形.csv文件 [英] Importing and analysing non-rectangular .csv files in R

查看:65
本文介绍了导入和分析R中的非矩形.csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从Mathematica转到R,我不需要在输入过程中预测数据结构,特别是在导入之前我不需要预测数据的矩形性。

I'm moving to R from Mathematica where I don't need to anticipate data structures during importation, in particular I do not need to anticipate the rectangularness of my data before import.

我有很多文件 .csv 文件格式如下:

I have many files .csv files formatted as follows:

tasty,chicken,cinnamon
not_tasty,butter,pepper,onion,cardamom,cayenne
tasty,olive_oil,pepper
okay,olive_oil,onion,potato,black_pepper
not_tasty,tomato,fenugreek,pepper,onion,potato
tasty,butter,cheese,wheat,ham

行有不同的长度,只包含字符串。

Rows have differing lengths and will only contain strings.

在R中,我该如何处理这个问题?

In R, how should I approach this problem?

你有什么尝试?

我尝试过使用 read.table

dataImport <- read.table("data.csv", header = FALSE)
class(dataImport)
##[1] "data.frame"
dim(dataImport)
##[1] 6   1
dataImport[1]
##[1] tasty,chicken,cinnamon
##6 Levels: ...

我从文档中将其解释为一个单独的列,每个成分列表作为一个独特的行。我可以按如下方式提取前三行,每行是 class factor 但似乎包含的数据多于我期待:

I interpret this from the documentation to be a singular column with each list of ingredients as a distinct row. I may extract the first three rows as follows, each row is of class factor but appears to contain more data than what I expect:

dataImport[c(1,2,3),1]
## my rows
rowOne <- dataImport[c(1),1];
class(rowOne)
## "factor"
rowOne
## [1] tasty,chicken,cinnamon
## 6 Levels: not_tasty,butter,cheese [...]

这就是我现在追求这个问题,我会感谢有关 read.table 适用于此数据结构的建议。

This is as far as I've pursued this problem for now, I would appreciate advice on suitability of read.table for this data structure.

我的目标是按以下方式对数据进行分组每行的第一个元素,并分析每种配方之间的差异。如果它有助于影响数据结构建议,我会在Mathematica中执行以下操作:

My goal is to group the data by the first element of each row, and analyse the difference between each type of recipe. In case it helps influence data structure advice, in Mathematica I would do the following:

dataImport=Import["data.csv"];
tasty = Cases[dataImport, {"tasty", ingr__} :> {ingr}]

回答讨论

@ G.Grothendieck提供了使用 read.table 的解决方案,并使用 reshape2 包 - 这看起来非常有用,我稍后会调查。这里的一般建议解决了我的问题,因此接受。

@G.Grothendieck has provided a solution in using read.table and subsequent processing using the reshape2 package - this seems tremendously useful and I'll investigate later. General advice here solved my issue, hence accept.

@ MrFlick建议使用 tm 包对以后有用分析使用 DataframeSource

@MrFlick's suggestion of using the tm package was useful for later analysis using DataframeSource

推荐答案

read.table 尝试 read.table fill = TRUE

d1 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE)

给予:

> d1
         V1        V2        V3     V4           V5      V6
1     tasty   chicken  cinnamon                            
2 not_tasty    butter    pepper  onion     cardamom cayenne
3     tasty olive_oil    pepper                            
4      okay olive_oil     onion potato black_pepper        
5 not_tasty    tomato fenugreek pepper        onion  potato
6     tasty    butter    cheese  wheat          ham   

具有NA的read.table

或用NA值填充空单元格添加 na.strings =

or to fill the empty cells with NA values add na.strings = "" :

d2 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE, na.strings = "")

给予:

> d2
         V1        V2        V3     V4           V5      V6
1     tasty   chicken  cinnamon   <NA>         <NA>    <NA>
2 not_tasty    butter    pepper  onion     cardamom cayenne
3     tasty olive_oil    pepper   <NA>         <NA>    <NA>
4      okay olive_oil     onion potato black_pepper    <NA>
5 not_tasty    tomato fenugreek pepper        onion  potato
6     tasty    butter    cheese  wheat          ham    <NA>

长表格

如果你想要它的长形式:

If you want it in long form:

library(reshape2)
long <- na.omit(melt(d2, id.var = c("id", "V1"))[-3])
long <- long[order(long$id), ]

给予:

> long
   id        V1        value
1   1     tasty      chicken
7   1     tasty     cinnamon
2   2 not_tasty       butter
8   2 not_tasty       pepper
14  2 not_tasty        onion
20  2 not_tasty     cardamom
26  2 not_tasty      cayenne
3   3     tasty    olive_oil
9   3     tasty       pepper
4   4      okay    olive_oil
10  4      okay        onion
16  4      okay       potato
22  4      okay black_pepper
5   5 not_tasty       tomato
11  5 not_tasty    fenugreek
17  5 not_tasty       pepper
23  5 not_tasty        onion
29  5 not_tasty       potato
6   6     tasty       butter
12  6     tasty       cheese
18  6     tasty        wheat
24  6     tasty          ham

宽格式0/1二进制变量

将变量部分表示为0 / 1个二进制变量试试这个:

To represent the variable portion as 0/1 binary variables try this:

wide <- cast(id + V1 ~ value, data = long)
wide[-(1:2)] <- 0 + !is.na(wide[-(1:2)])

给出:

数据框中的列表

不同的表示形式是数据框中的以下列表,以便 ag $ value 是一个字符向量列表:

A different representation would be the following list in a data frame so that ag$value is a list of character vectors:

ag <- aggregate(value ~., transform(long, value = as.character(value)), c)
ag <- ag[order(ag$id), ]

giving:

> ag
  id        V1                                    value
4  1     tasty                        chicken, cinnamon
1  2 not_tasty butter, pepper, onion, cardamom, cayenne
5  3     tasty                        olive_oil, pepper
3  4      okay   olive_oil, onion, potato, black_pepper
2  5 not_tasty tomato, fenugreek, pepper, onion, potato
6  6     tasty               butter, cheese, wheat, ham

> str(ag)
'data.frame':   6 obs. of  3 variables:
 $ id   : int  1 2 3 4 5 6
 $ V1   : chr  "tasty" "not_tasty" "tasty" "okay" ...
 $ value:List of 6
  ..$ 15: chr  "chicken" "cinnamon"
  ..$ 1 : chr  "butter" "pepper" "onion" "cardamom" ...
  ..$ 17: chr  "olive_oil" "pepper"
  ..$ 11: chr  "olive_oil" "onion" "potato" "black_pepper"
  ..$ 6 : chr  "tomato" "fenugreek" "pepper" "onion" ...
  ..$ 19: chr  "butter" "cheese" "wheat" "ham"

这篇关于导入和分析R中的非矩形.csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆