转换稀疏矩阵ARFF用awk [英] Converting sparse matrix to ARFF using awk

查看:225
本文介绍了转换稀疏矩阵ARFF用awk的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的数据在稀疏矩阵格式设置工作。

I am working with an extremely large data set in a sparse matrix format.

的数据具有的归档格式(3制表符分隔列,其中在第一列中的串对应于一列,在第二列的串对应属性和第三列中的值是一个加权分数)。

The data has the filing format (3 tab separated columns, where the string in the first column corresponds to a row, the string in the second column corresponds to the attribute and the value in the third column is a weighted score).

church place 3
church institution 6
man place 86
man food 63
woman book 37

我想用awk(如果可能),从而使用上述作为输入,我可以得到如下的输出将其转换为ARFF格式为:

I would like to convert this to arff format using awk (if possible) so that using the above as an input, I can obtain the following output:

@relation 'filename'
@attribute "place" string
@attribute "institution" string
@attribute "food" string
@attribute "book" string


@data
3,6,0,0,church
86,0,63,0,man
0,0,0,37,woman

我已经看到了这样做AWK文件<一个href=\"http://stackoverflow.com/questions/9234232/too-many-attributes-for-arff-format-in-weka\">HERE,产生非常相似,我所需要的结果。
然而,输入是一个有点不同。我试图操纵通过改变FS提供的code =|为\\ t的,但它不产生预期的结果。
有没有人有一个建议,我怎么可以操纵这个AWK code到我的输入转换为我所需的输出?

I have seen this awk file done HERE, that produces a result quite similar to what I need. However, the input is a bit different. I tried to manipulate the code provided by changing the FS = "|" to "\t", but it does not produce the desired results. Does anyone have a suggestion as to how I can manipulate this awk code to convert my input to my desired output?

推荐答案

我不知道什么ARFF是(我也不需要知道帮你转文本为不同的格式),所以让我们开始这样的:

I've no idea what arff is (nor do I need to know to help you transpose your text to a different format) so let's start with this:

$ cat tst.awk
BEGIN{ FS="\t" }
NR==1 { printf "@relation '%s'\n", FILENAME }
{
    row = $1
    attr = $2

    if (!seenRow[row]++) {
        rows[++numRows] = row
    }

    if (!seenAttr[attr]++) {
        printf "@attribute \"%s\" string\n", attr
        attrs[++numAttrs] = attr
    }

    score[row,attr] = $3
}
END {
    print "\n\n@data"
    for (rowNr=1; rowNr<=numRows; rowNr++) {
        row = rows[rowNr]
        for (attrNr=1;attrNr<=numAttrs;attrNr++)  {
            attr = attrs[attrNr]
            printf "%d,", score[row,attr]
        }
        print row
    }
}
$
$ cat file
church  place   3
church  institution     6
man     place   86
man     food    63
woman   book    37
$
$ awk -f tst.awk file
@relation 'file'
@attribute "place" string
@attribute "institution" string
@attribute "food" string
@attribute "book" string


@data
3,6,0,0,church
86,0,63,0,man
0,0,0,37,woman

现在,告诉我们什么是错的,我们可以从那里。

Now, tell us what's wrong with that and we can go from there.

这篇关于转换稀疏矩阵ARFF用awk的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆