将CSV转换为SequenceFile [英] Converting CSV to SequenceFile

查看:195
本文介绍了将CSV转换为SequenceFile的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件,我想将其转换为SequenceFile,我最终将使用它创建NamedVectors以用于群集作业。我一直使用seqdirectory命令尝试创建SequenceFile,然后使用-nv选项将该输出提供给seq2sparse以创建NamedVectors。看起来这是给一个大的向量作为输出,但我最终希望我的CSV的每一行都成为NamedVector。

解决方案

seqdirectory 命令将每个文件作为一个文件,所以在现实中,你只有一个文件,因此你只能得到一个向量。为了使它正常工作,您需要将CSV文件的每一行都作为文件本身,其中文档的是文件的名称,是其内容。但是,如果你的语料库很大,这是不实际的,因为磁盘读写可能变得非常缓慢。在实践中,你最好遵循我在< a href =https://stackoverflow.com/a/11948318/863772> comment


I have a CSV file which I would like to convert to a SequenceFile, which I would ultimately use to create NamedVectors to use in a clustering job. I've been using the seqdirectory command to try to make a SequenceFile, and then fed that output into seq2sparse with the -nv option to create NamedVectors. It seems like this is giving one big vector as an output, but I ultimately want each line of my CSV to become a NamedVector. Where am I going wrong?

解决方案

seqdirectory command takes every file as a document, so in reality, you only have one document, hence you only get one vector. To make it work properly you would make each line of your CSV file a file itself, where the key of the document is the name of the file and the value are its content. Nonetheless, this is quite unpractical if your corpus is large as disk reading and writing can become painfully slow.

In practice you are better off following the links I share in this comment

这篇关于将CSV转换为SequenceFile的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆