将CSV转换为SequenceFile [英] Converting CSV to SequenceFile
问题描述
我有一个CSV文件,我想将其转换为SequenceFile,我最终将使用它创建NamedVectors以用于群集作业。我一直使用seqdirectory命令尝试创建SequenceFile,然后使用-nv选项将该输出提供给seq2sparse以创建NamedVectors。看起来这是给一个大的向量作为输出,但我最终希望我的CSV的每一行都成为NamedVector。
seqdirectory
命令将每个文件作为一个文件,所以在现实中,你只有一个文件,因此你只能得到一个向量。为了使它正常工作,您需要将CSV文件的每一行都作为文件本身,其中文档的键是文件的名称,值是其内容。但是,如果你的语料库很大,这是不实际的,因为磁盘读写可能变得非常缓慢。在实践中,你最好遵循我在< a href =https://stackoverflow.com/a/11948318/863772> comment
I have a CSV file which I would like to convert to a SequenceFile, which I would ultimately use to create NamedVectors to use in a clustering job. I've been using the seqdirectory command to try to make a SequenceFile, and then fed that output into seq2sparse with the -nv option to create NamedVectors. It seems like this is giving one big vector as an output, but I ultimately want each line of my CSV to become a NamedVector. Where am I going wrong?
seqdirectory
command takes every file as a document, so in reality, you only have one document, hence you only get one vector. To make it work properly you would make each line of your CSV file a file itself, where the key of the document is the name of the file and the value are its content. Nonetheless, this is quite unpractical if your corpus is large as disk reading and writing can become painfully slow.
In practice you are better off following the links I share in this comment
这篇关于将CSV转换为SequenceFile的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!