大数据集上的余弦相似性 [英] Cosines similarity on large data sets

查看:471
本文介绍了大数据集上的余弦相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前我正在研究数据挖掘,文本比较,并发现了这一点: https:// en.wikipedia.org/wiki/Cosine_similarity

由于我已经成功实现了这个算法来比较两个字符串,我决定尝试一些更复杂的任务实现。
我已经遍历了包含大约 250k 文档的数据库,并将DB中的一个随机文档与该数据库中的整个文档进行了比较。

比较所有这些项目的时间:316.35898590088秒,即 - > 5分钟比较所有250k文件!

由于这个结果导致许多问题出现,我不想问一些建议。
为了清楚起见,我将首先描述一些可能有用的细节。


  • 由于编程语言选择了PHP。

  • 文档存储在MySQL中。

  • 余弦相似度函数的实现只包含这个
    函数,没有停止词和任何其他花哨的东西。


    问题


    • 有没有如何取得更好的表现?在哪里我应该开始,通过调整算法(即事先准备向量等),通过使用其他技术等?

    • 我应该如何以及在哪里存储这些比较结果。例如,我想打印一些图表,我可以通过相似性分数查看所有这些250k文档,以便我可以识别哪些图形最相似,等等。


    解决方案

    PHP和MySQL都是你可以做出的最糟糕的选择。 >

    高效的余弦相似性是Lucene的核心。关键的加速技术是合成倒排索引。但是你真的不想在PHP中重新实现它们......


    Currently i'm studying about data-mining, text comparison and have found this one: https://en.wikipedia.org/wiki/Cosine_similarity.

    Since i have successfully implemented this algorithm to compare two strings i have decided to try some more complex task to achieve. I have iterated over my DB which contains about 250k documents and compared one random document from DB to whole documents in that DB.

    To compare all these items time was taken: 316.35898590088 sec, that's, - > 5 minutes to compare all 250k documents!

    Due this results many issues have arisen and i wan't to ask some suggestions. For clarity first of all i'll describe some details which might be useful.

    • As programming language was chosen PHP.
    • Documents are stored inMySQL.
    • Implementation of cosines similarity function contains only this function, there's no stop words and any other fancy things.

    Questions

    • Is there any way to achieve some better performance? Where i should start, by tuning algorithm ( i.e. in advance to prepare vectors, etc ), by using other technologies, etc?
    • How and where i should store these comparison results. For example i want to print some graphs where i can see all these 250k documents by similarity score so that I can identify which are most similar and so on.

    解决方案

    Both PHP and MySQL are about the worst choices you could have made.

    Efficient cosine similarity is at the heart of Lucene. The key acceleration technique are comoressed inverted indexes. But you really don't want to reimplement them in PHP...

    这篇关于大数据集上的余弦相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆