获取MySQL数据库中重复次数最多的相似字段 [英] Get the most repeated similar fields in MySQL database

查看:407
本文介绍了获取MySQL数据库中重复次数最多的相似字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有一个像这样的数据库:

Let's assume we have a database like:

Actions_tbl:

Actions_tbl:


--------------------------------------------------------
id | Action_name                              | user_id|
--------------------------------------------------------
1  |  John reads one book                     | 1     
2  |  reading the book by john                | 1
3  |  Joe is jumping over fire                | 2
4  |  reading another book                    | 2
5  |  John reads the book in library          | 1
6  |  Joe read a    book                      | 2
7  |  read a book                             | 3
8  |  jumping with no reason is Ronald's habit| 3 


Users_tbl:


Users_tbl:


-----------------------
user_id |    user_name |
-----------------------
1       |     John
2       |     Joe
3       |     Ronald
4       |     Araz
-----------------------


想知道我是否可以选择重复性最高的类似操作,而不管它的用户是什么,然后用其当前用户替换我自己的user_name!


Wondering if I can choose the most repeated similar action regardless of it's user and replace my own user_name with its current user!

读一本书,读一本书,读另一本书,读图书馆的书,读一本书和读一本书是单词中最常见的单词,因此与阅读该书有关的工作人员重复了6次,系统应随机显示这六个句子之一,并将Araz替换为user_name

Read one book, reading the book, reading another book, read the book in library, read a book and read a book are the ones who have most common WORDS so the staffs related to reading the book is repeated 6 times, my system should show one of those six sentences randomly and replace Araz with user_name

像:阿拉兹(Araz)读这本书

Like: Araz reads the book

我的主意是

select replace(a.action_name , b.user_name) from actions_tbl a, user_tble b where a.user_id = b.user_id group_by

,然后使用

levenshtein()

但是这个根本没有性能!

But this one doesn't have performance at all!

假设我想对一个大数据库和几个不同的表执行相同的操作. 这将破坏我的服务器!!!

Assume that I want to do the same thing for a big db and for few different tables. This will destroy my server!!!

有更好的IDEA吗?

http://www.artfulsoftware.com/infotree/queries.php#552 levenshtein()函数被实现为MySQL函数,但是首先,您认为它具有足够的性能吗?然后,如何在我的情况下使用它? 也许自连接van可以解决此问题,但是我对sql不太满意!

in http://www.artfulsoftware.com/infotree/queries.php#552 the levenshtein() function is implemented as a MySQL function but firstly, do u think it has enough performance? and then, how to use it in my case? Maybe a self-join van fix this issue but I'm not that good with sql!

* 类似动作,是指具有超过X%常用词的动作

* similar action, are the actions that have more than X% common words

** 详细信息和说明:**

** More information and notes:**

  1. 我仅限于PHP和MySQL.

  1. I'm limited to PHP and MySQL.

这只是一个例子,在我的真实项目中,动作是很长的段落.这就是为什么性能是问题的原因.真实的情况是:用户输入了几个项目的项目描述,这些数据可能太相似了(用户的工作区域相同),我想自动填充(基于先前的填充)下一个项目的描述,以节省时间.

This is just an example, in my real project the actions are long paragraphs. That's why the performance is a matter. The real scenario is: user inputted the description of its project for several projects, those data may be too similar(users would have the same area of work), I want to fill automatically(base on previous fillings) the description of next project, to save time.

如果您有任何 务实 解决方案,我们将不胜感激.我检查了与NLP相关的解决方案,尽管它们很有趣,但是我认为其中许多解决方案都不准确,无法使用PHP来实现.

I would appreciate if you can have any pragmatical Solution. I checked the NLP related solutions, although they r interesting, but I don't think many of them can be accurate and can be implemented using PHP.

输出应该有意义,并且像所有其他项目一样是适当的段落.这就是为什么我在考虑从以前的选择中进行选择的原因.

The output should make sense and be a proper paragraph like all other projects. That's why I was thinking of choosing from previous ones.


感谢您的明智回答,如果您能对情况有所了解,我们将不胜感激

推荐答案

您在说的是文本聚类过程.您正在尝试查找相似的文本,然后任意选择其中之一.我对执行这种形式的文本挖掘的任何数据库都不熟悉.

What you are talking about is a text clustering process. You are trying to find similar pieces of text, and arbitrarily choosing one of them. I am not familiar with any database that does this form of text mining.

对于您所描述的,很基本的文本挖掘技术可能会起作用.用除用户名之外的所有单词创建一个术语文档矩阵.然后使用奇异值分解获得最大的奇异值和向量(这是相关矩阵的第一个主成分).类似的活动应沿着这条线聚集.

For what you describe, a pretty basic text mining technique would probably work. Create a term-document matrix with all the words except the user names. Then use singular value decomposition to get the largest singular value and vector (this is the first principal component of the correlation matrix). The similar activities should cluster along this line.

如果您的词汇量有限并且在表格中包含术语,则可以通过重叠的单词比例来测量两个动作之间的距离.您是否有动作中所有单词的列表?

If you have a limited vocabulary and have the terms in a table, you could measure distance between two actions by the proportion of words that overlap. Do you have a list of all words in the actions?

这篇关于获取MySQL数据库中重复次数最多的相似字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆