找出MySQL/PHP中最受欢迎的单词 [英] Find out most popular words in MySQL / PHP

查看:77
本文介绍了找出MySQL/PHP中最受欢迎的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含将近100,000条注释的数据库,我想检测最常用的单词(使用停用词来避免常见单词).

I have a database with almost 100,000 comments and I would like to detect the most used words (using stop words to avoid common words).

我只想这样做一次,然后使用一些最流行的单词标记包含它们的评论.

I want to do this only one time, and then use a few of the most popular words to Tag the comments that contains them.

您可以通过查询和PHP代码来帮助我吗? 谢谢!

Can you help me with the Query and PHP code to do this? Thanks!

推荐答案

我猜最简单的方法是:

  • 创建两个新表:keywords(id,单词)和keywords_comments(keyword_id,comment_id,count)
    • keywords保存唯一的ID和您在文本中找到的关键字
    • keywords_comments为包含该关键字的每个注释之间的每个连接存储一行.在count中,您将保存此关键字在注释中出现的次数.两列keyword_id + comment_id共同构成唯一键或直接构成主键.
    • Create two new tables: keywords (id, word) and keywords_comments (keyword_id, comment_id, count)
      • keywords saves an unique id and the keyword you found in a text
      • keywords_comments stores one row for each connection between each comment that contains that keyword. In count you wil save the number of times this keyword occurred in the comment. The two columns keyword_id + comment_id together form a unique or directly the primary key.

      您有以下两条评论:

      你好,你好吗!!

      Hello, how are you?!

      哇,你好.我叫Stefan.

      Wow, hello. My name is Stefan.

      现在,您将遍历两个对象,并按非字符将其拆分.对于每个文本,这将导致以下小写单词: -第一段文字:你好,你好吗 -第二个文字:哇,你好,我的名字叫斯蒂芬

      Now you would iterate over both of them and split them by non-characters. This would result in the following lowercase words for each text: - First text: hello, how, are, you - Second text: wow, hello, my, name, is, stefan

      一旦解析了其中一个文本,就可以将其再次插入数据库.我想您不想将100.000条评论加载到RAM.

      As soon as you have parsed one of this text, you can already insert it into the database again. I guess you do not want to load 100.000 comments to RAM.

      所以会这样:

      • 解析第一个文本并在上方获得关键字
      • 将每个关键字(如果还不存在的话)写到Tabke keywords
      • 设置从关键字到注释(keywords_comments)的引用,并正确设置计数(在我们的示例中,每个单词在每个文本中仅出现一次,您必须对它进行计数).
      • 解析第二个文本
      • Parse first text an get the keywords above
      • Write each keyword into the tabke keywords if it is not there yet
      • Set a reference from the keyword to the comment (keywords_comments) and set the count correctly (in our example each word occurs only once in each text, you have to count that).
      • Parse second text

      您可能必须对100.000条注释使用一个非常简单的改进,即使用计数变量或在每个注释中添加一个新字段 has_been_analyzed .然后,您可以从数据库中逐条阅读它们.

      A very easy improvement you probably have to use for 100.000 comments, is to use a counting variable or add a new field has_been_analyzed to each comment. Then you can read them comment by comment from the database.

      当我分块读取数据时,我通常会使用计数变量,并且知道数据不会从我开始的方向改变(即,直到我当前为止,它将保持一致).然后我做类似的事情:

      I usually use counting variables when I read data chunkwise and know that the data cannot not change from the direction I am starting (i.e. it will stay consistent up to the point I currently am). Then I do something like:

      SELECT * FROM table ORDER BY created ASC LIMIT 0, 100
      SELECT * FROM table ORDER BY created ASC LIMIT 100, 100
      SELECT * FROM table ORDER BY created ASC LIMIT 200, 100
      …
      

      请考虑到,只有在我们确定在我们认为已经阅读过的地方没有要添加的日期时,这才起作用.例如.使用DESC将不起作用,因为可能插入了数据.然后,整个偏移量将被破坏,我们将阅读两次文章,而永远不会阅读新文章.

      Consider that this only works if we know for sure that there are no dates to be added at a place we think we already read. E.g. using DESC would not work, as there could be data inserted. Then the whole offset would break and we would read one article twice and never read the new article.

      如果不能确保外部计数变量保持一致,则可以在阅读注释后立即添加一个新字段 analyzed (已分析),将其设置为true.然后,您始终可以看到哪些注释已被阅读,哪些注释未被阅读.然后,SQL查询将如下所示:

      If you cannot make sure that the outside counting variable stays consistent, you can add a new field analyzed which you set to true as soon as you have read the comment. Then you can always see which comments have already been read and which not. An SQL query would then look like this:

      SELECT * FROM table WHERE analyzed = 0 LIMIT 100 /* Reading chunks of 100 */
      

      只要不并行处理工作负载(具有多个客户端或线程),此方法就起作用.否则,您必须确保读数+设置为true是原子的(同步的).

      This works as long as you do not parallelize the workload (with multiple clients or threads). Otherwise you would have to make sure that reading + setting true is atomar (synchronized).

      这篇关于找出MySQL/PHP中最受欢迎的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆