使用SQL确定文本字段的字数统计 [英] Using SQL to determine word count stats of a text field

查看:472
本文介绍了使用SQL确定文本字段的字数统计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近一直在研究某些数据库搜索功能,希望获得一些信息,例如每个文档的平均单词数(例如数据库中的文本字段).到目前为止,我发现的唯一一件事(没有在DB之外进行选择语言的处理)是

I've recently been working on some database search functionality and wanted to get some information like the average words per document (e.g. text field in the database). The only thing I have found so far (without processing in language of choice outside the DB) is:

SELECT AVG(LENGTH(content) - LENGTH(REPLACE(content, ' ', '')) + 1)
FROM documents

这似乎可行*,但您还有其他建议吗?我目前正在使用MySQL 4(希望很快将其升级到该应用程序的版本5),但也对通用解决方案感兴趣.

This seems to work* but do you have other suggestions? I'm currently using MySQL 4 (hope to move to version 5 for this app soon), but am also interested in general solutions.

谢谢!

*我可以想象这是确定它的一种很粗糙的方法,因为它也没有考虑内容等中的HTML.对于这个特定的项目来说还可以,但是还有更好的方法吗?

* I can imagine that this is a pretty rough way to determine this as it does not account for HTML in the content and the like as well. That's OK for this particular project but again are there better ways?

更新:要定义更好"的意思是:更准确,执行效率更高或更正确"(易于维护,良好实践等).对于我可用的内容,上面的查询对于该项目来说足够快且准确,但是将来我可能需要类似的内容(所以我问).

Update: To define what I mean by "better": either more accurate, performs more efficiently, or is more "correct" (easy to maintain, good practice, etc). For the content I have available, the query above is fast enough and is accurate for this project, but I may need something similar in the future (so I asked).

推荐答案

MySQL的文本处理功能不足以满足您的需求.存储功能是一个选项,但可能会很慢.在MySQL中处理数据的最佳选择是添加用户定义功能.如果您仍要构建较新版本的MySQL,则还可以添加原生功能.

The text handling capabilities of MySQL aren't good enough for what you want. A stored function is an option, but will probably be slow. Your best bet to process the data within MySQL is to add a user defined function. If you're going to build a newer version of MySQL anyway, you could also add a native function.

正确"的方法是在DB外部处理数据,因为DB是用于存储而不是处理,并且任何繁重的处理都可能给DBMS带来过多的负担.此外,在MySQL外部计算字数使更改字数定义变得更加容易.更改文档后,如何将单词计数存储在数据库中并进行更新?

The "correct" way is to process the data outside the DB since DBs are for storage, not processing, and any heavy processing might put too much of a load on the DBMS. Additionally, calculating the word count outside of MySQL makes it easier to change the definition of what counts as a word. How about storing the word count in the DB and updating it when a document is changed?

示例存储功能:

DELIMITER $$
CREATE FUNCTION wordcount(str LONGTEXT)
       RETURNS INT
       DETERMINISTIC
       SQL SECURITY INVOKER
       NO SQL
  BEGIN
    DECLARE wordCnt, idx, maxIdx INT DEFAULT 0;
    DECLARE currChar, prevChar BOOL DEFAULT 0;
    SET maxIdx=char_length(str);
    SET idx = 1;
    WHILE idx <= maxIdx DO
        SET currChar=SUBSTRING(str, idx, 1) RLIKE '[[:alnum:]]';
        IF NOT prevChar AND currChar THEN
            SET wordCnt=wordCnt+1;
        END IF;
        SET prevChar=currChar;
        SET idx=idx+1;
    END WHILE;
    RETURN wordCnt;
  END
$$
DELIMITER ;

这篇关于使用SQL确定文本字段的字数统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆