PHP字索引,性能和合理的结果 [英] PHP word index, performance and reasonable results

查看:107
本文介绍了PHP字索引,性能和合理的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在为搜索功能开发索引器。索引器将处理来自字段的数据。
字段看起来像:

I'm currently working on an indexer for a search feature. The indexer will work over data from "fields". Fields looks like:

  Field_id   Field_type   Field_name   Field_Data
- 101        text         Name         Intel i7
- 102        integer      Cores        4 physical, 4 virtual
- 103        select       Vendor       Intel
- 104        multitext    Description  The i7 is intel's next gen range of cpus.

索引器会生成以下结果/索引:

The indexer would generate the following results/index:

  Keyword    Occurrences
- intel      101, 103, 104
- i7         101, 104
- physical   102
- virtual    102
- next       104
- gen        104
- range      104
- cpus       104   (*)
- cpu        104   (*)

所以它有点看起来很好很好,但是,有些问题我想解决:

So it somewhat looks all nice and fine, however, there are some issues which I'd like to sort out:


  • 过滤掉常用词(正如您可能已经注意到的那样,是和英特尔从列表中删除)

  • 关于cpus(复数与单数),最好是使用特定类型(单数或复数),两者还是精确(即cpus是不同的cpu)?

  • 继续上一项,如何确定复数(不同口味:test =>测试鱼=>鱼和叶=>叶子)

  • 我'我目前正在使用MySql而我是非常关注的性能问题;我们有500多个类别,我们甚至没有启动网站

  • 假设我想使用搜索词vendor:intel,其中vendor指定字段名称(field_name),你认为会对sql服务器产生巨大影响吗?

  • 搜索限制;我根本不喜欢这个,但这是一种可能性,如果你知道任何变通方法,那就让自己听一听!

  • 如果你发现我还有其他一些我可能忘记的问题任何,欢迎你对我大喊大叫; - )

  • 我不需要搜索引擎抓取链接,事实上,我特别希望它不抓取链接

  • filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
  • With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
  • Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
  • I'm currently using MySql and I'm very concerned with performance issues; we have 500+ categories and we didn't even launch the site
  • Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
  • Search throttling; I don't like this at all, but it's a possibility, and if you know of any workarounds, make yourself heard!
  • There were other issues which I probably forgot about, if you spot any, you're welcome to yell at me ;-)
  • I do not need the search engine to crawl links, in fact, I specifically want it to not crawl links.

(顺便说一句,我并不偏向于英特尔,只是碰巧我拥有一个基于i7的电脑;-))

推荐答案

这是对原始问题的回应,以及你以后的问题a href =https://stackoverflow.com/questions/3315910/php-word-index-performance-and-reasonable-results#answer-3316529>回答/问题。

This is in response to your original question, and your later answer/question.

我之前使用过 Sphinx 搜索引擎(很久以前,所以我有点生疏了,并发现它非常好,即使文档有时有点缺乏。

I've used the Sphinx search engine before (quite a while ago, so I'm a bit rusty), and found it to be very good, even if the documentation is sometimes a bit lacking.

我确信还有其他方法可以做到这一点,无论是使用自己的自定义代码,还是使用其他搜索引擎 - Sphinx恰好是我的一个已经用过。我并不是说它会按照你想要的方式做你想做的一切,但是我有理由相信它能够很容易地完成大部分工作,并且比用PHP / MySQL编写的任何东西都要快得多。

I'm sure there are other ways to do this, both with your own custom code, or with other search engines—Sphinx just happens to be the one I've used. I'm not suggesting that it will do everything you want, just the way you want, but I am reasonably certain that it will do most of it quite easily, and a lot faster than anything written in PHP/MySQL alone.

我建议阅读建立一个在深入了解 Sphinx文档之前,使用PHP自定义搜索引擎。如果你在阅读之后觉得它不合适,那就足够了。

I recommend reading Build a custom search engine with PHP before digging into the Sphinx documentation. If you don't think it's suitable after reading that, fair enough.

在回答你的具体问题时,我把文件中的一些链接与一些相关的引用:

In answer to your specific questions, I've put together some links from the documentation, together with some relevant quotes:

过滤掉常用词(正如您可能注意到的那样,是的和英特尔在列表中缺失)

11.2.8。停用词


停用词是不会将
编入索引的单词。通常情况下,你会在停用词列表
中放入大多数
频繁的单词,因为它们不会为
搜索结果增加太多价值,但会消耗大量
资源来处理。

Stopwords are the words that will not be indexed. Typically you'd put most frequent words in the stopwords list because they do not add much value to search results but consume a lot of resources to process.

关于cpus(复数与单数),最好使用特定类型(单数或复数),两者都是精确的(即cpus是不同的cpu)?

11.2.9。字形



通过
charset_table规则对传入文本进行标记后,将应用Word表单。他们必须
让你用另一个单词代替。
通常,这将用于将
不同的单词形式带到单个
正常形式(例如,规范化所有
变体,例如walk,walked,
行走到正常形式步行)。
它也可用于实现
阻止异常,因为阻止
不适用于
表单列表中的单词。

Word forms are applied after tokenizing the incoming text by charset_table rules. They essentialy let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form (eg. to normalize all the variants such as "walks", "walked", "walking" to the normal form "walk"). It can also be used to implement stemming exceptions, because stemming is not applied to words found in the forms list.

继续上一项,如何确定复数(不同口味:test =>测试鱼=>鱼和叶=>叶子)

Sphinx支持 Porter Stemming Algorithm


Porter阻塞算法(或
'Porter stemmer')是
移除布衣的过程形态

英语单词的inflexional结尾。它的主要用途是作为
项标准化过程的一部分,通常在设置
信息检索系统时进行

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

假设我想使用搜索词vendor:intel,其中vendor指定字段名称(field_name),你认为会对sql server产生巨大影响吗? / strong>

Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?

3.2。属性


属性的一个很好的例子是
一个论坛帖子表。假设只有
标题和内容字段需要
全文搜索 - 但是b $ b有时它还需要将
搜索限制为某个作者或
sub-forum(即只搜索那些在
SQL表中具有
author_id或forum_id列的特定值的行
);或者通过
post_date列对匹配进行排序;或者按照post_date按月分组匹配
的帖子,按
计算每组匹配计数。

A good example for attributes would be a forum posts table. Assume that only title and content fields need to be full-text searchable - but that sometimes it is also required to limit search to a certain author or a sub-forum (ie. search only those rows that have some specific values of author_id or forum_id columns in the SQL table); or to sort matches by post_date column; or to group matching posts by month of the post_date and calculate per-group match counts.

这可以通过指定所有$ b来实现$ b提到的列(不包括标题
和内容,即全文
字段)作为属性,索引它们,
然后使用API​​调用来设置
过滤,排序和分组。

This can be achieved by specifying all the mentioned columns (excluding title and content, that are full-text fields) as attributes, indexing them, and then using API calls to setup filtering, sorting, and grouping.

您还可以使用 5.3。扩展查询语法以搜索特定字段(而不是按属性过滤结果):

You can also use the 5.3. Extended query syntax to search specific fields (as opposed to filtering results by attributes):


字段搜索运算符:
@vendor intel

field search operator: @vendor intel

搜索引擎如何索引一组字段并将找到的短语/关键字/等绑定到特定字段ID?

8.6.1。查询


成功时,Query()返回包含一些找到的匹配项的结果集(由SetLimits请求( ))和其他一般的每查询统计数据。 >结果集是一个散列(特定于PHP;其他语言可能使用其他结构而不是散列),具有以下键和值:

On success, Query() returns a result set that contains some of the found matches (as requested by SetLimits()) and additional general per-query statistics. > The result set is a hash (PHP specific; other languages might utilize other structures instead of hash) with the following keys and values:

匹配:

哈希将找到的文档ID映射到另一个包含文档权重和属性值的小哈希(如果启用了SetArrayResult(),则为类似小哈希的数组)。

"matches":
Hash which maps found document IDs to another small hash containing document weight and attribute values (or an array of the similar small hashes if SetArrayResult() was enabled).

total:

此查询在服务器上检索的匹配总数(即服务器端结果集)。对于此查询文本,您可以使用当前查询设置从服务器检索此数量的匹配。

"total":
Total amount of matches retrieved on server (ie. to the server side result set) by this query. You can retrieve up to this amount of matches from server for this query text with current query settings.

total_found:

匹配总数索引中的文档(在服务器上找到并执行)。

"total_found":
Total amount of matching documents in index (that were found and procesed on server).

words:

映射查询关键字的哈希(case-folded,stemmed ,以及以其他方式处理)到每个关键字统计的小哈希(docs,hits)。

"words":
Hash which maps query keywords (case-folded, stemmed, and otherwise processed) to a small hash with per-keyword statitics ("docs", "hits").

error:

searchd报告的查询错误消息(字符串,人类可读)。如果没有错误则清空。

"error":
Query error message reported by searchd (string, human readable). Empty if there were no errors.

警告:

查询searchd报告的警告消息(字符串,人类可读)。如果没有警告则清空。

"warning":
Query warning message reported by searchd (string, human readable). Empty if there were no warnings.

另见清单11 清单13 来自 Build一个带PHP的自定义搜索引擎

这篇关于PHP字索引,性能和合理的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆