使用 python 搜索书签网站的搜索引擎 [英] Search engine using python for bookmarked sites

查看:56
本文介绍了使用 python 搜索书签网站的搜索引擎的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的想法是基于我的 CSV 格式的书签文件构建一个搜索引擎.

The idea I have is to build a search engine based on my bookmarks file which I have in CSV format.

这个想法背后的动机是我有大量与教育资源相关的书签,我希望能够搜索和查找特定主题或主题的相关内容.

The motivation behind this idea is that I have a large number of bookmarks related to the educational resources which I want to be able to search and find related content for a particular topic or subject.

我不是一个很好的程序员(我会用 c++ 和 java 编写简单的程序),最近开始学习 python.

I am not a very good programmer(I can write simple programs in c++ and java) and have recently started learning python.

这样的项目能在一个月内实施吗?

Is the implementation of such project possible in one month?

我搜索并发现python语言中存在一个CSV模块,我唯一能得到的想法来自udacity CS101使用python构建搜索引擎的课程.

I have searched and found that a CSV module exists in python language and the only idea I can get is from the udacity CS101 course of building a search engine using python.

我的问题是这是否可行以及从哪里开始?

My question is whether this is possible and where to start ?

推荐答案

我在工作中使用 Perl 和 Python 实现了一个搜索引擎.第一个是为解决生产问题而匆忙组装的,从概念到运行花了 2 个小时的时间来构建.我想开源最终版本,但不知道从哪里开始,因为它是为雇主完成的.无论如何,这是算法:

I implemented a search engine both in Perl and Python at work. The first one was put together for a production problem in great haste and took 2 hours to build, from concept to run. I want to open-source the final version, but not sure where to start since it was work done for employer. Anyhow, here's the algorithm:

st={} #dictonary for search engine tree
for bokm in bookmarks:
    bokm=re.sub('\W_',' ',bokm).toupper() #filter out junk chars
    ct = st;   #cursor for traversing and building our tree
    for c in bokm.split():
        if not ct[c]: ct[c]={}
        ct = ct[c]

此时您有一个包含书签的字符字典树.它只会从书签的开头找到匹配项,您可以修改算法以从任何单词开始散列书签.一定要去 pprint.pprint(st) 亲眼看看它的美.

At this point you have a dict-tree of chars that comprise your bookmarks. It will only find matches from beginning of bookmark, you can modify the algorithm to hash bookmarks starting from any word instead. Make sure to pprint.pprint(st) to see the beauty of it for yourself.

假设您现在正在搜索并输入狗"一词:

So let's say you are searching now and typed the word "dog":

def search(word, st):
    word=re.sub('\W_',' ',word).toupper() #pass word through same filter!
    ct = st #init our cursor
    for c in word.split():
        try:
            ct = ct[c]     #traverse the tree
        except KeyError:
            return False    #pattern diverged, no match
    return True #run out of word chars and every character matched. Found a match!

您几乎可以将其插入并开始使用.它不返回匹配的模式,您需要在搜索树分支的末尾记录该模式,并在最后一个搜索单词字符后递归遍历子树以打印所有匹配的书签.

You can pretty much plug this in and start using. It does not return WHICH pattern it matched, you need to record that at the ends of search tree branches and recursively traverse the subtree after the last search word character to print all bookmarks that matched.

PS:有很多可能的方法来实现单词搜索.这种方法的美妙之处在于它几乎可以立即找到匹配项,无论您的书签文件有多大.第二个好处是可以修改 search() 以在您键入时显示结果,每次按下按键,因为它会逐个字符地遍历我们的书签树,并且会立即执行.

PS: There are many possible ways to implement word search. The beauty of this method is that it find matches almost instantly, always, regardless of the size of your bookmarks file. The second benefit is that search() can be modified to show you results as you type, with each key press, because it traverses our bookmark tree character by character, and it will do it instantaneously.

这篇关于使用 python 搜索书签网站的搜索引擎的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆