网页推荐系统 [英] Web page recommender system

查看:118
本文介绍了网页推荐系统的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试建立一个推荐系统,该系统将根据用户的操作向用户推荐网页(Google搜索,点击次数,他还可以明确地对网页进行评分).为了了解Google新闻的处理方式,它会显示有关特定主题的网络新闻文章.从技术上讲,这是集群,但我的目标是相似的.它将是基于用户操作的基于内容的推荐.

I am trying to build a recommender system which would recommend webpages to the user based on his actions(google search, clicks, he can also explicitly rate webpages). To get an idea the way google news does it, it displays news articles from the web on a particular topic. In technical terms that is clustering, but my aim is similar. It will be content based recommendation based on user's action.

所以我的问题是:

  1. 我怎么可能在互联网上搜寻相关的网页?
  2. 我应该使用哪种算法从网页提取数据是文本分析和词频的唯一方法?
  3. 最后一个平台最适合此问题.我听说过Apache mahout,它带有一些可重复使用的算法,听起来很合适吗?

推荐答案

正如Thomas Jungblut所说,一个人可以就您的问题写几本书;-) 我将尝试为您提供简短的指针列表-但请注意,将没有现成的现成解决方案...

as Thomas Jungblut said, one could write several books on your questions ;-) I will try to give you a list of brief pointers - but be aware there will be no ready-to-use off-the-shelf solution ...

  1. 爬网:有很多工具包可以做到这一点,例如Python的Scrapy,Java的crawler4j和Heritrix或Perl的WWW :: Robot.要从网页中提取实际内容,请查看样板文件.

  1. Crawling the internet: There are plenty of toolkits for doing this, like Scrapy for Python , crawler4j and Heritrix for Java, or WWW::Robot for Perl. For extracting the actual content from web pages, have a look at boilerpipe.

http://scrapy.org/

http://crawler.archive.org/

http://code.google.com/p/crawler4j/

https://metacpan.org/module/WWW::Robot

http://code.google.com/p/boilerpipe/

首先,通常您可以使用协作过滤来代替基于内容的方法.但是,如果您想获得良好的覆盖率,尤其是在长长的尾巴上,则无法绕开文本进行分析.要看的一件事是主题建模,例如LDA.在Mallet,Apache Mahout和Vowpal Wabbit中实现了几种LDA方法. 对于索引,搜索和文本处理,请看一下Lucene.这是一个很棒的,成熟的软件.

First of all, often you can use collaborative filtering instead of content-based approaches. But if you want to have good coverage, especially in the long tail, there will be no way around analyzing the text. One thing to look at is topic modelling, e.g. LDA. Several LDA approaches are implemented in Mallet, Apache Mahout, and Vowpal Wabbit. For indexing, search, and text processing, have a look at Lucene. It is an awesome, mature piece of software.

http://mallet.cs.umass.edu/

http://mahout.apache.org/

http://hunch.net/~vw/

http://lucene.apache.org/

除了Apache Mahout(它还包含LDA(请参见上文),集群和文本处理等功能)之外,如果您想专注于协作过滤,还可以使用其他工具包:LensKit(也用Java实现)和MyMediaLite(免责声明:我是主要作者),它们是用C#实现的,但也具有Java端口.

Besides Apache Mahout which also contains things like LDA (see above), clustering, and text processing, there are also other toolkits available if you want to focus on collaborative filtering: LensKit, which is also implemented in Java, and MyMediaLite (disclaimer: I am the main author), which is implemented in C#, but also has a Java port.

http://lenskit.grouplens.org/

http://ismll.de/mymedialite

https://github.com/jcnewell/MyMediaLiteJava

这篇关于网页推荐系统的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆