使用R?映射博客之间的链接网络? [英] Mapping the link network between blogs using R?

查看:75
本文介绍了使用R?映射博客之间的链接网络?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望获得有关如何在博客之间创建和可视化链接地图的任何建议,以反映博客之间的社交网络".

I would like any advice on how to create and visualize a link map between blogs so to reflect the "social network" between them.

这就是我的想法:

  1. 从一个(或多个)博客主页开始,并收集该页面上的所有链接
  2. 删除所有作为内部链接的链接(也就是说,如果我从www.website.com开始.那么我想从形状"www.website.com/***"中删除所有链接).但要存储所有外部链接.
  3. 转到每个链接(假设您尚未访问它们),然后重复步骤1.
  4. 继续,直到(假设)X从首页跳出.
  5. 绘制收集的数据.

我想为了在R中做到这一点,可以使用RCurl/XML(感谢Shane的回答

I imagine that in order to do this in R, one would use RCurl/XML (Thanks Shane for your answer here), combined with something like igraph.

但是,由于我都没有经验,因此如果我错过了任何重要步骤,或者有任何有用的代码片段允许执行此任务,这里是否有人愿意纠正我?

But since I don't have experience with either of them, is there someone here that might be willing to correct me if I missed any important step, or attach any useful snippet of code to allow this task?

ps:我提出这个问题的动机是,一周内我将在useR 2010上发表有关博客和R"的演讲,我认为这可能是一种既给观众带来乐趣的同时又激励他人的好方法他们自己做这样的事情.

p.s: My motivation for this question is that in a week I am giving a talk on useR 2010 on "blogging and R", and I thought this might be a nice way to both give something fun to the audience and also motivate them to do something like this themselves.

非常感谢!

Tal

推荐答案

NB:此示例是获取链接的一种非常基本的方法,因此需要进行调整以使其更加健壮. :)

NB: This example is a very BASIC way of getting the links and therefore would need to be tweaked in order to be more robust. :)

我不知道此代码有多有用,但是希望它可以给您一个前进的方向的想法(只需将其复制并粘贴到R中,这是一个自包含的示例,一旦您安装了RCurl软件包和XML):

I don't know how useful this code is, but hopefully it might give you an idea of the direction to go in (just copy and paste it into R, it's a self contained example once you've installed the packages RCurl and XML):

library(RCurl)
library(XML)

get.links.on.page <- function(u) {
  doc <- getURL(u)
  html <- htmlTreeParse(doc, useInternalNodes = TRUE)
  nodes <- getNodeSet(html, "//html//body//a[@href]")
  urls <- sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])
  urls <- sort(urls)
  return(urls)
}

# a naieve way of doing it. Python has 'urlparse' which is suppose to be rather good at this
get.root.domain <- function(u) {
  root <- unlist(strsplit(u, "/"))[3]
  return(root)
}

# a naieve method to filter out duplicated, invalid and self-referecing urls. 
filter.links <- function(seed, urls) {
  urls <- unique(urls)
  urls <- urls[which(substr(urls, start = 1, stop = 1) == "h")]
  urls <- urls[grep("http", urls, fixed = TRUE)]
  seed.root <- get.root.domain(seed)
  urls <- urls[-grep(seed.root, urls, fixed = TRUE)]
  return(urls)
}

# pass each url to this function
main.fn <- function(seed) {
  raw.urls <- get.links.on.page(seed)
  filtered.urls <- filter.links(seed, raw.urls)
  return(filtered.urls)
}

### example  ###
seed <- "http://www.r-bloggers.com/blogs-list/"
urls <- main.fn(seed)

# crawl first 3 links and get urls for each, put in a list 
x <- lapply(as.list(urls[1:3]), main.fn)
names(x) <- urls[1:3]
x

如果您将其复制并粘贴到R中,然后查看x,我认为这是有道理的.

If you copy and paste it into R, and then look at x, I think it'll make sense.

无论哪种方式,祝你好运! 托尼·布雷亚(Tony Breyal)

Either way, good luck mate! Tony Breyal

这篇关于使用R?映射博客之间的链接网络?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆