练习:Web Crawler - 并发不工作 [英] Exercise: Web Crawler - concurrency not working

查看:182
本文介绍了练习:Web Crawler - 并发不工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将通过golang旅游,并在最后一个练习中更改一个网页抓取工具,以并行抓取,而不是重复抓取( http://tour.golang.org/#73 )。所有我改变的是抓取功能。

  var used = make(map [string] bool)

func Crawl(url string,depth int,fetcher Fetcher){
if depth< = 0 {
return
}
body,urls,err:= fetcher.Fetch (url)
if err!= nil {
fmt.Println(err)
return
}
fmt.Printf(\\\
found:%s%q \ n \\\
,url,body)
for _,u:= range urls {
if used [u] == false {
used [u] = true
Crawl(u,depth-1,fetcher)
}
}
return
}

为了使它并发我在函数Crawl的调用之前添加了go命令,而不是递归调用Crawl函数,程序只找到 http://golang.org/ 页面,没有其他页面。



为什么当我将go命令添加到Crawl函数的调用时,程序不工作?

解决方案

是的,你的进程退出之前所有的URL可以跟踪
的抓取工具。由于并发性, main()过程在
之前退出,工人完成。



为了避免这种情况,您可以使用 sync.WaitGroup

  func抓取(url string,depth int,fetcher Fetcher,wg * sync.WaitGroup){
defer wg.Done()
if depth< = 0 {
return
}
body,urls,err:= fetcher.Fetch(url)
if err!= nil {
fmt.Println(err)
return
}
fmt.Printf(\\\
found:%s%q\\\
\\\
url,body)
for _,u:= range urls {
if used [u] == false {
used [u] = true
wg.Add(1)
go Crawl(u,depth-1,fetcher,wg)
}
}
return
}
pre>

并在 main 中调用 Crawl ,如下所示:

  func main(){
wg:=& sync.WaitGroup {}

Crawl(http://golang.org/,4,fetcher,wg)

wg.Wait()
}

此外,不要在地图上是线程安全的


I am going through the golang tour and working on the final exercise to change a web crawler to crawl in parallel and not repeat a crawl ( http://tour.golang.org/#73 ). All I have changed is the crawl function.

    var used = make(map[string]bool)

    func Crawl(url string, depth int, fetcher Fetcher) {
        if depth <= 0 {
            return
        }
        body, urls, err := fetcher.Fetch(url)
        if err != nil {
            fmt.Println(err)
            return
        }
        fmt.Printf("\nfound: %s %q\n\n", url, body)
        for _,u := range urls {
            if used[u] == false {
                used[u] = true
                Crawl(u, depth-1, fetcher)
            }
        }
        return
    }

In order to make it concurrent I added the go command in front of the call to the function Crawl, but instead of recursively calling the Crawl function the program only finds the "http://golang.org/" page and no other pages.

Why doesn't the program work when I add the go command to the call of the function Crawl?

解决方案

The problem seems to be, that your process is exiting before all URLs can be followed by the crawler. Because of the concurrency, the main() procedure is exiting before the workers are finished.

To circumvent this, you could use sync.WaitGroup:

func Crawl(url string, depth int, fetcher Fetcher, wg *sync.WaitGroup) {
    defer wg.Done()
    if depth <= 0 {
         return
    }
    body, urls, err := fetcher.Fetch(url)
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Printf("\nfound: %s %q\n\n", url, body)
    for _,u := range urls {
        if used[u] == false {
           used[u] = true
           wg.Add(1)
           go Crawl(u, depth-1, fetcher, wg)
        }
    }
    return
}

And call Crawl in main as follows:

func main() {
    wg := &sync.WaitGroup{}

    Crawl("http://golang.org/", 4, fetcher, wg)

    wg.Wait()
}

Also, don't rely on the map being thread safe.

这篇关于练习:Web Crawler - 并发不工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆