异步抓取F# [英] Asynchronous crawling F#

查看:76
本文介绍了异步抓取F#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在浏览网页时,我需要注意不要对同一域发出太多请求,例如,我希望在请求之间放置1 s.据我了解,两次请求之间的时间很重要.因此,为了加快处理速度,我想在F#中使用异步工作流,其想法是每隔1秒发出一次请求,但要避免在等待请求响应时阻塞处理.

When crawling on webpages I need to be careful as to not make too many requests to the same domain, for example I want to put 1 s between requests. From what I understand it is the time between requests that is important. So to speed things up I want to use async workflows in F#, the idea being make your requests with 1 sec interval but avoid blocking things while waiting for request response.

let getHtmlPrimitiveAsyncTimer (uri : System.Uri) (timer:int) =
    async{

            let req =  (WebRequest.Create(uri)) :?> HttpWebRequest
            req.UserAgent<-"Mozilla"
            try 

                Thread.Sleep(timer)
                let! resp =    (req.AsyncGetResponse())
                Console.WriteLine(uri.AbsoluteUri+" got response")
                use stream = resp.GetResponseStream()
                use reader = new StreamReader(stream)
                let html = reader.ReadToEnd()
                return html
            with 
            | _ as ex -> return "Bad Link"
                 }

然后我做类似的事情:

let uri1 = System.Uri "http://rue89.com"
let timer = 1000
let jobs = [|for i in 1..10 -> getHtmlPrimitiveAsyncTimer uri1 timer|]

jobs
|> Array.mapi(fun i job -> Console.WriteLine("Starting job "+string i)
                               Async.StartAsTask(job).Result)

这还好吗?我不确定两件事: -Thread.Sleep东西可以延迟请求吗? -使用StartTask是否有问题?

Is this alright ? I am very unsure about 2 things: -Does the Thread.Sleep thing work for delaying the request ? -Is using StartTask a problem ?

我是F#的初学者(您可能已经注意到了)(实际上实际上是一般的编码),所有涉及线程的事情都吓到我了:)

I am a beginner (as you may have noticed) in F# (coding in general actually ), and everything envolving Threads scares me :)

谢谢!!

推荐答案

我认为您想做的是 -创建10个编号为"n"的作业,每个作业从现在开始数为"n"秒 -并行运行所有

I think what you want to do is - create 10 jobs, numbered 'n', each starting 'n' seconds from now - run those all in parallel

大约喜欢

let makeAsync uri n = async {
    // create the request
    do! Async.Sleep(n * 1000)
    // AsyncGetResponse etc
    }

let a = [| for i in 1..10 -> makeAsync uri i |]
let results = a |> Async.Parallel |> Async.RunSynchronously

请注意,如果例如您有一台4核计算机,其中4核将很快开始运行,但是很快就会执行到Async.Sleep,这时接下来的4核将运行直到它们进入睡眠状态,依此类推.然后在一秒钟内,第一个异步唤醒并发布一个请求,另一秒钟后,第二个异步唤醒,...这样就可以了. 1仅是近似值,因为它们每个都启动定时器,彼此之间错开了一个很小的小位......您可能希望对其进行一些缓冲,例如1100毫秒左右,或者如果您需要的截止时间恰好是一秒钟(网络等待时间和其他事情仍然可能在程序的可能控制范围之外).

Note that of course they all won't start exactly now, if e.g. you have a 4-core machine, 4 will start running very soon, but then quickly execute up to the Async.Sleep, at which point the next 4 will run up until their sleeps, and so forth. And then in one second the first async wakes up and posts a request, and another second later the 2nd async wakes up, ... so this should work. The 1s is only approximate, since they're starting their timers each a very tiny bit staggered from one another... you may want to buffer it a little, e.g. 1100 ms or something if the cut-off you need is really exactly a second (network latencies and whatnot still leave a bit of this outside the possible control of your program probably).

Thread.Sleep是次优的,对于少量请求它可以正常工作,但是您正在刻录线程,并且线程很昂贵,并且无法扩展到大量线程.

Thread.Sleep is suboptimal, it will work ok for a small number of requests, but you're burning a thread, and threads are expensive and it won't scale to a large number.

除非您想与.NET Tasks互操作,或者稍后要通过.Result对结果进行阻塞会合,否则不需要StartAsTask.如果只希望它们全部运行,然后阻塞以将所有结果收集到数组中,则Async.Parallel会为您执行该fork-join并行性.如果他们只是要打印结果,则可以通过Async.Start触发并忘记,这会将结果放到地板上.

You don't need StartAsTask unless you want to interoperate with .NET Tasks or later do a blocking rendezvous with the result via .Result. If you just want these to all run and then block to collect all the results in an array, Async.Parallel will do that fork-join parallelism for you just fine. If they're just going to print results, you can fire-and-forget via Async.Start which will drop the results on the floor.

(另一种策略是使用代理作为调节器.将所有http请求发布到单个代理,该代理在逻辑上是单线程的,并处于循环中,执行Async.Sleep 1s,然后处理下一个请求.这是制作通用节流阀的好方法...也许对我来说是值得博客的,考虑一下吧.)

(An alternative strategy is to use an agent as a throttle. Post all the http requests to a single agent, where the agent is logically single-threaded and sits in a loop, doing Async.Sleep for 1s, and then handling the next request. That's a nice way to make a general-purpose throttle... may be blog-worthy for me, come to think of it.)

这篇关于异步抓取F#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆