异步抓取F# [英] Asynchronous crawling F#
问题描述
在浏览网页时,我需要注意不要对同一域发出太多请求,例如,我希望在请求之间放置1 s.据我了解,两次请求之间的时间很重要.因此,为了加快处理速度,我想在F#中使用异步工作流,其想法是每隔1秒发出一次请求,但要避免在等待请求响应时阻塞处理.
When crawling on webpages I need to be careful as to not make too many requests to the same domain, for example I want to put 1 s between requests. From what I understand it is the time between requests that is important. So to speed things up I want to use async workflows in F#, the idea being make your requests with 1 sec interval but avoid blocking things while waiting for request response.
let getHtmlPrimitiveAsyncTimer (uri : System.Uri) (timer:int) =
async{
let req = (WebRequest.Create(uri)) :?> HttpWebRequest
req.UserAgent<-"Mozilla"
try
Thread.Sleep(timer)
let! resp = (req.AsyncGetResponse())
Console.WriteLine(uri.AbsoluteUri+" got response")
use stream = resp.GetResponseStream()
use reader = new StreamReader(stream)
let html = reader.ReadToEnd()
return html
with
| _ as ex -> return "Bad Link"
}
然后我做类似的事情:
let uri1 = System.Uri "http://rue89.com"
let timer = 1000
let jobs = [|for i in 1..10 -> getHtmlPrimitiveAsyncTimer uri1 timer|]
jobs
|> Array.mapi(fun i job -> Console.WriteLine("Starting job "+string i)
Async.StartAsTask(job).Result)
这还好吗?我不确定两件事: -Thread.Sleep东西可以延迟请求吗? -使用StartTask是否有问题?
Is this alright ? I am very unsure about 2 things: -Does the Thread.Sleep thing work for delaying the request ? -Is using StartTask a problem ?
我是F#的初学者(您可能已经注意到了)(实际上实际上是一般的编码),所有涉及线程的事情都吓到我了:)
I am a beginner (as you may have noticed) in F# (coding in general actually ), and everything envolving Threads scares me :)
谢谢!!
推荐答案
我认为您想做的是 -创建10个编号为"n"的作业,每个作业从现在开始数为"n"秒 -并行运行所有
I think what you want to do is - create 10 jobs, numbered 'n', each starting 'n' seconds from now - run those all in parallel
大约喜欢
let makeAsync uri n = async {
// create the request
do! Async.Sleep(n * 1000)
// AsyncGetResponse etc
}
let a = [| for i in 1..10 -> makeAsync uri i |]
let results = a |> Async.Parallel |> Async.RunSynchronously
请注意,如果例如您有一台4核计算机,其中4核将很快开始运行,但是很快就会执行到Async.Sleep,这时接下来的4核将运行直到它们进入睡眠状态,依此类推.然后在一秒钟内,第一个异步唤醒并发布一个请求,另一秒钟后,第二个异步唤醒,...这样就可以了. 1仅是近似值,因为它们每个都启动定时器,彼此之间错开了一个很小的小位......您可能希望对其进行一些缓冲,例如1100毫秒左右,或者如果您需要的截止时间恰好是一秒钟(网络等待时间和其他事情仍然可能在程序的可能控制范围之外).
Note that of course they all won't start exactly now, if e.g. you have a 4-core machine, 4 will start running very soon, but then quickly execute up to the Async.Sleep, at which point the next 4 will run up until their sleeps, and so forth. And then in one second the first async wakes up and posts a request, and another second later the 2nd async wakes up, ... so this should work. The 1s is only approximate, since they're starting their timers each a very tiny bit staggered from one another... you may want to buffer it a little, e.g. 1100 ms or something if the cut-off you need is really exactly a second (network latencies and whatnot still leave a bit of this outside the possible control of your program probably).
Thread.Sleep
是次优的,对于少量请求它可以正常工作,但是您正在刻录线程,并且线程很昂贵,并且无法扩展到大量线程.
Thread.Sleep
is suboptimal, it will work ok for a small number of requests, but you're burning a thread, and threads are expensive and it won't scale to a large number.
除非您想与.NET Tasks互操作,或者稍后要通过.Result
对结果进行阻塞会合,否则不需要StartAsTask
.如果只希望它们全部运行,然后阻塞以将所有结果收集到数组中,则Async.Parallel
会为您执行该fork-join并行性.如果他们只是要打印结果,则可以通过Async.Start
触发并忘记,这会将结果放到地板上.
You don't need StartAsTask
unless you want to interoperate with .NET Tasks or later do a blocking rendezvous with the result via .Result
. If you just want these to all run and then block to collect all the results in an array, Async.Parallel
will do that fork-join parallelism for you just fine. If they're just going to print results, you can fire-and-forget via Async.Start
which will drop the results on the floor.
(另一种策略是使用代理作为调节器.将所有http请求发布到单个代理,该代理在逻辑上是单线程的,并处于循环中,执行Async.Sleep
1s,然后处理下一个请求.这是制作通用节流阀的好方法...也许对我来说是值得博客的,考虑一下吧.)
(An alternative strategy is to use an agent as a throttle. Post all the http requests to a single agent, where the agent is logically single-threaded and sits in a loop, doing Async.Sleep
for 1s, and then handling the next request. That's a nice way to make a general-purpose throttle... may be blog-worthy for me, come to think of it.)
这篇关于异步抓取F#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!