使用 Julia 从大量 URL 中抓取字符串 [英] Scraping string from a large number of URLs with Julia

查看:19
本文介绍了使用 Julia 从大量 URL 中抓取字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

新年快乐!

我刚刚开始学习 Julia,我为自己设定的第一个小挑战是从大量 URL 列表中抓取数据.

I have just started to learn Julia and my first mini challenge I have set myself is to scrape data from a large list of URLs.

我在 CSV 文件中有 ca 50k URL(我使用 Julia 使用 Regex 从 JSON 中成功解析了这些 URL).我想抓取每一个并返回一个匹配的字符串(/page/12345/view" - 其中 12345 是任何整数).

I have ca 50k URLs (which I successfully parsed from a JSON with Julia using Regex) in a CSV file. I want to scrape each one and return a matched string ("/page/12345/view" - where 12345 is any integer).

我设法使用 HTTP 和 Queryverse 做到了这一点(虽然从 CSV 和 CSVFiles 开始,但出于学习目的查看包)但脚本似乎在不到 2k 后停止.我看不到超时等错误.

I managed to do so using HTTP and Queryverse (although had started with CSV and CSVFiles but looking at packages for learning purposes) but the script seems to stop after just under 2k. I can't see an error such as a timeout.

请问是否有人可以建议我做错了什么或如何以不同的方式处理它?学习资源的解释/链接也很棒!

May I ask if anyone can advise what I'm doing wrong or how I can approach it differently? Explanations/links to learning resources would also be great!

using HTTP, Queryverse


URLs = load("urls.csv") |> DataFrame

patternid = r"/page/[0-9]+/view"

touch("ids.txt")
f = open("ids.txt", "a")

for row in eachrow(URLs)

    urlResponse = HTTP.get(row[:url])
    if Int(urlResponse.status) == 404
        continue
    end

    urlHTML = String(urlResponse.body)

    urlIDmatch = match(patternid, urlHTML)

    write(f, urlIDmatch.match, "
")

end

close(f)

推荐答案

总有一个服务器会检测到你的爬虫并故意花费很长时间来响应.

There can be always a server that detects your scraper and intentionally takes a very long time to respond.

基本上,由于抓取是 IO 密集型操作,您应该使用大量异步任务来完成.此外,这应该与 get 函数的 readtimeout 参数结合使用.因此,您的代码或多或少看起来像这样:

Basically, since scraping is an IO intensive operations you should do it using a big number of asynchronous tasks. Moreover this should be combined with the readtimeout parameter of the get function. Hence your code will look more or less like this:

asyncmap(1:nrow(URLs);ntasks=50) do n
    row = URLs[n, :]
    urlResponse = HTTP.get(row[:url], readtimeout=10)
    # the rest of your code comes here
end

即使有一些服务器延迟传输,总是有很多连接在工作.

Even one some servers are delaying transmission, always many connections will be working.

这篇关于使用 Julia 从大量 URL 中抓取字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆