使用 Julia 从大量 URL 中抓取字符串 [英] Scraping string from a large number of URLs with Julia
问题描述
新年快乐!
我刚刚开始学习 Julia,我为自己设定的第一个小挑战是从大量 URL 列表中抓取数据.
I have just started to learn Julia and my first mini challenge I have set myself is to scrape data from a large list of URLs.
我在 CSV 文件中有 ca 50k URL(我使用 Julia 使用 Regex 从 JSON 中成功解析了这些 URL).我想抓取每一个并返回一个匹配的字符串(/page/12345/view" - 其中 12345 是任何整数).
I have ca 50k URLs (which I successfully parsed from a JSON with Julia using Regex) in a CSV file. I want to scrape each one and return a matched string ("/page/12345/view" - where 12345 is any integer).
我设法使用 HTTP 和 Queryverse 做到了这一点(虽然从 CSV 和 CSVFiles 开始,但出于学习目的查看包)但脚本似乎在不到 2k 后停止.我看不到超时等错误.
I managed to do so using HTTP and Queryverse (although had started with CSV and CSVFiles but looking at packages for learning purposes) but the script seems to stop after just under 2k. I can't see an error such as a timeout.
请问是否有人可以建议我做错了什么或如何以不同的方式处理它?学习资源的解释/链接也很棒!
May I ask if anyone can advise what I'm doing wrong or how I can approach it differently? Explanations/links to learning resources would also be great!
using HTTP, Queryverse
URLs = load("urls.csv") |> DataFrame
patternid = r"/page/[0-9]+/view"
touch("ids.txt")
f = open("ids.txt", "a")
for row in eachrow(URLs)
urlResponse = HTTP.get(row[:url])
if Int(urlResponse.status) == 404
continue
end
urlHTML = String(urlResponse.body)
urlIDmatch = match(patternid, urlHTML)
write(f, urlIDmatch.match, "
")
end
close(f)
推荐答案
总有一个服务器会检测到你的爬虫并故意花费很长时间来响应.
There can be always a server that detects your scraper and intentionally takes a very long time to respond.
基本上,由于抓取是 IO 密集型操作,您应该使用大量异步任务来完成.此外,这应该与 get
函数的 readtimeout
参数结合使用.因此,您的代码或多或少看起来像这样:
Basically, since scraping is an IO intensive operations you should do it using a big number of asynchronous tasks. Moreover this should be combined with the readtimeout
parameter of the get
function. Hence your code will look more or less like this:
asyncmap(1:nrow(URLs);ntasks=50) do n
row = URLs[n, :]
urlResponse = HTTP.get(row[:url], readtimeout=10)
# the rest of your code comes here
end
即使有一些服务器延迟传输,总是有很多连接在工作.
Even one some servers are delaying transmission, always many connections will be working.
这篇关于使用 Julia 从大量 URL 中抓取字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!