使用Julia从大量URL中抓取字符串 [英] Scraping string from a large number of URLs with Julia

查看:44
本文介绍了使用Julia从大量URL中抓取字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

新年快乐!

我刚刚开始学习 Julia ,我给自己设定的第一个小挑战是从大量URL中抓取数据.

I have just started to learn Julia and my first mini challenge I have set myself is to scrape data from a large list of URLs.

我在CSV文件中有 ca 个50k网址(我使用Regex使用Julia成功从JSON解析了该网址).我想抓取每个并返回匹配的字符串("/page/12345/view"-其中12345是任何整数).

I have ca 50k URLs (which I successfully parsed from a JSON with Julia using Regex) in a CSV file. I want to scrape each one and return a matched string ("/page/12345/view" - where 12345 is any integer).

我设法使用HTTP和Queryverse来做到这一点(尽管它最初是从CSV和CSVFiles开始的,但是只是出于学习目的而查看软件包),但是该脚本似乎在2k以下就停止了.我看不到超时等错误.

I managed to do so using HTTP and Queryverse (although had started with CSV and CSVFiles but looking at packages for learning purposes) but the script seems to stop after just under 2k. I can't see an error such as a timeout.

请问是否有人可以告诉我我在做错什么,或者我该如何以不同的方式来解决?学习资源的说明/链接也很棒!

May I ask if anyone can advise what I'm doing wrong or how I can approach it differently? Explanations/links to learning resources would also be great!

using HTTP, Queryverse


URLs = load("urls.csv") |> DataFrame

patternid = r"\/page\/[0-9]+\/view"

touch("ids.txt")
f = open("ids.txt", "a")

for row in eachrow(URLs)

    urlResponse = HTTP.get(row[:url])
    if Int(urlResponse.status) == 404
        continue
    end

    urlHTML = String(urlResponse.body)

    urlIDmatch = match(patternid, urlHTML)

    write(f, urlIDmatch.match, "\n")

end

close(f)

推荐答案

总是可以有一个服务器检测到您的抓取工具,并有意识地花费很长时间进行响应.

There can be always a server that detects your scraper and intentionally takes a very long time to respond.

基本上,由于抓取是IO密集型操作,因此您应该使用大量异步任务来执行此操作.此外,应将其与 get 函数的 readtimeout 参数结合使用.因此,您的代码或多或少看起来像这样:

Basically, since scraping is an IO intensive operations you should do it using a big number of asynchronous tasks. Moreover this should be combined with the readtimeout parameter of the get function. Hence your code will look more or less like this:

asyncmap(1:nrow(URLs);ntasks=50) do n
    row = URLs[n, :]
    urlResponse = HTTP.get(row[:url], readtimeout=10)
    # the rest of your code comes here
end

即使有一些服务器延迟了传输,也总是有许多连接正在工作.

Even one some servers are delaying transmission, always many connections will be working.

这篇关于使用Julia从大量URL中抓取字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆