如何在 Python 中抓取多个动态 url [英] How to scrape many dynamic urls in Python

查看:49
本文介绍了如何在 Python 中抓取多个动态 url的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想一次抓取一个动态网址.我所做的是从所有 href 中抓取 URL,然后我想抓取该 URL.我正在尝试:

I want to scrape one dynamic url at a time. What I did is that I scrape the URL from that I get from all the hrefs and then I want to scrape that URL. What I am trying:

from bs4 import BeautifulSoup
import urllib.request 
import re

r = urllib.request.urlopen('http://i.cantonfair.org.cn/en/ExpExhibitorList.aspx?k=glassware')
soup = BeautifulSoup(r, "html.parser")
links = soup.find_all("a", href=re.compile(r"expexhibitorlist\.aspx\?categoryno=[0-9]+"))

linksfromcategories = ([link["href"] for link in links])
string = "http://i.cantonfair.org.cn/en/"

str1 = [string + x for x in linksfromcategories]
fulllinksfromcategories = '\n'.join(str1)
lfc = urllib.request.urlopen(fulllinksfromcategories).read()
soup2 = BeautifulSoup(lfc,"html.parser") 
print(soup2)

但它给了我以下错误:

  Traceback (most recent call last):
  File "D:\python\scarpepython.py", line 50, in <module>
  lfc = urllib.request.urlopen(fulllinksfromcategories).read()
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 162, in urlopen
  return opener.open(url, data, timeout)
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 465, in open
  response = self._open(req, data)
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 483, in _open
'_open', req)
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 443, in _call_chain
result = func(*args)
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1268, in http_open
  return self.do_open(http.client.HTTPConnection, req)
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1243, in do_open
  r = h.getresponse()
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1174, in getresponse
response.begin()
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 282, in begin
version, status, reason = self._read_status()
 File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 264, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: 

推荐答案

str1 在您的案例中包含 URL 列表.您正在将此 URL 列表加入一个由换行符分隔的单个字符串,然后尝试导航到该组合,这当然是行不通的.

str1 in your case contains a list of URLs. You are joining this list of URLs into a single string separated by newlines and then try to navigate to that mix which of course would not work.

相反,您打算一一循环提取的 URL 并导航:

Instead, you meant to loop over the extracted URLs one by one and navigate:

linksfromcategories = [string + x for x in linksfromcategories]
for link in linksfromcategories:
    print(link)
    lfc = urllib.request.urlopen(link).read()
    soup2 = BeautifulSoup(lfc,"html.parser")
    print(soup2)

这篇关于如何在 Python 中抓取多个动态 url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆