如何在 Python 中抓取多个动态 url [英] How to scrape many dynamic urls in Python

查看：49 发布时间：2021/9/24 18:52:39 python web-scraping beautifulsoup

本文介绍了如何在 Python 中抓取多个动态 url的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想一次抓取一个动态网址.我所做的是从所有 href 中抓取 URL，然后我想抓取该 URL.我正在尝试:

I want to scrape one dynamic url at a time. What I did is that I scrape the URL from that I get from all the hrefs and then I want to scrape that URL. What I am trying:

from bs4 import BeautifulSoup
import urllib.request 
import re

r = urllib.request.urlopen('http://i.cantonfair.org.cn/en/ExpExhibitorList.aspx?k=glassware')
soup = BeautifulSoup(r, "html.parser")
links = soup.find_all("a", href=re.compile(r"expexhibitorlist\.aspx\?categoryno=[0-9]+"))

linksfromcategories = ([link["href"] for link in links])
string = "http://i.cantonfair.org.cn/en/"

str1 = [string + x for x in linksfromcategories]
fulllinksfromcategories = '\n'.join(str1)
lfc = urllib.request.urlopen(fulllinksfromcategories).read()
soup2 = BeautifulSoup(lfc,"html.parser") 
print(soup2)

但它给了我以下错误:

  Traceback (most recent call last):
  File "D:\python\scarpepython.py", line 50, in <module>
  lfc = urllib.request.urlopen(fulllinksfromcategories).read()
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 162, in urlopen
  return opener.open(url, data, timeout)
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 465, in open
  response = self._open(req, data)
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 483, in _open
'_open', req)
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 443, in _call_chain
result = func(*args)
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1268, in http_open
  return self.do_open(http.client.HTTPConnection, req)
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1243, in do_open
  r = h.getresponse()
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1174, in getresponse
response.begin()
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 282, in begin
version, status, reason = self._read_status()
 File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 264, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine:

推荐答案

str1 在您的案例中包含 URL 列表.您正在将此 URL 列表加入一个由换行符分隔的单个字符串，然后尝试导航到该组合，这当然是行不通的.

str1 in your case contains a list of URLs. You are joining this list of URLs into a single string separated by newlines and then try to navigate to that mix which of course would not work.

相反，您打算一一循环提取的 URL 并导航:

Instead, you meant to loop over the extracted URLs one by one and navigate:

linksfromcategories = [string + x for x in linksfromcategories]
for link in linksfromcategories:
    print(link)
    lfc = urllib.request.urlopen(link).read()
    soup2 = BeautifulSoup(lfc,"html.parser")
    print(soup2)

这篇关于如何在 Python 中抓取多个动态 url的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在 Python 中抓取多个动态 url [英] How to scrape many dynamic urls in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在 Python 中抓取多个动态 url [英] How to scrape many dynamic urls in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭