如何从列表中使用python刮网址 [英] How to scrape url from list using python

查看：327 发布时间：2016/8/5 19:20:29 python web-scraping beautifulsoup

本文介绍了如何从列表中使用python刮网址的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

欲刮除列表中的URL present。基本上我刮的网站在我刮的链接从我发现特定的链接
一刮这些链接，我搜索其他特定链接一刮吧。
我的code：

 从BS4进口BeautifulSoup
进口urllib.request里
进口重
R = urllib.request.urlopen（'http://i.cantonfair.org.cn/en/ExpExhibitorList.aspx?k=glassware'）
汤= BeautifulSoup（Rhtml.parser）
链接= soup.find_all（一，HREF = re.compile（Rexpexhibitorlist \\的.aspx \\？categoryno = [0-9] +））
linksfromcategories =（[链接[HREF]的链接链接]）字符串=http://i.cantonfair.org.cn/en/
linksfromcategories = [字符串+ X在linksfromcategories X]
subcatlinks =名单（）
在linksfromcategories链接：
  响应= urllib.request.urlopen（链接）
  soup2 = BeautifulSoup（回应，html.parser）
  links2 = soup2.find_all（一，HREF = re.compile（RExpExhibitorList \\的.aspx \\？categoryno = [0-9] +））
  linksfromsubcategories =（[链接[HREF]的链接links2]）
  subcatlinks.append（linksfromsubcategories）
反应= urllib.request.urlopen（subcatlinks）
soup3 = BeautifulSoup（回应，html.parser）
打印（soup3）

和我得到的错误

 回溯（最后最近一次调用）：
  文件D：\\ python的\\ phase2.py46行，上述＆lt;＆模块GT;
    反应= urllib.request.urlopen（subcatlinks）
  文件C：\\用户\\ amanp \\应用程序数据\\本地\\程序\\ Python的\\ Python35-32 \\ lib目录\\的urllib \\ request.py，线路162，在的urlopen
    返回opener.open（URL，数据，超时）
  文件C：\\用户\\ amanp \\应用程序数据\\本地\\程序\\ Python的\\ Python35-32 \\ lib目录\\的urllib \\ request.py，456线，开放
    req.timeout =超时
AttributeError的：'名单'对象有没有属性超时

解决方案

您只能在一传环节在时间 urllib.request.urlopen ，而不是一个他们的整个列表。

所以，你需要另一个循环是这样的：

 在subcatlinks链接：
    响应= urllib.request.urlopen（链接）
    soup3 = BeautifulSoup（回应，html.parser）
    打印（soup3）

I want to scrape the url present in the list. Basically I am scraping a website in I am scraping a link from that I am finding particular link an scraping those links and I search for another particular links a scrape it. My code:

from bs4 import BeautifulSoup
import urllib.request
import re
r = urllib.request.urlopen('http://i.cantonfair.org.cn/en/ExpExhibitorList.aspx?k=glassware')
soup = BeautifulSoup(r, "html.parser")
links = soup.find_all("a", href=re.compile(r"expexhibitorlist\.aspx\?categoryno=[0-9]+"))
linksfromcategories = ([link["href"] for link in links])

string = "http://i.cantonfair.org.cn/en/"
linksfromcategories = [string + x for x in linksfromcategories]
subcatlinks = list()
for link in linksfromcategories:
  response = urllib.request.urlopen(link)
  soup2 = BeautifulSoup(response, "html.parser")
  links2 = soup2.find_all("a", href=re.compile(r"ExpExhibitorList\.aspx\?categoryno=[0-9]+"))
  linksfromsubcategories = ([link["href"] for link in links2])
  subcatlinks.append(linksfromsubcategories)
responses = urllib.request.urlopen(subcatlinks)
soup3 = BeautifulSoup(responses, "html.parser")
print (soup3)

And I am getting the error

Traceback (most recent call last):
  File "D:\python\phase2.py", line 46, in <module>
    responses = urllib.request.urlopen(subcatlinks)
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 162, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\amanp\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 456, in open
    req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'

解决方案

You can only pass in one link at a time to urllib.request.urlopen as opposed to a whole list of them.

So you'll need another loop like this:

for link in subcatlinks:
    response = urllib.request.urlopen(link)
    soup3 = BeautifulSoup(response, "html.parser")
    print(soup3)

这篇关于如何从列表中使用python刮网址的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从列表中使用python刮网址 [英] How to scrape url from list using python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何从列表中使用python刮网址 [英] How to scrape url from list using python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭