跳过从CSV格式的URL列表时跳过错误 [英] Skipp the error while scraping a list of urls form a csv

查看：107 发布时间：2020/7/12 3:37:18 python csv screen-scraping

本文介绍了跳过从CSV格式的URL列表时跳过错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我设法从CSV文件中抓取了一个网址列表，但是我遇到了一个问题，当它碰到一个损坏的链接时，抓取就会停止.此外，它还会打印很多 None 行，是否有可能摆脱它们?

I managed to scrape a list of urls from a CSV file, but I got a problem, the scraping stops when it hits a broken link. Also it prints a lot of None lines, is it possible to get rid of them ?

在此感谢您的帮助.先感谢您！

Would appreciate some help here. Thank you in advance !

以下是代码:

#!/usr/bin/python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup #required to parse html
import requests #required to make request

#read file
with open('urls.csv','r') as f:
    csv_raw_cont=f.read()

#split by line
split_csv=csv_raw_cont.split('\n')

#specify separator
separator=";"

#iterate over each line
for each in split_csv:

    #specify the row index
    url_row_index=0 #in our csv example file the url is the first row so we set 0

    #get the url
    url = each.split(separator)[url_row_index] 

    #fetch content from server
    html = requests.get(url).content

    #soup fetched content
    soup = BeautifulSoup(html,'lxml')

    tags = soup.find("div", {"class": "productsPicture"}).findAll("a")

    for tag in tags:
       print(tag.get('href'))

错误结果如下:

https://www.tennis-point.com/asics-gel-resolution-7-all-court-shoe-men-white-silver-02013802720000.html
None
https://www.tennis-point.com/cep-ultralight-run-sports-socks-men-black-light-green-12143000063000.html
None
https://www.tennis-point.com/asics-gel-solution-speed-3-clay-court-shoe-men-white-grey-02013802634000.html
None
https://www.tennis-point.com/asics-gel-solution-speed-3-all-court-shoe-men-white-silver-02013802723000.html
None
https://www.tennis-point.com/asics-gel-challenger-9-indoor-carpet-shoe-men-white-grey-02012401735000.html
None
https://www.tennis-point.com/asics-gel-court-speed-clay-court-shoe-men-dark-blue-yellow-02014202833000.html
None
https://www.tennis-point.com/asics-gel-court-speed-all-court-shoe-men-white-silver-02014202832000.html
None
Traceback (most recent call last):
File "/Users/imaging-adrian/Desktop/Python Scripts/close_to_work.py", line 33, in <module>
tags = soup.find("div", {"class": "productsPicture"}).findAll("a")
AttributeError: 'NoneType' object has no attribute 'findAll'
[Finished in 3.7s with exit code 1]
[shell_cmd: python -u "/Users/imaging-adrian/Desktop/Python 
Scripts/close_to_work.py"]
[dir: /Users/imaging-adrian/Desktop/Python Scripts]
[path: /Users/imaging-adrian/anaconda3/bin:/Library/Frameworks/Python.framework/Versions/3.6/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/munki]

我的CSV文件中的链接如下所示:

https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E701Y-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E601N-4907;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E601N-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E600N-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E326Y-0174;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E801N-4589;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E800N-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E800N-9093;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E800N-4589;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E804N-9095;

跳过从CSV格式的URL列表时跳过错误 [英] Skipp the error while scraping a list of urls form a csv

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

跳过从CSV格式的URL列表时跳过错误 [英] Skipp the error while scraping a list of urls form a csv

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭