抓取网址列表 [英] Scraping a list of urls

查看：83 发布时间：2020/9/20 7:37:30 python web-scraping urllib bs4

本文介绍了抓取网址列表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Python 3.5并尝试从同一网站上抓取网址列表，代码如下:

I am using Python 3.5 and trying to scrape a list of urls (from the same website), code as follows:

import urllib.request
from bs4 import BeautifulSoup



url_list = ['URL1',
            'URL2','URL3]

def soup():
    for url in url_list:
        sauce = urllib.request.urlopen(url)
        for things in sauce:
            soup_maker = BeautifulSoup(things, 'html.parser')
            return soup_maker

# Scraping
def getPropNames():
    for propName in soup.findAll('div', class_="property-cta"):
        for h1 in propName.findAll('h1'):
            print(h1.text)

def getPrice():
    for price in soup.findAll('p', class_="room-price"):
        print(price.text)

def getRoom():
    for theRoom in soup.findAll('div', class_="featured-item-inner"):
        for h5 in theRoom.findAll('h5'):
            print(h5.text)


for soups in soup():
    getPropNames()
    getPrice()
    getRoom()

到目前为止，如果我打印汤，获取propNames，getPrice或getRoom，它们似乎可以工作.但是我似乎无法通过每个URL并打印getPropNames，getPrice和getRoom.

So far, if I print soup, get propNames, getPrice or getRoom they seem to work. But I can't seem to get it go through each of the urls and print getPropNames, getPrice and getRoom.

仅学习Python已有几个月，因此，非常感谢您提供一些帮助！

Only been learning Python a few months so would greatly appreciate some help with this please!

推荐答案

只需考虑一下此代码的作用:

Just think what this code do:

def soup():
    for url in url_list:
        sauce = urllib.request.urlopen(url)
        for things in sauce:
            soup_maker = BeautifulSoup(things, 'html.parser')
            return soup_maker

让我给你看一个例子:

def soup2():
    for url in url_list:
        print(url)
        for thing in ['a', 'b', 'c']:
            print(url, thing)
            maker = 2 * thing
            return maker

url_list = ['one', 'two', 'three']的输出为:

one
('one', 'a')

您现在看到了吗?到底是怎么回事?

Do you see now? What is going on?

基本上，您的汤函数首先返回return-不返回任何迭代器，任何列表；只有第一个BeautifulSoup-您很幸运(或不幸运)这是可迭代的:)

Basically your soup function return on first return - do not return any iterator, any list; only the first BeautifulSoup - you are lucky (or not) that this is iterable :)

因此更改代码:

def soup3():
    soups = []
    for url in url_list:
        print(url)
        for thing in ['a', 'b', 'c']:
            print(url, thing)
            maker = 2 * thing
            soups.append(maker)
    return soups

然后输出是:

one
('one', 'a')
('one', 'b')
('one', 'c')
two
('two', 'a')
('two', 'b')
('two', 'c')
three
('three', 'a')
('three', 'b')
('three', 'c')

但是我相信这也行不通:)只是想知道酱汁返回的内容是什么:sauce = urllib.request.urlopen(url)，实际上是您的代码正在迭代的内容:for things in sauce-表示things是什么.

But I believe that this also will not work :) Just wonder what is returned by sauce: sauce = urllib.request.urlopen(url) and actually on what your code is iterating on: for things in sauce - mean what the things is.

快乐的编码.

这篇关于抓取网址列表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

抓取网址列表 [英] Scraping a list of urls

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

抓取网址列表 [英] Scraping a list of urls

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭