使用我用来刮取其他页面的相同代码显示两个不同的错误 [英] Showing two différents errors with the same code that I used to scrape other pages

查看:33
本文介绍了使用我用来刮取其他页面的相同代码显示两个不同的错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用代码从tripadvisor抓取了两页,并且效果很好.但是现在,它向我展示了两个不同的错误:

I used a code to scrape two pages from tripadvisor, and it worked very well. But now, it shows me two differents errors :

with open("iletaitunsquare1.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
    w = csv.writer(csv_file, delimiter = ";", quoting=csv.QUOTE_MINIMAL)
    w.writerow(["inf_rest_name", "rest_eclf", "name_client", "date_rev_cli", "opinion_cl"])

    with requests. Session() as s:
        for offset in range (270,1230,10):
            url = f'https://www.tripadvisor.fr/Restaurant_Review-g187147-d6575305-Reviews-or{offset}-Il_Etait_Un_Square-Paris_Ile_de_France.html'
            r = s.get(url)
            soup = bs(r.content, 'lxml')
            reviews = soup.select('.reviewSelector')
            ids = [review.get('data.reviewid') for review in reviews]
            r = s.post(
                    'https://www.tripadvisor.fr/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer=',
                    data = {'reviews': ','.join(ids), 'contextChoice': 'DETAIL'},
                    headers = {'Referer': r.url}
                    )

            soup = bs(r.content, 'lxml')
            if not offset:
                inf_rest_name = soup.select_one('.heading').text.replace("\n","").strip()
                rest_eclf = soup.select_one('.header_links a').text.strip()

            for review in soup.select('.reviewSelector'):
                name_client = review.select_one('.info_text > div:first-child').text.strip()
                date_rev_cl = review.select_one('.ratingDate')['title'].strip()
                titre_rev_cl = review.select_one('.noQuotes').text.strip()
                opinion_cl = review.select_one('.partial_entry').text.replace("\n","").strip()
                row = [f"{inf_rest_name}", f"{rest_eclf}", f"{name_client}", f"{date_rev_cl}", f"{titre_rev_cl}", f"{opinion_cl}"]
                w.writerow(row)

执行错误:

"data = {'reviews':','.join(ids),'contextChoice':'DETAIL'} TypeError:序列项0:预期的str实例,找不到NoneType"

"data = {'reviews': ','.join(ids), 'contextChoice': 'DETAIL'} TypeError: sequence item 0: expected str instance, NoneType found"

,在我决定只更改第6行(网站页面)和第7行(网址)中的值之后:

and after I decided to change just values in line 6(pages of site) and 7 (Url):

with open("boutary.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
    w = csv.writer(csv_file, delimiter = ";", quoting=csv.QUOTE_MINIMAL)
    w.writerow(["inf_rest_name", "rest_eclf", "name_client", "date_rev_cl", "titre_rev_cl", "opinion_cl"])

    with requests.Session() as s:
        for offset in range(40, 290, 10):
            url = f'https://www.tripadvisor.fr/Restaurant_Review-g187147-d9783452-Reviews-or{offset}-Boutary-Paris_Ile_de_France.html'
            r = s.get(url)
            soup = bs(r.content, 'lxml')
            reviews = soup.select('.reviewSelector')
            ids = [review.get('data-reviewid') for review in reviews]
            r = s.post(
                'https://www.tripadvisor.fr/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer=',
                data = {'reviews': ','.join(ids), 'contextChoice': 'DETAIL'},
                headers = {'referer': r.url}
                )

            soup = bs(r.content, 'lxml')

            if not offset:
                    inf_rest_name = soup.select_one('.heading').text.replace("\n","").strip()
                    rest_eclf = soup.select_one('.header_links a').text.strip()

            for review in soup.select('.reviewSelector'):
                name_client = review.select_one('.info_text > div:first-child').text.strip()
                date_rev_cl = review.select_one('.ratingDate')['title'].strip()
                titre_rev_cl = review.select_one('.noQuotes').text.strip()
                opinion_cl = review.select_one('.partial_entry').text.replace("\n","").strip()
                row = [f"{inf_rest_name}", f"{rest_eclf}", f"{name_client}", f"{date_rev_cl}" , f"{titre_rev_cl}", f"{opinion_cl}"]
                w.writerow(row)

它显示了我

行= [f" {inf_rest_name},f" {rest_eclf},f" {name_client}, f"{date_rev_cl}",f"{titre_rev_cl}",f"{opinion_cl}"]

"row = [f"{inf_rest_name}", f"{rest_eclf}", f"{name_client}", f"{date_rev_cl}" , f"{titre_rev_cl}", f"{opinion_cl}"]

NameError:未定义名称'inf_rest_name'"

NameError: name 'inf_rest_name' is not defined"

这些错误很奇怪,因为在此之前,我将相同的代码与其他URL一起使用,并且效果很好. 你能告诉我发生了什么事吗?如何正确运行?感谢您的帮助.

These errors are strange because, before, I used the same code with other URL and it worked perfectly. Can you tell me please what is happening? How can I run it proprerly? I will appreciate your help.

推荐答案

这是因为在原始代码(未在此处发布)中,它依赖于偏移0的Truthy/Falsy值,在您先前的问题中,该值是第一个偏移.

This is because in the original code, not posted here, it was relying on Truthy/Falsy value of offset 0 which in your prior question was the first offset.

例如,使用:

for offset in range(0, 10, 10):
    if not offset:

第一个值0是Falsy,而数字> 0(在这种情况下)将被视为Truthy.如果不是True,即False,即偏移量为0,则设置inf_rest_name的值.这样可以确保仅在第一个循环上设置它的值,而不是每次都设置它.它的值不会改变,因此无需再次阅读.

The first value 0 is a Falsy versus numbers > 0 (in this scenario) which will be seen as Truthy. If not True i.e. False i.e. if 0 offset then set the value of inf_rest_name. This ensures its value it only set on the first loop rather than each time. Its value doesn't change so no need to read again.

使用以下所有值都是真值",因此inf_rest_name永远不会被设置.

With the following all values are Truthies and so inf_rest_name never gets set.

for offset in range(40, 290, 10):
    if not offset:

您可以更改为:

if offset == firstvalue:

例如

if offset == 40:
    inf_rest_name = soup.select_one('.heading').text.replace("\n","").strip()
    rest_eclf = soup.select_one('.header_links a').text.strip()

请参见了解更多信息.

这些行还需要先汤而不是后汤(因为这只是评论)

Those lines also need to work with first soup not later soup (as that is only reviews)

import requests
from bs4 import BeautifulSoup as bs

with requests.Session() as s:
        for offset in range(40, 290, 10):
            url = f'https://www.tripadvisor.fr/Restaurant_Review-g187147-d9783452-Reviews-or{offset}-Boutary-Paris_Ile_de_France.html'
            r = s.get(url)
            soup = bs(r.content, 'lxml')
            if offset == 40:
                inf_rest_name = soup.select_one('.heading').text.replace("\n","").strip()
                rest_eclf = soup.select_one('.header_links a').text.strip()
            reviews = soup.select('.reviewSelector')
            ids = [review.get('data-reviewid') for review in reviews]
            r = s.post(
                'https://www.tripadvisor.fr/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer=',
                data = {'reviews': ','.join(ids), 'contextChoice': 'DETAIL'},
                headers = {'referer': r.url}
                )

            soup = bs(r.content, 'lxml')

            for review in soup.select('.reviewSelector'):
                name_client = review.select_one('.info_text > div:first-child').text.strip()
                date_rev_cl = review.select_one('.ratingDate')['title'].strip()
                titre_rev_cl = review.select_one('.noQuotes').text.strip()
                opinion_cl = review.select_one('.partial_entry').text.replace("\n","").strip()
                row = [f"{inf_rest_name}", f"{rest_eclf}", f"{name_client}", f"{date_rev_cl}" , f"{titre_rev_cl}", f"{opinion_cl}"]


对于第一个代码块,您使用的是无效属性.应该是


For your first code block you are using an invalid attribute. It should be

ids = [review.get('data-reviewid') for review in reviews]

注意,我添加了一个is None测试来处理未找到的问题.这也应该添加到最高版本中.

Note I have added an is None test to handle not found. This should be added to top version as well.

import requests
from bs4 import BeautifulSoup as bs

with requests. Session() as s:
        for offset in range (270, 1230, 10):
            url = f'https://www.tripadvisor.fr/Restaurant_Review-g187147-d6575305-Reviews-or{offset}-Il_Etait_Un_Square-Paris_Ile_de_France.html'
            r = s.get(url)
            soup = bs(r.content, 'lxml')
            if offset == 270:
                inf_rest_name = soup.select_one('.heading').text.replace("\n","").strip()
                rest_eclf = soup.select_one('.header_links a').text.strip()
            reviews = soup.select('.reviewSelector')
            ids = [review.get('data-reviewid') for review in reviews]
            r = s.post(
                    'https://www.tripadvisor.fr/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer=',
                    data = {'reviews': ','.join(ids), 'contextChoice': 'DETAIL'},
                    headers = {'Referer': r.url}
                    )

            soup = bs(r.content, 'lxml')

            for review in soup.select('.reviewSelector'):
                name_client= review.select_one('.info_text > div:first-child')
                if name_client is None:
                    name_client = 'N/A'
                else:
                    name_client = name_client.text.strip()

                date_rev_cl = review.select_one('.ratingDate')
                if date_rev_cl is None:
                    date_rev_cl = 'N/A'
                else:
                    date_rev_cl  = date_rev_cl['title'].strip()

                titre_rev_cl = review.select_one('.noQuotes')
                if titre_rev_cl is None:
                    titre_rev_cl = 'N/A'
                else:
                    titre_rev_cl = titre_rev_cl.text.strip()

                opinion_cl = review.select_one('.partial_entry')
                if opinion_cl is None:
                     opinion_cl = 'N/A'
                else:
                     opinion_cl =  opinion_cl.text.replace("\n","").strip()

                row = [f"{inf_rest_name}", f"{rest_eclf}", f"{name_client}", f"{date_rev_cl}", f"{titre_rev_cl}", f"{opinion_cl}"]
                print(row)

这篇关于使用我用来刮取其他页面的相同代码显示两个不同的错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆