HTML源代码python BeautifulSoup中不存在古怪的字符 [英] Weird character not exists in html source python BeautifulSoup

查看:50
本文介绍了HTML源代码python BeautifulSoup中不存在古怪的字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看了一段视频,该视频教如何使用BeautifulSoup并要求刮擦网站这是代码

I have watched a video that teaches how to use BeautifulSoup and requests to scrape a website Here's the code

from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd

pages_to_scrape = 1

for i in range(1,pages_to_scrape+1):
    url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
    pages.append(url)
for item in pages:
    page = requests.get(item)
    soup = bs4(page.text, 'html.parser')
    #print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
    price=j.getText()
    print(price)

代码运行良好.但是对于结果,我注意到在欧元符号之前有奇怪的字符,并且在检查html源时,我没有找到该字符.有什么想法为什么这个角色出现?以及如何解决此问题..是否使用替换足够或有更好的方法?

The code i working well. But as for the results I noticed weird character before the euro symbol and when checking the html source, I didn't find that character. Any ideas why this character appears? and how this be fixed .. is using replace enough or there is a better approach?

推荐答案

对我来说,您似乎错误地解释了您的问题.我假设您使用的Windows的终端IDLE使用的默认编码为 cp1252

Seems for me you explained your question wrongly. I assume that you are using Windows where your terminal IDLE is using the default encoding of cp1252,

但是您要处理 UTF-8 ,则必须使用 UTF-8

But you are dealing with UTF-8, you've to configure your terminal/IDLE with UTF-8

import requests
from bs4 import BeautifulSoup


def main(url):
    with requests.Session() as req:
        for item in range(1, 10):
            r = req.get(url.format(item))
            print(r.url)
            soup = BeautifulSoup(r.content, 'html.parser')
            goal = [(x.h3.a.text, x.select_one("p.price_color").text)
                    for x in soup.select("li.col-xs-6")]
            print(goal)


main("http://books.toscrape.com/catalogue/page-{}.html")

  1. 尝试始终使用 DRY原则,这意味着不要重复你自己" .
  2. 由于您要处理相同的 host ,因此必须保持相同的会话,而不是保持打开的 tcp 套接字流,然后关闭它,然后再次打开它.这可能导致阻止您的请求,并将其视为 DDOS 攻击,其中 TCP 标志被后端捕获.想象一下,您打开浏览器然后打开一个网站,然后关闭它并重复该圆圈!
  3. Python functions 通常看起来不错并且易于阅读,而不是让代码看起来像日记文本.
  1. try to always use The DRY Principle which means Don’t Repeat Yourself".
  2. Since you are dealing with the same host so you've to maintain the same session instead of keep open tcp socket stream and then close it and then open it again. That's can lead to block your requests and consider it as DDOS attack where the TCP flags got captured by the back-end. imagine that you open your browser and then open a website then you close it and repeat the circle!
  3. Python functions is usually looks nice and easy to read instead of letting code looks like journal text.

注意: range() {} 格式字符串, CSS 选择器的用法.

Notes: the usage of range() and {} format string, CSS selectors.

这篇关于HTML源代码python BeautifulSoup中不存在古怪的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆