使用Python进行网络爬虫,当我说“检查"页面时,我看不到类的实际名称 [英] Webscraping with Python, I can't see the actual names of classes when I say inspect page

查看:97
本文介绍了使用Python进行网络爬虫,当我说“检查"页面时,我看不到类的实际名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好,所以我只是在学习python,我想使用网络抓取.我正在看本教程,导师那里的检查"页面(或所谓的页面)与我的完全不同.因此,他看到的是class ="ProfileHeaderCard",而我看到的是class ="css-1dbjc4n r-1iusvr4 r-16y2uox r-5f2r5o r-m611by".重要的部分是,当我使用我的类名版本时,BeautifulSoup库不起作用,但在使用他的版本时,它起作用.当我说print(soup.find('div', {"class":"css-1dbjc4n r-1iusvr4 r-16y2uox r-5f2r5o r-m611by"})) 它返回None 大声笑怎么回事,请帮忙.

Ok so I am just learning python and I want to use web scraping. I was watching this tutorial and there the tutor has a totally different "inspect" page(or whatever it is called) than mine. So what he sees is class = "ProfileHeaderCard", and what I see is class = "css-1dbjc4n r-1iusvr4 r-16y2uox r-5f2r5o r-m611by". THE IMPORTANT PART is that BeautifulSoup library does not work when I use my version of the class name but it works when I use his version. When I say print(soup.find('div', {"class":"css-1dbjc4n r-1iusvr4 r-16y2uox r-5f2r5o r-m611by"})) it returns None What is going on lol please help.

from bs4 import BeautifulSoup
import urllib.request

theurl = 'https://twitter.com/1kasecorba'
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, 'html.parser')

print(soup.find('div', {"class":"css-1dbjc4n r-1iusvr4 r-16y2uox r-5f2r5o r-m611by"}))

推荐答案

找不到它,因为它不存在.请注意,在页面上执行GET请求时,在浏览器中打开页面并在其中查看源代码时,通常不会获得与源代码相同的控件(Control + U).

It does not find it because it is not there. Note that when you perform GET request on a page, you often don't get the same source you see when you open a page in a browser and see source there (Control + U).

我写了一个脚本,将urllib获取的源内容写到一个文本文件中,没有您要查找的此类.在最后一行的示例中,汤.find函数没有任何问题.

I wrote a script that writes the content of source got by urllib to a text file, and no such class you are looking for is there. There's nothing wrong with the soup.find function, as you will see on the example at the last line.

from bs4 import BeautifulSoup
import urllib.request

theurl = 'https://twitter.com/1kasecorba'
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, 'html.parser')

file = open("page_source.txt", "w+", encoding="utf-8")
file.write(str(soup))
file.close()

# works as charm
print(soup.find('button', {"class":"modal-btn modal-close modal-close-fixed js-close"}))

如果您想查看真实的源代码,则需要使用Selenium之类的工具(可能有更好的选择,关于这个主题,我不能给出太多建议).

If you want to see the real source, you will need a tool like Selenium (there are probably better options, I can't give much advice on this topic).

这篇关于使用Python进行网络爬虫,当我说“检查"页面时,我看不到类的实际名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆