如何使用BeautifulSoup从网站上抓取特定单元格的文本 [英] How to scrape text of specific cell from website using BeautifulSoup

查看:263
本文介绍了如何使用BeautifulSoup从网站上抓取特定单元格的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

过去一个小时,我一直在尝试从网站上抓取文字,但没有任何进展,这仅仅是因为我对如何实际使用BSoup的知识很少.

I've been trying to scrape text from a website for the past hour and have made no progress, simply because I have very little knowledge on how to actually use BSoup.

def select_ticker():
    url = "https://www.barchart.com/stocks/performance/gap/gap-up?screener=nasdaq"

    r = requests.get(url)
    html = r.text
    soup = BeautifulSoup(html)


    find = soup.findAll('td, {"data-ng-if:"row.blankRow"}')

    print(find)

我要去

I'm going to this website and trying to get the first symbol from the table. Right now that symbol is BFBG

我知道这对于真正知道自己正在使用BSoup进行操作的人来说应该是非常容易的,但是我不了解搜索内容,并且该网站也使搜索变得不那么容易.

I know this should be extremely easy for someone who actually knows what they're doing with BSoup but I don't understand searching for things and this website doesn't make it easy to search either.

感谢您的宝贵时间,感谢您的帮助!

I appreciate your time and thanks for the help!

推荐答案

实际上,您无法从html get请求中删除第一个符号.您需要获取json.

Actually, you cannot scrap the first symbol from the html get request. You need to fetch the json.

import urllib3
import json
http = urllib3.PoolManager()
r = http.request('GET', 'https://core-api.barchart.com/v1/quotes/get?lists=stocks.gaps.up.nasdaq&orderDir=desc&fields=symbol,symbolName,lastPrice,priceChange,gapUp,highPrice,lowPrice,volume,tradeTime,symbolCode,symbolType,hasOptions&orderBy=gapUp&meta=field.shortName,field.type,field.description&hasOptions=true&page=1&limit=100&raw=1')
print(json.loads(r.data)['data'][0]['symbol'])

然后您会得到第一个符号.

And there you got the first symbol.

使用Json,您还可以找到可能要剪贴的所有信息.

With the Json you can also find every information you probably want to scrap.

这是通常可以找到那些Jsons的方法:

Here is how you can usually find those Jsons :

进入控制台,网络"选项卡,"xhr"选项卡并重新加载页面.如果获取了大量资源,则还可以按域名称进行过滤! :)

Going into the console, network tab, xhr tab and reload the page. If there are a lot of ressources fetched, you can also filter by the name of the domain ! :)

但是,此语法错误: soup.findAll('td,{"data-ng-if:" row.blankRow}')

However, this syntax is wrong: soup.findAll('td, {"data-ng-if:"row.blankRow"}')

您需要根据BS4文档为find_all方法提供字典 https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all

you need to give a dictionnary to the find_all method according to BS4 doc https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all

soup.find_all('td', {'data-ng-if':'row.blankRow'})

希望这会有所帮助

这篇关于如何使用BeautifulSoup从网站上抓取特定单元格的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆