使用美丽的汤从网页中的链接中抓取数据.Python [英] Scrape data from a link in a webpage using beautiful soup. python

查看:26
本文介绍了使用美丽的汤从网页中的链接中抓取数据.Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从网页内的 url 中抓取数据(instaid、平均喜欢、平均评论):、https://starngage.com/app/global/influencer/ranking/india

I am trying to scrape data (instaid, average likes, average comments) from a url inside the webpage: , https://starngage.com/app/global/influencer/ranking/india

url 的元素 id 是:@priyankachopra

The element id of the url is : @priyankachopra

同样,我想从同一个表中的所有 1000 个配置文件中抓取数据

Similary I want to scrape data from all 1000 profiles in the same table

谁能告诉我怎么做

from bs4 import BeautifulSoup
from prettytable import PrettyTable

tb = PrettyTable(['Name', 'Insta_ID', 'Followers'])
url = 'https://starngage.com/app/global/influencer/ranking/india'
resp = requests.get(url)

soup = BeautifulSoup(resp.text, 'html.parser')

table = soup.find('table', class_='table-responsive-sm')
td = table.findAll('tr')

for i in td[1:]:
    temp = i.select_one("td:nth-of-type(3)").text
    name, insta_id = temp.split('@')
    followers = i.select_one("td:nth-of-type(6)").text
    tb.add_row([name.strip(), insta_id.strip(), followers.strip()])

print(tb)

推荐答案

你可以这样做,我没有完全测试完整的代码,因为它会花费很多时间,可能需要长达 10 分钟,但我已经测试了部分部分并且是对我来说工作得很好.但如果不起作用,请在评论中问我.代码如下:

You can do this, I hadn't exactly tested complete code because it will take very much time it may take upto 10mins but I had tested part part and is working perfectly fine for me. But if not working ask me in comment. Here's code:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

ids=[]
avgc=[]
avgl=[]
for i in range(1,101):
    url = f'https://starngage.com/app/global/influencer/ranking/india?page={i}'
    print(url)
    resp = requests.get(url)
    
    soup = BeautifulSoup(resp.text, 'lxml')
    
    table = soup.find('table', class_='table-responsive-sm')
    trs = table.findAll('tr')
    
    for tr in trs[1:]:
        temp = tr.select_one("td:nth-of-type(3)").text
        _,insta_id = temp.split('@')
        ids.append(insta_id.strip())

for id in ids:
    page=requests.get("https://starngage.com/app/global/influencers/"+id)
    soup=BeautifulSoup(page.content, 'lxml')
    
    x=soup.find("blockquote").find("p").text.strip()
    #You can change this re code. I am not much familar with re. So, if you find any other better approch then comment.
    x=re.findall("is \d+",x)
    avl,avc=list(map(lambda y: y.replace("is ",""),x))
    avgl.append(avl)
    avgc.append(avc)

df = pd.DataFrame({"Insta Id":ids,"Avgerage Like":avgl,"Avgerage Commment":avgc})

print(df)

df.to_csv("test.csv")

这篇关于使用美丽的汤从网页中的链接中抓取数据.Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆