在 Python 中使用 BeautifulSoup 从脚本标签中提取数据 [英] Extracting data from script tag using BeautifulSoup in Python

查看:56
本文介绍了在 Python 中使用 BeautifulSoup 从脚本标签中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 Python 中使用 BeautifulSoup 从script"标签中的代码中提取SNG_TITLE"和ART_NAME"值.(整个脚本太长无法粘贴)

I want to extract "SNG_TITLE" and "ART_NAME" values from the code in "script" tag using BeautifulSoup in Python. (the whole script is too long to paste)

<script>window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276641","UPLOAD_ID":0,"SNG_TITLE":"Heathens","ART_ID":"647650","PROVIDER_ID":"3","ART_NAME":"Twenty One Pilots","ARTISTS":[{"ART_ID":"647650","ROLE_ID":"0","ARTISTS_SONGS_ORDER":"1","ART_NAME":"Twenty One Pilots","ART_PICTURE":"259dcf52853363d79753ec301377645d","SMARTRADIO":"1","RANK":"487762","LOCALES":[],"__TYPE__":"artist"}],"ALB_ID":"13371165","ALB_TITLE":"Heathens","TYPE":0,"MD5_ORIGIN":"5cea723b83af1ff0a62d65d334b978d4","VIDEO":false,"DURATION":"195","ALB_PICTURE":"3dfc8c9e406cf1bba8ce0695a44a9b7e","ART_PICTURE":"259dcf52853363d79753ec301377645d","RANK_SNG":"967143","SMARTRADIO":"1","FILESIZE_AAC_64":0,"FILESIZE_MP3_64":"0","FILESIZE_MP3_128":"3135946","FILESIZE_MP3_256":0,"FILESIZE_MP3_320":"7839868","FILESIZE_FLAC":"21777150","FILESIZE":"3135946","GAIN":"-12","MEDIA_VERSION":"4","DISK_NUMBER":"1","TRACK_NUMBER":"1","VERSION":"","EXPLICIT_LYRICS":"0","RIGHTS":{"STREAM_ADS_AVAILABLE":true,"STREAM_ADS":"2000-01-01","STREAM_SUB_AVAILABLE":true,"STREAM_SUB":"2000-01-01"},"ISRC":"USAT21601930","DATE_ADD":1497886149,"HIERARCHICAL_TITLE":"","SNG_CONTRIBUTORS":{"mainartist":["Twenty One Pilots"],"engineer":["Adam Hawkins"],"mixer":["Adam Hawkins"],"masterer":["Chris Gehringer"],"drums":["Josh Dun"],"producer":["Mike Elizondo","Tyler Joseph"],"programmer":["Mike Elizondo","Tyler Joseph"],"vocals":["Tyler Joseph"],"writer":["Tyler Joseph"]},"LYRICS_ID":30553991,"__TYPE__":"song"},{"SNG_ID":"99976952","PRODUCT_TRACK_ID":"171067651","UPLOAD_ID":0,"SNG_TITLE":"Stressed Out","ART_ID":"647650","PROVIDER_ID":"3","ART_NAME":"Twenty One Pilots","ARTISTS":[{"ART_ID":"647650","ROLE_ID":"0","ARTISTS_SONGS_ORDER":"1","ART_NAME":"Twenty One Pilots", ...</script>

代码的想法是打印出用户名、所有可以在给定页面上找到的歌曲和艺术家姓名.

The idea of the code is to print out the user name, all song and artist names that can be found on the given page.

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

user_name = soup.find(class_='user-name')
print(user_name.text)

这将打印用户名.

for script in soup.find_all('script'):
    print(script.contents) 

如果我理解正确的话,我需要的脚本是一本字典,所以我只需要找到它并获取它的内容.问题是我不知道如何专门找到正是这个脚本".它没有任何属性或任何使它独一无二的东西.所以我尝试了一个循环,找到页面上的所有脚本并打印出它们的内容,但不知道如何进一步进行.

If I understand correctly, the script I need is a dictionary, so I just need to find it and get its contents. The problem is I don't know how to specifically find exactly this "script". It doesn't have any attributes or anything that makes it unique. So I tried a loop that finds all scripts on the page and prints out their contents, but not sure how to proceed further.

如何在页面上只找到这个特定的脚本"?我可以以不同的方式访问这些值吗?

How do I find only this particular "script" on the page? Can I access the values in a different way?

推荐答案

脚本不会改变代码中的位置,因此您可以计算它们并使用索引来获得正确的脚本.

Scripts don't change places in code so you can count them and use index to get correct script.

all_scripts[6]

脚本是普通的字符串,所以你也可以使用标准的字符串函数,即.

Script is normal string so you can also use standard string functions ie.

if '{"loved"' in script.text:

两种方法的代码 - 我使用 [:100] 只显示字符串的一部分.

Code with both methods - I use [:100] to display only part of string.

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

all_scripts = soup.find_all('script')

print('--- first method ---')
print(all_scripts[6].text[:100])

print('--- second method ---')
for number, script in enumerate(all_scripts):
    if '{"loved"' in script.text:
        print(number, script.text[:100])

结果:

--- first method ---
window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276
--- second method ---
6 window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276

<小时>

当你有正确的脚本时,你可以使用切片只获取 JSON 字符串并使用模块 json 将其转换为 python字典然后tou可以获取数据


When you have correct script then you can use slicing to get only JSON string and use module json to convert it to python dictionary and then tou can get data

import requests
from bs4 import BeautifulSoup
import json

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

all_scripts = soup.find_all('script')

data = json.loads(all_scripts[6].get_text()[27:])

print('key:', data.keys())
print('key:', data['TAB'].keys())
print('key:', data['DATA'].keys())
print('---')

for item in data['TAB']['loved']['data']:
    print('ART_NAME:', item['ART_NAME'])
    print('SNG_TITLE:', item['SNG_TITLE'])
    print('---')

结果:

key: dict_keys(['TAB', 'DATA'])
key: dict_keys(['loved'])
key: dict_keys(['USER', 'FOLLOW', 'FOLLOWING', 'HAS_BLOCKED', 'IS_BLOCKED', 'IS_PUBLIC', 'CURATOR', 'IS_PERSONNAL', 'NB_FOLLOWER', 'NB_FOLLOWING'])
---
ART_NAME: Twenty One Pilots
SNG_TITLE: Heathens
---
ART_NAME: Twenty One Pilots
SNG_TITLE: Stressed Out
---
ART_NAME: Linkin Park
SNG_TITLE: Numb
---
ART_NAME: Three Days Grace
SNG_TITLE: Animal I Have Become
---
ART_NAME: Three Days Grace
SNG_TITLE: Painkiller
---
ART_NAME: Slipknot
SNG_TITLE: Before I Forget
---
ART_NAME: Slipknot
SNG_TITLE: Duality
---
ART_NAME: Skrillex
SNG_TITLE: Make It Bun Dem
---
ART_NAME: Skrillex
SNG_TITLE: Bangarang (feat. Sirah)
---
ART_NAME: Limp Bizkit
SNG_TITLE: Break Stuff
---
ART_NAME: Three Days Grace
SNG_TITLE: I Hate Everything About You
---
ART_NAME: Three Days Grace
SNG_TITLE: Time of Dying
---
ART_NAME: Three Days Grace
SNG_TITLE: I Am Machine
---
ART_NAME: Three Days Grace
SNG_TITLE: Riot
---
ART_NAME: Three Days Grace
SNG_TITLE: So What
---
ART_NAME: Three Days Grace
SNG_TITLE: Pain
---
ART_NAME: Three Days Grace
SNG_TITLE: Tell Me Why
---
ART_NAME: Three Days Grace
SNG_TITLE: Chalk Outline
---
ART_NAME: Three Days Grace
SNG_TITLE: Gone Forever
---
ART_NAME: Slipknot
SNG_TITLE: The Devil In I
---
ART_NAME: Linkin Park
SNG_TITLE: No More Sorrow
---
ART_NAME: Linkin Park
SNG_TITLE: Bleed It Out
---
ART_NAME: The Doors
SNG_TITLE: Roadhouse Blues
---
ART_NAME: The Doors
SNG_TITLE: Riders On The Storm
---
ART_NAME: The Doors
SNG_TITLE: Break On Through (To The Other Side)
---
ART_NAME: The Doors
SNG_TITLE: Alabama Song (Whisky Bar)
---
ART_NAME: The Doors
SNG_TITLE: People Are Strange
---
ART_NAME: My Chemical Romance
SNG_TITLE: Welcome to the Black Parade
---
ART_NAME: My Chemical Romance
SNG_TITLE: Teenagers
---
ART_NAME: My Chemical Romance
SNG_TITLE: Na Na Na [Na Na Na Na Na Na Na Na Na]
---
ART_NAME: My Chemical Romance
SNG_TITLE: Famous Last Words
---
ART_NAME: The Doors
SNG_TITLE: Soul Kitchen
---
ART_NAME: The Black Keys
SNG_TITLE: Lonely Boy
---
ART_NAME: Katy Perry
SNG_TITLE: I Kissed a Girl
---
ART_NAME: Katy Perry
SNG_TITLE: Hot N Cold
---
ART_NAME: Katy Perry
SNG_TITLE: E.T.
---
ART_NAME: Linkin Park
SNG_TITLE: Given Up
---
ART_NAME: My Chemical Romance
SNG_TITLE: Dead!
---
ART_NAME: My Chemical Romance
SNG_TITLE: Mama
---
ART_NAME: My Chemical Romance
SNG_TITLE: The Sharpest Lives
---

这篇关于在 Python 中使用 BeautifulSoup 从脚本标签中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆