在Python中使用BeautifulSoup从脚本标签中提取数据 [英] Extracting data from script tag using BeautifulSoup in Python

查看:143
本文介绍了在Python中使用BeautifulSoup从脚本标签中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Python中的BeautifulSoup从脚本"标记中的代码中提取"SNG_TITLE"和"ART_NAME"值. (整个脚本太长,无法粘贴)

I want to extract "SNG_TITLE" and "ART_NAME" values from the code in "script" tag using BeautifulSoup in Python. (the whole script is too long to paste)

<script>window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276641","UPLOAD_ID":0,"SNG_TITLE":"Heathens","ART_ID":"647650","PROVIDER_ID":"3","ART_NAME":"Twenty One Pilots","ARTISTS":[{"ART_ID":"647650","ROLE_ID":"0","ARTISTS_SONGS_ORDER":"1","ART_NAME":"Twenty One Pilots","ART_PICTURE":"259dcf52853363d79753ec301377645d","SMARTRADIO":"1","RANK":"487762","LOCALES":[],"__TYPE__":"artist"}],"ALB_ID":"13371165","ALB_TITLE":"Heathens","TYPE":0,"MD5_ORIGIN":"5cea723b83af1ff0a62d65d334b978d4","VIDEO":false,"DURATION":"195","ALB_PICTURE":"3dfc8c9e406cf1bba8ce0695a44a9b7e","ART_PICTURE":"259dcf52853363d79753ec301377645d","RANK_SNG":"967143","SMARTRADIO":"1","FILESIZE_AAC_64":0,"FILESIZE_MP3_64":"0","FILESIZE_MP3_128":"3135946","FILESIZE_MP3_256":0,"FILESIZE_MP3_320":"7839868","FILESIZE_FLAC":"21777150","FILESIZE":"3135946","GAIN":"-12","MEDIA_VERSION":"4","DISK_NUMBER":"1","TRACK_NUMBER":"1","VERSION":"","EXPLICIT_LYRICS":"0","RIGHTS":{"STREAM_ADS_AVAILABLE":true,"STREAM_ADS":"2000-01-01","STREAM_SUB_AVAILABLE":true,"STREAM_SUB":"2000-01-01"},"ISRC":"USAT21601930","DATE_ADD":1497886149,"HIERARCHICAL_TITLE":"","SNG_CONTRIBUTORS":{"mainartist":["Twenty One Pilots"],"engineer":["Adam Hawkins"],"mixer":["Adam Hawkins"],"masterer":["Chris Gehringer"],"drums":["Josh Dun"],"producer":["Mike Elizondo","Tyler Joseph"],"programmer":["Mike Elizondo","Tyler Joseph"],"vocals":["Tyler Joseph"],"writer":["Tyler Joseph"]},"LYRICS_ID":30553991,"__TYPE__":"song"},{"SNG_ID":"99976952","PRODUCT_TRACK_ID":"171067651","UPLOAD_ID":0,"SNG_TITLE":"Stressed Out","ART_ID":"647650","PROVIDER_ID":"3","ART_NAME":"Twenty One Pilots","ARTISTS":[{"ART_ID":"647650","ROLE_ID":"0","ARTISTS_SONGS_ORDER":"1","ART_NAME":"Twenty One Pilots", ...</script>

代码的想法是打印出用户名,以及在给定页面上可以找到的所有歌曲和歌手的名字.

The idea of the code is to print out the user name, all song and artist names that can be found on the given page.

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

user_name = soup.find(class_='user-name')
print(user_name.text)

这将打印用户名.

for script in soup.find_all('script'):
    print(script.contents) 

如果我理解正确,则我需要的脚本是字典,所以我只需要查找它并获取其内容.问题是我不知道如何确切地找到正是这个脚本".它没有任何属性或任何使其唯一的属性.因此,我尝试了一个循环,该循环可找到页面上的所有脚本并打印出它们的内容,但不确定如何进一步进行.

If I understand correctly, the script I need is a dictionary, so I just need to find it and get its contents. The problem is I don't know how to specifically find exactly this "script". It doesn't have any attributes or anything that makes it unique. So I tried a loop that finds all scripts on the page and prints out their contents, but not sure how to proceed further.

如何在页面上仅找到该特定的脚本"?我可以用其他方式访问值吗?

How do I find only this particular "script" on the page? Can I access the values in a different way?

推荐答案

脚本不会更改代码中的位置,因此您可以对它们进行计数并使用索引来获取正确的脚本.

Scripts don't change places in code so you can count them and use index to get correct script.

all_scripts[6]

脚本是普通的字符串,因此您也可以使用标准的字符串函数,例如.

Script is normal string so you can also use standard string functions ie.

if '{"loved"' in script.text:

两种方法的代码-我使用[:100]仅显示字符串的一部分.

Code with both methods - I use [:100] to display only part of string.

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

all_scripts = soup.find_all('script')

print('--- first method ---')
print(all_scripts[6].text[:100])

print('--- second method ---')
for number, script in enumerate(all_scripts):
    if '{"loved"' in script.text:
        print(number, script.text[:100])

结果:

--- first method ---
window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276
--- second method ---
6 window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276


如果脚本正确,则可以使用切片仅获取JSON字符串,并使用模块json将其转换为python字典,然后tou即可获取数据


When you have correct script then you can use slicing to get only JSON string and use module json to convert it to python dictionary and then tou can get data

import requests
from bs4 import BeautifulSoup
import json

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

all_scripts = soup.find_all('script')

data = json.loads(all_scripts[6].get_text()[27:])

print('key:', data.keys())
print('key:', data['TAB'].keys())
print('key:', data['DATA'].keys())
print('---')

for item in data['TAB']['loved']['data']:
    print('ART_NAME:', item['ART_NAME'])
    print('SNG_TITLE:', item['SNG_TITLE'])
    print('---')

结果:

key: dict_keys(['TAB', 'DATA'])
key: dict_keys(['loved'])
key: dict_keys(['USER', 'FOLLOW', 'FOLLOWING', 'HAS_BLOCKED', 'IS_BLOCKED', 'IS_PUBLIC', 'CURATOR', 'IS_PERSONNAL', 'NB_FOLLOWER', 'NB_FOLLOWING'])
---
ART_NAME: Twenty One Pilots
SNG_TITLE: Heathens
---
ART_NAME: Twenty One Pilots
SNG_TITLE: Stressed Out
---
ART_NAME: Linkin Park
SNG_TITLE: Numb
---
ART_NAME: Three Days Grace
SNG_TITLE: Animal I Have Become
---
ART_NAME: Three Days Grace
SNG_TITLE: Painkiller
---
ART_NAME: Slipknot
SNG_TITLE: Before I Forget
---
ART_NAME: Slipknot
SNG_TITLE: Duality
---
ART_NAME: Skrillex
SNG_TITLE: Make It Bun Dem
---
ART_NAME: Skrillex
SNG_TITLE: Bangarang (feat. Sirah)
---
ART_NAME: Limp Bizkit
SNG_TITLE: Break Stuff
---
ART_NAME: Three Days Grace
SNG_TITLE: I Hate Everything About You
---
ART_NAME: Three Days Grace
SNG_TITLE: Time of Dying
---
ART_NAME: Three Days Grace
SNG_TITLE: I Am Machine
---
ART_NAME: Three Days Grace
SNG_TITLE: Riot
---
ART_NAME: Three Days Grace
SNG_TITLE: So What
---
ART_NAME: Three Days Grace
SNG_TITLE: Pain
---
ART_NAME: Three Days Grace
SNG_TITLE: Tell Me Why
---
ART_NAME: Three Days Grace
SNG_TITLE: Chalk Outline
---
ART_NAME: Three Days Grace
SNG_TITLE: Gone Forever
---
ART_NAME: Slipknot
SNG_TITLE: The Devil In I
---
ART_NAME: Linkin Park
SNG_TITLE: No More Sorrow
---
ART_NAME: Linkin Park
SNG_TITLE: Bleed It Out
---
ART_NAME: The Doors
SNG_TITLE: Roadhouse Blues
---
ART_NAME: The Doors
SNG_TITLE: Riders On The Storm
---
ART_NAME: The Doors
SNG_TITLE: Break On Through (To The Other Side)
---
ART_NAME: The Doors
SNG_TITLE: Alabama Song (Whisky Bar)
---
ART_NAME: The Doors
SNG_TITLE: People Are Strange
---
ART_NAME: My Chemical Romance
SNG_TITLE: Welcome to the Black Parade
---
ART_NAME: My Chemical Romance
SNG_TITLE: Teenagers
---
ART_NAME: My Chemical Romance
SNG_TITLE: Na Na Na [Na Na Na Na Na Na Na Na Na]
---
ART_NAME: My Chemical Romance
SNG_TITLE: Famous Last Words
---
ART_NAME: The Doors
SNG_TITLE: Soul Kitchen
---
ART_NAME: The Black Keys
SNG_TITLE: Lonely Boy
---
ART_NAME: Katy Perry
SNG_TITLE: I Kissed a Girl
---
ART_NAME: Katy Perry
SNG_TITLE: Hot N Cold
---
ART_NAME: Katy Perry
SNG_TITLE: E.T.
---
ART_NAME: Linkin Park
SNG_TITLE: Given Up
---
ART_NAME: My Chemical Romance
SNG_TITLE: Dead!
---
ART_NAME: My Chemical Romance
SNG_TITLE: Mama
---
ART_NAME: My Chemical Romance
SNG_TITLE: The Sharpest Lives
---

这篇关于在Python中使用BeautifulSoup从脚本标签中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆