提取<script>的内容搭配美汤 [英] Extract content of <script> with BeautifulSoup

查看:18
本文介绍了提取<script>的内容搭配美汤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

1/我正在尝试使用美丽的汤提取脚本的一部分,但它什么也没打印.怎么了?

1/ I am trying to extract a part of the script using beautiful soup but it prints Nothing. What's wrong ?

URL = "http://www.reuters.com/video/2014/08/30/woman-who-drank-restaurants-tainted-tea?videoId=341712453"
oururl= urllib2.urlopen(URL).read()
soup = BeautifulSoup(oururl)

for script in soup("script"):
        script.extract()

list_of_scripts = soup.findAll("script")
print list_of_scripts

2/目标是提取属性transcript"的值:

2/ The goal is to extract the value of the attribute "transcript":

<script type="application/ld+json">
{
    "@context": "http://schema.org",
    "@type": "VideoObject",
    "video": {
        "@type": "VideoObject",
        "headline": "Woman who drank restaurant&#039;s tainted tea hopes for industry...",
        "caption": "Woman who drank restaurant&#039;s tainted tea hopes for industry...",  
        "transcript": "Jan Harding is speaking out for the first time about the ordeal that changed her life.               SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING:               "Immediately my whole mouth was on fire."               The Utah woman was critically burned in her mouth and esophagus after taking a sip of sweet tea laced with a toxic cleaning solution at Dickey's BBQ.               SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING:               "It was like a fire beyond anything you can imagine. I mean, it was not like drinking hot coffee."               Authorities say an employee mistakenly mixed the industrial cleaning solution containing lye into the tea thinking it was sugar.               The Hardings hope the incident will bring changes in the restaurant industry to avoid such dangerous mixups.               SOUNDBITE: JIM HARDING, HUSBAND, SAYING:               "Bottom line, so no one ever has to go through this again."               The district attorney's office is expected to decide in the coming week whether criminal charges will be filed.",

推荐答案

extract 从 dom 中移除标签.这就是你得到空列表的原因.

extract remove tag from the dom. That's why you get empty list.

使用 type="application/ld+json" 属性查找 script 并使用 json.loads 对其进行解码.然后,您可以像 Python 数据结构一样访问数据.(dict 用于给定数据)

Find script with the type="application/ld+json" attribute and decode it using json.loads. Then, you can access the data like Python data structure. (dict for the given data)

import json
import urllib2

from bs4 import BeautifulSoup

URL = ("http://www.reuters.com/video/2014/08/30/"
       "woman-who-drank-restaurants-tainted-tea?videoId=341712453")
oururl= urllib2.urlopen(URL).read()
soup = BeautifulSoup(oururl)

data = json.loads(soup.find('script', type='application/ld+json').text)
print data['video']['transcript']

这篇关于提取&lt;script&gt;的内容搭配美汤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆