解析来自BeautifulSoup返回的JavaScript [英] Parse the JavaScript returned from BeautifulSoup

查看:1476
本文介绍了解析来自BeautifulSoup返回的JavaScript的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想解析网页<一个href=\"http://dcsd.nutrislice.com/menu/meadow-view/lunch/\">http://dcsd.nutrislice.com/menu/meadow-view/lunch/抓住今天的午餐菜单。 (我已经建立了一个Adafruit的#IoT热敏打印机,我想自动打印菜单的每一天。)

I would like to parse the webpage http://dcsd.nutrislice.com/menu/meadow-view/lunch/ to grab today's lunch menu. (I've built an Adafruit #IoT Thermal Printer and I'd like to automatically print the menu each day.)

我最初接触这个使用BeautifulSoup,但事实证明,大多数数据在JavaScript中被加载,我不知道BeautifulSoup可以处理它。如果你查看​​源代码,你会看到存储在 bootstrapData相关数据['menuMonthWeeks']

I initially approached this using BeautifulSoup but it turns out that most of the data is loaded in JavaScript and I'm not sure BeautifulSoup can handle it. If you view source you'll see the relevant data stored in bootstrapData['menuMonthWeeks'].

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
soup = BeautifulSoup(urllib2.urlopen(url).read())

这是一个简单的方法获取源和审查。

This is an easy way to get the source and review.

我的问题是:什么是提取这些数据,这样我可以用它做什么最简单的方法?从字面上看,我要的是一个字符串一样的东西:

My question is: what is the easiest way to extract this data so that I can do something with it? Literally, all I want is a string something like:

西南奶酪煎蛋卷,土豆楔子,收获吧(THB)泰铢 - 芝士香蒜面包,火腿熟食三明治,红辣椒棒,草莓

Southwest Cheese Omelet, Potato Wedges, The Harvest Bar (THB), THB - Cheesy Pesto Bread, Ham Deli Sandwich, Red Pepper Sticks, Strawberries

我已经想过使用的WebKit处理页面,并获得HTML(即浏览器有哪些呢),但似乎过于复杂。我宁愿只是找到的东西,可以解析 bootstrapData ['menuMonthWeeks'] 数据。

I've thought about using webkit to process the page and get the HTML (i.e. what a browser does) but that seems unnecessarily complex. I'd rather simply find something that can parse the bootstrapData['menuMonthWeeks'] data.

推荐答案

像PhantomJS事情可能会更强劲,但这里的一些基本的Python code,以提取它的完整的菜单:

Something like PhantomJS may be more robust, but here's some basic Python code to extract it the full menu:

import json
import re
import urllib2

text = urllib2.urlopen('http://dcsd.nutrislice.com/menu/meadow-view/lunch/').read()
menu = json.loads(re.search(r"bootstrapData\['menuMonthWeeks'\]\s*=\s*(.*);", text).group(1))

print menu

在这之后,你会想通过对你感兴趣的日期菜单进行搜索。

After that, you'll want to search through the menu for the date you're interested in.

修改:对我而言有些矫枉过正:

EDIT: Some overkill on my part:

import itertools
import json
import re
import urllib2

text = urllib2.urlopen('http://dcsd.nutrislice.com/menu/meadow-view/lunch/').read()
menus = json.loads(re.search(r"bootstrapData\['menuMonthWeeks'\]\s*=\s*(.*);", text).group(1))

days = itertools.chain.from_iterable(menu['days'] for menu in menus)

day = next(itertools.dropwhile(lambda day: day['date'] != '2014-01-13', days), None)

if day:
    print '\n'.join(item['food']['description'] for item in day['menu_items'])
else:
    print 'Day not found.'

这篇关于解析来自BeautifulSoup返回的JavaScript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆