解析从 BeautifulSoup 返回的 JavaScript [英] Parse the JavaScript returned from BeautifulSoup

查看:12
本文介绍了解析从 BeautifulSoup 返回的 JavaScript的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想解析网页 http://dcsd.nutrislice.com/menu/meadow-view/lunch/ 获取今天的午餐菜单.(我已经构建了一个 Adafruit #IoT 热敏打印机,我想每天自动打印菜单.)

I would like to parse the webpage http://dcsd.nutrislice.com/menu/meadow-view/lunch/ to grab today's lunch menu. (I've built an Adafruit #IoT Thermal Printer and I'd like to automatically print the menu each day.)

我最初使用 BeautifulSoup 来解决这个问题,但结果证明大部分数据是在 JavaScript 中加载的,我不确定 BeautifulSoup 是否可以处理它.如果查看源代码,您将看到存储在 bootstrapData['menuMonthWeeks'] 中的相关数据.

I initially approached this using BeautifulSoup but it turns out that most of the data is loaded in JavaScript and I'm not sure BeautifulSoup can handle it. If you view source you'll see the relevant data stored in bootstrapData['menuMonthWeeks'].

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://dcsd.nutrislice.com/menu/meadow-view/lunch/"
soup = BeautifulSoup(urllib2.urlopen(url).read())

这是获取来源和评论的简单方法.

This is an easy way to get the source and review.

我的问题是:提取这些数据以便我可以用它做某事的最简单方法是什么?从字面上看,我想要的只是一个类似于以下内容的字符串:

My question is: what is the easiest way to extract this data so that I can do something with it? Literally, all I want is a string something like:

西南奶酪煎蛋卷、土豆块、The Harvest Bar (THB)、THB - 奶酪香蒜面包、火腿熟食三明治、红辣椒棒、草莓

Southwest Cheese Omelet, Potato Wedges, The Harvest Bar (THB), THB - Cheesy Pesto Bread, Ham Deli Sandwich, Red Pepper Sticks, Strawberries

我想过使用 webkit 来处理页面并获取 HTML(即浏览器的作用),但这似乎不必要地复杂.我宁愿简单地找到可以解析 bootstrapData['menuMonthWeeks'] 数据的东西.

I've thought about using webkit to process the page and get the HTML (i.e. what a browser does) but that seems unnecessarily complex. I'd rather simply find something that can parse the bootstrapData['menuMonthWeeks'] data.

推荐答案

像 PhantomJS 这样的东西可能更健壮,但这里有一些基本的 Python 代码来提取它的完整菜单:

Something like PhantomJS may be more robust, but here's some basic Python code to extract it the full menu:

import json
import re
import urllib2

text = urllib2.urlopen('http://dcsd.nutrislice.com/menu/meadow-view/lunch/').read()
menu = json.loads(re.search(r"bootstrapData['menuMonthWeeks']s*=s*(.*);", text).group(1))

print menu

之后,您需要在菜单中搜索您感兴趣的日期.

After that, you'll want to search through the menu for the date you're interested in.

编辑:对我来说有些矫枉过正:

EDIT: Some overkill on my part:

import itertools
import json
import re
import urllib2

text = urllib2.urlopen('http://dcsd.nutrislice.com/menu/meadow-view/lunch/').read()
menus = json.loads(re.search(r"bootstrapData['menuMonthWeeks']s*=s*(.*);", text).group(1))

days = itertools.chain.from_iterable(menu['days'] for menu in menus)

day = next(itertools.dropwhile(lambda day: day['date'] != '2014-01-13', days), None)

if day:
    print '
'.join(item['food']['description'] for item in day['menu_items'])
else:
    print 'Day not found.'

这篇关于解析从 BeautifulSoup 返回的 JavaScript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆