Python从URL抓取pdf [英] Python scraping pdf from URL

查看:459
本文介绍了Python从URL抓取pdf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从URL" http://www.nycgo上抓取文本. com/venues/thalia-restaurant#menu " 我感兴趣的文本位于页面的菜单"选项卡中.我尝试使用BeautifulSoup来获取页面上的所有文本,但是以下代码的返回值会丢失菜单中的所有文本.

I want to scrape the text from the URL "http://www.nycgo.com/venues/thalia-restaurant#menu" The text I'm interested in is in the 'menu' tab on the page. I tried BeautifulSoup to get all the text on the page, but the return value from the following code misses all the text in the menu.

html = urllib2.urlopen("http://www.nycgo.com/venues/thalia-restaurant#menu")
html=html.read()
soup = BS(html)
print soup.get_text()

当我检查菜单内容中的元素时,菜单的内容似乎是页面上html的一部分.我确实注意到,当实际浏览页面时,菜单完全加载需要花费几秒钟的时间.不知道这就是为什么上面的代码无法获取菜单内容的原因.

It seems that the content of the menu is part of the html on the page when I inspect elements from the menu content. I did notice that when physically browsing the page, it takes several seconds for the menu to fully load. Not sure if that's why the code above fails to get the menu content.

任何见识都会受到赞赏.

Any insight would be appreciated.

推荐答案

soup.get_text() 返回HTML文档(网页)中的所有文本,这里的问题是菜单是嵌入式的在页面中以PDF格式显示,Beautiful soup无法访问.实际的PDF文件是用Javascript定义的,如下所示:

While soup.get_text() will return all of the text from a HTML document (webpage) the problem here is that the menu is embedded in the page as a PDF, which Beautiful soup cannot access. The actual PDF file is defined in Javascript like follows:

{
    name: "menu",
    show: Boolean(1),
    url: "/assets/files/programs/rw/2016W/thalia-restaurant.pdf"
}

然后提取此内容的最简单方法可能是使用正则表达式.尽管通常这不是一个好主意,但在这里您要查找的是非常具体的东西-文件,文件中以".quots"结尾的引号"包装.以下代码将找到并提取URL:

The simplest way to extract this then is probably to use regular expressions. While this is generally a bad idea, here you're looking for a very specific thing — a file, wrapped in "quotes" ending in .pdf. The following code will find that and extract the URL:

import re
from urllib import urlopen

html = urlopen("http://www.nycgo.com/venues/thalia-restaurant#menu")
html_doc = html.read()

match = re.search(b'\"(.*?\.pdf)\"', html_doc)
pdf_url = "http://www.nycgo.com" + match.group(1).decode('utf8')

现在pdf_url是:

u'http://www.nycgo.com/assets/files/programs/rw/2016W/thalia-restaurant.pdf'

但是,从PDF中提取文本比较麻烦.您可以先下载文件:

However, extracting the text from the PDF is a little trickier. You can download the file first:

from urllib import urlretrieve
urlretrieve(pdf_url, "download.pdf")

然后按照功能在另一个问题的答案中提取文本:

text = convert_pdf_to_txt("download.pdf")
print(text)

返回:

NEW YOUR CITY 
RESTAURANT WEEK

WINTER 2016

MONDAY - FRIDAY
828 Eighth Avenue
New York City, 10019

Tel: 212.399.4444

www.restaurantthalia.com

LUNCH $25
FIRST COURSE
CREAMY POLENTA
fricassee of truffle mushrooms

...

这篇关于Python从URL抓取pdf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆