无法理解在源代码中查找的位置,以创建 Web 抓取工具 [英] Having trouble understanding where to look in source code, in order to create a web scraper
问题描述
我是 Python 的菜鸟,从今年夏天开始就断断续续地自学.我正在阅读scrapy教程,偶尔阅读更多关于html/xml的内容来帮助我理解scrapy.我给自己的项目是模仿scrapy教程为了爬http://www.gamefaqs.com/boards/916373-pc.我想得到一个线程标题列表和线程 url,应该很简单!
I am noob with python, been on and off teaching myself since this summer. I am going through the scrapy tutorial, and occasionally reading more about html/xml to help me understand scrapy. My project to myself is to imitate the scrapy tutorial in order to scrape http://www.gamefaqs.com/boards/916373-pc. I want to get a list of the thread title along with the thread url, should be simple!
我的问题在于不理解 xpath,我猜也是 html.在查看 gamefaqs 站点的源代码时,我不确定要查找什么才能拉取链接和标题.我想说只看锚标签并抓取文本,但我对如何做感到困惑.
My problem lies in not understanding xpath, and also html i guess. When viewing the source code for the gamefaqs site, I am not sure what to look for in order to pull the link and title. I want to say just look at the anchor tag and grab the text, but i am confused on how.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["http://www.gamefaqs.com"]
start_urls = ["http://www.gamefaqs.com/boards/916373-pc"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//a')
items = []
for site in sites:
item = DmozItem()
item['link'] = site.select('a/@href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
我想将其更改为在 gamefaq 上工作,那么我会在此路径中放什么?我想象程序返回的结果是这样的线程名称线程网址我知道代码不是很正确,但有人可以帮我重写它以获得结果,这将帮助我更好地理解抓取过程.
I want to change this to work on gamefaqs, so what would i put in this path? I imagine the program returning results something like this thread name thread url I know the code is not really right but can someone help me rewrite this to obtain the results, it would help me understand the scraping process better.
推荐答案
网页的布局和组织可能会发生变化,基于深度标签的路径可能难以处理.我更喜欢模式匹配链接的文本.即使链接格式发生变化,匹配新模式也很简单.
The layout and organization of a web page can change and deep tag based paths can be difficult to deal with. I prefer to pattern match the text of the links. Even if the link format changes, matching the new pattern is simple.
对于 gamefaq,文章链接如下所示:
For gamefaqs the article links look like:
http://www.gamefaqs.com/boards/916373-pc/37644384
那是协议、域名、文字板"路径.916373-pc"标识论坛区域,37644384"是文章ID.
That's the protocol, domain name, literal 'boards' path. '916373-pc' identifies the forum area and '37644384' is the article ID.
我们可以使用正则表达式匹配特定论坛区域的链接:
We can match links for a specific forum area using using a regular expression:
reLink = re.compile(r'.*\/boards\/916373-pc\/\d+$')
if reLink.match(link)
或使用以下任何论坛区域:
Or any forum area using using:
reLink = re.compile(r'.*\/boards\/\d+-[^/]+\/\d+$')
if reLink.match(link)
添加与您的代码匹配的链接:
Adding link matching to your code we get:
import re
reLink = re.compile(r'.*\/boards\/\d+-[^/]+\/\d+$')
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//a')
items = []
for site in sites:
link = site.select('a/@href').extract()
if reLink.match(link)
item = DmozItem()
item['link'] = link
item['desc'] = site.select('text()').extract()
items.append(item)
return items
许多网站都有单独的摘要和详细信息页面或描述和文件链接,其中路径与带有文章 ID 的模板相匹配.如果需要,您可以像这样解析论坛区域和文章 ID:
Many sites have separate summary and detail pages or description and file links where the paths match a template with an article ID. If needed, you can parse the forum area and article ID like this:
reLink = re.compile(r'.*\/boards\/(?P<area>\d+-[^/]+)\/(?P<id>\d+)$')
m = reLink.match(link)
if m:
areaStr = m.groupdict()['area']
idStr = m.groupdict()['id']
isStr
会是一个字符串,可以用来填 URL 模板,但是如果你需要计算之前的 ID 等等,那就把它转换成一个数字:
isStr
will be a string which is fine for filling in a URL template, but if you need to calculate the previous ID, etc., then convert it to a number:
idInt = int(idStr)
我希望这会有所帮助.
这篇关于无法理解在源代码中查找的位置,以创建 Web 抓取工具的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!