Scrapy:URL错误,程序添加了不必要的字符(URL代码) [英] Scrapy: URL error, Program adds unnecessary characters(URL-codes)

查看:88
本文介绍了Scrapy:URL错误,程序添加了不必要的字符(URL代码)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Scrapyto搜寻了一个德语论坛: http://www.musikerboard.de/forum

im using Scrapyto crawl a german forum: http://www.musikerboard.de/forum

它遵循所有子论坛并从线程中提取信息.

It follows all subforums and extracts Information from threads.

问题:在抓取过程中,它在多个线程链接上给我一个错误:

The problem: During crawling it gives me an error on ultiple threadlinks:

2015-09-26 14:01:59 [scrapy] DEBUG: Ignoring response <404 http://www.musiker-board.de/threads/spotify-premium-paket.621224/%0A%09%09>: HTTP status code is not handled or not allowed

除了这部分/%0A%09%09

它会显示404错误.

我不知道为什么程序会不断将代码添加到URL的末尾

I dont know why the program keeps adding the code to the end of the URL

这里是我的代码:

def urlfunc(value):
    value = value.replace("%0A", "")
    value = value.replace("%09", "")
    return value

class spidermider(CrawlSpider):
name = 'memberspider'
allowed_domains = ["musiker-board.de"]
start_urls = ['http://www.musiker-board.de/forum/'
              # 'http://www.musiker-board.de/'
              ]  # urls from which the spider will start crawling
rules = (
    Rule(LinkExtractor(allow=(r'forum/\w*',))),
    Rule(LinkExtractor(allow=(r'threads/\w+',),deny=(r'threads/\w+/[\W\d]+'),process_value=urlfunc), callback='parse_thread' ),
)

有人解释为什么它不断发生吗(及其解决方案)

Does someone have a explanation why it keeps happening?(And a solution to it)

更新的代码

推荐答案

如果进行一些手动调试和研究,您会发现URL末尾的值是元字符. %0A是换行符,%09是水平制表符: http://www.w3schools .com/tags/ref_urlencode.asp

If you do some manual debugging and research you will find that the values at the end of the URL are meta-characters. %0A is a line feed, %09 is a horizontal tab: http://www.w3schools.com/tags/ref_urlencode.asp

然后,如果您使用手动调试语句丰富了urlfunc函数(并将日志级别提高到INFO以更好地查看结果),那么您将看到URL不会仅以字符串作为字符结尾称为网站时会被转换.

Then if you enrich your urlfunc function with manual debug statements (and increase the log-level to INFO to see the results better) then you will see that the URLs do not end with these characters as a string just are converted when calling it as a website.

def urlfunc(value):
    print 'orgiginal: ', value
    value = value.replace('%0A', '').replace('%09', '')
    print 'replaced: ', value
    return value

此结果显示在以下输出中:

This resulst in the following output:

orgiginal:  http://www.musiker-board.de/posts/7609325/

replaced:  http://www.musiker-board.de/posts/7609325/

orgiginal:  http://www.musiker-board.de/members/martin-hofmann.17/
replaced:  http://www.musiker-board.de/members/martin-hofmann.17/

第一个结果和第二个结果之间的线在输出中,因为它们具有元字符.

The lines between the first result and the second one are there in the output because they have the meta-characters.

所以解决方案是strip值:

def urlfunc(value):
    return value.strip()

在这种情况下,您不会收到任何调试消息,告诉您找不到该站点.

In this case you do not get any debug messages which tell you that the site was not found.

这篇关于Scrapy:URL错误,程序添加了不必要的字符(URL代码)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆