尝试抓取访问的链接内的链接时出现问题吗? [英] Problems while trying to crawl links inside visted links with scrapy?
问题描述
为了学习草率,我尝试从start_urls
列表中抓取一些内部url.问题在于,并非start_urls
中的所有元素都具有内部urls
(在这里我想返回NaN
).因此,如何返回以下两列数据帧(**)
:
In order to learn scrapy, I am trying to crawl some inner urls from a list of start_urls
. The problem is that not all elements from start_urls
have inner urls
(here I would like to return NaN
). Thus, how can I return the following 2 column dataframe (**)
:
visited_link, extracted_link
https://www.example1.com, NaN
https://www.example2.com, NaN
https://www.example3.com, https://www.extracted-link3.com
到目前为止,我尝试:
在:
# -*- coding: utf-8 -*-
class ToySpider(scrapy.Spider):
name = "toy_example"
allowed_domains = ["www.example.com"]
start_urls = ['https:example1.com',
'https:example2.com',
'https:example3.com']
def parse(self, response):
links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a")
lis_ = []
for l in links:
item = ToyCrawlerItem()
item['visited_link'] = response.url
item['extracted_link'] = l.xpath('@href').extract_first()
yield item
lis_.append(item)
df = pd.DataFrame(lis_)
print('\n\n\n\n\n', df, '\n\n\n\n\n')
df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)
但是,上面的代码返回了我:
However, the above code its returning me:
出局:
extracted_link,visited_link
https://www.extracted-link.com,https://www.example1.com
我尝试使用以下方法管理None
问题值:
I tried to manage the None
issue values with:
if l == None:
item['visited_link'] = 'NaN'
else:
item['visited_link'] = response.url
但是它不起作用,关于如何获取(**)
But it is not working, any idea of how to get (**)
*
是一个数据帧,我知道我可以执行-o
,但是我将执行数据帧操作.
*
yes a dataframe, I know that I can do -o
, but I will do dataframe operations.
更新
在阅读@rrschmidt答案后,我尝试:
After reading @rrschmidt answer I tried to:
def parse(self, response):
links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")
lis_ = []
for l in links:
item = ToyItem()
if len(l) == 0:
item['visited_link'] = 'NaN'
else:
item['visited_link'] = response.url
#item['visited_link'] = response.url
item['extracted_link'] = l.xpath('@href').extract_first()
yield item
print('\n\n\n Aqui:\n\n', item, "\n\n\n")
lis_.append(item)
df = pd.DataFrame(lis_)
print('\n\n\n\n\n', df, '\n\n\n\n\n')
df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)
尽管如此,它仍然返回了我同样的错误输出.有人可以帮我澄清这个问题吗?.
Nevertheless, it still returned me the same wrong output. Could anybody help me to clarify this issue?.
推荐答案
据我所知,您的刮板有两个问题:
As far as I can see there are two problems with your scraper:
-
对于
- as
parse
,并且您正在为每个链接创建并保存一个新的数据框,正在生成的数据框会相互覆盖.
start_urls
中的每个元素,都调用- as
parse
is called for every element instart_urls
and you are creating and saving a new dataframe for each link, the dataframes you are generating are overwriting each other.
这就是为什么您在crawled_table.csv
解决方案:仅创建数据框一次,并将所有项目推送到同一数据框对象中.
Solution for this: create the dataframe only one time and push all items into the same dataframe object.
然后将数据帧保存在每个parse
调用中,以防刮板在完成之前必须停止.
Then save the dataframe in each parse
call, just in case the scraper has to stop before finishing.
-
if l == None:
将不起作用,因为如果未找到匹配项,response.xpath
将返回一个空列表.所以做if len(l) == 0:
应该做
if l == None:
won't work asresponse.xpath
returns an empty list if no matches were found. So doingif len(l) == 0:
should do
要点在于,我将如何构造刮板(未经测试的代码!)
In a gist here's how I would structure the scraper (code not tested!)
# -*- coding: utf-8 -*-
class ToySpider(scrapy.Spider):
name = "toy_example"
allowed_domains = ["www.example.com"]
start_urls = ['https:example1.com',
'https:example2.com',
'https:example3.com']
df = pd.DataFrame()
def parse(self, response):
links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")
items = []
if len(links) == 0:
item = ToyItem()
# build item with visited_link = NaN here
item['visited_link'] = response.url
item['extracted_link'] = 'NaN'
items.append(item)
else:
for l in links:
item = ToyItem()
# build the item as you previously did here
item['visited_link'] = response.url
item['extracted_link'] = l.xpath('@href').extract_first()
items.append(item)
items_df = pd.DataFrame(items)
self.df = self.df.append(items_df, ignore_index=True)
print('\n\n\n\n\n', self.df, '\n\n\n\n\n')
self.df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)
return items
这篇关于尝试抓取访问的链接内的链接时出现问题吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!