如何找到网站上列出的所有工作? [英] How to find ALL the jobs listed in a website?

查看:45
本文介绍了如何找到网站上列出的所有工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用scrapy 获取网站https://www.germanystartupjobs.com 上发布的所有职位.作为 POST 请求加载的作业,我输入 start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/'].我在 network 选项卡的第一页中找到了这个 URL 使用命令 method:POST 使用 Chrome 开发工具.

I would like to get all the jobs posted in the website https://www.germanystartupjobs.com using the scrapy. As the jobs loaded by POST request, I put start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/']. I have found this URL in the 1ST PAGE from the network tab using the command method:POST using the Chrome dev tool.

我以为在第二页中,我会得到不同的 URL 但是,这里似乎并非如此.我也试过

I thought that in the 2nd page, I will get different URL but, it seems not the case here. I also tried with

start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/' + str(i) for i in range(1, 5)]

生成更多带有无用索引的页面.我的代码的当前版本在这里:

to generate more pages with indexes which doesn't help. The current version of my code is here:

import scrapy
import json
import re
import textwrap 


class GermanyStartupJobs(scrapy.Spider):

    name = 'gsjobs'
    start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/' + str(i) for i in range(1, 5)]

    def parse(self, response):

        data = json.loads(response.body)
        html = data['html']
        selector = scrapy.Selector(text=data['html'], type="html")
        hrefs = selector.xpath('//a/@href').extract()

        print "LENGTH = ", len(hrefs)

        for href in hrefs:
            yield scrapy.Request(href, callback=self.parse_detail)


    def parse_detail(self, response):

        try:
            full_d  = str(response.xpath\
                ('//div[@class="col-sm-5 justify-text"]//*/text()').extract()) 

            full_des_li = full_d.split(',')
            full_des_lis = []

            for f in full_des_li:
                ff = "".join((f.strip().replace('\n', '')).split())
                if len(ff) < 3:
                    continue 
                full_des_lis.append(f)

            full = 'u'+ str(full_des_lis)

            length = len(full)
            full_des_list = textwrap.wrap(full, length/3)[:-1]

            full_des_list.reverse()


            # get the job title             
            try:
                title = response.css('.job-title').xpath('./text()').extract_first().strip()
            except:
                print "No title"
                title = ''

            # get the company name
            try:
                company_name = response.css('.company-title').xpath('./normal/text()').extract_first().strip()
            except:
                print "No company name"
                company_name = ''


            # get the company location  
            try:
                company_location = response.xpath('//a[@class="google_map_link"]/text()').extract_first().strip()
            except:
                print 'No company location'
                company_location = ''

            # get the job poster email (if available)            
            try:
                pattern = re.compile(r"(\w(?:[-.+]?\w+)+\@(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)

                for text in full_des_list:
                    email = pattern.findall(text)[-1]
                    if email is not None:
                        break   
            except:
                print 'No email'
                email = ''

            # get the job poster phone number(if available)                        
            try:
                r = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)
                phone = r.findall(full_des_list[0])[-1]

                if phone is not None:
                    phone = '+49-' +phone

            except:
                print 'no phone'
                phone = ''

            yield {
                'title': title,
                'company name': company_name,
                'company_location': company_location, 
                'email': email,
                'phone': phone,
                'source': u"Germany Startup Job" 
            }

        except:
            print 'Not valid'
            # raise Exception("Think better!!")

我想从网站的至少前 17 页中获取类似信息.我怎样才能实现这一目标并改进我的代码?获得所需信息后,我计划使用 multi-threading 来加快进程并使用 nltk 来搜索海报名称(如果可用).

I would like to get the similar info from at least first 17 pages from the website. How could I achieve that and to improve my code ? After getting the required info, I plan to use multi-threading to speed up the process and nltk to search for the poster name (if available).

推荐答案

您必须实际弄清楚数据是如何在客户端和服务器之间传递的,以便通过查看内容来抓取站点.您想要的数据页面,可以这么说,可能无法在 URL 中表达.

You'll have to actually figure out how that data's passed between client and server to scrape the site that way by looking at content. The page of data you want, so to speek, probably can't be expressed in the URL.

当您通过 URL 访问该网站时,您是否分析过该网站建立的网络连接?它可能从 URL 中提取内容,您也可以访问这些内容以计算机可读的方式检索数据.这比抓取网站要容易得多.

Have you analyzed the network connections the site makes when you visit it in a URL? It might be pulling content from URLs that you, too, can access to retrieve data in a computer readable fashion. That'd be a lot easier than scraping the site.

这篇关于如何找到网站上列出的所有工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆