如何使用scrapy获取职位描述? [英] How to get the job description using scrapy?

查看:31
本文介绍了如何使用scrapy获取职位描述?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 scrapyXPath 的新手,但用 Python 编程有一段时间了.我想从页面 https://www 获取电子邮件提供报价的人的姓名电话号码.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ 使用scrapy.如您所见,电子邮件和电话作为 <p> 标签内的文本提供,这使得难以提取.

我的想法是首先获取 Job Overview 中的文本,或者至少是所有关于该工作的文本,然后使用 ReGex 获取 email电话号码,如果可能的话,还有人名.

所以,我使用以下命令启动了 scrapy shell:scrapy shell https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ 并从那里获取 response .

现在,我尝试从 div job_description 获取所有文本,但实际上我什么也没得到.我用过

full_des = response.xpath('//div[@class="job_description"]/text()').extract()

它返回[u'\t\t\t\n\t\t']

如何从提到的页面中获取所有文本?显然,任务将在获取之前提到的属性之后出现,但是,首先要做的事情.

更新:这个选择只返回[] response.xpath('//div[@class="job_description"]/div[@class="container"]/div[@class="row"]/text()').extract()

解决方案

你与

full_des = response.xpath('//div[@class="job_description"]/text()').extract()

div 标签实际上除了你得到的之外没有任何文本.

这是你收到的短信"<p>这是您想要的文字"</p>

如您所见,您使用 response.xpath('//div[@class="job_description"]/text()').extract() 获得的文本是介于 div-tag 之间,而不是 div-tag 内的标记之间.为此,您需要:

response.xpath('//div[@class="job_description"]///*/text()').extract()

它的作用是从 div[@class="job_description] 中选择所有子节点并返回文本(参见 此处 了解不同 xpath 的作用.

你会看到这也会返回很多无用的文本,因为你仍然得到所有 \n 等.为此,我建议您将 xpath 缩小到您想要的元素,而不是采用广泛的方法.

例如整个工作描述将在

response.xpath('//div[@class="col-sm-5 justify-text"]///*/text()').extract()

I'm new to scrapy and XPath but programming in Python for sometime. I would like to get the email, name of the person making the offer and phone number from the page https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ using scrapy. As you see, the email and phone is provided as text inside the <p> tag and that makes it hard to extract.

My idea is to first get text inside the Job Overview or at least all the text talking about this respective job and use ReGex to get the email, phone number and if possible the name of the person.

So, I fired up the scrapy shell using the command: scrapy shell https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ and get the response from there.

Now, I try to get all the text from the div job_description where I actually get nothing. I used

full_des = response.xpath('//div[@class="job_description"]/text()').extract()

It returns [u'\t\t\t\n\t\t ']

How do I get all the text from the page mentioned ? Obviously, the task will come afterwards to get the attributes mentioned before, but, first things first.

Update: This selection only returns [] response.xpath('//div[@class="job_description"]/div[@class="container"]/div[@class="row"]/text()').extract()

解决方案

You were close with

full_des = response.xpath('//div[@class="job_description"]/text()').extract()

The div-tag actually does not have any text besides what you get.

<div class="job_description" (...)>
    "This is the text you are getting"
    <p>"This is the text you want"</p>
</div>

As you see, the text you are getting with response.xpath('//div[@class="job_description"]/text()').extract() is the text that is in between the div-tag, not in between the tags inside the div-tag. For this you would need:

response.xpath('//div[@class="job_description"]//*/text()').extract()

What this does is it selects all the child-nodes from div[@class="job_description] and returns the text (see here for what the different xpaths do).

You will see that this returns much useless text as well, as you are still getting all the \n and such. For this I suggest that you narrow your xpath down to the element that you want, instead of doing a broad approach.

For example the entire job description would be in

response.xpath('//div[@class="col-sm-5 justify-text"]//*/text()').extract()

这篇关于如何使用scrapy获取职位描述?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆