将LXML与Html,Requests和ETree一起使用,它可以提供链接,但不会让我搜索特定文本的链接 [英] Using LXML with Html, Requests, and ETree, it gives links, but wont let me search links for specific text

查看:45
本文介绍了将LXML与Html,Requests和ETree一起使用,它可以提供链接,但不会让我搜索特定文本的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从下面提供的链接中提取特定数据.当我运行代码时,它会按预期提供所有的href链接,但是当我尝试对同一字符串进行进一步测试(但使用contains语法)时,它返回为空.

I am trying to pull specific data out of the link provided below. When I run the code, it gives me all of the href links as expected, but when I try further testing for the same string, but using the contains syntax, it comes back as empty.

我已经阅读了文档以及DevHints,在我所看到的所有地方,都推荐使用包含"语法来捕获我所寻找的内容,而我所知道的只是将要包含的语法,而不是在哪里或

Ive checked read the docs, as well as DevHints, and everywhere I look, the "Contains" syntax is the recommended method to capture what Im looking for when all I know is that the syntax will be included, but not where or how.

我正试图制造一种刮板,以帮助最近被解雇的许多人找到新工作,因此,我们将不胜感激.

Im trying to build a scraper to help a lot of people recently laid off find new work, so any assistance is greatly appreciated.

代码:

from lxml import html, etree
import requests

page = requests.get('https://ea.gr8people.com/index.gp?method=cappportal.showPortalSearch&sysLayoutID=123')

# print(page.content)

tree = html.fromstring(page.content)

print(tree)
# Select All Nodes

AllNodes = tree.xpath("//*")

# Select Only hyperlink nodes

AllHyperLinkNodes = tree.xpath("//*/a")

# Iterate through all Node Links

for node in AllHyperLinkNodes:
        print(node.values())

print("======================================================================================================================")

# select using a condition 'contains'
# NodeThatContains = tree.xpath('//td[@class="search-results-column-left"]/text()')
NodeThatContains = tree.xpath('//*/a[contains(text(),"opportunityid")]')

for node in NodeThatContains:
        print(node.values())

# Print the link that 'contains' the text
# print(NodeThatContains[0].values())

推荐答案

BeautifulSoup解决方案

BeautifulSoup based solution

from bs4 import BeautifulSoup
import requests

page = requests.get('https://ea.gr8people.com/index.gp?method=cappportal.showPortalSearch&sysLayoutID=123').content

soup = BeautifulSoup(page, 'html.parser')
links = soup.find_all('a')
links = [a for a in links if a.attrs.get('href') and 'opportunityid' in a.attrs.get('href')]
print('-- opportunities --')
for idx, link in enumerate(links):
    print('{}) {}'.format(idx, link))

输出

-- opportunities --
0) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=154761&amp;opportunityid=154761">
                                        2D Capture Artist - 6 month contract
                                    </a>
1) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=154426&amp;opportunityid=154426">
                                        Accounting Supervisor
                                    </a>
2) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=152147&amp;opportunityid=152147">
                                        Advanced Analyst
                                    </a>
3) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=153395&amp;opportunityid=153395">
                                        Advanced UX Researcher
                                    </a>
4) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=151309&amp;opportunityid=151309">
                                        AI Engineer
                                    </a>
5) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=150468&amp;opportunityid=150468">
                                        AI Scientist
                                    </a>
6) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=151310&amp;opportunityid=151310">
                                        AI Scientist - NLP Focus
                                    </a>
7) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=153351&amp;opportunityid=153351">
                                        AI Software Engineer (Apex Legends)
                                    </a>
8) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=152737&amp;opportunityid=152737">
                                        AI Software Engineer (Frostbite)
                                    </a>
9) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=154764&amp;opportunityid=154764">
                                        Analyste Qualité Sénior / Senior Quality Analyst
                                    </a>
10) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=153948&amp;opportunityid=153948">
                                        Animator 1
                                    </a>
11) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=151353&amp;opportunityid=151353">
                                        Applications Agreement Analyst
                                    </a>
12) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=154668&amp;opportunityid=154668">
                                        AR Analyst I
                                    </a>
13) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=153609&amp;opportunityid=153609">
                                        AR Specialist
                                    </a>
14) <a href="index.gp?method=cappportal.showJob&amp;layoutid=2092&amp;inp1541=&amp;inp1375=154773&amp;opportunityid=154773">
                                        Artiste Audio / Audio Artist
                                    </a>

这篇关于将LXML与Html,Requests和ETree一起使用,它可以提供链接,但不会让我搜索特定文本的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆