在嵌套的 div 和 span 标签中使用 scrapy 跟踪信息 [英] following the information using scrapy in nested div and span tags

查看:113
本文介绍了在嵌套的 div 和 span 标签中使用 scrapy 跟踪信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python中的scrapy制作网络爬虫,它会在您进行搜索时提取google在右侧显示的信息,例如:

我想提取右边框中的信息

链接是:在谷歌中搜索

源代码:源代码

部分 HTML 代码是:

<div class="g rhsvw kno-kp mnr-c g-blk" lang="es-419" data-hveid="CAoQAA" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQjh8oAHoECAoQAA"><div class="kp-blkknowledge-panel Wnoohf OJXvsb" data-hveid="CAoQAQ" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQww0oAHoECAoQAQ"><div class="xpdopen"><div class="ifM9O"><div><div></div>

<div data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ_xd6BAgKEAI"><div class="kp-header" lang="es-419" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ3z56BAgEEAA"><div lang="es-419"><h2 class="bNg8Rb">结果画报

<div class="kp-hc"><div class="NFQFxe Hhmu2e viOShc LKPcQc mod" data-md="16" lang="es-419" style="clear:none" data-hveid="CAQQAQ" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIBQABhygo"ADAbeg;QI<!--m--><div class="Ftghae iirjIb"><div class="rsir2d"><kno-share-button><div jsaction="r._HouY4r6utk" data-rtid="iHUQypqXTr0Q" jsl="$t t-dhmk9MkDbvI;$x 0;"class="r-iHUQypqXTr0Q" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ-YABKAAwG3oECAQQAg"><span class="JP8rKe r8U5xb z1asCe Fp7My" aria-label="">索引>svg focusable="false" xmlns="http://www.w3.org/2000/svg" viewbox="0 0 24 24"><path d="M18 16.08c-.76 0-1.44.3-1.96.77L8.91 12.7c.05-.23.09-.46.09-.7s-.04-.47-.09-.7l7.05-4.11c.54.5 1.25.81 2.04.81 1.66 0 43-1.66-3s-1.34-3-3-3-3 1.34-3 3c0 .24.04.47.09.7L8.04 9.81C​​7.5 9.31 6.79 9 6 9c-1.66 0-3 1.34-3 3s1.39 3s1.33c1.5-.31 2.04-.81l7.12 4.16c-.05.21-.08.43-.08.65 0 1.61 1.31 2.92 2.92 2.92 1.61 0 2.92-1.31.2.292<2.2-1.31.2.292<2.1.39<2.292/path></svg></span><div style="display:none" class="iHUQypqXTr0Q-YbcQq9Khf_8 r-im11Tgib5Xfc" jsaction="dg_dismissed:r.-FPnppROon0;kno_shr_close_button_clicked:r.giXQqEBMb3E" 数据7hzFN84w9_k;$x 0;"data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ2poBMBt6BAgEEAM"><g-dialog class="im11Tgib5Xfc-0078sLar460 r-iuKAMqdareQ0" data-id="_RWTdXKfnLs_EswXNnaCQDw4" jsaction="dg_reg_content:r.J_j78ao4uyM" Q$0Kutid=r-iuKAMqdareQ0"data-id="_RWTdXKfnLs_EswXNnaCQDw4";"><div class="iuKAMqdareQ0-oPwtUFSp9U8" id="_RWTdXKfnLs_EswXNnaCQDw4" jsaction="dg_close:r.99yxp2ZuQP0;r.nUlQmbHCUts" data-rtid="Q0$x>js"</g-对话框>

<div style="display:none" class="iHUQypqXTr0Q--9_AnHJXi80" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQhc0CMBt6BAgEEAk"></div>

</kno-share-button>

<div class="SPZz6b"><div class="kno-ecr-pt kno-fb-ctx gsmt" data-local-attribute="d3bn" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ3B0oATAbegQIBBAK"><span>La Cuarta</span</;<div class="wwUB2c kno-fb-ctx"><span data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ2kooAjAbegQIBBAL">Periódico</span></div>

<!--n--></div><i class="GdltXd r-i5fJ88MOldfA" style="display:none" jsl="$t t-izLg50Mkmp4;$x 0;"></i></div>

<div class="SALvLe farUxc mJ2Mod"><div class="i4J0ge"><div class="mod" data-md="50" lang="es-419" style="clear:none" data-hveid="CAUQAA" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQkCkwHHoECAUQAA"><!--m--><div class="PZPZlf hb8SAc kno-fb-ctx" data-attrid="description" data-hveid="CAUQAQ" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQziAoADAcegQIBRAB"><div jsl="$t t-oF0h478wPRI;$x 0;"class="r-igZyUtaLvb3g"><div class="kno-rdesc r-iNUajC5fIXTY" jsaction="sngtp:r.Eddvt4h-GI8;tp_btn:r.Eddvt4h-GI8" data-rtid="iNUajC5fIXTY" jsl="$t t-JgTEvN6zUII;0;"><div><h3 class="bNg8Rb">Description</h3><span>La Cuarta es un periódico chileno de circulación nacional diaria,editado por el consorcio Copesa.Suprime número fue publicado el 13 de noviembre de 1984.Su eslogan hasta 2017 fue El diario popular.</span><span><span></span><a class="q ruhjFe NJLBac fl" href="https://es.wikipedia.org/wiki/La_Cuarta" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQmhMwHHoECAUQAg" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://es.wikipedia.org/wiki/La_Cuarta&amp;ved=2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQmhMwHHoECAUQAg">维基百科</a></

看到我想要的信息嵌套在很多div标签中,最后是一个span标签的描述,于是我尝试了以下方法:

response.xpath("//div[@class='kno-rdesc']")response.xpath("//div[@class='mod']")response.xpath("//div[@class='i4J0ge']")

我只是空了,我什至尝试像这样跟踪每个标签:

response.xpath("//div//div//div//div//div//div//div//div//div//span")

但仍然无法获取我想要的信息

解决方案

xpath 并不总是获取数据的好方法.很多时候 xpaths 会随着 DOM 的变化而改变,甚至在每次加载时都会发生变化.

并在抓取著名网站时使用这些模块与scrapy.

  1. scrapy-rotating-proxys
  2. scrapy-user-agents

否则谷歌将您的请求检测为机器人请求并阻止页面加载.

按类和 id 在页面上查找内容的更好方法

(注意 - 您必须注意 class 和 id 不会在每次加载和每次查询时更改).

I am trying to make web crawler, using scrapy from python, that extracts the information that google shows in the right side when you make a search, for example:

I want to extract the information in the box in the rigth side

The link is: search in google

The source code: source code

Part of the HTML code is:

<div class="g rhsvw kno-kp mnr-c g-blk" lang="es-419" data-hveid="CAoQAA" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQjh8oAHoECAoQAA">
    <div class="kp-blk knowledge-panel Wnoohf OJXvsb" data-hveid="CAoQAQ" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQww0oAHoECAoQAQ">
        <div class="xpdopen">
            <div class="ifM9O">
                <div>
                    <div></div>
                </div>
                <div data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ_xd6BAgKEAI">
                    <div class="kp-header" lang="es-419" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ3z56BAgEEAA">
                        <div lang="es-419">
                            <h2 class="bNg8Rb">Resultado del Gráfico de conocimiento
                            </h2>
                        </div>
                        <div class="kp-hc">
                            <div class="NFQFxe Hhmu2e viOShc LKPcQc mod" data-md="16" lang="es-419" style="clear:none" data-hveid="CAQQAQ" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQhygoADAbegQIBBAB">
                                <!--m-->
                                <div class="Ftghae iirjIb">
                                    <div class="rsir2d">
                                        <kno-share-button>
                                            <div jsaction="r._HouY4r6utk" data-rtid="iHUQypqXTr0Q" jsl="$t t-dhmk9MkDbvI;$x 0;" class="r-iHUQypqXTr0Q" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ-YABKAAwG3oECAQQAg"><span class="JP8rKe r8U5xb z1asCe Fp7My" aria-label="Compartir" role="button" tabindex="0"><svg focusable="false" xmlns="http://www.w3.org/2000/svg" viewbox="0 0 24 24"><path d="M18 16.08c-.76 0-1.44.3-1.96.77L8.91 12.7c.05-.23.09-.46.09-.7s-.04-.47-.09-.7l7.05-4.11c.54.5 1.25.81 2.04.81 1.66 0 3-1.34 3-3s-1.34-3-3-3-3 1.34-3 3c0 .24.04.47.09.7L8.04 9.81C7.5 9.31 6.79 9 6 9c-1.66 0-3 1.34-3 3s1.34 3 3 3c.79 0 1.5-.31 2.04-.81l7.12 4.16c-.05.21-.08.43-.08.65 0 1.61 1.31 2.92 2.92 2.92 1.61 0 2.92-1.31 2.92-2.92s-1.31-2.92-2.92-2.92z"></path></svg></span>
                                                <div style="display:none" class="iHUQypqXTr0Q-YbcQq9Khf_8 r-im11Tgib5Xfc" jsaction="dg_dismissed:r.-FPnppROon0;kno_shr_close_button_clicked:r.giXQqEBMb3E" data-rtid="im11Tgib5Xfc" jsl="$t t-7hzFN84w9_k;$x 0;" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ2poBMBt6BAgEEAM">
                                                    <g-dialog class="im11Tgib5Xfc-0078sLar460 r-iuKAMqdareQ0" data-id="_RWTdXKfnLs_EswXNnaCQDw4" jsaction="dg_reg_content:r.J_j78ao4uyM" data-rtid="iuKAMqdareQ0" jsl="$t t-cuCqGEujB5w;$x 0;">
                                                        <div class="iuKAMqdareQ0-oPwtUFSp9U8" id="_RWTdXKfnLs_EswXNnaCQDw4" jsaction="dg_close:r.99yxp2ZuQP0;r.nUlQmbHCUts" data-rtid="iuKAMqdareQ0" jsl="$x 4;"></div>
                                                    </g-dialog>
                                                </div>
                                                <div style="display:none" class="iHUQypqXTr0Q--9_AnHJXi80" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQhc0CMBt6BAgEEAk"></div>
                                            </div>
                                        </kno-share-button>
                                    </div>
                                    <div class="SPZz6b">
                                        <div class="kno-ecr-pt kno-fb-ctx gsmt" data-local-attribute="d3bn" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ3B0oATAbegQIBBAK"><span>La Cuarta</span></div>
                                        <div class="wwUB2c kno-fb-ctx"><span data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQ2kooAjAbegQIBBAL">Periódico</span></div>
                                    </div>
                                </div>
                                <!--n-->
                            </div><i class="GdltXd r-i5fJ88MOldfA" style="display:none" jsl="$t t-izLg50Mkmp4;$x 0;"></i></div>
                    </div>
                    <div class="SALvLe farUxc mJ2Mod">
                        <div class="i4J0ge">
                            <div class="mod" data-md="50" lang="es-419" style="clear:none" data-hveid="CAUQAA" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQkCkwHHoECAUQAA">
                                <!--m-->
                                <div class="PZPZlf hb8SAc kno-fb-ctx" data-attrid="description" data-hveid="CAUQAQ" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQziAoADAcegQIBRAB">
                                    <div jsl="$t t-oF0h478wPRI;$x 0;" class="r-igZyUtaLvb3g">
                                        <div class="kno-rdesc r-iNUajC5fIXTY" jsaction="sngtp:r.Eddvt4h-GI8;tp_btn:r.Eddvt4h-GI8" data-rtid="iNUajC5fIXTY" jsl="$t t-JgTEvN6zUII;$x 0;">
                                            <div>
                                                <h3 class="bNg8Rb">Descripción</h3><span>La Cuarta es un periódico chileno de circulación nacional diaria, editado por el consorcio Copesa. Su primer número fue publicado el 13 de noviembre de 1984. Su eslogan hasta 2017 fue El diario popular.</span><span><span> </span><a class="q ruhjFe NJLBac fl" href="https://es.wikipedia.org/wiki/La_Cuarta" data-ved="2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQmhMwHHoECAUQAg" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://es.wikipedia.org/wiki/La_Cuarta&amp;ved=2ahUKEwjnnabakqDiAhVP4qwKHc0OCPIQmhMwHHoECAUQAg">Wikipedia</a></span>
                                            </div>
                                        </div>
                                    </div>
                                </div>

I saw that the information i want is nested in a lot of div tags and finally is the description of a span tag, so I tried the following:

response.xpath("//div[@class='kno-rdesc']")
response.xpath("//div[@class='mod']")
response.xpath("//div[@class='i4J0ge']")

I just get emprty, I even tried like following each of the tags like this:

response.xpath("//div//div//div//div//div//div//div//div//div//span")

But still can't get to the info I want

解决方案

xpath is not always a good approach to get data. Many times xpaths is changed accordingly to changed in DOM and even changed in every load.

And use these modules with scrapy when crawl famous websites.

  1. scrapy-rotating-proxies
  2. scrapy-user-agents

otherwiese google detect you request as robot request and block the page load.

The better way to find something on page by classes and id

(Note - you have to notice that class and id not changed on every load and on every query changed).

这篇关于在嵌套的 div 和 span 标签中使用 scrapy 跟踪信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
前端开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆