与美丽的汤抄袭:为什么不get_text方法返回此元素的文本? [英] Scraping with Beautiful Soup: Why won't the get_text method return the text of this element?

查看:93
本文介绍了与美丽的汤抄袭:为什么不get_text方法返回此元素的文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我一直工作在Python中的项目,涉及刮对于一些代理的几个网站。我运行到这样做的问题是,当我试着刮了一定众所周知的代理网站,美丽的汤不会做我期望什么,当我问它寻找到IP在代理表。我会尝试SCAPE的IP地址为每个代理,当我用美丽的汤的 .get_text()法的相应元素上我会得到这样的输出。

Lately I've been working on a project in python that involves scraping a few websites for some proxies. The problem I'm running into with this is that when I try to scrape a certain well known proxy site, Beautiful Soup doesn't do what I expect when I ask it to find where the IPs are in the table of proxies. I'll attempt to scape for the IPs for each proxy, and I'll get outputs like this when I use Beautiful Soup's .get_text() method on the appropriate element.

...

.UbZT{display:none}
.f5fa{display:inline}
.Glj2{display:none}
.cUce{display:inline}
.zjUZ{display:none}
.GzLS{display:inline}
98120169.117.186373161218218.83839393101138154165203242 

...

下面是我试图解析(包含IP td标签)的元素:

Here's the element that I'm trying to parse (the td tag which contains the IP):

<td><span><style>
.lLXJ{display:none}
.qRCB{display:inline}
.qC69{display:none}
.V0zO{display:inline}
</style><span style="display: inline">190</span><span class="V0zO">.</span><span 
style="display:none">2</span><div style="display:none">20</div><span 
style="display:none">51</span><span style="display:none">56</span><div 
style="display:none">56</div><span style="display:none">61</span><span 
class="lLXJ">61</span><div style="display:none">61</div><span 
class="qC69">110</span><div 
style="display:none">110</div><span style="display:none">135</span><div 
style="display:none">135</div><span class="V0zO">221</span><span 
style="display:none">234</span><div style="display:none">234</div><span class="147">.
</span><span style="display: inline">29</span><div style="display:none">44</div><span 
style="display:none">228</span><span></span><span class="qC69">248</span>.<span 
style="display:none">7</span><span></span><span style="display:none">44</span><span 
class="qC69">44</span><span class="qC69">80</span><span></span><span 
style="display:none">85</span><span class="lLXJ">85</span><div 
style="display:none">85</div><span class="qC69">100</span><div 
style="display:none">100</div><span></span><span class="qC69">130</span><div 
style="display:none">130</div><div style="display:none">168</div>212<span 
style="display:none">230</span><span class="qC69">230</span><div 
style="display:none">230</div></span></td>  

该元素的实际文本是简单的代理IP。

The actual text of this element is simply the IP for the proxy.

下面是我的code的片段:

Here's the snippet of my code:

# Hide My Ass
pages = ['https://www.hidemyass.com/proxy-list']

for page in pages:
    hidemyass = Soup(requests.get(page).text)
    rows = hidemyass.find_all(lambda tag:tag.name=='tr' and tag.has_attr('class'))
    for row in rows:
        fields = row.find_all('td')
        # get ip, port, and protocol for proxy
        ip = fields[1].get_text()            # <-- Here's the above td element
        port = fields[2].get_text()
        protocol = fields[6].get_text().lower()
        # store proxy in database
        db.add_proxy({'ip':ip,'port':port,'protocol':protocol})
        num_found += 1

有没有分析这个元素,使输出不会得到混乱了这样一个正确的方式?这似乎是直观美丽的汤的 .get_text()方法将返回正好是在网站上看到的内容,但我想这是不正确的。感谢您的任何帮助或建议。

Is there a correct way to parse this element so that the output won't get jumbled up like this? It seems intuitive that Beautiful Soup's .get_text() method would return exactly the text that is visible on the site, but I suppose that's not true. Thanks for any help or advice.

推荐答案

BeautifulSoup 无法区分的HTML标记其他文本可见文本。这种特殊的网站是否混淆的标记的一个很好的工作,使页面更复杂的网络刮。你可以试着去了解什么文字是可见的,但因为有被插入了很多不相关的元素,可以直接不可见通过风格通过<$或者它不是那么容易C $ C>类。有些 IP 部分在跨度 S,其中有些是没有任何标记的一部分。

BeautifulSoup cannot distinguish visible text from other text in the HTML markup. This particular website does a very good job of obfuscating the markup and makes web-scraping of the page more complex. You can try to understand what text is visible but it's not that easy since there are a lot of irrelevant elements being inserted that can be directly made invisible via style or via the class. Some of the IP parts are in spans, some of them are not a part of any tag.

一个解决办法是使用 可以只抢可见从元素的文本。例如,该code将打印你所有的 IP ■在特定表:

One workaround would be to use Selenium which can grab only visible text from the element. For example, this code will print you all the IPs in the particular table:

from selenium.webdriver.firefox import webdriver

browser = webdriver.WebDriver()
browser.get('https://www.hidemyass.com/proxy-list')

rows = browser.find_elements_by_xpath('//table[@id="listtable"]//tr')
for row in rows[1:]:
    cells = row.find_elements_by_tag_name('td')
    print cells[1].text

browser.close()

另请参阅:

  • BeautifulSoup Grab Visible Webpage Text

希望有所帮助。

这篇关于与美丽的汤抄袭:为什么不get_text方法返回此元素的文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆