与美丽的汤抄袭:为什么不get_text方法返回此元素的文本? [英] Scraping with Beautiful Soup: Why won't the get_text method return the text of this element?
问题描述
最近,我一直工作在Python中的项目,涉及刮对于一些代理的几个网站。我运行到这样做的问题是,当我试着刮了一定众所周知的代理网站,美丽的汤不会做我期望什么,当我问它寻找到IP在代理表。我会尝试SCAPE的IP地址为每个代理,当我用美丽的汤的 .get_text()
法的相应元素上我会得到这样的输出。
Lately I've been working on a project in python that involves scraping a few websites for some proxies. The problem I'm running into with this is that when I try to scrape a certain well known proxy site, Beautiful Soup doesn't do what I expect when I ask it to find where the IPs are in the table of proxies. I'll attempt to scape for the IPs for each proxy, and I'll get outputs like this when I use Beautiful Soup's .get_text()
method on the appropriate element.
...
.UbZT{display:none}
.f5fa{display:inline}
.Glj2{display:none}
.cUce{display:inline}
.zjUZ{display:none}
.GzLS{display:inline}
98120169.117.186373161218218.83839393101138154165203242
...
下面是我试图解析(包含IP td标签)的元素:
Here's the element that I'm trying to parse (the td tag which contains the IP):
<td><span><style>
.lLXJ{display:none}
.qRCB{display:inline}
.qC69{display:none}
.V0zO{display:inline}
</style><span style="display: inline">190</span><span class="V0zO">.</span><span
style="display:none">2</span><div style="display:none">20</div><span
style="display:none">51</span><span style="display:none">56</span><div
style="display:none">56</div><span style="display:none">61</span><span
class="lLXJ">61</span><div style="display:none">61</div><span
class="qC69">110</span><div
style="display:none">110</div><span style="display:none">135</span><div
style="display:none">135</div><span class="V0zO">221</span><span
style="display:none">234</span><div style="display:none">234</div><span class="147">.
</span><span style="display: inline">29</span><div style="display:none">44</div><span
style="display:none">228</span><span></span><span class="qC69">248</span>.<span
style="display:none">7</span><span></span><span style="display:none">44</span><span
class="qC69">44</span><span class="qC69">80</span><span></span><span
style="display:none">85</span><span class="lLXJ">85</span><div
style="display:none">85</div><span class="qC69">100</span><div
style="display:none">100</div><span></span><span class="qC69">130</span><div
style="display:none">130</div><div style="display:none">168</div>212<span
style="display:none">230</span><span class="qC69">230</span><div
style="display:none">230</div></span></td>
该元素的实际文本是简单的代理IP。
The actual text of this element is simply the IP for the proxy.
下面是我的code的片段:
Here's the snippet of my code:
# Hide My Ass
pages = ['https://www.hidemyass.com/proxy-list']
for page in pages:
hidemyass = Soup(requests.get(page).text)
rows = hidemyass.find_all(lambda tag:tag.name=='tr' and tag.has_attr('class'))
for row in rows:
fields = row.find_all('td')
# get ip, port, and protocol for proxy
ip = fields[1].get_text() # <-- Here's the above td element
port = fields[2].get_text()
protocol = fields[6].get_text().lower()
# store proxy in database
db.add_proxy({'ip':ip,'port':port,'protocol':protocol})
num_found += 1
有没有分析这个元素,使输出不会得到混乱了这样一个正确的方式?这似乎是直观美丽的汤的 .get_text()
方法将返回正好是在网站上看到的内容,但我想这是不正确的。感谢您的任何帮助或建议。
Is there a correct way to parse this element so that the output won't get jumbled up like this? It seems intuitive that Beautiful Soup's .get_text()
method would return exactly the text that is visible on the site, but I suppose that's not true. Thanks for any help or advice.
推荐答案
BeautifulSoup
无法区分的HTML标记其他文本可见文本。这种特殊的网站是否混淆的标记的一个很好的工作,使页面更复杂的网络刮。你可以试着去了解什么文字是可见的,但因为有被插入了很多不相关的元素,可以直接不可见通过风格
通过<$或者它不是那么容易C $ C>类。有些 IP
部分在跨度
S,其中有些是没有任何标记的一部分。
BeautifulSoup
cannot distinguish visible text from other text in the HTML markup. This particular website does a very good job of obfuscating the markup and makes web-scraping of the page more complex. You can try to understand what text is visible but it's not that easy since there are a lot of irrelevant elements being inserted that can be directly made invisible via style
or via the class
. Some of the IP
parts are in span
s, some of them are not a part of any tag.
一个解决办法是使用 硒
可以只抢可见
从元素的文本。例如,该code将打印你所有的 IP
■在特定表:
One workaround would be to use Selenium
which can grab only visible
text from the element. For example, this code will print you all the IP
s in the particular table:
from selenium.webdriver.firefox import webdriver
browser = webdriver.WebDriver()
browser.get('https://www.hidemyass.com/proxy-list')
rows = browser.find_elements_by_xpath('//table[@id="listtable"]//tr')
for row in rows[1:]:
cells = row.find_elements_by_tag_name('td')
print cells[1].text
browser.close()
另请参阅:
- BeautifulSoup Grab Visible Webpage Text
希望有所帮助。
这篇关于与美丽的汤抄袭:为什么不get_text方法返回此元素的文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!