Python的网页抓取(美丽的汤,硒,PhantomJS):整个页面只有部分刮 [英] Python Web Scraping (Beautiful Soup, Selenium and PhantomJS): Only scraping part of full page

查看:615
本文介绍了Python的网页抓取(美丽的汤,硒,PhantomJS):整个页面只有部分刮的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好我有麻烦试图从建模目的的网站刮数据(fantsylabs网络公司)。我只是一个黑客所以原谅我上的计算机科学术语无知。什么我试着去做到的是...

Hello I am having trouble trying to scrape data from a website for modeling purposes (fantsylabs dotcom). I'm just a hack so forgive my ignorance on comp sci lingo. What Im trying to accomplish is...


  1. 使用硒登录到网站并导航到数据的页面。

  1. Use selenium to login to the website and navigate to the page with data.

## Initialize and load the web page
url = "website url"
driver = webdriver.Firefox()
driver.get(url)
time.sleep(3)

## Fill out forms and login to site
username = driver.find_element_by_name('input')
password = driver.find_element_by_name('password')
username.send_keys('username')
password.send_keys('password')
login_attempt = driver.find_element_by_class_name("pull-right")
login_attempt.click()

## Find and open the page with the data that I wish to scrape
link = driver.find_element_by_partial_link_text('Player Models')
link.click()
time.sleep(10)

##UPDATED CODE TO TRY AND SCROLL DOWN TO LOAD ALL THE DYNAMIC DATA
scroll = driver.find_element_by_class_name("ag-body-viewport")
driver.execute_script("arguments[0].scrollIntoView();", scroll)

## Try to allow time for the full page to load the lazy way then pass to BeautifulSoup
time.sleep(10)
html2 = driver.page_source

soup = BeautifulSoup(html2, "lxml", from_encoding="utf-8")
div = soup.find_all('div', {'class':'ag-pinned-cols-container'})
## continue to scrape what I want


这个过程的工作,它登录时,导航到正确的页面,但一旦页面完成动态加载(30秒),把它传递给beautifulsoup。我看到,我想凑表约300多个实例....然而,BS4刮板只吐出了大约30 300的情况下,从我自己的研究看来,这可能是与数据通过动态加载的问题JavaScript和只有什么被推到HTML是由BS4解析? (<一href=\"http://stackoverflow.com/questions/29996001/using-python-requests-get-to-parse-html-$c$c-that-does-not-load-at-once\">Using Python的requests.get解析HTML code,不加载一次)

This process works in that it logs in, navigates to the correct page but once the page finishes dynamically loading (30 seconds) pass it to beautifulsoup. I see about 300+ instances in the table that I want to scrape.... However the bs4 scraper only spits out about 30 instances of the 300. From my own research it seems that this could be an issue with the data dynamically loading via javascript and that only what is pushed to html is being parsed by bs4? (Using Python requests.get to parse html code that does not load at once)

这可能是任何人都很难提供意见重现我的例子,而无需创建网站上的个人资料,但会使用phantomJS初始化浏览器全部是需要以捕获所有需要的数据,以抢的所有实例?

It may be hard for anyone offering advice to reproduce my example without creating a profile on the website but would using phantomJS to initialize the browser be all that is need to "grab" all instances in order to capture all the desired data?

    driver = webdriver.PhantomJS() ##instead of webdriver.Firefox()

pciated作为香港专业教育学院

任何想法或经验将成为AP $ P $从未有过处理动态页/ JavaScript的刮如果这是我遇到了。

Any thoughts or experiences will be appreciated as Ive never had to deal with dynamic pages/scraping javascript if that is what I am running into.

已更新Alecs响应之后:

UPDATED AFTER Alecs Response:

下面是目标数据的屏幕截图(蓝色高亮显示)。可以看到在滚动条中的图像的右侧,它被嵌入在页面内。我也这个容器提供的页面源代码code的景色。

Below is a screen shot of the targeted data (highlighted in blue). You can see the scroll bar in the right of the image and that it is embedded within the page. I have also provided a view of the page source code at this container.

在这里输入的形象描述

我已经修改原来的code,我提供了尝试向下滚动至底部,并完全加载网页,但它不能执行此操作。当我设置司机火狐(),我可以看到,页面向下移动上通过外部滚动条而不是有针对性的容器内。我希望这是有道理的。

I have modified the original code that I provided to attempt to scroll down to the bottom and fully load the page but it fails to perform this action. When I set the driver to Firefox(), I can see that the page moves down on via the outer scroll bar but not within the targeted container. I hope this makes sense.

任何意见/指导再次感谢。

Thanks again for any advice/guidance.

推荐答案

这并不容易回答,因为没有办法为我们重现该问题。

It's not easy to answer since there is no way for us to reproduce the problem.

一个问题是,在 LXML 是的不处理这个特定的HTML特别,你可能需要尝试的更改解析器

One problem is that the lxml is not handling this specific HTML particularly well and you may need to try changing the parser:

soup = BeautifulSoup(html2, "html.parser")
soup = BeautifulSoup(html2, "html5lib")


此外,有可能无法在 BeautifulSoup 需要摆在首位。你可以找到与在许多不同的方式元素。例如,在这种情况下:


Also, there might not be a need in BeautifulSoup in the first place. You can locate elements with selenium in a lot of different ways. For example, in this case:

for div in driver.find_elements_by_css_selector(".ag-pinned-cols-container'"):
    # do smth with 'div'


这也可能是,当你滚动页面到底部数据被动态加载。在这种情况下,你可能需要,直到你看到的数据所需要的数量或有装上滚动没有更多的新的数据页面滚动到底。下面是有关螺纹样品溶液:


It may also be that the data is dynamically loaded when you scroll the page to bottom. In this case, you may need to scroll the page to bottom until you see the desired amount of data or there are no more new data loaded on scroll. Here are the relevant thread with sample solutions:

  • Scrolling web page using selenium python webdriver
  • Scroll down to bottom of infinite page with PhantomJS in Python
  • Slow scrolling down the page using Selenium
  • Stop the Scroll in Dynamic Page with Selenium in Python

这篇关于Python的网页抓取(美丽的汤,硒,PhantomJS):整个页面只有部分刮的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆