什么类型的HTML表格是这个,你可以使用什么类型的网页扫描技术? [英] What type of HTML table is this and what type of webscraping techniques can you use?

查看:109
本文介绍了什么类型的HTML表格是这个,你可以使用什么类型的网页扫描技术?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在此链接中提取数据,通过 classname resultsListProvider




$ b

xpath =// div [@ class ='resultsListProvider']

现在我们可以将 em>使用 AND 创建
$%3d'resultsListProvider'%5D%22rel =nofollow noreferrer> 最终YQL语句
b
$ b

  SELECT * FROM html WHERE url =http://doctors.ucsd.edu/?s ortby = familyName& sortDirection = asc& setsize = 5AND xpath =// div [@ class ='resultsListProvider']

上面的最终YQL语句现在将提供可用的结果,以便在我创建的新jsFiddle中使用,该jsFiddle更新了评论以反映这些更改。如果需要,您可以将 XML文件 HTML网址方法结合使用,以满足您的数据抓取需求,因为每种方法都提供其他方法可能缺乏的内容。



提醒:网页加载时或使用 YQL休息状态查询时,可能会直接渲染一些数据。这意味着 您的动态数据 可能基于 其动态数据

jsFiddle Data Scraping HTML Demo 参见上文jsFiddle XML Demo






技巧3

编辑2:直接使用HTML

jsFiddle Data Scraping HTML演示:克隆该网页

最新的编辑显示了如何使用原始网页的样式表(是可选的,您可以创建自己的),但使用 dataType 来请求Ajax数据, code>属性。使用这种方法将精确的标记放在本地网页上,包括任何 classnames id's



jsFiddle截图:


I am trying to extract data within in this link, http://www.rchsd.org/doctors/index.htm?strt=0&ln=&fn=&sp=&grp=&loc=&lng=&gen=, with R but it is rather difficult.

I notice that the url link does not change whenever I click on a page number. Is this table created with JavaScript? Is the table created by some external source and how can I get access to it? Also, is there a technical name for this type of table?

Also, for anyone who knows web scraping with R or any other program, how would you extract all the data from this table? I tried using the following code in R to extract the data, but I get NULL. How would you address this issue?

mps <- paste("http://www.va.gov/providerinfo/SANDIEGO/index.asp?servicesearch=&specialtysearch=&gendersearch=&sort=&currentPage=1") 
mps.doc <- htmlParse(mps)
mps.tabs <- readHTMLTable(mps.doc)

Also, if you can not anwer the second half of my question, that is okay. I mainly want to know the answer of the first half of my question.

解决方案

Answer Revised with 3 different techniques, all .ajax() and YQL based.

Technique 1

Reference HTML: http://doctors.ucsd.edu/?index=1

For the first part of your question, the type of table in the URL you provided is a standard HTML Table Model variety. In creating that table, the website uses a XML File to populate it's rows and columns with data, including the photo of the doctor.

To keep the servers happy, not all of the data from the XML File is loaded into the browser, only limited results are shown with the option to proceed to the next page.

This is also true for the URL link in the comments section you wrote about (i.e. http://doctors.ucsd.edu/?index=1) where the visitor can select 10, 25 or 50 results from the webpages Results Per Page dropdown menu. The web's address bar will show that number requested via &setsize=25 for example.

Although you may want to data scrape that reference URL, it's best not to since you already have the XML file with all the data you need. It's less work to access it directly!

Reference XML: http://www.rchsd.org/api/physdir/

The second part of your question is easy enough since the XML File is readily available. This time around, when you data scrape that reference XML File, it will show the information your looking for quickly and with very much readability too.

I've limited the request to 5 results for testing purposes in both data scraping queries above, but you can increase that to a larger sampling value. The amount of extra webpage data in the 1st example would require the use of XPATH to map out nodes and require extra processing to use that data.

I've prepared a detailed jsFiddle which should explain a lot of your questions about this process. In it, I explain how to use YQL, .ajax(), and the link for the XML File.


Reference Example:

$.ajax({
    type: 'GET',
    url: 'http://query.yahooapis.com/v1/public/yql?q=SELECT%20phys%20FROM%20xml%20WHERE%20url%3D%22http%3A%2F%2Fwww.rchsd.org%2Fapi%2Fphysdir%2F%22%20LIMIT%205',
    dataType: 'xml',
    success: function(data) {
        var dataResults = $(data).find('results');
        console.log(dataResults);
    }
});

Reference Tutorial:
jsFiddle Data Scraping XML Demo (See below for jsFiddle HTML Demo)


Technique 2

EDIT: Returning to original reference HTML: http://doctors.ucsd.edu/?index=1

The last thing I wrote in the first section is actually not true, as you don't necessarily have all the data you need. While you can create your own Google Map Location Data from the physical doctors address in the XML File, that information is already available to use.

It's then also discovered that this URL also contains a unique formatted Thumbnail Image and includes Doctors Information section when available.

So then, what follows is a re-written jsFiddle that shows how to data scrape that HTML webpage. You'll note in this new jsFiddle that the YQL Statement is no longer ACCESS phys FROM xml since we are now dealing with a HTML document. Also, we are going to use wild card * and not tagname phys in that YQL Statement. It will then be ACCESS * FROM html

As you remember from above data scraping 1st method, too much data was returned from that request. I'll explain how to add an XPATH to that YQL Statement so you only get the desired data back.

Where to start you ask? At that website in your browser! I'll use Firefox to continue.

First, let's force 5 results to be returned in our tests. To do this, change Results Per Page to 25, then at the browser bar change the 25 to 5 for &setsize= query. Hit enter on your keyboard to apply changes.

Using the webpages Additional Search Criteria, Show More Specialties, Location, and Sort results by: will also modify the browsers bar and further create a customize URL to use.

For our Demo, we need just 1 additional customization of Sort results by: Last Name A-Z. Reload the webpage if need be, and to be sure... our customized URL should look like:

http://doctors.ucsd.edu/?sortby=familyName&sortDirection=asc&setsize=5

Now that the webpage is populated with our requested 5 results, we need to see how the layout is supporting those items.

Use Firefox Inspect Element tool via right-click on your mouse to view and learn the table layout structure. Soon, you will see that all the results returned are enclosed in a unique class name.

Here is a screenshot of using Firefox to illustrate:

When popping up the HTML Panel via the icon at bottom of Inspect Element tools (to the right side of Inspect Element Icon), you can see how the layout is for that single Doctors box:

In the photo above, you can visually traverse up the DOM to see the main classname resultsList is the div holding the requested 5 results. That actual classname is can be used, but a more refined classname to use is the resultsListProvider that each returned item carries.

You now have the required information to construct a YQL Statement to use. First, here's the minimum we will use to get started:

ACCESS * FROM html WHERE url="http://doctors.ucsd.edu/?sortby=familyName&sortDirection=asc&setsize=5"

The above really will not do since it returns too much non essential webpage data, that's why we used Inspect Element to discover what's really important. That being said, we will use XPATH to access the part of the webpage we need via classname resultsListProvider.

xpath="//div[@class='resultsListProvider']"

Now we can combine both parts using AND to create the Final YQL Statement that we can data scrape:

SELECT * FROM html WHERE url="http://doctors.ucsd.edu/?sortby=familyName&sortDirection=asc&setsize=5" AND xpath="//div[@class='resultsListProvider']"

The Final YQL Statement above will now provide usable results to work with in a new jsFiddle I've created, which has updated comments to reflect these changes. If you need to, you can combine both XML File and HTML URL methods to satisfy your data scraping requirements, as each method provides content that other method may lack.

Reminder: Some data might be directly rendered when the webpage loads or when using YQL Rest State query. That means your dynamic data might be based on their dynamic data. Oh my!

Reference Tutorial:

jsFiddle Data Scraping HTML Demo (See above for jsFiddle XML Demo)


Technique 3

EDIT 2: Using HTML Directly

jsFiddle Data Scraping HTML Demo: Clone That Webpage

The latest edit shows how to use the original webpage's style sheets (which is optional, you can create your own), but requests the Ajax data differently using dataType attribute. Using this approach places the the exact markup on the local webpage, including any classnames or id's with it.

jsFiddle Screenshot:

这篇关于什么类型的HTML表格是这个,你可以使用什么类型的网页扫描技术?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆