我可以从highcharts.js中获取原始数据吗? [英] Can I scrape the raw data from highcharts.js?

查看:128
本文介绍了我可以从highcharts.js中获取原始数据吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从使用 highcharts.js 显示图表的页面刮取数据,因此我完成了解析所有页面以达到关注页面。但是,最后一页显示数据集的页面使用 highcharts.js 来显示图形,这似乎几乎不可能访问原始数据。 / p>

我使用Python 3.5与BeautifulSoup。



它仍然可以解析它吗?如果是这样的话,我该如何刮呢?

解决方案

数据位于脚本标记中。您可以使用bs4和正则表达式来获取脚本标记。您也可以使用正则表达式提取数据,但我喜欢使用 / js2xml 将js函数解析为xml树:

  from bs4 import BeautifulSoup 
导入请求
导入重新导入
导入js2xml

汤= BeautifulSoup(requests.get(http://www.worldweatheronline.com/brussels-weather-averages/be.aspx).content,html.parser)
script =汤.find(script,text = re.compile(Highcharts.Chart))。text
#script = soup.find(script,text = re.compile(precipchartcontainer))。文本如果你想要降水数据
parsed = js2xml.parse(script)
print js2xml.pretty_print(parsed)

这给了你:

 < program> 
< functioncall>
<函数>
< identifier name =$/>
< / function>
<参数>
< funcexpr>
< identifier />
< parameters />
< body>
< var name =chart/>
< functioncall>
<函数>
< dotaccessor>
< object>
< functioncall>
<函数>
< identifier name =$/>
< / function>
<参数>
< identifier name =document/>
< / arguments>
< / functioncall>
< / object>
<属性>
< identifier name =ready/>
< / property>
< / dotaccessor>
< / function>
<参数>
< funcexpr>
< identifier />
< parameters />
< body>
< assign operator ==>
< left>
< identifier name =chart/>
< / left>
< right>
< new>
< dotaccessor>
< object>
< identifier name =Highcharts/>
< / object>
<属性>
< identifier name =Chart/>
< / property>
< / dotaccessor>
<参数>
< object>
< property name =chart>
< object>
< property name =renderTo>
< string> tempchartcontainer< / string>
< / property>
< property name =type>
< string> spline< / string>
< / property>
< / object>
< / property>
< property name =credits>
< object>
< property name =enabled>
< boolean> false< / boolean>
< / property>
< / object>
< / property>
< property name =colors>
< array>
< string>#FF8533< / string>
< string>#4572A7< / string>
< / array>
< / property>
< property name =title>
< object>
< property name =text>
< string>布鲁塞尔的平均温度(°C)图< / string>
< / property>
< / object>
< / property>
< property name =xAxis>
< object>
< property name =categories>
< array>
< string> January< / string>
<字符串>二月< /字符串>
< string> March< / string>
< string> April< / string>
< string> May< / string>
< string> June< / string>
< string> July< / string>
< string> August< / string>
< string> September< / string>
< string> October< / string>
< string> 11月< /字符串>
< string> December< / string>
< / array>
< / property>
< property name =labels>
< object>
< property name =rotation>
< number value =270/>
< / property>
< property name =y>
< number value =40/>
< / property>
< / object>
< / property>
< / object>
< / property>
< property name =yAxis>
< object>
< property name =title>
< object>
< property name =text>
< string>温度(°c)< / string>
< / property>
< / object>
< / property>
< / object>
< / property>
< property name =tooltip>
< object>
< property name =enabled>
< boolean> true< / boolean>
< / property>
< / object>
< / property>
< property name =plotOptions>
< object>
< property name =spline>
< object>
< property name =dataLabels>
< object>
< property name =enabled>
< boolean> true< / boolean>
< / property>
< / object>
< / property>
< property name =enableMouseTracking>
< boolean> false< / boolean>
< / property>
< / object>
< / property>
< / object>
< / property>
< property name =series>
< array>
< object>
< property name =name>
< string>平均高温(°c)< / string>
< / property>
< property name =color>
< string>#FF8533< / string>
< / property>
< property name =data>
< array>
< number value =6/>
< number value =8/>
< number value =11/>
< number value =14/>
< number value =19/>
< number value =21/>
< number value =23/>
< number value =23/>
< number value =19/>
< number value =15/>
< number value =9/>
< number value =6/>
< / array>
< / property>
< / object>
< object>
< property name =name>
< string>平均低温(°c)< / string>
< / property>
< property name =color>
< string>#4572A7< / string>
< / property>
< property name =data>
< array>
< number value =2/>
< number value =2/>
< number value =4/>
< number value =6/>
< number value =10/>
< number value =12/>
< number value =14/>
< number value =14/>
< number value =11/>
< number value =8/>
< number value =5/>
< number value =2/>
< / array>
< / property>
< / object>
< / array>
< / property>
< / object>
< / arguments>
< / new>
< / right>
< / assign>
< / body>
< / funcexpr>
< / arguments>
< / functioncall>
< / body>
< / funcexpr>
< / arguments>
< / functioncall>
< / program>

所以得到所有的数据:

<$在[28]中:import from $ b $ in [29]:import requests
在[30]中:import re
在[31]中:p $ p> 导入js2xml
在[32]中:from itertools import repeat
In [33]:from pprint import pprint as pp
In [34]:soup = BeautifulSoup(requests.get(http: //www.worldweatheronline.com/brussels-weather-averages/be.aspx\").content,html.parser)

在[35]中:script = soup.find(script ,text = re.compile(Highcharts.Chart))。text

In [36]:parsed = js2xml.parse(script)

In [37]: data = [d.xpath(.// array / number / @ value)for parsed.xpath(//属性[@ name ='data'])]

In [38]:categories = parsed.xpath(//属性[@ name ='categories'] // string / text())

在[39]中:output = list(zip重复(分类),数据))
在[40]:pp(输出)
[(''January',
'February',
'March',
'四月',
'五月',
'六月',
'七月',
'八月',$ b $'九月',
'十月',
'十一月',
'December'],
['6','8','11','14','19','21','23','23','19 '','15','9','6']),
(['January',
'February',
'March',
'April',
'五月',
'六月',
'七月',
'八月',
'九月',
'十月',
'11月',
'12月'],
['2','2','4','6','10','12','14','14 '','11','8','5','2'])]

喜欢我说你可以使用正则表达式,但是我发现它更加可靠,因为错误的空间等等不会破坏它。


I want to scrape the data from a page that shows a graph using highcharts.js, and thus I finished to parse all the pages to get to the following page. However, the last page, the one that displays the dataset, uses highcharts.js to show the graph, which it seems to be near impossible to access to the raw data.

I use Python 3.5 with BeautifulSoup.

Is it still possible to parse it? If so how can I scrape it?

解决方案

The data is in a script tag. You can get the script tag using bs4 and a regex. You could also extract the data using a regex but I like using /js2xml to parse js functions into a xml tree:

from bs4 import BeautifulSoup
import requests
import re
import js2xml

soup = BeautifulSoup(requests.get("http://www.worldweatheronline.com/brussels-weather-averages/be.aspx").content, "html.parser")
script = soup.find("script", text=re.compile("Highcharts.Chart")).text
# script = soup.find("script", text=re.compile("precipchartcontainer")).text if you want precipitation data
parsed = js2xml.parse(script)
print js2xml.pretty_print(parsed)

That gives you:

<program>
  <functioncall>
    <function>
      <identifier name="$"/>
    </function>
    <arguments>
      <funcexpr>
        <identifier/>
        <parameters/>
        <body>
          <var name="chart"/>
          <functioncall>
            <function>
              <dotaccessor>
                <object>
                  <functioncall>
                    <function>
                      <identifier name="$"/>
                    </function>
                    <arguments>
                      <identifier name="document"/>
                    </arguments>
                  </functioncall>
                </object>
                <property>
                  <identifier name="ready"/>
                </property>
              </dotaccessor>
            </function>
            <arguments>
              <funcexpr>
                <identifier/>
                <parameters/>
                <body>
                  <assign operator="=">
                    <left>
                      <identifier name="chart"/>
                    </left>
                    <right>
                      <new>
                        <dotaccessor>
                          <object>
                            <identifier name="Highcharts"/>
                          </object>
                          <property>
                            <identifier name="Chart"/>
                          </property>
                        </dotaccessor>
                        <arguments>
                          <object>
                            <property name="chart">
                              <object>
                                <property name="renderTo">
                                  <string>tempchartcontainer</string>
                                </property>
                                <property name="type">
                                  <string>spline</string>
                                </property>
                              </object>
                            </property>
                            <property name="credits">
                              <object>
                                <property name="enabled">
                                  <boolean>false</boolean>
                                </property>
                              </object>
                            </property>
                            <property name="colors">
                              <array>
                                <string>#FF8533</string>
                                <string>#4572A7</string>
                              </array>
                            </property>
                            <property name="title">
                              <object>
                                <property name="text">
                                  <string>Average Temperature (°c) Graph for Brussels</string>
                                </property>
                              </object>
                            </property>
                            <property name="xAxis">
                              <object>
                                <property name="categories">
                                  <array>
                                    <string>January</string>
                                    <string>February</string>
                                    <string>March</string>
                                    <string>April</string>
                                    <string>May</string>
                                    <string>June</string>
                                    <string>July</string>
                                    <string>August</string>
                                    <string>September</string>
                                    <string>October</string>
                                    <string>November</string>
                                    <string>December</string>
                                  </array>
                                </property>
                                <property name="labels">
                                  <object>
                                    <property name="rotation">
                                      <number value="270"/>
                                    </property>
                                    <property name="y">
                                      <number value="40"/>
                                    </property>
                                  </object>
                                </property>
                              </object>
                            </property>
                            <property name="yAxis">
                              <object>
                                <property name="title">
                                  <object>
                                    <property name="text">
                                      <string>Temperature (°c)</string>
                                    </property>
                                  </object>
                                </property>
                              </object>
                            </property>
                            <property name="tooltip">
                              <object>
                                <property name="enabled">
                                  <boolean>true</boolean>
                                </property>
                              </object>
                            </property>
                            <property name="plotOptions">
                              <object>
                                <property name="spline">
                                  <object>
                                    <property name="dataLabels">
                                      <object>
                                        <property name="enabled">
                                          <boolean>true</boolean>
                                        </property>
                                      </object>
                                    </property>
                                    <property name="enableMouseTracking">
                                      <boolean>false</boolean>
                                    </property>
                                  </object>
                                </property>
                              </object>
                            </property>
                            <property name="series">
                              <array>
                                <object>
                                  <property name="name">
                                    <string>Average High Temp (°c)</string>
                                  </property>
                                  <property name="color">
                                    <string>#FF8533</string>
                                  </property>
                                  <property name="data">
                                    <array>
                                      <number value="6"/>
                                      <number value="8"/>
                                      <number value="11"/>
                                      <number value="14"/>
                                      <number value="19"/>
                                      <number value="21"/>
                                      <number value="23"/>
                                      <number value="23"/>
                                      <number value="19"/>
                                      <number value="15"/>
                                      <number value="9"/>
                                      <number value="6"/>
                                    </array>
                                  </property>
                                </object>
                                <object>
                                  <property name="name">
                                    <string>Average Low Temp (°c)</string>
                                  </property>
                                  <property name="color">
                                    <string>#4572A7</string>
                                  </property>
                                  <property name="data">
                                    <array>
                                      <number value="2"/>
                                      <number value="2"/>
                                      <number value="4"/>
                                      <number value="6"/>
                                      <number value="10"/>
                                      <number value="12"/>
                                      <number value="14"/>
                                      <number value="14"/>
                                      <number value="11"/>
                                      <number value="8"/>
                                      <number value="5"/>
                                      <number value="2"/>
                                    </array>
                                  </property>
                                </object>
                              </array>
                            </property>
                          </object>
                        </arguments>
                      </new>
                    </right>
                  </assign>
                </body>
              </funcexpr>
            </arguments>
          </functioncall>
        </body>
      </funcexpr>
    </arguments>
  </functioncall>
</program>

So to get all the data:

In [28]: from bs4 import BeautifulSoup  
In [29]: import requests
In [30]: import re    
In [31]: import js2xml    
In [32]: from itertools import repeat    
In [33]: from pprint import pprint as pp
In [34]: soup = BeautifulSoup(requests.get("http://www.worldweatheronline.com/brussels-weather-averages/be.aspx").content, "html.parser")

In [35]: script = soup.find("script", text=re.compile("Highcharts.Chart")).text

In [36]: parsed = js2xml.parse(script)

In [37]: data = [d.xpath(".//array/number/@value") for d in parsed.xpath("//property[@name='data']")]

In [38]: categories = parsed.xpath("//property[@name='categories']//string/text()")

In [39]: output =  list(zip(repeat(categories), data))    
In [40]: pp(output)
[(['January',
   'February',
   'March',
   'April',
   'May',
   'June',
   'July',
   'August',
   'September',
   'October',
   'November',
   'December'],
  ['6', '8', '11', '14', '19', '21', '23', '23', '19', '15', '9', '6']),
 (['January',
   'February',
   'March',
   'April',
   'May',
   'June',
   'July',
   'August',
   'September',
   'October',
   'November',
   'December'],
  ['2', '2', '4', '6', '10', '12', '14', '14', '11', '8', '5', '2'])]

Like I said you could just use a regex but js2xml I find is more reliable as erroneous spaces etc.. won't break it.

这篇关于我可以从highcharts.js中获取原始数据吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆