从python的网页中读取特定的行 [英] Reading a particular line from a webpage in python

查看:121
本文介绍了从python的网页中读取特定的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的代码中,我试图从网页中将第一行文本转换为python中的变量。目前,我正在使用urlopen获取我想阅读的每个链接的整个页面。我如何才能阅读网页上的第一行字。



我的代码

  import urllib2 
line_number = 10
id =(np.arange(1,5))
for n in id:
link = urllib2.urlopen(http:// www。 cv.edu/id={}\".format(n))
l = link.read()

我想从网页的以下html代码中提取单词old car:

 < html> ; 
< head>
< link rel =stylesheet>
< style>
.norm {font-family:arial; font-size:8.5pt;颜色:#000000;文字修饰:无; }
.norm:Visited {font-family:arial; font-size:8.5pt;颜色:#000000;文字修饰:无; }
.norm:Hover {font-family:arial; font-size:8.5pt;颜色:#000000;文字修饰:下划线; }
< / style>
< / head>
< body>
< b>旧车< / b>< br>
< sup> 13< / sup> CO< font color =red> v = 0< / font>< br>
ID:02910< br>
< p>
< p>< b> CDS< / b>< / p>


解决方案

使用 XPath 。这正是我们需要的。

b b XPath XML路径语言用于从XML文档中选择节点的查询语言。

lxml python库将帮助我们解决这个问题。这是很多人之一。 Libxml2 PyXML 是一些选项。有很多很多库可以做这种事情。



使用XPath



类似于以下是基于您现有的代码,将工作:

 从lxml导入urllib2 
导入html
line_number = 10
id =(np.arange(1,5))
for n in id:
link = urllib2.urlopen(http://www.cv.edu/id= {}。format(n))
l = link.read()
tree = html.fromstring(l)
print tree.xpath(// b / text()) [0]

XPath查询 // b / text() code>基本上是说从页面上的< b> 元素获取文本。 tree.xpath 函数调用返回一个列表,我们选择第一个使用 [0] 。除了请求

请求库是用于阅读代码中的网页的最新技术,它可能为您节省时间一些令人头痛的问题。



完整的程序可能如下所示:

  from lxml import html 
导入请求

在范围(1,6)中为nn:
page = requests.get(http://www.cv.edu/ (b / text())[0]
code>



警告



这些网址并不适合我,所以您可能不得不修补一下。但是,这个概念是完全正确的。



从网页中读取,您可以使用以下内容来测试XPath:

from lxml ; link rel =stylesheet>
< / head>
< body>
< b>旧车< / b>< br>
< sup> 13< CO>< font color =red> v = 0< / font>< br>
ID:02910< br>
< p>
< b> CDS< / b>< / p>)

print tree.xpath(// b / text())[0]# 旧车


In my code I'm trying to get the first line of text from a webpage into a variable in python. At the moment I'm using urlopen to get the whole page for each link I want to read. How do I only read the first line of words on the webpage.

My code:

import urllib2
line_number = 10
id = (np.arange(1,5))
for n in id:
    link =  urllib2.urlopen("http://www.cv.edu/id={}".format(n))
    l = link.read()

I want to extract the word "old car" from the following html code of the webpage:

<html>
    <head>
        <link rel="stylesheet">
        <style>
            .norm { font-family: arial; font-size: 8.5pt; color: #000000; text-decoration : none; }
            .norm:Visited { font-family: arial; font-size: 8.5pt; color: #000000; text-decoration : none; }
            .norm:Hover { font-family: arial; font-size: 8.5pt; color : #000000; text-decoration : underline; }
        </style>
    </head>
    <body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>

解决方案

Use XPath. It's exactly what we need.

XPath, the XML Path Language, is a query language for selecting nodes from an XML document.

The lxml python library will help us with this. It's one of many. Libxml2, Element Tree, and PyXML are some of the options. There are many, many, many libraries to do this type of thing.

Using XPath

Something like the following, based on your existing code, will work:

import urllib2
from lxml import html
line_number = 10
id = (np.arange(1,5))
for n in id:
    link =  urllib2.urlopen("http://www.cv.edu/id={}".format(n))
    l = link.read()
    tree = html.fromstring(l)
    print tree.xpath("//b/text()")[0]

The XPath query //b/text() basically says "get the text from the <b> elements on a page. The tree.xpath function call returns a list, and we select the first one using [0]. Easy.

An aside about Requests

The Requests library is the state-of-the-art when it comes to reading webpages in code. It may save you some headaches later.

The complete program might look like this:

from lxml import html
import requests

for nn in range(1, 6):
    page = requests.get("http://www.cv.edu/id=%d" % nn)
    tree = html.fromstring(page.text)
    print tree.xpath("//b/text()")[0]

Caveats

The urls didn't work for me, so you might have to tinker a bit. The concept is sound, though.

Reading from the webpages aside, you can use the following to test the XPath:

from lxml import html

tree = html.fromstring("""<html>
    <head>
        <link rel="stylesheet">
    </head>
    <body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>""")

print tree.xpath("//b/text()")[0] # "Old cars"

这篇关于从python的网页中读取特定的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆