用Python计数HTML图像 [英] Counting HTML images with Python

查看:72
本文介绍了用Python计数HTML图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一些关于如何提取Python 3.01后对HTML图像进行计数的反馈,也许我的正则表达式未正确使用.

I need some feedback on how to count HTML images with Python 3.01 after extracting them, maybe my regular expression are not used properly.

这是我的代码:

import re, os
import urllib.request
def get_image(url):
  url = 'http://www.google.com'
  total = 0
  try:
    f = urllib.request.urlopen(url)
    for line in f.readline():
      line = re.compile('<img.*?src="(.*?)">')
      if total > 0:
        x = line.count(total)
        total += x
        print('Images total:', total)

  except:
    pass

推荐答案

关于您的代码的几点:

  1. 使用专用的HTML解析库来解析您的页面非常容易(这是python方式)..我个人更喜欢它将被缓存解释器
  2. 您要丢弃异常,因此不知道代码中发生了什么!
  3. <img>标记可能还有其他属性..因此,您的Regex有点基础,同样,请使用re.findall()方法在同一行上捕获多个实例...
  1. It's much easiser to use a dedicated HTML parsing library to parse your pages (that's the python way).. I personally prefer Beautiful Soup
  2. You're over-writing your line variable in the loop
  3. total will always be 0 with your current logic
  4. no need to compile your RE, as it will be cached by the interpreter
  5. you're discarding your exception, so no clues about what's going on in the code!
  6. there could be other attributes to the <img> tags.. so your Regex is a little basic, also, use the re.findall() method to catch multiple instances on the same line...

稍微更改一下代码,我得到:

changing your code around a little, I get:

import re
from urllib.request import urlopen

def get_image(url):

    total  = 0
    page   = urlopen(url).readlines()

    for line in page:

        hit   = re.findall('<img.*?>', str(line))
        total += len(hit)

    print('{0} Images total: {1}'.format(url, total))

get_image("http://google.com")
get_image("http://flickr.com")

这篇关于用Python计数HTML图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆