使用powershell从HTML网站抓取图像链接 [英] Grab image links from HTML website using powershell

查看:119
本文介绍了使用powershell从HTML网站抓取图像链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想下载一些图片库。图像免费提供,无需权限。我为我的生活不能让它上班。这是我到目前为止。 $ pattern吐出是整个HTML行,而不仅仅是图像链接。有什么可以给我的指针吗?循环设置为仅运行一次用于测试目的。循环,将通过数字组织的所有页面。

I'd like to download some image galleries in bulk. The images are offered up for free with no permissions needed. I for the life of me cannot get it to work. This is what I have so far. The $pattern spit out is the whole HTML line, not just the image link. Is there any pointers you can give me? The loop is set to only run once for testing purposes. The loop, will go through all pages which are organized numerically.

# Variables
$i=1        # Webpage Counter
$j=1        # Image Counter
$rootDir = "http://website.com/sport/galleries/"
$saveDir = "C:\Users\user\Desktop\"
$webpagetxt = "C:\Users\user\Desktop\page.txt"
$links = "C:\Users\user\Desktop\links.txt"
$regex = "http://website.com/galleries/[0-9]*/[^\.]*.JPG"

# Create folder to download to
#New-Item -Name SiouxSportsGalleries -ItemType directory

# Start Web Client
$client = New-Object System.Net.WebClient

# Main loop to get image links and download
    For($i=10; $i -le 10; $i++){

        # Download source code of the web page.
        $url = $rootDir+$i+'.htm'
        $webclient = new-object System.Net.WebClient
        $webpage = $webclient.DownloadString($url)
        $webpage > "$webpagetxt"

    # Parse web page and find image link.
       $pattern = Get-Content $webpagetxt | Select-String -pattern $regex -Allmatches
       echo "This is the link" $pattern
    #$pattern > $links

 }


推荐答案

你需要提取一个匹配的值。 Select-String 返回对象,当您 echo 时,发生的是 $ pattern。的ToString() ToString()返回行,而不是匹配值。这将只返回所有链接:

You need to extract value that was a match. Select-String returns objects, and when you echo it, what happends is $pattern.ToString(). ToString() returns the line, and not the match-value. This will return all the links only:

Get-Content $webpagetxt | Select-String -pattern $regex -Allmatches | % { $_.Matches | % { $_.Value } }

Btw,而不是保存网页,并重新打开 get-content ,你可以简单地将linebreaks上的字符串拆分成一个数组(如果这是你保存的唯一原因)。 : - )

Btw, instead of saving the webpage and reopen it with get-content, you can simply split the string on linebreaks to get an array(if that's was the only reason you saved it). :-)

$webpage -split "`n" | Select-String -pattern $regex -Allmatches | % { $_.Matches | % { $_.Value } }

编辑要下载,可以用另一个foreach循环扩展它:

EDIT To download it, you could just extend it with another foreach-loop:

$rootDir = "http://website.com/sport/galleries/"
$saveDir = "C:\Users\user\Desktop\"
$webpage -split "`n" | Select-String -pattern $regex -Allmatches | % { $_.Matches | % { $_.Value } } | % {
    #Get local path
    $local = $_.Replace($rootDir, $saveDir)
    #Create path
    $file = New-Item $local -ItemType file -Force
    #Download
    $wb.DownloadFile($_, $file.FullName)
}

这篇关于使用powershell从HTML网站抓取图像链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆