使用powershell登录后如何从网站获取表格数据? [英] How do you get table data from a website after you login using powershell?

查看:78
本文介绍了使用powershell登录后如何从网站获取表格数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的公司希望我从他们的内部网站获取数据,进行整理,然后将其发送到数据库.数据显示在您在站点内导航到的表上.我想将字段提取到文件或内存中以供进一步处理.

My company wants me to grab data from their internal website, organize it, and send it to a database. The data is displayed on tables that you navigate to within the site. I'm wanting to pull the fields into a file or memory for further processing.

到目前为止,我可以通过获取提交登录按钮的 ID 并传递我的用户名/密码在 powershell 中登录该站点.我可以通过使用导航方法将页面更改为站点内的相应页面.但是,在新页面上运行 Invoke-WebRequest 以及在新页面上使用 Net.WebClient 会返回在原始站点的登录屏幕上找到的信息(我知道,因为表中没有任何内容使其成为返回值,无论我使用什么命令).注释的代码是我以前尝试过的.

So far, I can log into the site in powershell by getting the submit login button's ID, and passing my username/password. I'm able to pass use the navigate method to change the page to the appropriate page within the site. However, running an Invoke-WebRequest on the new page, as well as using the Net.WebClient on the new page is returning the information found on the original site's login screen(I know, because nothing from the table makes it into the returned values, regardless of the commands I use). The commented code is what I've tried previously.

这是代码减去我的 id/密码/站点链接的值

Here is the code-minus the values of my id/password/site link

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
$ie = New-Object -ComObject 'internetExplorer.Application'
$ie.Visible= $true # Make it visible
$username="myid"
$password="mypw"
$ie.Navigate("https://webpage.com/index.jsp")
While ($ie.Busy -eq $true) {Start-Sleep -Seconds 3;}
$usernamefield = $ie.document.getElementByID('login')
$usernamefield.value = "$username"
$passwordfield = $ie.document.getElementByID('password')
$passwordfield.value = "$password"
$Link = $ie.document.getElementByID('SubmitLogin')
$Link.click()
$url = "https://webpage.com/home.pa#%5BT1%2CM181%5D"
$ie.Navigate($url) 
While ($ie.Busy -eq $true) {Start-Sleep -Seconds 3;}
$doc = $ie.document
$web = New-Object Net.WebClient
$web.DownloadString($url)
#$r = Invoke-WebRequest $url
#$r.Forms.fields | get-member
#$InnerText = $r.AllElements | 
#    Where-Object {$_.tagName -ne "TD" -and $_.innerText -ne $null} | 
#    Select -ExpandProperty innerText
#write-host $InnerText
#$r.AllElements|Where-Object {$_.InnerHtml -like "*=*"} 

#$doc = $ie.Document
#$doc.getElementByID("ext-element-7") | % {
#    if ($_.id -ne $null){
#        write-host $_.id
#    }
#}
$ie.Quit()

推荐答案

我显然没有你的页面,无法确保登录时 POST 的正文包含字段 loginpassword 所以这将需要一些试验 &你的错误.作为一个小例子,如果你打开你的控制台开发工具网络选项卡并通过 POST 过滤,你可以观察你的登录页面是如何登录的.当我打开 reddit 登录时,它会发送一个POSThttps://www.reddit.com/login,正文包含 usernamepassword 密钥/value(都是明文).此操作设置我的浏览器会话以保留我的登录信息.

I obviously don't have your page and can't ensure that the body of the POST from signing in contains the fields login and password so that will require some trial & error from you. As a mini-example, if you open up your console dev tools network tab and filter by POST, you can observe how your login page signs you in. When I open reddit to sign in, it sends a POST to https://www.reddit.com/login with a body containing a username and password key/value (both plaintext). This action sets up my browser session to persist my login.

这是一个代码示例,它使用 HtmlAgilityPack 库与结果页面交互,就好像它是 XML.

Here's a code example that uses the HtmlAgilityPack library to interact with the resulting page as if it were XML.

启用 TLS1.2:

[System.Net.ServicePointManager]::SecurityProtocol =
    [System.Net.ServicePointManager]::SecurityProtocol -bor [System.Net.SecurityProtocolType]::Tls12

设置您的网络会话:

$iwrParams = @{
    'Uri'             = 'https://webpage.com/index.jsp'
    'Method'          = 'POST'
    'Body'            = @{
        'login'    = $username
        'password' = $password
    }
    'SessionVariable' = 'session'
    # avoids cases where IE has not been opened
    'UseBasicParsing' = $true
}
# don't care about response - only here to initialize the session
$null = Invoke-WebRequest @iwrParams

获取保护页面内容:

$iwrParams = @{
    'Uri'             = 'https://webpage.com/home.pa#%5BT1%2CM181%5D'
    'WebSession'      = $session
    'UseBasicParsing' = $true
}
$output = (Invoke-WebRequest @iwrParams).Content

下载/添加HtmlAgility:

if (-not (Test-Path -Path "$PSScriptRoot\HtmlAgilityPack.dll" -PathType Leaf))
{
    Invoke-WebRequest -Uri https://www.nuget.org/api/v2/package/HtmlAgilityPack -OutFile "$PSScriptRoot\html.zip"
    Expand-Archive -Path "$PSScriptRoot\html.zip" -DestinationPath "$PSScriptRoot\html" -Force
    Copy-Item -Path "$PSScriptRoot\html\lib\netstandard2.0\HtmlAgilityPack.dll" -Destination "$PSScriptRoot\"
    Remove-Item -Path "$PSScriptRoot\html", "$PSScriptRoot\html.zip" -Recurse -Force
}

Add-Type -Path "$PSScriptRoot\HtmlAgilityPack.dll"
$html = [HtmlAgilityPack.HtmlDocument]::new()

加载/解析页面内容:

$html.LoadHtml($output)

# do stuff with output.
$html.DocumentNode.SelectNodes('//*/text()').Text.Where{$PSItem -like '*=*'}

<小时>

脚注

我在您从脚本中执行的代码中做出了假设,其中 $PSScriptRoot 将被填充.如果它以交互方式运行,则可以改用 $pwd 自动变量(从 *nix 结转,打印工作目录).此代码需要 PSv5+.


Footnote

I made the assumption in the code you were executing from a script where $PSScriptRoot will be populated. If it's being run interactively, you can use the $pwd automatic variable instead (carry-over from *nix, print working directory). This code requires PSv5+.

这篇关于使用powershell登录后如何从网站获取表格数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆