如何遍历页面的HTML返回所有以子字符串a开头并以子字符串B结尾的子字符串? [英] How do I loop through a page's HTML return all substrings that begin with a substring a and end with a substring B?

查看:87
本文介绍了如何遍历页面的HTML返回所有以子字符串a开头并以子字符串B结尾的子字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对SQL更熟悉,所以我想我会用C#来寻求帮助。



我的目标是从SQL调用C#脚本服务器SSIS包,它通过一个网页解析可下载的链接,以我知道的2个子串开头和结尾不会改变。



网页在这里:专利查看数据下载 [ ^ ]



我想找到以http://www.patentsview.org/data开头的HTML中的每个实例/并以.tsv.zip结尾。目前这是我的主要挑战(下一个挑战将是1)将这些保存为SSIS中的变量或某种类型,2)下载它们,3)解压缩它们,以及4)将它们加载到SQL Server数据库。) 。不过,主要关注于此时解析HTML。



有没有人有关于如何做到这一点的意见?请记住,我以前从未使用过C#,但我有其他语言编写的适量经验。



最好

Nico



我尝试过:



我尝试过使用第三方SSIS组件,但我相信使用脚本任务是最好的方法。

解决方案

刮网页面充满了危险,因为页面格式可能会在将来的任何时候发生变化,从而破坏您的软件包。但是,请记住,html实际上只不过是xml,解析它只是一件简单的事情。还有一些库,例如 Html Agility Pack | HAP [ ^ ]可以让您的解析生活更轻松。



删除文件名后,下载文件,然后在脚本任务中解压缩,然后创建导入程序包将数据导入到数据库中。

I'm more familiar with SQL, so I thought I would reach out for help using C#.

My objective is to call a C# script from a SQL Server SSIS package which parses through a webpage for downloadable links starting and ending with 2 substrings that I know will not change.

The webpage is here: PatentsView Data Download[^]

I'd like to find every instance in the HTML that starts with "http://www.patentsview.org/data/" and ends with ".tsv.zip". For the moment this is my main challenge (the next challenges will be 1) saving these as variables or something of the sort in SSIS, 2) downloading them, 3) unzipping them, and 4) loading them to a SQL Server database.). Mainly focused on parsing the HTML at this point, though.

Does anyone have input on how to do this? Please keep in mind that I have never used C# before, but I have have a moderate amount of experience coding in other languages.

Best
Nico

What I have tried:

I have tried using third party SSIS components, but I believe using script tasks is the best way.

解决方案

Scraping a web page is fraught with danger, because the page format could change at any time in the future, thus breaking your package. However, keep in mind that html is actually nothing more than xml, and parsing it is a simple matter. There are also libraries available, such as Html Agility Pack | HAP[^] that can make your parsing life much easier.

Once you've scraped your file names, download the files, and unzip them in the script task, and then create an importer package to import the data into your database.


这篇关于如何遍历页面的HTML返回所有以子字符串a开头并以子字符串B结尾的子字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆