如何从字符串中提取包含非英文字符的 URL? [英] How can I extract a URL with non-English characters from a string?

查看:53
本文介绍了如何从字符串中提取包含非英文字符的 URL?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个简单的脚本,它采用一个带有德语 URL 的锚标记,并提取 URL:

Here's a simple script that takes an anchor tag with a German URL in it, and extracts the URL:

# encoding: utf-8

require 'uri'

url = URI.extract('<a href="http://www.example.com/wp content/uploads/2012/01/München.jpg">München</a>')

puts url

<小时>

http://www.example.com/wp-content/uploads/2012/01/M

extract 方法在 ü 处停止.我怎样才能让它与非英文字母一起工作?我正在使用 ruby​​-1.9.3-p0.

The extract method stops at the ü. How can I get it to work with non-English letters? I'm using ruby-1.9.3-p0.

推荐答案

Ruby 的内置 URI 在某些方面很有用,但在处理国际字符或 IDNA 地址时并不是最佳选择.为此,我建议使用 Addressable gem.

Ruby's built-in URI is useful for some things, but it's not the best choice when dealing with international characters or IDNA addresses. For that I recommend using the Addressable gem.

这是一些清理过的 IRB 输出:

This is some cleaned-up IRB output:

require 'addressable/uri'
url = 'http://www.example.com/wp content/uploads/2012/01/München.jpg'
uri = Addressable::URI.parse(url)

以下是 Ruby 现在所知道的:

Here's what Ruby knows now:

#<Addressable::URI:0x102c1ca20
    @uri_string = nil,
    @validation_deferred = false,
    attr_accessor :authority = nil,
    attr_accessor :host = "www.example.com",
    attr_accessor :path = "/wp content/uploads/2012/01/München.jpg",
    attr_accessor :scheme = "http",
    attr_reader :hash = nil,
    attr_reader :normalized_host = nil,
    attr_reader :normalized_path = nil,
    attr_reader :normalized_scheme = nil
>

查看路径,您可以看到它的原样,或者它应该是的样子:

And looking at the path you can see it as is, or as it should be:

1.9.2-p290 :004 > uri.path            # => "/wp content/uploads/2012/01/München.jpg"
1.9.2-p290 :005 > uri.normalized_path # => "/wp%20content/uploads/2012/01/M%C3%BCnchen.jpg"

考虑到互联网如何转向更复杂的 URI 和混合的 Unicode 字符,确实应该选择可寻址来代替 Ruby 的 URI.

Addressable really should be selected to replace Ruby's URI considering how the internet is moving to more complex URIs and mixed Unicode characters.

现在,获取字符串也很容易,但这取决于您需要浏览多少文本.

Now, getting at the string is easy too, but depends on how much text you have to look through.

如果您有完整的 HTML 文档,最好的办法是使用 Nokogiri 来解析 HTML 并提取 标签中的 >href 参数.这是从单个 开始的地方:

If you have a full HTML document, your best bet is to use Nokogiri to parse the HTML and extract the href parameters from the <a> tags. This is where to start for a single <a>:

require 'nokogiri'
html = '<a href="http://www.example.com/wp content/uploads/2012/01/München.jpg">München</a>'
doc = Nokogiri::HTML::DocumentFragment.parse(html)

doc.at('a')['href'] # => "http://www.example.com/wp content/uploads/2012/01/München.jpg"

使用 DocumentFragment 进行解析可避免将片段包装在通常的 标签中.对于您想要使用的完整文档:

Parsing using DocumentFragment avoids wrapping the fragment in the usual <html><body> tags. For a full document you'd want to use:

doc = Nokogiri::HTML.parse(html)

这是两者之间的区别:

irb(main):006:0> Nokogiri::HTML::DocumentFragment.parse(html).to_html
=> "<a href=\"http://www.example.com/wp%20content/uploads/2012/01/M%C3%BCnchen.jpg\">München</a>"

对比:

irb(main):007:0> Nokogiri::HTML.parse(html).to_html
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><a href=\"http://www.example.com/wp%20content/uploads/2012/01/M%C3%BCnchen.jpg\">München</a></body></html>\n"

因此,对于完整的 HTML 文档,请使用第二个,对于小的部分块,请使用第一个.

So, use the second for a full HTML document, and for a small, partial chunk, use the first.

要扫描整个文档,提取所有的 hrefs,请使用:

To scan an entire document, extracting all the hrefs, use:

hrefs = doc.search('a').map{ |a| a['href'] }

如果您只有像示例中所示的小字符串,则可以考虑使用简单的正则表达式来隔离所需的href:

If you only have small strings like you show in your example, you can consider using a simple regex to isolate the needed href:

html[/href="([^"]+)"/, 1]
=> "http://www.example.com/wp content/uploads/2012/01/München.jpg"

这篇关于如何从字符串中提取包含非英文字符的 URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆