如何从MATLAB中的txt文件中仅读取URL [英] how to read only URL from txt file in MATLAB

查看:112
本文介绍了如何从MATLAB中的txt文件中仅读取URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件,其中包含多个URL,以及URL的其他信息.如何读取txt文件并将URL仅保存在数组中以进行下载?我要使用

I have a text file having multiple URLs with other information of the URL. How can I read the txt file and save the URLs only in an array to download it? I want to use

C = textscan(fileId, formatspec);

formatspec中,对于URL作为格式我应该提到什么?

What should I mention in formatspec for URL as format?

推荐答案

这不是textscan的工作;为此,您应该使用正则表达式.在MATLAB中,此处描述了正则表达式. 对于URL,还请参考此处

This is not a job for textscan; you should use regular expressions for this. In MATLAB, regexes are described here. For URLs, also refer here or here for examples in other languages.

这是MATLAB中的一个示例:

Here's an example in MATLAB:

% This string is obtained through textscan or something
str = {...
    'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
    'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};


% find URLs    
C = regexpi(str, ...
    ['((http|https|ftp|file)://|www\.|ftp\.)',...
    '[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]'], 'match');

C{:}

结果:

ans = 
    'http://www.example.com/index.php?query=test&otherStuf=info'
ans = 
    'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'

请注意,此正则表达式要求您包含协议,必须带有前导www.ftp.. example.com/universal_remote.cgi?redirect=之类的东西不匹配.

Note that this regex requires you to have the protocol included, or have a leading www. or ftp.. Something like example.com/universal_remote.cgi?redirect= is NOT matched.

您可以继续使正则表达式涵盖越来越多的情况.但是,最终您会偶然发现最重要的结论(如此处;例如,我从中获得了正则表达式):给出了精确构成有效URL的 full 定义,因此没有 single 正则表达式能够始终匹配每个有效网址.也就是说,有一些您可以梦valid以求的有效URL,这些URL被显示的任何正则表达式 not 捕获.

You could go on and make the regex cover more and more cases. However, eventually you'll stumble upon the the most important conclusion (as made here for example; where I got my regex from): given the full definition of what precisely constitutes a valid URL, there is no single regex able to always match every valid URL. That is, there are valid URLs you can dream up that are not captured by any of the regexes shown.

但是请记住,这最后一个陈述是理论性的而非实用性的-那些不匹配的URL是有效的,但在实践中并不经常遇到:)换句话说,如果您的URL具有相当标准的格式,则您我几乎给了我给你的正则表达式.

But please keep in mind that this last statement is more theoretical rather than practical -- those non-matchable URLs are valid but not often encountered in practice :) In other words, if your URLs have a pretty standard form, you're pretty much covered with the regex I gave you.

现在,我对pm89的Java建议不知所措.正如我所怀疑的那样,它比正则表达式要慢一个数量级,因为您在代码中引入了另一个黏性层"(在我看来,两者之间的差异要慢40倍左右,其中不包括导入).这是我的版本:

Now, I fooled around a bit with the Java suggestion by pm89. As I suspected, it is an order of magnitude slower than just a regex, since you introduce another "layer of goo" to the code (in my timings, the difference was about 40x slower, excluding the imports). Here's my version:

import java.net.URL;
import java.net.MalformedURLException;

str = {...
    'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
    'pre--URL garbage example.com/index.php?query=test&otherStuf=info more stuff here'
    'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};


% Attempt to convert each item into an URL.  
for ii = 1:numel(str)    
    cc = textscan(str{ii}, '%s');
    for jj = 1:numel(cc{1})
        try
            url = java.net.URL(cc{1}{jj})

        catch ME
            % rethrow any non-url related errors
            if isempty(regexpi(ME.message, 'MalformedURLException'))
                throw(ME);
            end

        end
    end
end

结果:

url =
    'http://www.example.com/index.php?query=test&otherStuf=info'
url =
    'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'

我对java.net.URL不太熟悉,但是显然,如果没有领先的协议或标准域(例如,example.com/path/to/page),它也无法找到URL.

I'm not too familiar with java.net.URL, but apparently, it is also unable to find URLs without leading protocol or standard domain (e.g., example.com/path/to/page).

此代码段无疑可以得到改进,但我敦促您考虑为什么要为此更长,本质上更慢且更丑陋的解决方案进行此操作:)

This snippet can undoubtedly be improved upon, but I would urge you to consider why you'd want to do this for this longer, inherently slower and far uglier solution :)

这篇关于如何从MATLAB中的txt文件中仅读取URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆