在Google表格中提取网址域根目录 [英] Extract url domain root in Google Sheet
问题描述
在一个表中,我有完整的URL列表,如:
In a table, I have lists of full urls like :
https://www.example.com/page -1/product-x?utm-source = google
目标:我只想提取网址的域名部分:
Objective : I want to extract the domain name part of the url only :
我正在使用以下公式:
=REGEXEXTRACT(A1;"^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n?]+)")
当对其进行测试时,正则表达式可以正常工作:
The regex is working fine when testing it :
https://www.example.com/
但是在Google工作表中,它显示为:
However in Google sheet, It displays like :
example.com
- 为什么相同正则表达式的结果不相同?
- 如何在Google表格中更正它?
推荐答案
您可以通过删除捕获组(例如,此处([^:\/\n?]+)
=> [^:\/\n?]+
)或将捕获组转换为非捕获组来修复模式.捕获一个(即([^:\/\n?]+)
=> (?:[^:\/\n?]+)
):
You can fix the pattern by removing the capturing group (i.e. here, ([^:\/\n?]+)
=> [^:\/\n?]+
) or by converting the capturing groups to non-capturing ones (i.e. ([^:\/\n?]+)
=> (?:[^:\/\n?]+)
):
=REGEXEXTRACT(A1;"^(?:https?://)?(?:[^@\n]+@)?(?:www\.)?[^:/\n?]+")
=REGEXEXTRACT(A1;"^(?:https?://)?(?:[^@\n]+@)?(?:www\.)?(?:[^:/\n?]+)")
注意:
- 如果正则表达式包含捕获组,则
REGEXEXTRACT
返回捕获的值 - 如果正则表达式中没有捕获组,则该函数仅返回整个匹配值.
- If the regex contains capturing group(s), the
REGEXEXTRACT
returns captured value(s) - If there are no capturing groups in the regex, the function returns the whole match value only.
请注意,您不需要在RE2正则表达式中转义/
正斜杠,因为它们是借助Google表格中的字符串文字定义的.
Note you do not need to escape /
forward slashes in RE2 regexps since they are defined with the help of string literals in Google Sheets.
可以将模式简化为^(?:https?://)?[^:/\n?]+
,该模式可选地匹配http://
或https://
,然后匹配一个或多个除/
,换行符或?
以外的字符.
The pattern may be reduced to ^(?:https?://)?[^:/\n?]+
, that matches http://
or https://
optionally, and then matches one or more chars other than /
, newline, or ?
.
请参见此RE2正则表达式演示.
这篇关于在Google表格中提取网址域根目录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!