同时在Xpath中转义双引号和单引号 [英] Simultaneously escape double and single quotes in Xpath

查看:371
本文介绍了同时在Xpath中转义双引号和单引号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

类似于如何在xpath中处理单引号,我想转义单引号。区别在于,我不能排除双引号也可能出现在目标字符串中的可能性。

Similar to How to deal with single quote in xpath, I want to escape single quotes. The difference is that I can't exclude the possibility that a double quote might also appear in the target string.

目标:

使用Xpath(在R中)同时转义双引号和单引号。目标元素应该用作变量,而不是像现有答案之一那样被硬编码。 (它应该是一个变量,因为我事先不知道内容,它可以有单引号,双引号或两者都有。)

Escape double and single quotes simultaneously with Xpath (in R). The target element should be used as a variable and not be hard coded like in one of the existing answers. (It should be a variable, because I am unaware of the content beforehand, it could have single quotes, double quotes or both).

Works:

library(rvest)
library(magrittr)
html <- "<div>1</div><div>Father's son</div>"
target <- "Father's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"", target,"\")]"))
{xml_nodeset (1)}
[1] <div>Father's son</div>

不起作用:

html <- "<div>1</div><div>Fat\"her's son</div>"
target <- "Fat\"her's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"", target,"\")]"))
{xml_nodeset (0)}
Warning message:
In xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
  Invalid expression [1207]



更新



Update

Non-R answers that I could try to "translate to R" are very welcome.

推荐答案

这里的关键是意识到使用xml2,您可以使用html转义字符写回经过解析的html。此功能可以解决问题。它比需要的时间更长,因为我已经添加了注释和一些类型检查/转换逻辑。

The key here is realising that with xml2 you can write back into the parsed html with html-escaped characters. This function will do the trick. It's longer than it needs to be because I've included comments and some type checking / converting logic.

contains_text <- function(node_set, find_this)
{
  # Ensure we have a nodeset
  if(all(class(node_set) == c("xml_document", "xml_node")))
    node_set %<>% xml_children()

  if(class(node_set) != "xml_nodeset")
    stop("contains_text requires an xml_nodeset or xml_document.")

  # Get all leaf nodes
  node_set %<>% xml_nodes(xpath = "//*[not(*)]")

  # HTML escape the target string
  find_this %<>% {gsub("\"", "&quot;", .)}

  # Extract, HTML escape and replace the nodes
  lapply(node_set, function(node) xml_text(node) %<>% {gsub("\"", "&quot;", .)})

  # Now we can define the xpath and extract our target nodes
  xpath <- paste0("//*[contains(text(), \"", find_this, "\")]")
  new_nodes <- html_nodes(node_set, xpath = xpath)

  # Since the underlying xml_document is passed by pointer internally,
  # we should unescape any text to leave it unaltered
  xml_text(node_set) %<>% {gsub("&quot;", "\"", .)}
  return(new_nodes)
}

现在:

library(rvest)
library(xml2)

html %>% xml2::read_html() %>% contains_text(target)
#> {xml_nodeset (1)}
#> [1] <div>Fat"her's son</div>
html %>% xml2::read_html() %>% contains_text(target) %>% xml_text()
#> [1] "Fat\"her's son"






附录


ADDENDUM

这是另一种方法,它是@Alejandro建议的方法的实现,但允许任意目标。它具有使xml文档保持不变的优点,并且比上述方法要快一些,但是涉及xml库应该防止的字符串解析。它的工作原理是获取目标,在每个 '之后将其分割,然后将每个片段括在相反的类型中引用其中包含的内容,然后再将所有内容与逗号一起粘贴并插入到XPath concatenate 函数中。

This is an alternative method, which is an implementation of the method suggested by @Alejandro but allows arbitrary targets. It has the merit of leaving the xml document untouched, and is a little faster than the above method, but involves the kind of string parsing that an xml library is supposed to prevent. It works by taking the target, splitting it after each " and ', then enclosing each fragment in the opposite type of quote to the one it contains before pasting them all back together with commas and inserting them into an XPath concatenate function.

library(stringr)

safe_xpath <- function(target)
{
  target                                 %<>%
  str_replace_all("\"", "&quot;&break;") %>%
  str_replace_all("'", "&apo;&break;")   %>%
  str_split("&break;")                   %>%
  unlist()

  safe_pieces    <- grep("(&quot;)|(&apo;)", target, invert = TRUE)
  contain_quotes <- grep("&quot;", target)
  contain_apo    <- grep("&apo;", target)

  if(length(safe_pieces) > 0) 
      target[safe_pieces] <- paste0("\"", target[safe_pieces], "\"")

  if(length(contain_quotes) > 0)
  {
    target[contain_quotes] <- paste0("'", target[contain_quotes], "'")
    target[contain_quotes] <- gsub("&quot;", "\"", target[contain_quotes])
  }

  if(length(contain_apo) > 0)
  {
    target[contain_apo] <- paste0("\"", target[contain_apo], "\"")
    target[contain_apo] <- gsub("&apo;", "'", target[contain_apo])
  }

  fragment <- paste0(target, collapse = ",")
  return(paste0("//*[contains(text(),concat(", fragment, "))]"))
}

现在我们可以生成有效的这样的xpath:

Now we can generate a valid xpath like this:

safe_xpath(target)
#> [1] "//*[contains(text(),concat('Fat\"',\"her'\",\"s son\"))]"

这样

html %>% xml2::read_html() %>% html_nodes(xpath = safe_xpath(target))
#> {xml_nodeset (1)}
#> [1] <div>Fat"her's son</div>

这篇关于同时在Xpath中转义双引号和单引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆