检查字符串中是否有多个单词匹配,以便在R中搜索文本 [英] Check for multiple words in string match for text search in r
问题描述
目前,我有一个可用于一个单词搜索的代码,我们可以搜索多个单词并将匹配的单词写到数据框中吗? (为澄清起见,请参阅此帖子),这是 akrun的解决方案适用于一个单词. 这是代码:
Presently I have a code which works for one word search, can we search multiple words and write those matched words in a dataframe? (for clarification, please refer to this post) this is akrun's solution which works for one word. Here is the code:
library(pdftools)
library(tesseract)
All_files <- Sys.glob("*.pdf")
v1 <- numeric(length(All_files))
word <- "school"
df <- data.frame()
Status <- "Present"
for (i in seq_along(All_files)){
file_name <- All_files[i]
cnt <- pdf_info(All_files[i])$pages
print(cnt)
for(j in seq_len(cnt)){
img_file <- pdftools::pdf_convert(All_files[i], format = 'tiff', pages = j, dpi = 400)
text <- ocr(img_file)
ocr_text <- capture.output(cat(text))
check <- sapply(ocr_text, paste, collapse="")
junk <- dir(path= paste0(path, "/tiff"), pattern="tiff")
file.remove(junk)
br <-if(length(which(stri_detect_fixed(tolower(check),tolower(word)))) <= 0) "Not Present"
else "Present"
print(br)
if(br=="Present") {
v1[i] <- j
break}
}
Status <- if(v1[i] == 0) "Not Present" else "Present"
pages <- if(v1[i] == 0) "-" else
paste0(tools::file_path_sans_ext(basename(file_name)), "_", v1[i])
words <- if(v1[i] == 0) "-" else word
df <- rbind(df, cbind(file_name = basename(file_name),
Status, pages = pages, words = words))
}
在这里,我们仅搜索一个词,即school
.我们可以搜索多个单词,例如school
,gym
,swimming pool
吗?
Here we are searching for only one word i.e school
. Can we search for multiple words like school
, gym
, swimming pool
?
预期O/P
fileName Status Page Words TEXT
test.pdf Present test_1 gym I go gym regularly
test.pdf Present test_3 school Here is the next school
test1.pdf Present test1_4 swimming pool In swimming pool
test1.pdf Present test1_7 gym next to Gold gym
test2.pdf Not Present - -
fileName =文件名
状态 =如果找到任何单词,则存在",否则为不存在"
Status=If any word is found then "Present" else "Not Present"
页面 ="_ 1","_ 3"定义在其上找到单词的页面编号;在页面"test_1"上找到单词"gym",在页面"test_3"上找到单词"school".
Page=Here "_1", "_3" defines the page number on which the word was found;; on page "test_1" word "gym" was found and on page "test_3" word "school" was found.
单词 =找到了所有单词;就像在test.pdf文件的第1和第3页上只发现了健身房"和学校",而在test1.pdf文件的第4和7页上只发现了游泳池"和健身房".
Words= Which all words were found ;; like only "gym" and "school" were found on page 1 and 3 of test.pdf file AND only "swimming pool" and "gym" were found on page 4 and 7 of test1.pdf file.
TEXT =它是在其中找到单词的文本
TEXT = It is the text in which the word was found
任何关于此的建议都会有所帮助.
Any suggestion on the same will be helpful.
谢谢
推荐答案
您可以使用外部循环浏览目录中的每个PDF.然后,浏览PDF的所有页面,并在内部循环中提取文本.您要检查每个文档是否至少一页包含school
,gym
或swimming pool
.您要使用的返回值是:
You go through every PDF in your directory with the outside loop. Then you go through all pages of the PDF and extract the text in the inner loop. You want to check for every document whether at least one page contains either school
, gym
or swimming pool
. The returned values you want to use are:
- 包含
Present
或Not present
的PDF文档数量的长度的向量. - 带有一些字符串的三个向量,其中包含有关哪个单词何时何地出现的信息.
- a vector of the length of the number of PDF documents containing either
Present
orNot present
. - Three vector with some strings, containing information on which word occurs where and when.
对吗?
您可以跳过循环中的几个步骤,尤其是在将PDF转换为TIFF并使用ocr
从其中读取文本时:
You can skip a couple of steps in your loop, especially while transforming PDFs to TIFFs and reading texts from them with ocr
:
all_files <- Sys.glob("*.pdf")
strings <- c("school", "gym", "swimming pool")
# Read text from pdfs
texts <- lapply(all_files, function(x){
img_file <- pdf_convert(x, format="tiff", dpi=400)
return( tolower(ocr(img_file)) )
})
# Check for presence of each word in checkthese
pages <- words <- vector("list", length(texts))
for(d in seq_along(texts)){
for(w in seq_along(strings)){
intermed <- grep(strings[w], texts[[d]])
words[[d]] <- c(words[[d]],
strings[w][ (length(intermed) > 0) ])
pages[[d]] <- unique(c(pages[[d]], intermed))
}
}
# Organize data so that it suits your wanted output
fileName <- tools::file_path_sans_ext(basename(all_files))
Page <- Map(paste0, fileName, "_", pages, collapse=", ")
Page[!grepl(",", Page)] <- "-"
Page <- t(data.frame(Page))
Words <- sapply(words, paste0, collapse=", ")
Status <- ifelse(sapply(Words, nchar) > 0, "Present", "Not present")
data.frame(row.names=fileName, Status=Status, Page=Page, Words=Words)
# Status Page Words
# pdf1 Present pdf1_1, pdf1_2 gym, swimming pool
# pdf2 Present pdf2_2, pdf2_5, pdf2_8, pdf2_3, pdf2_6 school, gym, swimming pool
它不像我希望的那样可读.可能是因为几乎没有要求输出需要少量的中间步骤,使代码看起来有些混乱.效果很好,尽管
It's not as readable as I'd like it to be. Probably because little requirements w.r.t. the output require minor intermediate steps that make the code seem a bit chaotic. It works well, though
这篇关于检查字符串中是否有多个单词匹配,以便在R中搜索文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!