如何处理由数百万个句子组成的巨大文本文件,并获得包含给定单词作为输入的适当句子? [英] How to I process huge text file consisting of millions of sentences and get the approriate sentences that contain a given word as input?

查看:93
本文介绍了如何处理由数百万个句子组成的巨大文本文件,并获得包含给定单词作为输入的适当句子?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Trie数据结构来获取任何单词作为输入的句子

它以很快的速度运行几个gb的文本文件。问题是当我有庞大的文本文件时,它的时间复杂度更高。提取句子需要花费更多时间。



因此我决定将给定的大文本文件拆分为26个小文件,其中每个文件包含以字母顺序开头的句子。 />


允许我们采取上述句子。



file1仅包含以字母C开头的句子



板球是我最喜欢的运动之一





file2仅由句子组成从字母I开始:



i我正在观看板球



file3仅由以?开头的句子组成字母P:



打板球是我的爱好



file 4仅包含以字母W:



我们不会错过任何时候的ipl匹配





现在,这里的问题是,如果我给出一个叫做板球的单词

它将显示来自file1的句子,因为它有一个ll以c字母开头的句子。但是我的文件2和文件3在句子里都有板球字。

如何解决这个问题?



我有什么尝试过:



我们可以用文件中的几句话说:



i am看板球。打板球是我的爱好。板球是我最喜欢的运动之一。我们不会在任何时候错过ipl匹配。



现在我们必须将这些句子分开,我用点运算符来分割这些句子。

它显示了一些东西像这样:



i看着板球

打板球是我的爱好

板球是我最喜欢的运动之一

我们不要错过任何时候ipl匹配

i used Trie data structure to fetch a sentence on giving any word as input
it is working at good speed for few gb of a text file. problem is when i have huge text file its time complexity is more. fetching of sentences is taking more time.

so i decided to split a given huge text file into 26 small files where each file contains sentences starting with alphabetical order.

lets take the above sentences.

file1 consisting of only the sentences starting with the letter C

cricket is one of my favorite sports


file2 consisting of only the sentences starting with the letter I:

i am watching cricket

file3 consisting of only the sentences starting with the letter P:

playing cricket is my hobby

file 4 consisting of only the sentences starting with the letter W:

we dont miss ipl matches anytime


now, here problem would be, if i give a word called cricket
it will display the sentences from file1 since it has all sentences starting with c letter. but my file 2 and file 3 have cricket word in there sentences.
how to solve this problem?

What I have tried:

lets take few sentences that are in a file say:

i am watching cricket. playing cricket is my hobby. cricket is one of my favorite sports. we dont miss ipl matches anytime.

now we have to separate these sentences i used dot operator to split these sentences.
it showed something like this:

i am watching cricket
playing cricket is my hobby
cricket is one of my favorite sports
we dont miss ipl matches anytime

推荐答案

我会这样做的方式不同:我创建一个映射文件将索引包含到大文件中。可能是两组索引。

第一组是每个句子的索引和长度(加上一个句子ID值来单独标识句子)。

第二组是每个单词。或者更确切地说,每个不同的单词,它出现的句子ID,以及句子中的偏移量。

因此条目可能如下所示:

The way I would do it is different: I'd create a "mapping" file which contained indexes into the big file. Probably two sets of indexes.
The first set would be in indexes and lengths of each sentence (plus a "Sentence ID" value to individually identify the sentence).
The second set would be each word. Or rather, each different word, with the Sentence ID it appears in, and the offset within the sentence.
So entries might look like this:
the     1,0;1,93;3,0;3,116;4,0;5,37;5,72,5,90
way     1,4
i       1,8
...

在开始此回复的句子中。

当你想要找到一个单词时,你只需查看单词映射表,它就会告诉它你是否立即在文本中,并告诉你哪个句子,以及它所在的句子中的偏移量。

因为任何语言的单词数量都相当小(英语的平均有效词汇量)母语为20,000字)这应该比每次解析表更快更容易使用,只需要在输入文本文件更改时创建(或修改)。

in the sentences that start this reply.
When you want to find a word, you just look in the "Words" mapping table, and it tells you immediately if it is in the text, and tells you which sentence, and offset within the sentence it's located.
Since the number of words in any language is fairly small (the average active vocabulary of an English native speaker is 20,000 words) this should be a lot quicker and easier to work with than parsing the table each time, and it only needs to be created (or modified) when your input text file changes.


我猜你已经索引了一个文件,一组文件或任何符合你需求的文件。该索引是为了快速找到答案。

I guess you have indexed a file, a set of files, or anything that is consistent for your needs. The index being to find answers fast.
引用:

问题是当我有大文本文件时它的时间复杂度是更多。提取句子需要更多时间。

problem is when i have huge text file its time complexity is more. fetching of sentences is taking more time.

这句话翻译为我设计了一个复杂的解决方法,以避免在我的搜索例程中遇到设计缺陷。 你应该解释为什么你有大量文件的时间复杂性问题。



通过为每个单词构建一组信息的索引(

{Word,FileName,WordOffsetInFile}或{Word,FileName,SebtenceOffsetInFile})检索句子的时间复杂度基本上就是句子长度。

我不详细用于压缩索引的各种技术。



This sentence translate to "I design a complicated workaround to avoid hitting a design flaw in my search routine". You should explain why you have a time complexity problem with huge files.

By building the index with a set of informations for each words (
{Word, FileName, WordOffsetInFile} or {Word, FileName, SebtenceOffsetInFile}) the time complexity to retrieve a sentence is basically the sentence length.
I don't detail the various techniques used to compress the index.

引用:

如何解决这个问题?

要解决您的问题,您需要解释为什么您有时间复杂性问题。并且可能显示相关代码。

我的解决方案是不拆分大文件,修复代码中的缺陷。

To solve your problem, you need to explain why you have a time complexity problem. And probably show related code.
My solution is don't split the huge file, fix the flaw in your code.


这篇关于如何处理由数百万个句子组成的巨大文本文件,并获得包含给定单词作为输入的适当句子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆