哈希集处理可避免在迭代过程中陷入循环 [英] Hashset handling to avoid stuck in loop during iteration

查看:74
本文介绍了哈希集处理可避免在迭代过程中陷入循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究图像挖掘项目,并且我使用Hashset而不是array来避免在收集url时添加重复的url,我到达了代码点,以迭代包含主url的Hashset,并且在迭代过程中进行了迭代并下载主URL的页面并将其添加到Hashet中,然后继续,在迭代期间,我应排除每个扫描的url,并且还应排除(删除)所有以jpg结尾的url,直到url的Hashet计算达到0,问题是我在此迭代中遇到了无限循环,我可能会在其中获得url(我们称其为X)

I'm working on image mining project, and I used Hashset instead of array to avoid adding duplicate urls while gathering urls, I reached to the point of code to iterate the Hashset that contains the main urls and within the iteration I go and download the the page of the main URL and add them to the Hashet, and go on , and during iteration I should exclude every scanned url, and also exclude ( remove ) every url that end with jpg, this until the Hashet of url count reaches 0, the question is that I faced endless looping in this iteration , where I may get url ( lets call it X )

1-我扫描网址X的页面 2-获取第X页的所有网址(通过应用过滤器) 3-使用unioinwith将网址添加到哈希表 4-删除扫描的网址X

1- I scan the page of url X 2- get all urls of page X ( by applying filters ) 3- Add urls to the Hashset using unioinwith 4- remove the scanned url X

当其中一个网址Y被扫描时再次带来X时,问题就出现了

我应该使用Dictionary和作为"scanned"的键吗?我将尝试在此处发布结果,对不起,我在发布问题后想到了...

shall I use Dictionary and the key as "scanned" ?? I will try and post the result here, sorry it comes to my mind after I posted the question...

我设法解决了一个URL,但是似乎其他URL都会发生这种情况,从而产生循环,所以即使在删除链接后,如何处理哈希集也要避免重复,, 我希望我的观点是清除.

I managed to solve it for one url, but it seems it happens with other urls to generate loop, so how to handle the Hashset to avoid duplicate even after removing the links,,, I hope that my point is clear.

while (URL_Can.Count != 0)
 {

                  tempURL = URL_Can.First();

                   if (tempURL.EndsWith("jpg")) 
                    {
                        URL_CanToSave.Add(tempURL);
                        URL_Can.Remove(tempURL);

                    }
                    else
                    {

                        if (ExtractUrlsfromLink(client, tempURL, filterlink1).Contains(toAvoidLoopinLinks))
                        {

                            URL_Can.Remove(tempURL);

                            URL_Can.Remove(toAvoidLoopinLinks);
                        }
                        else 
                        {
                            URL_Can.UnionWith(ExtractUrlsfromLink(client, tempURL, filterlink1));

                            URL_Can.UnionWith(ExtractUrlsfromLink(client, tempURL, filterlink2));

                            URL_Can.Remove(tempURL);

                            richTextBox2.PerformSafely(() => richTextBox2.AppendText(tempURL + "\n"));
                        }

                    }

                   toAvoidLoopinLinks = tempURL;

                }

推荐答案

感谢所有人,我设法使用Dictionary而不是Hashset解决了此问题,并使用Key来保存URL,并使用value来保存int,以如果网址被扫描,则为1;如果网址仍未处理,则为0,以下是我的代码. 我使用了另一个Dictionary"URL_CANtoSave"保存以jpg"my target"结尾的url ...,而While..can循环可以循环直到网站的所有url根据您在过滤字符串变量中指定的值用完您相应地解析了网址.

Thanks for All, I managed to solve this issue using Dictionary instead of Hashset, and use the Key to hold the URL , and the value to hold int , to be 1 if the urls is scanned , or 0 if the url still not processed, below is my code. I used another Dictionary "URL_CANtoSave to hold the url that ends with jpg "my target"...and this loop of While..can loop until all the url of the website ran out based on the values you specify in the filter string variable that you parse the urls accordingly.

因此,要打破循环,您可以在URL_CantoSave中指定要获取的图像url数量.

so to break the loop you can specify amount of images url to get in the URL_CantoSave.

  return Task.Factory.StartNew(() =>
        {
            try
            {


                string tempURL;

                int i = 0;

//我曾经将Dictionary Key的值设置为1或0(1表示已扫描, 0表示尚未进行迭代,直到扫描完所有词典密钥,或者根据您在另一本词典中收集的图像URL的数量在中间中断为止

// I used to set the value of Dictionary Key, 1 or 0 ( 1 means scanned, 0 means not yet and to iterate until all the Dictionry Keys are scanned or you break in the middle based on how much images urls you collected in the other Dictionary

               while (URL_Can.Values.Where(value => value.Equals(0)).Any())


                {

//取1个键并将其放入temp变量

// take 1 key and put it in temp variable

                    tempURL = URL_Can.ElementAt(i).Key;

//检查它是否以目标文件扩展名结尾.在这种情况下,图像文件

// check if it ends with your target file extension. in this case image file

                   if (tempURL.EndsWith("jpg")) 
                    {
                        URL_CanToSave.Add(tempURL,0);

                        URL_Can.Remove(tempURL);

                    }

//如果没有图片,请根据网址下载页面并继续进行分析

// if not image go and download the page based on the url and keep analyzing

                    else
                    {

//如果之前未扫描过网址,则

// if the url not scanned before then

                        if (URL_Can[tempURL] != 1) 
                        {

//在这里似乎有点复杂,其中Add2Dic是添加到字典而无需再次添加Key的过程(解决了主要问题!!) "ExtractURLfromLink"是另一个过程,它通过下载url的文档字符串并对其进行分析来返回包含所有链接的字典的字典, 您可以根据自己的分析添加删除过滤器字符串

// here it seems complex little bit, where Add2Dic is process to add to Dictionaries without adding the Key again ( solving main problem !! ) "ExtractURLfromLink" is another process that return dictionary with all links analyzed by downloading the document string of the url and analyzing it , you can add remove filter string based on you analysis

Add2Dic(ExtractUrlsfromLink(client, tempURL, filterlink1), URL_Can, false);
Add2Dic(ExtractUrlsfromLink(client, tempURL, filterlink2), URL_Can, false);

 URL_Can[tempURL] = 1;  //  to set it as scanned link


    richTextBox2.PerformSafely(() => richTextBox2.AppendText(tempURL + "\n"));
                        }



                    }


        statusStrip1.PerformSafely(() => toolStripProgressBar1.PerformStep());

//这是另一种技巧,可以使此迭代继续进行,直到它扫描所有收集的链接为止

// here comes the other trick to keep this iteration keeps going until it scans all gathered links

                    i++;  if (i >= URL_Can.Count) { i = 0; }

                    if (URL_CanToSave.Count >= 150) { break; }

                }


                richTextBox2.PerformSafely(() => richTextBox2.Clear());

                textBox1.PerformSafely(() => textBox1.Text = URL_Can.Count.ToString());


                return ProcessCompleted = true;




            }
            catch (Exception aih)
            {

                MessageBox.Show(aih.Message);

                return ProcessCompleted = false;

                throw;
            }


            {
              richTextBox2.PerformSafely(()=>richTextBox2.AppendText(url+"\n"));
            }
        })

这篇关于哈希集处理可避免在迭代过程中陷入循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆