Python / Django仅提取并附加新的链接 [英] Python/Django Extract and append only new links

查看:109
本文介绍了Python / Django仅提取并附加新的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Windows 7上使用Python 2.7 Django 1.5组合项目。
我有以下视图:

I am putting together a project using Python 2.7 Django 1.5 on Windows 7. I have the following view:

views.py:

def foo():
    site = "http://www.foo.com/portal/jobs"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    soup = BeautifulSoup(jobpass)
    for tag in soup.find_all('a', href = True):
        tag['href'] = urlparse.urljoin('http://www.businessghana.com/portal/',  tag['href'])
    return map(str, soup.find_all('a', href = re.compile('.getJobInfo')))

def example():
    site = "http://example.com"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    soup = BeautifulSoup(jobpass)
    return map(str, soup.find_all('a', href = re.compile('.display-job')))

 foo_links = foo()
 example_links = example()

def all_links():
    return (foo_links + example_links)

def display_links(request):
    name = all_links()
    paginator = Paginator(name, 25)
    page = request.GET.get('page')
    try:
        name = paginator.page(page)
    except PageNotAnInteger:
        name = paginator.page(1)
    except EmptyPage:
        name = paginator.page(paginator.num_pages)

    return render_to_response('jobs.html', {'name' : name})    

我的模板如下所示:

<ol>
{% for link in name %}
  <li> {{ link|safe }}</li>
{% endfor %}
 </ol>
 <div class="pagination">
<span class= "step-links">
    {% if name.has_previous %}
        <a href="?page={{ names.previous_page_number }}">Previous</a>
    {% endif %}

    <span class = "current">
        Page {{ name.number }} of {{ name.paginator.num_pages}}.
    </span>

    {% if name.has_next %}
        <a href="?page={{ name.next_page_number}}">next</a>
    {% endif %}
</span>
 </div>

现在,随着我的代码站立,任何时候我运行它,它废除所有链接的首页所选择的网站并将其分页全部重新
然而,
我不认为它是一个很好的主意,脚本读取/写入以前提取链接的所有链接,因此,只想检查并追加新的链接。我想保存以前的刮擦链接,以便在一个星期的过程中,这些网站首页上出现的所有链接将作为旧页面在我的网站上提供。

Right now as my code stands, anytime I run it, it scraps all the links on the frontpage of the sites selected and presents them paginated all afresh. However, I don't think its a good idea for the script to read/write all the links that had previously extracted links all over again and therefore would like to check for and append only new links. I would like to save the previously scraped links so that over the course of say, a week, all the links that have appeared on the frontpage of these sites will be available on my site as older pages.

这是我的第一个编程项目,不知道如何将这个逻辑结合到我的代码中。

It's my first programming project and don't know how to incorporate this logic into my code.

更新:

我的模型如下所示:

from django.db import models

class jobLinks(models.Model):
    links = models.URLField()
    pub_date = models.DateTimeField('date retrieved')

    def __unicode__(self):
        return self.links

任何帮助/指针/引用将不胜感激。

Any help/pointers/references will be greatly appreciated.

regards,
最大

regards, Max

推荐答案

我建议建立URL表格,并在表格中有一个日期字段用于排序,因此您最近的URL将首先列出,就像您描述的那样,您尝试对分页进行操作。您的网址表格可能如下所示:

I'd recommend building URL table and having a date field in the table to use for sorting by so your most recent URL's are listed first as you described that you are trying to do with pagination. Your URL table might look like so:

models.py:

class URL_Table(models.Model):
  date = models.DateField(auto_add_now=True)
  url = models.URLField()

您可以按照日期降序排序,并将其链接到 views.py 中的视图:

You can sort by date descending like so and link this to your views in views.py:

urls = URL_Table.objects.order_by('-date')

然后可以引用此表来查看URL是否已存在。如果它是一个新的URL,那么将其保存到表中。

You can then reference this table to see if the URL exists already. If it a new URL then save it to the table.

您还可以在 views.py get() $ c>视图函数在页面加载时执行某些操作,或构建一个自定义模型方法,只有在URL超过一周的时间内才能进行某种URL维护 django.utils.timezone datetime.datetime 在python

You could also override get() in your views.py view function to do something when the page loads, or build a custom model method that only does some kind of URL maintenance if the URL's are more than one week old using django.utils.timezone or datetime.datetime in python

更新:

如果要检查已保存到表中的链接,只保存新的链接,然后调用所有的链接和检查与您的新链接。您可以选择只显示在上周使用timedelta创建的链接。所以我建议在这里使用两个功能。

If you want to check for links that are already saved to your table, and only save the new ones, then call all of your links and check vs. your new links. You can choose to only show links created in the last week using a timedelta. So I would recommend using two functions here.

使用此功能检查新的链接,并且只保存新的链接:

Use this function to check for new links, and only save the new ones:

def save_new_links(all_links):
  current_links = joblink.objects.all()
  for i in all_links:
    if i not in current_links:
      joblink.objects.create(url=i)

然后调用所有链接最后一周使用timedelta

Then call all links in the last week using timedelta

def this_weeks_links(all_links):
  return joblinks.objects.filter(date__gte=datetime.timedelta(days=-7))

然后将这些函数插入到您的视图中以#1 仅保存新的链接,#2 仅显示在上周保存的第一页链接。

Then insert these functions into your view to #1 only save the new links, and #2 only display on your first page links saved in the last week.

祝你好运!

这篇关于Python / Django仅提取并附加新的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆