一切都与逻辑有关:findall posts &相应的线程 - 在 vbulletin [英] its all about the logic: findall posts & the corresponding threads - on vbulletin

查看:18
本文介绍了一切都与逻辑有关:findall posts &相应的线程 - 在 vbulletin的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

公告

主要目标

最后,我们拥有演示用户参与的所有主题(和讨论).

(注意:这意味着我们应该牢记收集结果的一个很好的展示.)

详情

用于制定逻辑使我们能够使用这种技术 - 在所有 Vbulletin(运行版本 3.8xy)上.我们选择了一个演示页面[这只是一个带有开放板的示例 - 任何人无需注册即可看到].

没有兴趣收集这些数据:主要兴趣是找出逻辑:获取一个论坛用户 (vbulletin) 所涉及的完整话语.我们必须从帖子开始,然后转到主题……以获得完整的对话.

出于测试目的,我们选择了一个自由板 - 可以开放访问结构.所以在这里我们要创建一个最小可重现示例(mre(参见

  1. 显示第 1 到 25 个 xyz 帖子的结果

由 xyz 发布意味着:我们有一定数量的(例如)xyz 帖子(位于各种线程中):

 td class="vbmenu_control

最后 - 最后一个问题是:如何存储所有收集到的数据?应该使用哪种格式...!?那是另一个问题......但我很确定这里有一些不错的技术可用......

期待您的回音

解决方案

看看这是否能帮助您入门.

可以使用 postIds 制作一些可以提取线程 id 的函数.然后遍历线程 ID 页面并解析数据.我真的不会在这上面花太多时间.您也可以使用 html 中的注释来提取某些部分,但我认为这或多或少是您正在寻找的思考过程.

将pandas导入为pd进口请求从 bs4 导入 BeautifulSoup进口重新headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'}用户 ID = 4793def get_user_stats(userId):url = 'https://forums.sagetv.com/forums/member.php'有效载荷 = {'u':f'{userId}'}response = requests.get(url, headers=headers, params=payload)汤 = BeautifulSoup(response.text, 'html.parser')统计数据 = {}stats = soup.find('fieldset',{'class':'statistics_group'}).find_all('li')对于统计中的每个:values = each.text.replace(',','').split(':')如果 len(values) == 2:键,值 = ''.join(values[0].split()),float(values[1])stats_data[key] = 值返回 stats_datadef get_searchId(userId):url = 'https://forums.sagetv.com/forums/search.php?do=finduser&u=4793'有效载荷 = {'do':'finduser','u':f'{userId}'}response = requests.get(url, headers=headers, params=payload)汤 = BeautifulSoup(response.text, 'html.parser')searchId = soup.find('td',{'class':'vbmenu_control'}, text=re.compile("^Page 1 of")).find_next('a')['href'].split('searchid=')[-1].split('&')[0]返回搜索 IDdef get_page_threadIds(threadId_list, 汤):postIds = soup.find_all('table',{'id':re.compile("^post")})对于 postIds 中的每个:a_s = each.find_all('a')对于 a_s 中的 alpha:如果 't=' 在 alpha['href'] 中:threadId = alpha['href'].split('t=')[-1]如果 threadId 不在 threadId_list 中:threadId_list.append(threadId)返回 threadId_listdef get_all_threadIds(searchId):threadId_list = []url = 'https://forums.sagetv.com/forums/search.php'有效载荷 = {'searchid':'%s' %searchId,'pp':'200'}response = requests.get(url, headers=headers, params=payload)汤 = BeautifulSoup(response.text, 'html.parser')total_pages = int(soup.find('td',{'class':'vbmenu_control'}, text=re.compile("^Page 1 of")).text.split('of ')[-1])threadId_list = get_page_threadIds(threadId_list, 汤)对于范围内的页面(2,total_pages+1):payload.update({'page': '%s' %page})response = requests.get(url, headers=headers, params=payload)汤 = BeautifulSoup(response.text, 'html.parser')threadId_list += get_page_threadIds(threadId_list,汤)返回列表(设置(threadId_list))统计信息 = get_user_stats(userId)searchId = get_searchId(userId)threadId_list = get_all_threadIds(searchId)行 = []对于 threadId_list 中的 threadId:url = 'https://forums.sagetv.com/forums/showthread.php'有效载荷 = {'t':'%s' %threadId,'pp':'40','页面':'1'}response = requests.get(url, headers=headers, params=payload)汤 = BeautifulSoup(response.text, 'html.parser')尝试:total_pages = int(soup.find('td',{'class':'vbmenu_control'}, text=re.compile("^Page 1 of")).text.split('of ')[-1])除了:total_pages=1对于范围内的页面(1,total_pages+1):payload.update({'page':'%s' %page})response = requests.get(url, headers=headers, params=payload)汤 = BeautifulSoup(response.text, 'html.parser')讨论 = 汤.find('td',{'class':'navbar'}).text.strip()帖子 = 汤.find_all('table',{'id':re.compile("^post")})在帖子中发布:dateStr = post.find('td',{'class':'thead'}).text.split()postNo = dateStr[0]dateStr = ' '.join(dateStr[1:])海报名 = post.find('a',{'class':'bigusername'}).textjoinDate = post.find('div', text=re.compile("^Join Date:")).text.split('Join Date:')[-1].strip()尝试:location = post.find('div', text=re.compile("^Location:")).text.split('Location:')[-1].strip()除了:位置 = '不适用'postNum = post.find('div', text=re.compile(".*Posts:")).text.split('Posts:')[-1].replace(',','').条()message = post.find('div',{'id':re.compile("^post_message_")}).text.strip()row = {'date':dateStr,'postNumber':postNo,'海报':海报名,'加入日期':加入日期,'位置':位置,'帖子数':postNum,'线程':讨论,'线程ID':线程ID,'消息':消息}行.追加(行)打印('收集:%s - %s 页面 %0s' %(讨论,页面,total_pages))df = pd.DataFrame(行)打印(统计)打印(df)

输出:

vbulletin

Main Goal

At the end we have all the threads (and discourses) where our demo-user is involved.

(Note: This means that we should keep in mind a nice presentation of the gathered results.)

Details

For working out the logic that enables us to use this technique - on all Vbulletin (that run version 3.8xy). we choose a demo-page[which is only a example with an open board - visible to anybody without registration].

There is no interest in gathering these data: the main interest is to find out the logic: getting the full discourses one user of a board (vbulletin) is involved). We have to start from posts, and go to threads... in order to get the full conversations.

For testing purposes we choose a free board - with open access to the structure. So here we want to create a minimal reproducible example (m r e (cf. https://stackoverflow.com/help/minimal-reproducible-example). The URL doesn't matter, just the content of the HTML. i just made the example as small as possible, and still exhibit the problem i am aiming to solve.

Starting Point

We take a vbulletin (version 3.8.xy) as an example-board - see the page: https://forums.sagetv.com/forums/ note - no login necessary!

..then choose one single author (user) of this board - just pick one...as an example: just for example: https://forums.sagetv.com/forums/member.php?u=4793 - (we may pick any other)

look for: show all statistics:

Total Posts
    Total Posts: 4,406
    Posts Per Day: 0.78
    Find all posts by nyplayer
    Find all threads started by nyplayer

and then you get a starting point - with the page of the postings: "Find all posts by nyplayer" - https://forums.sagetv.com/forums/search.php?searchid=15505533

and now we have a page, that is showing results 1 to 25 of xyz postings

"Showing results 1 to 25 of xyz postings"

...and now we need to pick all the posts - and besides that: The whole thread, in which the user nyplayer is one poster among others. At the end we get all the threads and discourses (of our example-user) nyplayer is involved.

Notice the difference: we are aiming for all the discourse, nplayer (our demo-user) is involved in. We re not aiming to get (only) the threads he started. This little condition makes it a bit more tricky to solve the task - and besides that: i guess we need to find out a good method to store and display the data we gathered. Perhaps csv is a good idea - so that we can work with the results...

The task to get all includes skip from page to page: ...while having worked out the first pages where we gathered all the threads where nplayer is involved - we can go ahead to the next page - and to the next... untill we reached the end of the pages where postings of nplayer are displaeyd.

note: i have added some images to illustrate the two main tasks to solve. a. gathering all the threads of a certain author - note: as mentioned above: this is more than only getting the threads he has started. Main-goal: getting the whole (!) threads a certain author is involved. This of course includes to go through all the pages (see the attached images). Thats just all.

Starting point could be the overview of postings of any of the user of this example-board... - from there we could gain the general logic...

and best would be - to fetch all the thread.

Task

so the job is:

  1. first of all: we need to findall the threads (that contain postings of our certain user); ...and while we got the threads (discourses) of the first page.. ; then we have
  2. to skip to the next page - and the next and the next

Main goal: at the end we have all the threads (and discourses) where our demo-user is involved. For working out the logic that enables us to use this technique - on all Vbulletin (that run version 3.8xy). As mentioned above: It is all about the logic: So here we just want to create a minimal reproducible example (cf. https://stackoverflow.com/help/minimal-reproducible-example ). The URL doesn't matter, just the content of the HTML. i just made the example as small as possible, and still exhibit the problem i am aiming to solve.

the coding parts, that i have found so far:

Steps

  1. gathering the postings and the threads(!) where our demo-user is involved

.... threads = soup.find_all('tr', attrs={'class' : 'alt1'})

tr 
td class="alt1"

  1. Showing results 1 to 25 of xyz postings

posted By xyz means: that we ve got an certain amount of (for example) xyz postings (situated in various threads):

 td class="vbmenu_control

finally - the last question is: how to store all the gathered data? Which format should be used..!? thats another question.. But i am pretty sure that here some nice techniques are available...

look forward to hear from you

解决方案

See if this gets you started.

Can make some functions that will pull out the thread ids, using the postIds. then iterate through the thread Id pages and parse the data. I'm not really going to spend too much more time on this. You could possibly use the comments in the html as well to pull out some of the sections, but I think this is more or less the thought process you are looking for.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re


headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'}
userId = 4793


def get_user_stats(userId):
    url = 'https://forums.sagetv.com/forums/member.php'
    payload = {'u':f'{userId}'}
    
    response = requests.get(url, headers=headers, params=payload)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    stats_data = {}
    stats = soup.find('fieldset',{'class':"statistics_group"}).find_all('li')
    for each in stats:
        values = each.text.replace(',','').split(':')
        if len(values) == 2:
            key, value = ''.join(values[0].split()), float(values[1])
            stats_data[key] = value
    return stats_data


def get_searchId(userId):
    url = 'https://forums.sagetv.com/forums/search.php?do=finduser&u=4793'
    payload = {'do':'finduser',
               'u':f'{userId}'}    
    
    response = requests.get(url, headers=headers, params=payload)
    soup = BeautifulSoup(response.text, 'html.parser')
    searchId = soup.find('td',{'class':'vbmenu_control'}, text=re.compile("^Page 1 of")).find_next('a')['href'].split('searchid=')[-1].split('&')[0]
    return searchId

def get_page_threadIds(threadId_list, soup):
    postIds = soup.find_all('table',{'id':re.compile("^post")})
    for each in postIds:
        a_s = each.find_all('a')
        for alpha in a_s:
            if 't=' in alpha['href']:
                threadId = alpha['href'].split('t=')[-1]

        if threadId not in threadId_list:
            threadId_list.append(threadId)
    return threadId_list


def get_all_threadIds(searchId):
    threadId_list = []
    url = 'https://forums.sagetv.com/forums/search.php'
    payload = {'searchid':'%s' %searchId,
               'pp':'200'}

    response = requests.get(url, headers=headers, params=payload)
    soup = BeautifulSoup(response.text, 'html.parser')
    total_pages = int(soup.find('td',{'class':'vbmenu_control'}, text=re.compile("^Page 1 of")).text.split('of ')[-1])
    
    threadId_list = get_page_threadIds(threadId_list, soup)
    for page in range(2, total_pages+1):
        payload.update({'page': '%s' %page})
        response = requests.get(url, headers=headers, params=payload)
        soup = BeautifulSoup(response.text, 'html.parser')
        threadId_list += get_page_threadIds(threadId_list, soup)
    return list(set(threadId_list))
        
        
        
stats = get_user_stats(userId)
searchId = get_searchId(userId)
threadId_list = get_all_threadIds(searchId)   



rows = []
for threadId in threadId_list:
    url = 'https://forums.sagetv.com/forums/showthread.php'
    payload = {'t':'%s' %threadId,
               'pp':'40',
               'page':'1'}
    
    response = requests.get(url, headers=headers, params=payload)
    soup = BeautifulSoup(response.text, 'html.parser')
    try:
        total_pages = int(soup.find('td',{'class':'vbmenu_control'}, text=re.compile("^Page 1 of")).text.split('of ')[-1])
    except:
        total_pages=1
    
    for page in range(1,total_pages+1):
        payload.update({'page':'%s' %page})
        response = requests.get(url, headers=headers, params=payload)
        soup = BeautifulSoup(response.text, 'html.parser')
        discussion = soup.find('td',{'class':'navbar'}).text.strip()
        posts = soup.find_all('table',{'id':re.compile("^post")})
        for post in posts:
            dateStr = post.find('td',{'class':'thead'}).text.split()
            postNo = dateStr[0]
            dateStr = ' '.join(dateStr[1:])
            
            postername = post.find('a',{'class':'bigusername'}).text
            joinDate = post.find('div', text=re.compile("^Join Date:")).text.split('Join Date:')[-1].strip()
            try:
                location = post.find('div', text=re.compile("^Location:")).text.split('Location:')[-1].strip()
            except:
                location = 'N/A'
            postNum = post.find('div', text=re.compile(".*Posts:")).text.split('Posts:')[-1].replace(',','').strip()
            message = post.find('div',{'id':re.compile("^post_message_")}).text.strip()
            
            row = {'date':dateStr,
                   'postNumber':postNo,
                   'poster':postername,
                   'joinDate':joinDate,
                   'location':location,
                   'number of posts':postNum,
                   'thread':discussion,
                   'thread id':threadId,
                   'message':message}
            rows.append(row)
            
        print ('Collected: %s - Page %0s of %s' %(discussion, page,total_pages))


df = pd.DataFrame(rows)


print (stats)
print(df)

Output:

这篇关于一切都与逻辑有关:findall posts &相应的线程 - 在 vbulletin的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆