关于逻辑的全部内容:findall帖子&相应的线程-在vbulletin上 [英] its all about the logic: findall posts & the corresponding threads - on vbulletin

查看:92
本文介绍了关于逻辑的全部内容:findall帖子&相应的线程-在vbulletin上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

vbulletin

主要目标

最后,我们拥有了演示用户所涉及的所有线索(和论述).

(注意:这意味着我们应该牢记收集结果的清晰呈现.)

详细信息

要在所有Vbulletin(运行3.8xy版本)上制定使我们能够使用此技术的逻辑.我们选择一个演示页面(这只是一个开放面板的示例-未经注册的任何人都可以看到).

收集这些数据没有兴趣:主要的兴趣是找出逻辑:涉及到一个董事会(vbulletin)的使用者的全部论述).我们必须从帖子开始,然后转到主题...以获取完整的对话.

出于测试目的,我们选择一块免费的木板-可以自由进入结构.因此,在这里我们要创建一个最小可重复的示例(mre (请参阅

  1. 显示1到25个xyz帖子的结果

xyz发布的意思是:我们有一定数量的(例如)xyz发布(位于各个线程中):

  td class =''vbmenu_control 

最后-最后一个问题是:如何存储所有收集的数据?应该使用哪种格式..!?多数民众赞成在另一个问题..但我很确定这里提供了一些不错的技术...

期待您的来信

解决方案

看看这是否使您入门.

可以使用postId进行一些提取线程ID的函数.然后遍历线程ID页并解析数据.我真的不会花太多时间在此上.您也可以使用html中的注释来提取某些部分,但我认为这或多或少是您正在寻找的思维过程.

 将pandas导入为pd汇入要求从bs4导入BeautifulSoup汇入headers = {'User-Agent':'Mozilla/5.0(Windows NT 10.0; Win64; x64)AppleWebKit/537.36(KHTML,like Gecko)Chrome/88.0.4324.150 Safari/537.36'}userId = 4793def get_user_stats(userId):url ='https://forums.sagetv.com/forums/member.php'有效载荷= {'u':f'{userId}'}响应= request.get(URL,headers = headers,params = payload)汤= BeautifulSoup(response.text,'html.parser')stats_data = {}stats = soup.find('fieldset',{'class':"statistics_group"}).find_all('li')对于每个统计数据:值= each.text.replace(',','').split(':')如果len(values)== 2:键,值=" .join(values [0] .split()),float(values [1])stats_data [key] =值返回stats_datadef get_searchId(userId):url ='https://forums.sagetv.com/forums/search.php?do=finduser&u=4793'有效载荷= {'do':'finduser','u':f'{userId}'}响应= request.get(URL,headers = headers,params = payload)汤= BeautifulSoup(response.text,'html.parser')searchId = soup.find('td',{'class':'vbmenu_control'},text = re.compile("^^ Page of of")).find_next('a')['href'].split('searchid =')[-1] .split('&')[0]返回searchIddef get_page_threadIds(threadId_list,汤):postIds = soup.find_all('table',{'id':re.compile(" ^ post")})对于每个postIds:a_s = each.find_all('a')对于a_s中的alpha:如果alpha ['href']中的't =':threadId = alpha ['href'].split('t =')[-1]如果threadId不在threadId_list中:threadId_list.append(threadId)返回threadId_listdef get_all_threadIds(searchId):threadId_list = []url ='https://forums.sagetv.com/forums/search.php'有效载荷= {'searchid':'%s'%searchId,'pp':'200'}响应= request.get(URL,headers = headers,params = payload)汤= BeautifulSoup(response.text,'html.parser')total_pages = int(soup.find('td',{'class':'vbmenu_control'},text = re.compile("^^ Page of of"))).text.split('of')[-1])threadId_list = get_page_threadIds(threadId_list,汤)适用范围(2,total_pages + 1)中的页面:payload.update({'page':'%s'%page})响应= request.get(URL,headers = headers,params = payload)汤= BeautifulSoup(response.text,'html.parser')threadId_list + = get_page_threadIds(threadId_list,汤)返回列表(set(threadId_list))统计= get_user_stats(userId)searchId = get_searchId(userId)threadId_list = get_all_threadIds(searchId)行= []对于threadId_list中的threadId:url ='https://forums.sagetv.com/forums/showthread.php'有效负载= {'t':'%s'%threadId,'pp':'40','page':'1'}响应= request.get(URL,headers = headers,params = payload)汤= BeautifulSoup(response.text,'html.parser')尝试:total_pages = int(soup.find('td',{'class':'vbmenu_control'},text = re.compile("^^ Page of of"))).text.split('of')[-1])除了:total_pages = 1适用范围(1,total_pages + 1)中的页面:payload.update({'page':'%s'%page})响应= request.get(URL,headers = headers,params = payload)汤= BeautifulSoup(response.text,'html.parser')讨论= soup.find('td',{'class':'navbar'}).text.strip()posts = soup.find_all('table',{'id':re.compile("^ post")}})用于发布帖子:dateStr = post.find('td',{'class':'thead'}).text.split()postNo = dateStr [0]dateStr =''.join(dateStr [1:])postername = post.find('a',{'class':'bigusername'}).textjoinDate = post.find('div',text = re.compile("^ Join Date:")).text.split('Join Date:')[-1] .strip()尝试:location = post.find('div',text = re.compile("^ Location:")).text.split('Location:')[-1] .strip()除了:位置=不适用"postNum = post.find('div',text = re.compile(.* Posts:")).text.split('Posts:')[-1] .replace(',','').条()message = post.find('div',{'id':re.compile("^^ post_message _")}).text.strip()row = {'date':dateStr,'postNumber':postNo,'海报':海报名称,'joinDate':joinDate,'位置':位置,帖子数":postNum,'thread':讨论,线程ID":threadId,'message':message}rows.append(row)打印(已收集:%s-%s的第%0s页"%(讨论,页面,total_pages))df = pd.DataFrame(行)列印(统计资料)打印(df) 

输出:

vbulletin

Main Goal

At the end we have all the threads (and discourses) where our demo-user is involved.

(Note: This means that we should keep in mind a nice presentation of the gathered results.)

Details

For working out the logic that enables us to use this technique - on all Vbulletin (that run version 3.8xy). we choose a demo-page[which is only a example with an open board - visible to anybody without registration].

There is no interest in gathering these data: the main interest is to find out the logic: getting the full discourses one user of a board (vbulletin) is involved). We have to start from posts, and go to threads... in order to get the full conversations.

For testing purposes we choose a free board - with open access to the structure. So here we want to create a minimal reproducible example (m r e (cf. https://stackoverflow.com/help/minimal-reproducible-example). The URL doesn't matter, just the content of the HTML. i just made the example as small as possible, and still exhibit the problem i am aiming to solve.

Starting Point

We take a vbulletin (version 3.8.xy) as an example-board - see the page: https://forums.sagetv.com/forums/ note - no login necessary!

..then choose one single author (user) of this board - just pick one...as an example: just for example: https://forums.sagetv.com/forums/member.php?u=4793 - (we may pick any other)

look for: show all statistics:

Total Posts
    Total Posts: 4,406
    Posts Per Day: 0.78
    Find all posts by nyplayer
    Find all threads started by nyplayer

and then you get a starting point - with the page of the postings: "Find all posts by nyplayer" - https://forums.sagetv.com/forums/search.php?searchid=15505533

and now we have a page, that is showing results 1 to 25 of xyz postings

"Showing results 1 to 25 of xyz postings"

...and now we need to pick all the posts - and besides that: The whole thread, in which the user nyplayer is one poster among others. At the end we get all the threads and discourses (of our example-user) nyplayer is involved.

Notice the difference: we are aiming for all the discourse, nplayer (our demo-user) is involved in. We re not aiming to get (only) the threads he started. This little condition makes it a bit more tricky to solve the task - and besides that: i guess we need to find out a good method to store and display the data we gathered. Perhaps csv is a good idea - so that we can work with the results...

The task to get all includes skip from page to page: ...while having worked out the first pages where we gathered all the threads where nplayer is involved - we can go ahead to the next page - and to the next... untill we reached the end of the pages where postings of nplayer are displaeyd.

note: i have added some images to illustrate the two main tasks to solve. a. gathering all the threads of a certain author - note: as mentioned above: this is more than only getting the threads he has started. Main-goal: getting the whole (!) threads a certain author is involved. This of course includes to go through all the pages (see the attached images). Thats just all.

Starting point could be the overview of postings of any of the user of this example-board... - from there we could gain the general logic...

and best would be - to fetch all the thread.

Task

so the job is:

  1. first of all: we need to findall the threads (that contain postings of our certain user); ...and while we got the threads (discourses) of the first page.. ; then we have
  2. to skip to the next page - and the next and the next

Main goal: at the end we have all the threads (and discourses) where our demo-user is involved. For working out the logic that enables us to use this technique - on all Vbulletin (that run version 3.8xy). As mentioned above: It is all about the logic: So here we just want to create a minimal reproducible example (cf. https://stackoverflow.com/help/minimal-reproducible-example ). The URL doesn't matter, just the content of the HTML. i just made the example as small as possible, and still exhibit the problem i am aiming to solve.

the coding parts, that i have found so far:

Steps

  1. gathering the postings and the threads(!) where our demo-user is involved

.... threads = soup.find_all('tr', attrs={'class' : 'alt1'})

tr 
td class="alt1"

  1. Showing results 1 to 25 of xyz postings

posted By xyz means: that we ve got an certain amount of (for example) xyz postings (situated in various threads):

 td class="vbmenu_control

finally - the last question is: how to store all the gathered data? Which format should be used..!? thats another question.. But i am pretty sure that here some nice techniques are available...

look forward to hear from you

解决方案

See if this gets you started.

Can make some functions that will pull out the thread ids, using the postIds. then iterate through the thread Id pages and parse the data. I'm not really going to spend too much more time on this. You could possibly use the comments in the html as well to pull out some of the sections, but I think this is more or less the thought process you are looking for.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re


headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'}
userId = 4793


def get_user_stats(userId):
    url = 'https://forums.sagetv.com/forums/member.php'
    payload = {'u':f'{userId}'}
    
    response = requests.get(url, headers=headers, params=payload)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    stats_data = {}
    stats = soup.find('fieldset',{'class':"statistics_group"}).find_all('li')
    for each in stats:
        values = each.text.replace(',','').split(':')
        if len(values) == 2:
            key, value = ''.join(values[0].split()), float(values[1])
            stats_data[key] = value
    return stats_data


def get_searchId(userId):
    url = 'https://forums.sagetv.com/forums/search.php?do=finduser&u=4793'
    payload = {'do':'finduser',
               'u':f'{userId}'}    
    
    response = requests.get(url, headers=headers, params=payload)
    soup = BeautifulSoup(response.text, 'html.parser')
    searchId = soup.find('td',{'class':'vbmenu_control'}, text=re.compile("^Page 1 of")).find_next('a')['href'].split('searchid=')[-1].split('&')[0]
    return searchId

def get_page_threadIds(threadId_list, soup):
    postIds = soup.find_all('table',{'id':re.compile("^post")})
    for each in postIds:
        a_s = each.find_all('a')
        for alpha in a_s:
            if 't=' in alpha['href']:
                threadId = alpha['href'].split('t=')[-1]

        if threadId not in threadId_list:
            threadId_list.append(threadId)
    return threadId_list


def get_all_threadIds(searchId):
    threadId_list = []
    url = 'https://forums.sagetv.com/forums/search.php'
    payload = {'searchid':'%s' %searchId,
               'pp':'200'}

    response = requests.get(url, headers=headers, params=payload)
    soup = BeautifulSoup(response.text, 'html.parser')
    total_pages = int(soup.find('td',{'class':'vbmenu_control'}, text=re.compile("^Page 1 of")).text.split('of ')[-1])
    
    threadId_list = get_page_threadIds(threadId_list, soup)
    for page in range(2, total_pages+1):
        payload.update({'page': '%s' %page})
        response = requests.get(url, headers=headers, params=payload)
        soup = BeautifulSoup(response.text, 'html.parser')
        threadId_list += get_page_threadIds(threadId_list, soup)
    return list(set(threadId_list))
        
        
        
stats = get_user_stats(userId)
searchId = get_searchId(userId)
threadId_list = get_all_threadIds(searchId)   



rows = []
for threadId in threadId_list:
    url = 'https://forums.sagetv.com/forums/showthread.php'
    payload = {'t':'%s' %threadId,
               'pp':'40',
               'page':'1'}
    
    response = requests.get(url, headers=headers, params=payload)
    soup = BeautifulSoup(response.text, 'html.parser')
    try:
        total_pages = int(soup.find('td',{'class':'vbmenu_control'}, text=re.compile("^Page 1 of")).text.split('of ')[-1])
    except:
        total_pages=1
    
    for page in range(1,total_pages+1):
        payload.update({'page':'%s' %page})
        response = requests.get(url, headers=headers, params=payload)
        soup = BeautifulSoup(response.text, 'html.parser')
        discussion = soup.find('td',{'class':'navbar'}).text.strip()
        posts = soup.find_all('table',{'id':re.compile("^post")})
        for post in posts:
            dateStr = post.find('td',{'class':'thead'}).text.split()
            postNo = dateStr[0]
            dateStr = ' '.join(dateStr[1:])
            
            postername = post.find('a',{'class':'bigusername'}).text
            joinDate = post.find('div', text=re.compile("^Join Date:")).text.split('Join Date:')[-1].strip()
            try:
                location = post.find('div', text=re.compile("^Location:")).text.split('Location:')[-1].strip()
            except:
                location = 'N/A'
            postNum = post.find('div', text=re.compile(".*Posts:")).text.split('Posts:')[-1].replace(',','').strip()
            message = post.find('div',{'id':re.compile("^post_message_")}).text.strip()
            
            row = {'date':dateStr,
                   'postNumber':postNo,
                   'poster':postername,
                   'joinDate':joinDate,
                   'location':location,
                   'number of posts':postNum,
                   'thread':discussion,
                   'thread id':threadId,
                   'message':message}
            rows.append(row)
            
        print ('Collected: %s - Page %0s of %s' %(discussion, page,total_pages))


df = pd.DataFrame(rows)


print (stats)
print(df)

Output:

这篇关于关于逻辑的全部内容:findall帖子&相应的线程-在vbulletin上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆