由请求,CURL和BeautifulSoup组成的wsj报废文章 [英] Scrap articles form wsj by requests, CURL and BeautifulSoup

查看:121
本文介绍了由请求,CURL和BeautifulSoup组成的wsj报废文章的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是wsj的付费会员,因此我尝试将文章报废以进行我的NLP项目。我以为我要保留会话。

I'm a paid member of wsj and I tried to scrap articles to do my NLP project. I thought I kept the session.

rs = requests.session()
login_url="https://sso.accounts.dowjones.com/login?client=5hssEAdMy0mJTICnJNvC9TXEw3Va7jfO&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin&scope=openid%20idp_id&response_type=code&nonce=18091b1f-2c73-4a93-ab10-77b0d4d4f9d3&connection=DJldap&ui_locales=en-us-x-wsj-3&mg=prod%2Faccounts-wsj&state=NfljSw-Gz-TnT_I6kLjnTa2yxy8akTui#!/signin" 
payload={
    "username":"xxx@email",
    "password":"myPassword",
}
result = rs.post(
    login_url, 
    data = payload, 
    headers = dict(referer=login_url)
)

我要解析的文章。

r = rs.get('https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y')

然后我发现html是sti ll非会员的

Then I found the html is still the one for non-member

我还尝试了另一种方法,即在登录后使用CURL保存cookie

I also tried another method by using CURL to save the cookies after I login

curl -c cookies.txt -I "https://www.wsj.com"
curl -v cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y" > test.html

结果相同。

我对浏览器后面的身份验证工作机制并不十分熟悉。有人可以解释为什么以上两种方法都失败了,以及我应该如何修复它才能实现我的目标。非常感谢。

I'm not very familiar with the mechanism how the authencation work behind the browser. Can someone explains why both the methods above are failed and how should I fix it to get my goal. Thanks you very much.

推荐答案

由于使用的协议是 oauth2.0 。这不是基本的身份验证。

Your attempts have failed because the protocol used is oauth2.0. This is not basic authentication.

这里发生的是:


  • 一些信息是在登录网址 https://accounts.wsj.com/login 被调用时在服务器端生成的:连接& client_id

  • 提交用户名/密码时,URL https://sso.accounts.dowjones.com / usernamepassword / login 被调用,它需要一些参数(先前的 connection & client_id + oauth2的一些静态参数: scope response_type redirect_uri

  • 从先前的登录调用中收到响应,该响应给出了一个自动提交的表单。该表单具有3个参数 wa 结果 wctx 结果 JWT )。此表单将调用 https://sso.accounts.dowjones.com/login/回调来检索带有代码参数的URL,例如 code = AjKK8g0pZZfvYpju

  • URL https://accounts.wsj.com/auth/sso/login?code=AjKK8g0pZZfvYpju 被调用,它将使用有效的用户会话检索cookie

  • some information are generated server side when login URL https://accounts.wsj.com/login is called : connection & client_id
  • when submitting username/password, the URL https://sso.accounts.dowjones.com/usernamepassword/login is called which needs some parameter (the previous connection & client_id + some static parameter for oauth2 : scope, response_type, redirect_uri
  • a response is received from the previous login call that gives a form which auto-submit. This form has 3 params wa, wresult and wctx (wresult is a JWT). This form performs the call to https://sso.accounts.dowjones.com/login/callback to retrieve an URL with a code param like code=AjKK8g0pZZfvYpju
  • The URL https://accounts.wsj.com/auth/sso/login?code=AjKK8g0pZZfvYpju is called which retrieve the cookies with a valid user session

使用 curl grep pup jq

username="user@gmail.com"
password="YourPassword"

login_url=$(curl -s -I "https://accounts.wsj.com/login")
connection=$(echo "$login_url" | grep -oP "Location:\s+.*connection=\K(\w+)")
client_id=$(echo "$login_url" | grep -oP "Location:\s+.*client_id=\K(\w+)")

#connection=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*connection=(\w+)&/, data) {print data[1]}')
#client_id=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*client_id=(\w+)&/, data) {print data[1]}')

rm -f cookies.txt

IFS='|' read -r wa wresult wctx < <(curl -s 'https://sso.accounts.dowjones.com/usernamepassword/login' \
      --data-urlencode "username=$username" \
      --data-urlencode "password=$password" \
      --data-urlencode "connection=$connection" \
      --data-urlencode "client_id=$client_id" \
      --data 'scope=openid+idp_id&tenant=sso&response_type=code&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin' | pup 'input json{}' | jq -r 'map(.value) | join("|")')

# replace double quote ""
wctx=$(echo "$wctx" | sed 's/&#34;/"/g')

code_url=$(curl -D - -s -c cookies.txt 'https://sso.accounts.dowjones.com/login/callback' \
     --data-urlencode "wa=$wa" \
     --data-urlencode "wresult=$wresult" \
     --data-urlencode "wctx=$wctx" | grep -oP "Location:\s+\K(\S*)")

curl -s -c cookies.txt "$code_url"

# here call your URL loading cookies.txt
curl -s -b cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y"

这篇关于由请求,CURL和BeautifulSoup组成的wsj报废文章的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆