如何从Linux终端登录到我的wsj帐户(使用curl,oauth2.0) [英] How to log on to my wsj account from linux terminal (using curl, oauth2.0)

查看:72
本文介绍了如何从Linux终端登录到我的wsj帐户(使用curl,oauth2.0)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是wsj的付费会员,我想从linux终端登录我的wsj帐户,以便我可以编写代码以将一些文章用于我的NLP研究.我不会发布任何数据.

I'm a paid member of wsj and I want to log onto my wsj account from linux terminal so I can write codes to scrap some articles to for my NLP research. I won't release the data whatsoever.

我的方法基于

My approach is based on a previous answer from Scrap articles form wsj by requests, CURL and BeautifulSoup The main issue with the codes that work back then but do not work now is that apparently wsj has adopted a different OAuth 2.0 approach. First, connection I cannot obtain anymore by running login_url. I kinda feel this is the bottleneck. It is a mandatory field for next step.

我注意到的另一件事是使用了状态参数.我不知道如何使用这个领域.运行后

Another thing I notice is state parameter is used. I don't know how to use this field. After running

curl -s 'https://sso.accounts.dowjones.com/authorize?scope=openid+idp_id+roles+email+given_name+family_name+djid+djUsername+djStatus+trackid+tags+prts&client_id=XXXXXXX&response_type=code&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin&state=https://www.wsj.com&username=XXXXXX&password=XXXXXX'

它确实返回:找到.重定向到/login?state = XXXX ...."但不确定在此步骤后如何使用状态参数.

It does return: "Found. Redirecting to /login?state=XXXX...." But not sure how to use the state parameter after this step.

我使用的一些参考资料是: https://oauth.net/2/

Some references I used are: https://developer.dowjones.com/site/global/develop/authentication/index.gsp#2-exchanging-the-authorization-code-for-authn-tokens-98 https://oauth.net/2/

username="user@gmail.com"
password="YourPassword"

login_url=$(curl -s -I "https://accounts.wsj.com/login")
connection=$(echo "$login_url" | grep -oP "Location:\s+.*connection=\K(\w+)")
client_id=$(echo "$login_url" | grep -oP "Location:\s+.*client_id=\K(\w+)")

#connection=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*connection=(\w+)&/, data) {print data[1]}')
#client_id=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*client_id=(\w+)&/, data) {print data[1]}')

rm -f cookies.txt

IFS='|' read -r wa wresult wctx < <(curl -s 'https://sso.accounts.dowjones.com/usernamepassword/login' \
      --data-urlencode "username=$username" \
      --data-urlencode "password=$password" \
      --data-urlencode "connection=$connection" \
      --data-urlencode "client_id=$client_id" \
      --data 'scope=openid+idp_id&tenant=sso&response_type=code&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin' | pup 'input json{}' | jq -r 'map(.value) | join("|")')

# replace double quote ""
wctx=$(echo "$wctx" | sed 's/&#34;/"/g')

code_url=$(curl -D - -s -c cookies.txt 'https://sso.accounts.dowjones.com/login/callback' \
     --data-urlencode "wa=$wa" \
     --data-urlencode "wresult=$wresult" \
     --data-urlencode "wctx=$wctx" | grep -oP "Location:\s+\K(\S*)")

curl -s -c cookies.txt "$code_url"

# here call your URL loading cookies.txt
curl -s -b cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y"

推荐答案

/usernamepassword/login 请求还需要一些其他参数.它需要 state nonce .同样, connection 字段似乎不再出现在Location标头中,而是硬编码在js文件中.

There are a few more parameters needed for the /usernamepassword/login request. It needs the state and nonce. Also it seems the connection field is no longer present in Location header but hardcoded in a js file.

凭据详细信息嵌入在 https://accounts.wsj下的脚本标签下的Base64编码JSON中.com/login

您可以将脚本的问题更新为下列的.它使用:

You can update the bash script as the following. It uses curl, jq, sed & pup:

#/bin/bash

username="your_email@gmail.com"
password="your_password"
base_url="https://accounts.wsj.com"

rm -f cookies.txt

login_page=$(curl -s -L -c cookies.txt "$base_url/login")
jspage=$(echo "$login_page" | pup 'script attr{src}' | grep "app-min")
connection=$(curl -s "$base_url$jspage" | sed -rn "s/.*connection:\s*\"(\w+)\".*/\1/p" | head -1)

crendentials=$(echo "$login_page" | \
       sed -rn "s/.*Base64\.decode\('(.*)'.*/\1/p" | \
       base64 -d | \
       jq -r '.internalOptions.state, .internalOptions.nonce, .clientID')

read state nonce clientID < <(echo $crendentials)

echo "state:      $state"
echo "nonce:      $nonce"
echo "client_id:  $clientID"
echo "connection: $connection"

login_result=$(curl -s  -b cookies.txt -c cookies.txt 'https://sso.accounts.dowjones.com/usernamepassword/login' \
      --data-urlencode "username=$username" \
      --data-urlencode "password=$password" \
      --data-urlencode "connection=$connection" \
      --data-urlencode "client_id=$clientID" \
      --data-urlencode "state=$state" \
      --data-urlencode "nonce=$nonce" \
      --data-urlencode "scope=openid idp_id roles email given_name family_name djid djUsername djStatus trackid tags prts" \
      --data 'tenant=sso&response_type=code&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin' | \
      pup 'input json{}' | jq -r '.[] | .value')

read wa wresult wctx < <(echo $login_result)

wctx=$(echo "$wctx" | sed 's/&#34;/"/g') #replace double quote ""

echo "wa:      $wa"
echo "wresult: $wresult"
echo "wctx:    $wctx"

callback=$(curl -s -b cookies.txt -c cookies.txt -L 'https://sso.accounts.dowjones.com/login/callback' \
     --data-urlencode "wa=$wa" \
     --data-urlencode "wresult=$wresult" \
     --data-urlencode "wctx=$wctx")

#try this one to get an article, your username should be embedded in the page as logged in user
#curl -s -b cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y"

但是脚本很难维护,我建议使用脚本的问题,例如这个:

But this bash script is painful to maintain, I'd recommend to use a python script like this:

import requests
from bs4 import BeautifulSoup
import re
import base64
import json

username="your_email@gmail.com"
password="your_password"
base_url="https://accounts.wsj.com"

session = requests.Session()
r = session.get("{}/login".format(base_url))
soup = BeautifulSoup(r.text, "html.parser")
jscript = [ 
    t.get("src") 
    for t in soup.find_all("script") 
    if t.get("src") is not None and "app-min" in t.get("src")
][0]

credentials_search = re.search("Base64\.decode\('(.*)'", r.text, re.IGNORECASE)
base64_decoded = base64.b64decode(credentials_search.group(1))
credentials = json.loads(base64_decoded)

print("client_id : {}".format(credentials["clientID"]))
print("state     : {}".format(credentials["internalOptions"]["state"]))
print("nonce     : {}".format(credentials["internalOptions"]["nonce"]))
print("scope     : {}".format(credentials["internalOptions"]["scope"]))

r = session.get("{}{}".format(base_url, jscript))

connection_search = re.search('connection:\s*\"(\w+)\"', r.text, re.IGNORECASE)
connection = connection_search.group(1)

r = session.post(
    'https://sso.accounts.dowjones.com/usernamepassword/login',
    data = {
        "username": username,
        "password": password,
        "connection": connection,
        "client_id": credentials["clientID"],
        "state": credentials["internalOptions"]["state"],
        "nonce": credentials["internalOptions"]["nonce"],
        "scope": credentials["internalOptions"]["scope"],
        "tenant": "sso",
        "response_type": "code",
        "protocol": "oauth2",
        "redirect_uri": "https://accounts.wsj.com/auth/sso/login"
    })
soup = BeautifulSoup(r.text, "html.parser")

login_result = dict([ 
    (t.get("name"), t.get("value")) 
    for t in soup.find_all('input') 
    if t.get("name") is not None
])

r = session.post(
    'https://sso.accounts.dowjones.com/login/callback',
    data = login_result)

#check connected user
r = session.get("https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y")
username_search = re.search('\"firstName\":\s*\"(\w+)\",', r.text, re.IGNORECASE)
print("connected user : " + username_search.group(1))

这篇关于如何从Linux终端登录到我的wsj帐户(使用curl,oauth2.0)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆