由请求，CURL和BeautifulSoup组成的wsj报废文章 [英] Scrap articles form wsj by requests, CURL and BeautifulSoup

查看：121 发布时间：2020/10/13 2:04:54 curl web-crawler python-requests

本文介绍了由请求，CURL和BeautifulSoup组成的wsj报废文章的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是wsj的付费会员，因此我尝试将文章报废以进行我的NLP项目。我以为我要保留会话。

I'm a paid member of wsj and I tried to scrap articles to do my NLP project. I thought I kept the session.

rs = requests.session()
login_url="https://sso.accounts.dowjones.com/login?client=5hssEAdMy0mJTICnJNvC9TXEw3Va7jfO&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin&scope=openid%20idp_id&response_type=code&nonce=18091b1f-2c73-4a93-ab10-77b0d4d4f9d3&connection=DJldap&ui_locales=en-us-x-wsj-3&mg=prod%2Faccounts-wsj&state=NfljSw-Gz-TnT_I6kLjnTa2yxy8akTui#!/signin" 
payload={
    "username":"xxx@email",
    "password":"myPassword",
}
result = rs.post(
    login_url, 
    data = payload, 
    headers = dict(referer=login_url)
)

我要解析的文章。

r = rs.get('https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y')

然后我发现html是sti ll非会员的

Then I found the html is still the one for non-member

我还尝试了另一种方法，即在登录后使用CURL保存cookie

I also tried another method by using CURL to save the cookies after I login

curl -c cookies.txt -I "https://www.wsj.com"
curl -v cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y" > test.html

结果相同。

我对浏览器后面的身份验证工作机制并不十分熟悉。有人可以解释为什么以上两种方法都失败了，以及我应该如何修复它才能实现我的目标。非常感谢。

I'm not very familiar with the mechanism how the authencation work behind the browser. Can someone explains why both the methods above are failed and how should I fix it to get my goal. Thanks you very much.

推荐答案

由于使用的协议是 oauth2.0 。这不是基本的身份验证。

Your attempts have failed because the protocol used is oauth2.0. This is not basic authentication.

这里发生的是：

一些信息是在登录网址 https://accounts.wsj.com/login 被调用时在服务器端生成的：连接& client_id

提交用户名/密码时，URL https://sso.accounts.dowjones.com / usernamepassword / login 被调用，它需要一些参数（先前的 connection & client_id + oauth2的一些静态参数： scope ， response_type ， redirect_uri

从先前的登录调用中收到响应，该响应给出了一个自动提交的表单。该表单具有3个参数 wa ， 结果和 wctx （结果是 JWT ）。此表单将调用 https://sso.accounts.dowjones.com/login/回调来检索带有代码参数的URL，例如 code = AjKK8g0pZZfvYpju

URL https://accounts.wsj.com/auth/sso/login?code=AjKK8g0pZZfvYpju 被调用，它将使用有效的用户会话检索cookie

some information are generated server side when login URL https://accounts.wsj.com/login is called : connection & client_id
when submitting username/password, the URL https://sso.accounts.dowjones.com/usernamepassword/login is called which needs some parameter (the previous connection & client_id + some static parameter for oauth2 : scope, response_type, redirect_uri
a response is received from the previous login call that gives a form which auto-submit. This form has 3 params wa, wresult and wctx (wresult is a JWT). This form performs the call to https://sso.accounts.dowjones.com/login/callback to retrieve an URL with a code param like code=AjKK8g0pZZfvYpju
The URL https://accounts.wsj.com/auth/sso/login?code=AjKK8g0pZZfvYpju is called which retrieve the cookies with a valid user session

使用 curl ， grep ， pup 和 jq ：

username="user@gmail.com"
password="YourPassword"

login_url=$(curl -s -I "https://accounts.wsj.com/login")
connection=$(echo "$login_url" | grep -oP "Location:\s+.*connection=\K(\w+)")
client_id=$(echo "$login_url" | grep -oP "Location:\s+.*client_id=\K(\w+)")

#connection=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*connection=(\w+)&/, data) {print data[1]}')
#client_id=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*client_id=(\w+)&/, data) {print data[1]}')

rm -f cookies.txt

IFS='|' read -r wa wresult wctx < <(curl -s 'https://sso.accounts.dowjones.com/usernamepassword/login' \
      --data-urlencode "username=$username" \
      --data-urlencode "password=$password" \
      --data-urlencode "connection=$connection" \
      --data-urlencode "client_id=$client_id" \
      --data 'scope=openid+idp_id&tenant=sso&response_type=code&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin' | pup 'input json{}' | jq -r 'map(.value) | join("|")')

# replace double quote ""
wctx=$(echo "$wctx" | sed 's/&#34;/"/g')

code_url=$(curl -D - -s -c cookies.txt 'https://sso.accounts.dowjones.com/login/callback' \
     --data-urlencode "wa=$wa" \
     --data-urlencode "wresult=$wresult" \
     --data-urlencode "wctx=$wctx" | grep -oP "Location:\s+\K(\S*)")

curl -s -c cookies.txt "$code_url"

# here call your URL loading cookies.txt
curl -s -b cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y"

这篇关于由请求，CURL和BeautifulSoup组成的wsj报废文章的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

由请求，CURL和BeautifulSoup组成的wsj报废文章 [英] Scrap articles form wsj by requests, CURL and BeautifulSoup

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

由请求，CURL和BeautifulSoup组成的wsj报废文章 [英] Scrap articles form wsj by requests, CURL and BeautifulSoup

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭