通过请求、CURL 和 BeautifulSoup 从 wsj 中抓取文章 [英] Scrape articles form wsj by requests, CURL and BeautifulSoup
问题描述
我是 wsj 的付费会员,我尝试抓取文章来完成我的 NLP 项目.我以为我保留了会话.
I'm a paid member of wsj and I tried to scrape articles to do my NLP project. I thought I kept the session.
rs = requests.session()
login_url="https://sso.accounts.dowjones.com/login?client=5hssEAdMy0mJTICnJNvC9TXEw3Va7jfO&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin&scope=openid%20idp_id&response_type=code&nonce=18091b1f-2c73-4a93-ab10-77b0d4d4f9d3&connection=DJldap&ui_locales=en-us-x-wsj-3&mg=prod%2Faccounts-wsj&state=NfljSw-Gz-TnT_I6kLjnTa2yxy8akTui#!/signin"
payload={
"username":"xxx@email",
"password":"myPassword",
}
result = rs.post(
login_url,
data = payload,
headers = dict(referer=login_url)
)
我要解析的文章.
r = rs.get('https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y')
然后我发现html仍然是非会员的
Then I found the html is still the one for non-member
我还尝试了另一种方法,即在登录后使用 CURL 保存 cookie
I also tried another method by using CURL to save the cookies after I login
curl -c cookies.txt -I "https://www.wsj.com"
curl -v cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y" > test.html
结果是一样的.
我不太熟悉浏览器背后的身份验证机制.有人可以解释为什么上述两种方法都失败以及我应该如何解决它以实现我的目标.非常感谢.
I'm not very familiar with the mechanism how the authencation work behind the browser. Can someone explains why both the methods above are failed and how should I fix it to get my goal. Thanks you very much.
推荐答案
您的尝试失败了,因为使用的协议是 oauth2.0.这不是基本的身份验证.
Your attempts have failed because the protocol used is oauth2.0. This is not basic authentication.
这里发生的事情是:
- 在调用登录URL
https://accounts.wsj.com/login
时,服务器端会生成一些信息:connection
&client_id
- 提交用户名/密码时,URL
https://sso.accounts.dowjones.com/usernamepassword/login
被调用,它需要一些参数(之前的connection
&client_id
+ oauth2 的一些静态参数:scope
、response_type
、redirect_uri
- 从前一个登录调用收到一个响应,该响应提供一个自动提交的表单.这个表单有 3 个参数
wa
、wresult
和wctx
(wresult
是一个 JWT).此表单执行对https://sso.accounts.dowjones.com/login/callback
的调用,以检索带有类似code=AjKK8g0pZZfvYpju
- URL
https://accounts.wsj.com/auth/sso/login?code=AjKK8g0pZZfvYpju
被调用,它使用有效的用户会话检索 cookie
- some information are generated server side when login URL
https://accounts.wsj.com/login
is called :connection
&client_id
- when submitting username/password, the URL
https://sso.accounts.dowjones.com/usernamepassword/login
is called which needs some parameter (the previousconnection
&client_id
+ some static parameter for oauth2 :scope
,response_type
,redirect_uri
- a response is received from the previous login call that gives a form which auto-submit. This form has 3 params
wa
,wresult
andwctx
(wresult
is a JWT). This form performs the call tohttps://sso.accounts.dowjones.com/login/callback
to retrieve an URL with a code param likecode=AjKK8g0pZZfvYpju
- The URL
https://accounts.wsj.com/auth/sso/login?code=AjKK8g0pZZfvYpju
is called which retrieve the cookies with a valid user session
使用 curl 的 bash 脚本,grep, pup 和 jq :
The bash script which uses curl, grep, pup and jq :
username="user@gmail.com"
password="YourPassword"
login_url=$(curl -s -I "https://accounts.wsj.com/login")
connection=$(echo "$login_url" | grep -oP "Location:\s+.*connection=\K(\w+)")
client_id=$(echo "$login_url" | grep -oP "Location:\s+.*client_id=\K(\w+)")
#connection=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*connection=(\w+)&/, data) {print data[1]}')
#client_id=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*client_id=(\w+)&/, data) {print data[1]}')
rm -f cookies.txt
IFS='|' read -r wa wresult wctx < <(curl -s 'https://sso.accounts.dowjones.com/usernamepassword/login' \
--data-urlencode "username=$username" \
--data-urlencode "password=$password" \
--data-urlencode "connection=$connection" \
--data-urlencode "client_id=$client_id" \
--data 'scope=openid+idp_id&tenant=sso&response_type=code&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin' | pup 'input json{}' | jq -r 'map(.value) | join("|")')
# replace double quote ""
wctx=$(echo "$wctx" | sed 's/"/"/g')
code_url=$(curl -D - -s -c cookies.txt 'https://sso.accounts.dowjones.com/login/callback' \
--data-urlencode "wa=$wa" \
--data-urlencode "wresult=$wresult" \
--data-urlencode "wctx=$wctx" | grep -oP "Location:\s+\K(\S*)")
curl -s -c cookies.txt "$code_url"
# here call your URL loading cookies.txt
curl -s -b cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y"
这篇关于通过请求、CURL 和 BeautifulSoup 从 wsj 中抓取文章的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!