知識
不管是網(wǎng)站,軟件還是小程序,都要直接或間接能為您產(chǎn)生價(jià)值,我們在追求其視覺表現(xiàn)的同時(shí),更側(cè)重于功能的便捷,營銷的便利,運(yùn)營的高效,讓網(wǎng)站成為營銷工具,讓軟件能切實(shí)提升企業(yè)內(nèi)部管理水平和效率。優(yōu)秀的程序?yàn)楹笃谏壧峁┍憬莸闹С郑?
您當(dāng)前位置>首頁 » 新聞資訊 » 公眾號相關(guān) >
fiddle python抓取微信公眾號文章
發(fā)表時(shí)間:2020-10-19
發(fā)布人:葵宇科技
瀏覽次數(shù):66
分析
1,先進(jìn)行準(zhǔn)備工作:使用fiddle抓包(大家可以自行百度怎么使用哦)
2,打開電腦端微信,找到需要爬取的公眾號
點(diǎn)擊進(jìn)入公眾號,再打開fiddle,微信停留在這一步
打開fiddle后再點(diǎn)擊微信的下圖按鈕
fiddle會(huì)出現(xiàn)很多包,微信里面繼續(xù)向下滑動(dòng),直至fiddle里出現(xiàn)
點(diǎn)擊fiddle右側(cè)的Raw,找到下面的鏈接并點(diǎn)擊進(jìn)去
點(diǎn)擊后會(huì)出現(xiàn)這個(gè)頁面
3,然后在瀏覽器中點(diǎn)擊檢查元素,找到network
在網(wǎng)頁滑動(dòng)的過程中會(huì)出現(xiàn)一個(gè)json文件的鏈接
進(jìn)去看看是這樣子的
這就是我們要抓取的數(shù)據(jù)了
4,在第3步中可以繼續(xù)將網(wǎng)頁向下滑動(dòng),對比鏈接的不同
發(fā)現(xiàn)只有offset每次增加10,再觀察第三步的第二個(gè)圖發(fā)現(xiàn)正好是10個(gè)數(shù)據(jù),說明offset每一次改變就是10篇不同的文章
**遇到的坑:
1,微信不能請求太頻繁,不然會(huì)無法進(jìn)入文章頁面24小時(shí)
2,用fiddle抓取的這個(gè)鏈接有時(shí)效性,經(jīng)測試大約30分鐘
測試代碼如下
3,想讓爬取的url變成pdf需要使用pdfkit庫,大家可以自行百度哦
import requests,time
data0 = time.time()
while True:
url = 'https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz=MzA5NjEwNjE0OQ==&f=json&offset=11&count=10&is_ok=1&scene=124&uin=MTczNjk0NDEwMw%3D%3D&key=827f3335bef33e45717c17a835620ed3e7c540ab72a526ab5b053adcaa860be393c02f9ac5dcd1f29e45d6568788ca024b2aef3a0ff57fea9324a750ff257637fdba0690f8531315bdfca09cb3b9face1b1a5eb7efd9a8fc4f6948dd63e5930be4109b6de50b4efea8dc446012adf7ea5d58ee9ee75620ef9b1d7086201a78dc&pass_ticket=jN5PzMHo4SdLo6xWe8i%2FvQ6x87AEnKHHtwMkpl%2FuH6TKwnoBj%2F01J3thBdOHmMTM&wxtoken=&appmsg_token=1075_HtI4fzr7%252F2AwFEgwfox68YcvhRovjfzSy9-Knw~~&x5=0&f=json'
r = requests.get(url)
if len(r.text) > 400:
time.sleep(180)
else:
print("有效期為"+str(int(time.time()-data0)/60)+"分鐘")
break
結(jié)果為
編寫代碼
1,構(gòu)造headers,使用這里面的參數(shù)
import requests,random,re
User_Agent = [
'Mozilla/5.0 CK={} (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; KTXN)',
'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0'
]
headers = {}
headers['connection'] = 'keep-alive'
headers['host'] = 'mp.weixin.qq.com'
headers['User-Agent'] = random.choice(User_Agent)
headers['referer'] = 'https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzA3Nzc4MzY2NA=='
headers['cookie'] = 'pgv_pvi=2150888448; RK=qBo4z5fmcP; ptcz=6b4b4ae9eb01daeffff996302197842b93afa3c6501532cce8850cdda5a37855; pgv_pvid=2824430422; pac_uid=0_5e87d63ce838c; tvfe_boss_uuid=bfebe1f857002eaf; XWINDEXGREY=0; Hm_lvt_dde6ba2851f3db0ddc415ce0f895822e=1588923704,1589040012; ua_id=MgD7dxr2iLcmCup2AAAAAC3Cx-Yr2sX1OGi-UfyQTzk=; Hm_lvt_dde6ba2851f3db0ddc415ce0f895822e=1588923704,1589040012; _ga=GA1.2.807773159.1593618327; iip=0; wxuin=1736944103; lang=zh_CN; pass_ticket=jN5PzMHo4SdLo6xWe8i/vQ6x87AEnKHHtwMkpl/uH6TKwnoBj/01J3thBdOHmMTM; devicetype=android-28; version=2700113f; wap_sid2=COfTnrwGEooBeV9ITnlBb3FjWk9LLUh6bmhCWE9XZk5oWEFoUXZxdUhkYTlTMWxUYVVvYjNSaGdmN3NiVTBIb29BejBRVjFicFFHV0phQzBWSE5FLUJpREw2a3pUVUROM0RjejlnaTN4eUQtZXJkQWpOb2QxbXR1SWl2bnU0Y2l0ME0zNkFGS3hvRnJSRVNBQUF+MJKUkfoF='
import requests,json,time
import pdfkit
##對json文件進(jìn)行解析
def parse_page(url):
r = requests.get(url,headers=headers)
r = r.text
a = json.loads(r)['general_msg_list']
lists = json.loads(a)['list']
##此處將鏈接和名字保存在html文件中
f = open('python.html','a')
for i in lists:
try:
title1 = i['app_msg_ext_info']['title']
##json文件中有兩種形式表示title,URL,故使用if else語句
if len(title1)==0:
title = i['app_msg_ext_info']['multi_app_msg_item_list'][0]['title']
title = re.sub(r'[!,,\?\\\/:<>&$\*\|@#]','',title)
link = i['app_msg_ext_info']['multi_app_msg_item_list'][0]['content_url']
digest = i['app_msg_ext_info']['multi_app_msg_item_list'][0]['digest']
article = {
'title':title,
'link':link,
'digest':digest
}
f.write('<a href='+link+'>'+title+'</a>'+'<br>')
save_pdf(link,title)
print(article)
time.sleep(1)
else:
title = i['app_msg_ext_info']['title']
title = re.sub(r'[!,,\?\\\/:<>&$\*\|@#]','',title)
link = i['app_msg_ext_info']['content_url']
digest = i['app_msg_ext_info']['digest']
article = {
'title':title,
'link':link,
'digest':digest
}
f.write('<a href='+link+'>'+title+'</a>'+'<br>')
save_pdf(link,title)
time.sleep(1)
print(article)
print("*"*30)
except:
continue
f.close()
##將網(wǎng)頁鏈接url保存為pdf
def save_pdf(url,title):
config=pdfkit.configuration(wkhtmltopdf=r"D:\tesseract\wkhtmltopdf\bin\wkhtmltopdf.exe")
pdfkit.from_url(url,'E:1/'+title+'.pdf',configuration=config)
##多頁爬取
def main():
for i in range(1,100):
print("*"*30)
print('第%d頁' %i)
print("*"*30)
url = 'https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz=MzA3Nzc4MzY2NA==&f=json&offset={}0&count=10&is_ok=1&scene=124&uin=MTczNjk0NDEwMw%3D%3D&key=827f3335bef33e450aa3cb8e6088b5eed9e86a3b7a9023c61ce26ed85c0fae456a59262f47906011422679031b89df59a0ded071e96ceb9f39d6226f284762ccb2b9d755a17d2047b09cc00a9bf44e23f3ce6f33e8744deb4c69caa7c7c9226316825095c58ecbfa010e4219651e8eeb45c0370d6f04637e301a7b08a89e966f&pass_ticket=pCWEQrqZBNyT5N91MECA49xCvslYFAsMinBKcBCJHXd32k4pqEAaJtOjqUXajp0R&wxtoken=&appmsg_token=1079_uwFRe2sGJ6uaIdrzGCAnHbgqeLCpi7WbQqak7g~~&x5=0&f=json'.format(i)
parse_page(url)
##爬取一頁睡眠幾秒,防止微信被封
time.sleep(10+int(random.random()*10))
if __name__ == '__main__':
##此處將HTML 文件內(nèi)容清空
f = open('python.html','w')
f.write('')
f.close()
main()
結(jié)尾
相關(guān)案例查看更多
相關(guān)閱讀
- 退款
- 云南etc微信小程序
- 云南省建設(shè)廳網(wǎng)站官網(wǎng)
- 小程序設(shè)計(jì)
- asp網(wǎng)站
- 支付寶小程序被騙
- 云南網(wǎng)站建設(shè)
- 云南小程序開發(fā)制作公司
- 汽車報(bào)廢回收管理軟件
- 網(wǎng)站排名優(yōu)化
- 汽車報(bào)廢管理系統(tǒng)
- 人人商城
- 模版信息
- 云南網(wǎng)站建設(shè)電話
- 汽車報(bào)廢回收軟件
- 網(wǎng)站搭建
- 云南小程序定制
- 云南建設(shè)廳網(wǎng)站首頁
- APP
- 生成海報(bào)
- 報(bào)廢車
- 昆明網(wǎng)站制作
- 汽車拆解管理軟件
- 云南網(wǎng)站建設(shè)方案 doc
- 網(wǎng)站建設(shè)公司網(wǎng)站
- 云南小程序被騙蔣軍
- 昆明軟件定制
- 云南網(wǎng)站建設(shè)公司地址
- 云南網(wǎng)站建設(shè)報(bào)價(jià)
- 網(wǎng)站建設(shè)開發(fā)