Weibo Hot Topic Web Scrawler - to monitor the public sentiment in China

Web Scrawler is dangerous

First and foremost, do not touch web scrawler if you are 100% sure that you wanna do this.

Similar topic:

https://nobugs.dev/2019/07/14/webscrawler/

website: enlightent

This time

We choose to use the package of requests and package json to get the result of hot topics in weibo, similar website to twitter which has real-time hot topics, a great way to monitor the public sentiment in China.

Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import requests
import json


headers = {
'charset': "utf-8",
'Accept-Encoding': "gzip",
'referer': "https://servicewechat.com/wx90ae92bbd13ec629/11/page-frame.html",
'content-type': "application/x-www-form-urlencoded",
'User-Agent': "Mozilla/5.0 (Linux; Android 9; Redmi Note 7 Build/PKQ1.180904.001; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/68.0.3440.91 Mobile Safari/537.36 MicroMessenger/7.0.3.1400(0x2700033B) Process/appbrand0 NetType/WIFI Language/zh_CN",
'Host': "www.eecso.com",
'Connection': "keep-alive",
'cache-control': "no-cache",
'Origin': 'https://www.weibotop.cn',
}


with open('微博热搜.csv', 'w', encoding='gbk') as f:
f.write('时间,排名,热搜内容,上榜时间,最后时间\n')

timeid = 77594
dateUrl = "https://www.eecso.com/test/weibo/apis/getlatest.php?timeid={}"
contentUrl = "https://www.eecso.com/test/weibo/apis/currentitems.php?timeid={}"
n = 1
days = 42 #需获取2.10往前多少天的数据
interval = 720 #改为1则是爬所有数据(该网站2分钟记录一次) 24*30 = 720
while True:
dateResponse = requests.request("GET", dateUrl.format(timeid), headers=headers,verify=False)
contentResponse = requests.request("GET", contentUrl.format(timeid), headers=headers,verify=False)
timeid = 77594-interval*n #77594为2020/2/10 12:00的timeid,720为一天timeid的间隔
print(timeid)
n += 1
dateJson = json.loads(dateResponse.text)
json_obj = json.loads(contentResponse.text)
#print(dateJson)

for index,item in enumerate(json_obj):
date = dateJson[1]
rank = str(index+1)
hotTopic = item[0]
onTime = item[1]
lastTime = item[2]
save_res = date+","+rank+","+hotTopic+','+onTime+','+lastTime+'\n'
with open('微博热搜.csv','a',encoding='gbk',errors='ignore') as f:
f.write(save_res)
if n > days:
break