Discovery towards Web Scrawler

Web Scrawler is dangerous

First and foremost, do not touch web scrawler if you are 100% sure that you wanna do this.

Beginning

The target website is enlightent, a third-party data website which have data of my company and component.

We need to do some background research first. The problems I encountered are listed:

  • Need to log in with WeChat account by QR code
  • Simulate click (by package selenium)
  • It’s a dynamic website, you need to wait for its information loaded (by package time)
  • Write into MySQL (by package Pymysql)

Process

First problem cannot be solve due to the security of WeChat.
Second problem

Step 1: Find the pattern in html. Using chrome, just ctrl+u or ctrl+shift+i. It needs your patience to find the thing you want. If you mistaken the pattern, you cannot get the information you want

Step 2: Choose the function: by_path or by_class. The tricky point is that if there is only one class, it’s okay to use by_class, if there are more than two classes, selenium would choose the first class as your output. As a result, I choose by_path

Step 3: Install chrome driver according to your chrome version. Be sure to download into /anaconda3/lib/site-packages.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Choose daily
time.sleep(5)
driver.find_elements_by_id("rank-date-btn")[0].click()
# Choose year
time.sleep(1)
driver.find_element_by_class_name("datepicker-switch").click()
# Choose target month
time.sleep(1)
# driver.find_element_by_class_name("month").click()
month_url = "/html/body/div[2]/div[3]/div[1]/div[1]/div[1]/div[2]/div[2]/div[2]/div[1]/div/div/div/div[2]/table/tbody/tr/td/span[%s]" % (l+1)
driver.find_element_by_xpath(month_url).click()
# Choose target Date
time.sleep(1)
# day = driver.find_element_by_class_name("day").click()
xpath = "/html/body/div[2]/div[3]/div[1]/div[1]/div[1]/div[2]/div[2]/div[2]/div[1]/div/div/div/div/table/tbody/tr[%d]/td[%d]" % ((start_date) // 7 + 1, (start_date) % 7 + 1)
driver.find_element_by_xpath(xpath).click()
# Press enter
time.sleep(1)
driver.find_element_by_id("choose-rank").click()
time.sleep(1)
Third Problem: Data Processing
1
2
3
4
5
6
7
8
9
album_separately = string_list[j][string_list[j].find('data-name='):string_list[j].find('data-channeltype="tv"')]
if j != 0:
album.append(album_separately.replace('data-name=','').replace('"',''))
percentage_separately = string_list[j][string_list[j].find('<td class="sort rank-playTimesPredicted active" style=""><span>'):string_list[j].find('</span></td><td class="rank-playTimes" style="">')]
percentage.append(percentage_separately.replace('<td class="sort rank-playTimesPredicted active" style=""><span>',''))
click_separately = string_list[j][string_list[j].find('</td><td class="rank-playTimes" style=""><span>'):string_list[j].find('</span><span class="star-playtimes">')]
if len(click_separately) >= 10:
click_separately = string_list[j][string_list[j].find('</td><td class="rank-playTimes" style=""><span>'):string_list[j].find('</span></td><td class="rank-average m-change" style="">')]
click.append(click_separately.replace('</td><td class="rank-playTimes" style=""><span>','').replace('</span><span class="star-playtimes">',''))
Fourth problem: Log into your MySQL and use python like MySQL!
1
2
db = DB('your database')
db.insert(dataset)

Take Care! Be sure to add time.sleep() when you do it!