Basic web crawler

基本上用Python寫web crawler很簡單啦. 基本如下面的程式. 接下來就請各自發揮了喔

#-*- coding: utf-8 -*-

from __future__ import print_function
from bs4 import BeautifulSoup

import requests

url = "https://www.flyvair.com/zh/"
# storing all the information including headers in the variable source code
req = requests.get(url)
req.encoding = 'utf-8'
# sort source code and store only the plaintext
plain_text = req.text

# print(plain_text)
soup = BeautifulSoup(plain_text, 'html.parser')

print(soup.prettify())

# Do something u wnat

Scraping with Python Selenium and PhantomJS

上面的Crawler遇到一個問題, 就是現在的網站都是動態的網站, 所以有一些動態的頁面是無法被抓到的.

所以這邊利用 Selenium and PhantomJS 來達到抓取動態網頁的目的.

PhantomJS

PhantomJS是一個基於WebKit的服務器端JavaScript API,它無需瀏覽器的支持即可實現對Web的支持,且原生支持各種Web標準,如DOM 處理、JavaScript、CSS選擇器、JSON、Canvas和可縮放矢量圖形SVG。 PhantomJS主要是通過JavaScript和CoffeeScript控制WebKit的CSS選擇器、可縮放矢量圖形SVG和HTTP網絡等各個模塊.

Selenium

Selenium WebDriver 可以模擬人對於瀏覽器”所有”的操作行為! 是一个用於web自動畫測試的工具集.

scrape.py

#-*- coding: utf-8 -*-
#!/usr/bin/env python3

import re
from urllib import parse

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

link = 'https://www.flyvair.com/zh/'


class WebScraper(object):

    def __init__(self):
        self.driver = webdriver.PhantomJS()
        self.driver.set_window_size(1120, 550)  # 啟動詞指定瀏覽器寬高

    def scrape(self):
        self.driver.get(link)
        sleep(1)
        s = BeautifulSoup(self.driver.page_source, 'html.parser')
        tmp = s.find_all('img')

        for r in tmp:
            print(r)
            print("")

        self.driver.quit()

if __name__ == '__main__':
    scraper = WebScraper()
    scraper.scrape()