Web Scraping (0) Building Scrapers

9月 13, 2018

Web Scraping (0) Building Scrapers

1. Installing bs4:

Beautiful soup 是一個解析 DOM 的工具，用 pip3 簡單裝一下：

$ pip3 install beautifulsoup4

2. Hello Scraper:

我們先用 urlopen("https://en.wikipedia.org/wiki/Ward_Cunningham") 取得一個 html response，再用 BeautifulSoup 解析 html。

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://en.wikipedia.org/wiki/Ward_Cunningham")

# use different parser

# bsObj = BeautifulSoup(html.read(), features="html.parser")
# bsObj = BeautifulSoup(html.read(), features="html5lib")
bsObj = BeautifulSoup(html.read(), features="lxml")

print(bsObj.head.title)
print(bsObj.title)

headings = bsObj.find_all({"h1", "h2"})
for heading in headings:
    print(heading)

print(bsObj.prettify())  # print whole doc

bsObj 是一個解析過 DOM 的物件，我們可以從 bsObj 獲取 html 元素的資料。

bsObj.head.title 和 bsObj.title 是相同的，代表說你不用遞迴地取得元素 (註： title 是 head 裡的元素，可以透過瀏覽器的 F12 確認這件事)。

3. Parsers:

根據官網 [1] 的說法，Parser 會決定解析的相容性和效能。

透過指令我們可以新增 Parsers:

$ pip3 install lxml
$ pip3 install html5lib

Reference:

[1] https://www.crummy.com/software/BeautifulSoup/bs4/doc/

搜尋此網誌

簡單最重要

Web Scraping (0) Building Scrapers

1. Installing bs4:

2. Hello Scraper:

3. Parsers:

Reference:

留言

張貼留言

熱門文章

Chef (1) Install Chef Development Kit and Basic Ruby Syntax

PyTorch (2) Tensor autograd functions