Web Scraping (2) Navigating Trees
1. BeautifulSoup Objects:
- BeautifulSoup objects: BeautifulSoup() 回傳的物件
- Tag objects: BeautifulSoup objects 呼叫 find() 和 findAll() 回傳的物件,帶有 get_text() 方法
- NavigableString objects: 標籤文字,不具有 get_text() 方法
- Comment object: 註解文字
2. 前置工作:
自定義一個簡單的 html:
from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<div id="content" class="my-body">
<h3>heading 1</h3>
<ol>
<li>list 1</li>
<li>list 2</li>
<li>list 3</li>
</ol>
<h3>heading 2</h3>
</div>
</body>
</html>
"""
bsObj = BeautifulSoup(html, features="lxml")
3. 子代:
for child in bsObj.find("div").children:
print(child)
取得子代標籤內容:
for child in bsObj.find("ol").children:
print(child.string)
4. 平輩:
向後迭代平輩:
for sibling in bsObj.find("li").next_siblings:
print(sibling)
向前迭代平輩:
for sibling in bsObj.find("li", text="list 3").previous_siblings:
print(sibling)
5. 親代:
bsbj.find("h3").parent
留言
張貼留言