Web Scraping (2) Navigating Trees

1. BeautifulSoup Objects:

  • BeautifulSoup objects: BeautifulSoup() 回傳的物件
  • Tag objects: BeautifulSoup objects 呼叫 find() 和 findAll() 回傳的物件,帶有 get_text() 方法
  • NavigableString objects: 標籤文字,不具有 get_text() 方法
  • Comment object: 註解文字

2. 前置工作:

自定義一個簡單的 html:

from bs4 import BeautifulSoup

html = """
<html>
<head></head>
<body>
  <div id="content" class="my-body">
    <h3>heading 1</h3>
      <ol>
        <li>list 1</li>
        <li>list 2</li>
        <li>list 3</li>
      </ol>
    <h3>heading 2</h3>
  </div>
</body>
</html>
"""

bsObj = BeautifulSoup(html, features="lxml")


3. 子代:

for child in bsObj.find("div").children:
    print(child)

取得子代標籤內容:

for child in bsObj.find("ol").children:
    print(child.string)


4. 平輩:

向後迭代平輩:

for sibling in bsObj.find("li").next_siblings:
    print(sibling)

向前迭代平輩:

for sibling in bsObj.find("li", text="list 3").previous_siblings:
    print(sibling)


5. 親代:

bsbj.find("h3").parent


留言

熱門文章