Python NLP (3) Trie

查找的資料結構及複雜度:

  1. list: O(n)
  2. tree: O(log(n))
  3. dictionary: O(1)

通常查找會採用 tree 資料結構在空間及時間複雜度上取得平衡。

1. Node:

tree 結構的基礎是節點 (node)。節點通常會有 children 和 value 兩個屬性,children 代表節點底下的子節點, value 代表該節點底下紀錄的值。

class Node:

    def __init__(self, value) -> None:
        self._children = {}
        self._value = value
    
    def _add_child(self, char, value, overwrite=False):
        child = self._children.get(char)
        if child is None:
            child = Node(value) # 建立並新增節點
            self._children[char] = child
        elif overwrite:
            child._value = value
        return child

2. Trie:

建立 tree 並實現 __getitem__ 和 __setitem__ 方法。

__getitem__ 會在 tree 中以 key 的每個 char 向下搜尋返回最後一個節點所存放的值。

__setitem__ 會在 tree 中以 key 的每個 char 向下新增節點並設定最後一個節點的值。

class Trie(Node):

    def __init__(self) -> None:
        super().__init__(None)

    def __contains__(self, key):
        return self[key] is not None

    def __getitem__(self, key):
        state = self
        for char in key:
            state = state._children.get(char)
            if state is None:
                return None
        return state._value

    def __setitem__(self, key, value):
        state = self
        for i, char in enumerate(key):
            if i < len(key) - 1:
                state = state._add_child(char, None, False)
            else:
                state = state._add_child(char, value, True)

3. 建 tree 查找測試:

CRUD 測試:

if __name__ == "__main__":
    trie = Trie()
   
    # create
    trie["節點"] = "node"
    trie["節日"] = "holiday"
    trie["節約用水"] = "save water"
    trie["節慶"] = "festival"
    trie["寵物"] = "pet"
    assert "節日" in trie
   
    # read
    print("寵物: ", trie["寵物"])
    assert trie["寵物"] == "pet"

    # update
    print("trie['節日']: ", trie['節日'])
    trie["節日"] = "festival"
    print("trie['節日']: ", trie['節日'])
    assert trie["節日"] == "festival"
    
    # delete
    trie["節慶"] = None
    assert "節慶" not in trie

Reference:

1. 自然語言處理入門

留言

熱門文章