Python NLP (2) 簡易切分算法
簡單的詞批配/切分算法有下列四種:
- fully segment
- forward segment
- backward segment
- bidirectional segment
1. 準備字典:
我們先準備字典,之後會利用字典進行分詞。
def load_dict(): IOUtil = JClass("com.hankcs.hanlp.corpus.io.IOUtil") core_file = HanLP.Config.CoreDictionaryPath print("Core file: ", core_file) mini_file = core_file.replace("txt", "mini.txt") print("Mini file: ", mini_file) dic = IOUtil.loadDictionary([mini_file]) return dic
2. Fully segment:
Fully segment 會把句子中所有可能的單詞都切出來
def full_segment(text, dic): word_list = [] for i in range(len(text)): for j in range(i+1, len(text)): word = text[i:j] if word in dic: word_list.append(word) return word_list
測試 fully segment:
if __name__ == "__main__": dic = load_dict() tsent = "商品和服務" ssent = HanLP.convertToSimplifiedChinese(tsent) print(full_segment(ssent, dic))
執行結果如下。
['商', '商品', '品', '和', '和服', '服']
3. Forward segment:
我們可以利用 forward segment 依序讀取字詞,並篩選留下最常的字詞:
def forward_segment(text, dic): word_list = [] head = 0 while head < len(text): longest_word = text[head] for tail in range(head+1, len(text)+1): word = text[head:tail] if word in dic: longest_word = word # 有批配到就會變長 word_list.append(longest_word) head += len(longest_word) return word_list
測試 forward segment:
if __name__ == "__main__": dic = load_dict() tsent = "就讀北京大學" ssent = HanLP.convertToSimplifiedChinese(tsent) print(ssent) print(forward_segment(ssent, dic))
執行結果如下。
['就读', '北京大学']
4. Backward segment:
Backward segment 類似 forward segment,我們從後面向前找句子並留下最長的字詞:
def backward_segment(text, dic): word_list = [] tail = len(text) - 1 while tail >= 0: longest_word = text[tail] for head in range(tail, -1, -1): word = text[head:tail+1] if word in dic: longest_word = word word_list.insert(0, longest_word) tail -= len(longest_word) return word_list
測試 backward segment:
if __name__ == "__main__": dic = load_dict() tsent = "研究生命起源" ssent = HanLP.convertToSimplifiedChinese(tsent) print(backward_segment(ssent, dic))
執行結果如下。
['研究', '生命', '起源']
5. Bidirectional segment:
Bidirectional segment 的概念是 forward segment 和 backward segment 都執行一次,選擇最能成詞的切法 (即單字詞較少的切法)。
def count_single_char(word_list): """計算 list 中的單字詞個數""" return sum(1 for word in word_list if len(word) == 1) def bidirectional_segment(text, dic): f = forward_segment(text, dic) b = backward_segment(text, dic) if len(f) < len(b): return f if len(b) < len(f): return b if count_single_char(f) < count_single_char(b): return f return b
測試 bidirectional segment:
if __name__ == "__main__": dic = load_dict() tsent_list = ["項目的研究", "商品和服務", "研究生命起源", "當下雨天地面積水", "結婚的和尚未結婚的", "歡迎新老師生前來就餐"] for tsent in tsent_list: ssent = HanLP.convertToSimplifiedChinese(tsent) print(bidirectional_segment(ssent, dic))
執行結果如下。
['项', '目的', '研究'] ['商品', '和', '服务'] ['研究', '生命', '起源'] ['当下', '语', '天', '地面', '积水'] ['结婚', '的', '和', '尚未', '结婚', '的'] ['欢', '迎新', '老', '师生', '前来', '就餐']
Reference:
1. 自然語言處理入門
留言
張貼留言