Python NLP (2) 簡易切分算法

簡單的詞批配/切分算法有下列四種:

  1. fully segment
  2. forward segment
  3. backward segment
  4. bidirectional segment

1. 準備字典:

我們先準備字典,之後會利用字典進行分詞。

def load_dict():
    IOUtil = JClass("com.hankcs.hanlp.corpus.io.IOUtil")
    core_file = HanLP.Config.CoreDictionaryPath
    print("Core file: ", core_file)
    mini_file = core_file.replace("txt", "mini.txt")
    print("Mini file: ", mini_file)
    dic = IOUtil.loadDictionary([mini_file])
    return dic

2. Fully segment:

Fully segment 會把句子中所有可能的單詞都切出來

def full_segment(text, dic):
    word_list = []
    for i in range(len(text)):
        for j in range(i+1, len(text)):
            word = text[i:j]
            if word in dic:
                word_list.append(word)
    return word_list

測試 fully segment:

if __name__ == "__main__":
    dic = load_dict()
    tsent = "商品和服務"
    ssent = HanLP.convertToSimplifiedChinese(tsent) 
    print(full_segment(ssent, dic))

執行結果如下。

['商', '商品', '品', '和', '和服', '服']

3. Forward segment:

我們可以利用 forward segment 依序讀取字詞,並篩選留下最常的字詞:

def forward_segment(text, dic):
    word_list = []
    head = 0
    while head < len(text):
        longest_word = text[head]
        for tail in range(head+1, len(text)+1):
            word = text[head:tail]
            if word in dic:
                longest_word = word # 有批配到就會變長
        word_list.append(longest_word)
        head += len(longest_word)
    return word_list

測試 forward segment:

if __name__ == "__main__":
    dic = load_dict()
    tsent = "就讀北京大學"
    ssent = HanLP.convertToSimplifiedChinese(tsent)
    print(ssent)
    print(forward_segment(ssent, dic))

執行結果如下。

['就读', '北京大学']

4. Backward segment:

Backward segment 類似 forward segment,我們從後面向前找句子並留下最長的字詞:

def backward_segment(text, dic):
    word_list = []
    tail = len(text) - 1
    
    while tail >= 0:
        longest_word = text[tail]
        for head in range(tail, -1, -1):
            word = text[head:tail+1]
            if word in dic:
                longest_word = word
        word_list.insert(0, longest_word)
        tail -= len(longest_word)
    return word_list

測試 backward segment:

if __name__ == "__main__":
    dic = load_dict()
    tsent = "研究生命起源"
    ssent = HanLP.convertToSimplifiedChinese(tsent)
    print(backward_segment(ssent, dic))

執行結果如下。

['研究', '生命', '起源']

5. Bidirectional segment:

Bidirectional segment 的概念是 forward segment 和 backward segment 都執行一次,選擇最能成詞的切法 (即單字詞較少的切法)。

def count_single_char(word_list):
    """計算 list 中的單字詞個數"""
    return sum(1 for word in word_list if len(word) == 1)

def bidirectional_segment(text, dic):
    f = forward_segment(text, dic)
    b = backward_segment(text, dic)
   
    if len(f) < len(b):
        return f

    if len(b) < len(f):
        return b

    if count_single_char(f) < count_single_char(b):
        return f
    return b

測試 bidirectional segment:

if __name__ == "__main__":
    dic = load_dict()
    tsent_list = ["項目的研究", "商品和服務", "研究生命起源", "當下雨天地面積水", "結婚的和尚未結婚的", "歡迎新老師生前來就餐"]
    for tsent in tsent_list: 
        ssent = HanLP.convertToSimplifiedChinese(tsent)
        print(bidirectional_segment(ssent, dic))

執行結果如下。

['项', '目的', '研究']
['商品', '和', '服务']
['研究', '生命', '起源']
['当下', '语', '天', '地面', '积水']
['结婚', '的', '和', '尚未', '结婚', '的']
['欢', '迎新', '老', '师生', '前来', '就餐']

Reference:

1. 自然語言處理入門

留言

熱門文章