Watch 1 Star 3 Fork 1

ZFJ_张福杰 / DFAFilterDemoObjective-C

Join us
Explore and code with more than 2 million developers,Free private repositories !:)
Sign up
This repository doesn't specify license. Without author's permission, this code is only for learning and cannot be used for other purposes.
OC、Swift、Python spread retract

Clone or download
zhangfujie authored //
Cancel
Notice: Creating folder will generate an empty file .keep, because not support in Git
Loading...
README.md

在这里插入图片描述

前言

前段时间,公司的IM SDK想做敏感词过滤,但是后端的小伙伴《比较忙》,在开产品需求会的时候想把敏感词过滤放到前端,让iOS、安卓自己搞,但是前端小伙伴写了一个方法来检测一段文本,耗时【一两秒】钟而且比较耗CPU,这样肯定不行的,最后后端小伙伴妥协了,把敏感词过滤放到后端了。

一般的思路可能是遍历敏感词库,然后把一段文字的敏感词过滤掉,但是针对比较大的词库时(比如我们的敏感词库10万),这样非常耗时和耗内存,在电脑上还能跑跑,但是在手机上分分钟钟被系统杀死掉,这样肯定是不行的,这里就用到一种DFA算法。

但是使用了DFA算法,十万的敏感词库过滤一句话只需要【0.434510秒】!

2019-10-23 14:34:08.230918+0800 DFAFilterDemo[4728:4650502] message == 小明骂小王是个王八蛋,小王骂小明是个王八羔子!
2019-10-23 14:34:08.232972+0800 DFAFilterDemo[4728:4650502] result == 小明骂小王是个***,小王骂小明是个王八羔子!
2019-10-23 14:34:08.316380+0800 DFAFilterDemo[4728:4650502] 总共耗时: 0.434510 

DFA算法

简介

何谓DFA,它的全称是Deterministic Finite Automaton,即确定有穷自动机;其特征为:有一个有限状态集合和一些从一个状态通向另一个状态的边,每条边上标记有一个符号,其中一个状态是初态,某些状态是终态。但不同于不确定的有限自动机,DFA中不会有从同一状态出发的两条边标志有相同的符号;DFA算法的核心是建立了以敏感词为基础的许多敏感词树。

描述

我们先把敏感词库拆分解析成一个”敏感词树“,我们以敏感词”王八蛋“和”王八羔子“为例: 在这里插入图片描述 拆成的敏感词树如下: 在这里插入图片描述

代码

OC代码

//
//  DFAFilter.m
//  DFAFilterDemo
//
//  Created by 张福杰 on 2019/10/22.
//  Copyright © 2019 张福杰. All rights reserved.
//

#import "DFAFilter.h"

@interface DFAFilter ()

@property (nonatomic,strong) NSMutableDictionary *keyword_chains;
@property (nonatomic,  copy) NSString *delimit;

@end

@implementation DFAFilter

- (instancetype)init{
    if(self == [super init]){
        _delimit = @"\x00";
    }
    return self;
}

/// 读取解析敏感词
- (void)parseSensitiveWords:(NSString *)path{
    if(path == nil){
        path = [[NSBundle mainBundle] pathForResource:@"sensitive_words" ofType:@"txt"];
    }
    NSString *content = [[NSString alloc] initWithContentsOfFile:path encoding:NSUTF8StringEncoding error:nil];
    NSArray *keyWordList = [content componentsSeparatedByString:@","];
    for (NSString *keyword in keyWordList) {
        [self addSensitiveWords:keyword];
    }
    
    NSLog(@"keyword_chains == %@",self.keyword_chains);
}

- (void)addSensitiveWords:(NSString *)keyword{
    keyword = keyword.lowercaseString;
    keyword = [keyword stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
    
    NSMutableDictionary *node = self.keyword_chains;
    for (int i = 0; i < keyword.length; i ++) {
        NSString *word = [keyword substringWithRange:NSMakeRange(i, 1)];
        if (node[word] == nil) {
            node[word] = [NSMutableDictionary dictionary];
        }
        node = node[word];
    }
    //敏感词最后一个字符标识
    [node setValue:@0 forKey:self.delimit];
}

- (NSString *)filterSensitiveWords:(NSString *)message replaceKey:(NSString *)replaceKey{
    replaceKey = replaceKey == nil ? @"*" : replaceKey;
    message = message.lowercaseString;
    NSMutableArray *retArray = [[NSMutableArray alloc] init];
    NSInteger start = 0;
    while (start < message.length) {
        NSMutableDictionary *level = self.keyword_chains.mutableCopy;
        NSInteger step_ins = 0;
        NSString *message_chars = [message substringWithRange:NSMakeRange(start, message.length - start)];
        for(int i = 0; i < message_chars.length; i++){
            NSString *chars_i = [message_chars substringWithRange:NSMakeRange(i, 1)];
            if([level.allKeys containsObject:chars_i]){
                step_ins += 1;
                NSDictionary *level_char_dict = level[chars_i];
                if(![level_char_dict.allKeys containsObject:self.delimit]){
                    level = level_char_dict.mutableCopy;
                }else{
                    NSMutableString *ret_str = [[NSMutableString alloc] init];
                    for(int i = 0; i < step_ins; i++){
                         [ret_str appendString:replaceKey];
                    }
                    [retArray addObject:ret_str];
                    start += (step_ins - 1);
                    break;
                }
            }else{
                [retArray addObject:[NSString stringWithFormat:@"%C",[message characterAtIndex:start]]];
                break;
            }
        }
        start ++;
    }
    return [retArray componentsJoinedByString:@""];
}

- (NSMutableDictionary *)keyword_chains{
    if(_keyword_chains == nil){
        _keyword_chains = [[NSMutableDictionary alloc] initWithDictionary:@{}];
    }
    return _keyword_chains;
}

@end

Swift代码

//
//  DFAFilter.swift
//  DFAFilterDemo
//
//  Created by 张福杰 on 2019/10/23.
//  Copyright © 2019 张福杰. All rights reserved.
//

import UIKit

class DFAFilter: NSObject {
    lazy var keyword_chains: NSMutableDictionary = {
        let dict = NSMutableDictionary()
        return dict
    }()
    
    lazy var delimit: String = "\\x00";
    
    /// 读取敏感词
    func parseSensitiveWords() -> Void {
        let path = Bundle.main.path(forResource: "sensitive_words", ofType: "txt");
        let url = URL(fileURLWithPath: path!)
        do {
            let data = try Data(contentsOf: url)
            let content: String = String(data: data, encoding: String.Encoding.utf8)!
            let keyWordList = content.components(separatedBy: ",")
            for keyword in keyWordList {
                addSensitiveWords(keyword)
            }
            
        } catch let error as Error? {
            print(error?.localizedDescription as Any)
        }
    }
    
    /// 添加敏感词到敏感词树
    func addSensitiveWords(_ keyword: String) -> Void {
        let keyword: String = keyword.lowercased().trimmingCharacters(in: .whitespacesAndNewlines)
        var node = self.keyword_chains
        if keyword.count <= 0{
            return
        }
        
        for index in 0...keyword.count - 1 {
            let index0 = keyword.index(keyword.startIndex, offsetBy: index)
            let index1 = keyword.index(keyword.startIndex, offsetBy: index + 1)
            let word = keyword[index0..<index1]
            if node[word] == nil{
                node[word] = NSMutableDictionary()
            }
            node = node[word] as! NSMutableDictionary
        }
        node[self.delimit] = 0
    }
    
    /// 开始过滤敏感词
    func filterSensitiveWords(_ message: String, replaceKey: String) -> String {
        let replaceKey = replaceKey.count > 0 ? replaceKey : "*"
        let message = message.lowercased()
        let retArray: NSMutableArray = NSMutableArray()
        var start = 0
        while start < message.count {
            var level: NSMutableDictionary = self.keyword_chains.mutableCopy() as! NSMutableDictionary
            var step_ins = 0
            let message_chars = getChar(message, startIndex: start, endIndex: message.count)
            for index in 0...message_chars.count - 1 {
                let chars_i = getChar(message_chars, startIndex: index, endIndex: index + 1)
                if level[chars_i] != nil{
                    step_ins += 1
                    let level_char_dict: NSDictionary = level[chars_i] as! NSDictionary
                    if level_char_dict[self.delimit] == nil{
                        level = level_char_dict.mutableCopy() as! NSMutableDictionary
                    }else{
                        var ret_str = ""
                        for _ in 0...step_ins - 1 {
                            ret_str += replaceKey
                        }
                        retArray.add(ret_str)
                        start += (step_ins - 1)
                        break
                    }
                }else{
                    let word = getChar(message, startIndex: start, endIndex: start + 1)
                    retArray.add(word)
                    break
                }
            }
            start += 1
        }

        return retArray.componentsJoined(by: "")
    }
    
    func getChar(_ message: String, startIndex: NSInteger, endIndex: NSInteger) -> String {
        let index0 = message.index(message.startIndex, offsetBy: startIndex)
        let index1 = message.index(message.startIndex, offsetBy: endIndex)
        let word = message[index0..<index1]
        return String(word)
    }

}

Python代码

# -*- coding: utf-8 -*-
# @Author: zhangfujie
# @Date:   2019/10/22
# @Last Modified by:   zhangfujie
# @Last Modified time: 2019/10/22
# @ ---------- DFA过滤算 ---------- 
import time
time1 = time.time()

class DFAFilter(object):
    """DFA过滤算法"""
    def __init__(self):
        super(DFAFilter, self).__init__()
        self.keyword_chains = {}
        self.delimit = '\x00'

    # 读取解析敏感词
    def parseSensitiveWords(self, path):
        ropen = open(path,'r')
        text = ropen.read()
        keyWordList = text.split(',')
        for keyword in keyWordList:
            self.addSensitiveWords(str(keyword).strip())

    # 生成敏感词树
    def addSensitiveWords(self, keyword):
        keyword = keyword.lower()
        chars = keyword.strip()
        if not chars:
            return
        level = self.keyword_chains
        for i in range(len(chars)):
            if chars[i] in level:
                level = level[chars[i]]
            else:
                if not isinstance(level, dict):
                    break
                for j in range(i, len(chars)):
                    level[chars[j]] = {}

                    last_level, last_char = level, chars[j]

                    level = level[chars[j]]
                last_level[last_char] = {self.delimit: 0}
                break
            if i == len(chars) - 1:
                level[self.delimit] = 0

    # 过滤敏感词
    def filterSensitiveWords(self, message, repl="*"):
        message = message.lower()
        ret = []
        start = 0
        while start < len(message):
            level = self.keyword_chains
            step_ins = 0
            message_chars = message[start:]
            for char in message_chars:
                if char in level:
                    step_ins += 1
                    if self.delimit not in level[char]:
                        level = level[char]
                    else:
                        ret.append(repl * step_ins)
                        start += step_ins - 1
                        break
                else:
                    ret.append(message[start])
                    break
            start += 1

        return ''.join(ret)


if __name__ == "__main__":
    gfw = DFAFilter()
    gfw.parseSensitiveWords('shieldwords.txt')
    text = "小明骂小王是个王八蛋,小王骂小明是个王八羔子!"
    result = gfw.filterSensitiveWords(text)
    print(result)
    time2 = time.time()
    print('总共耗时:' + str(time2 - time1) + 's')    

结束语

demo下载地址: https://gitee.com/zfj1128/DFAFilterDemo 过往大佬喜欢的给个小星星吧!

欢迎各位大神提出宝贵的意见和建议,也欢迎大家进群交流365152048!

CSDN博客 https://zfj1128.blog.csdn.net
GITEE主页 https://gitee.com/zfj1128
GITHUB主页 https://github.com/zfjsyqk

Comments ( 0 )

Sign in for post a comment

Objective-C
1
https://gitee.com/zfj1128/DFAFilterDemo.git
git@gitee.com:zfj1128/DFAFilterDemo.git
zfj1128
DFAFilterDemo
DFAFilterDemo
master

Help Search