502 Star 2.1K Fork 624

GVP狮子的魂 / jcseg

 / 详情

iphone4s 分词为 iphone4s iphone

已完成
创建于  
2014-03-18 16:51

期待分词为: iphone4s or iphone / 4s。

有个疑问,这样的分词结果是怎么产生的呢?

部分配置如下,

# Jcseg function
#maximum match length. (5-7)
jcseg.maxlen=5

#recognized the chinese name.(1 to open and 0 to close it)
jcseg.icnname=1

#maximum chinese word number of english chinese mixed word.
jcseg.mixcnlen=2

#maximum length for pair punctuation text.
jcseg.pptmaxlen=15

#maximum length for chinese last name andron.
jcseg.cnmaxlnadron=1

#Wether to clear the stopwords.(set 1 to clear stopwords and 0 to close it)
jcseg.clearstopword=0

#Wether to convert the chinese numeric to arabic number. (set to 1 open it and 0 to close it)
# like '\u4E09\u4E07' to 30000.
jcseg.cnnumtoarabic=0

#Wether to convert the chinese fraction to arabic fraction.
jcseg.cnfratoarabic=0

#Wether to keep the unrecognized word. (set 1 to keep unrecognized word and 0 to clear it)
jcseg.keepunregword=1

#Wether to start the secondary segmentation for the complex english words.
jcseg.ensencondseg = 1

#min length of the secondary simple token. (better larger than 1)
jcseg.stokenminlen = 2

#thrshold for chinese name recognize.
# better not change it before you know what you are doing.
jcseg.nsthreshold=1000000

#The punctuations that will be keep in an token.(Not the end of the token).
jcseg.keeppunctuations=@%.&+

评论 (5)

@CH 你将jcseg.stokenminlen=1,然后再测试:iphone4s你会得到iphone/ 4/ s/ iphone4s

jcseg一直是这么认为的:数字和字母组合在一起一定有特定的含义,所以选择不分开。但是会影响检索命中率,所以会进行二次切分。从而得到:iphones4/ iphone

如果你想切分为:iphone4s/ iphone/ 4s 目前只有通过在lex-en.lex中加入4s为iphone的同义词。

后续版本考虑实现你说的功能。

嗯,测试了下。的确是二次分词导致的。

打算近期过一下源码。

再次感谢分享代码。

@CH 如果你不需要二次切分功能,可以在jcseg.properties中设置ensencondseg =0来关闭。

恩,也欢迎参与开发,贡献力量。

状态更改为 已关闭

登录 后才可以发表评论

状态
负责人
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
参与者(2)
5187 lionsoul 1578914315
Java
1
https://gitee.com/lionsoul/jcseg.git
git@gitee.com:lionsoul/jcseg.git
lionsoul
jcseg
jcseg

搜索帮助