# simtrie **Repository Path**: MathMiner/simtrie ## Basic Information - **Project Name**: simtrie - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-01-11 - **Last Updated**: 2024-01-11 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # simtrie `simtrie` is a library for fast, highly configurable approximate string similarity searches. Here's a simple example: ``` import timeit from nltk.corpus import wordnet as wn lemmas = list(set(i for i in wn.words())) print('creating set of %d words' % len(lemmas)) >> creating set of 147306 words s = simtrie.Set(lemmas) def search(): return list(s.similar("bookish", 2)) print(timeit.timeit(stmt=search, number=1)) >> 0.00041486499958409695 print(search()) >> [('blockish', 2.0), ('bookie', 2.0), ('booking', 2.0), ('bookish', 0.0), ('boorish', 1.0), ('boxfish', 2.0), ('boyish', 2.0), ('foolish', 2.0), ('goodish', 2.0), ('monkish', 2.0), ('moorish', 2.0)] ``` `simtrie` allows you to fine-tune searches using custom weighted metrics: ``` metric = simtrie.Metric( (('c', None), 1.9), # deletion cost (('ab', 'ba'), 1.5) # transpose cost ) s.similar("bookish", 2, metric, allow_transpose=True) ``` Some of simtrie's features: * Stores string sets and dicts in ram using a prefix tree * Fast, configurable similarity searches over large sets * Pythonic API similar to regular `set` and `dict` * Supports transpose, split and union weights Note: binary data files are not portable between machine architectures (they are either little or big endian). # Credits `simtrie` is a fork of https://github.com/pytries/DAWG. Its internal data structure is a very clever C++ implementation of a DAFSA by Susumu Yata. Various test cases and ideas were taken from the super clean implementation at https://github.com/infoscout/weighted-levenshtein/. # Similar Projects * https://github.com/wolfgarbe/SymSpell # License Python code is licensed under the MIT License. Bundled `dawgdic`_ C++ library and C++ extensions for simtrie are licensed under the BSD license.