# phrase-analysis-py

**Repository Path**: gxlfqy/phrase-analysis-py

## Basic Information

- **Project Name**: phrase-analysis-py
- **Description**: 使用Python进行词法分析
快速自定义规则
使用json进行配置
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2019-06-02
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# phrase-analysis-py

## 介绍

- 普遍性词法分析器
- 使用Python进行词法分析
- 快速自定义规则
- 使用json进行配置

## 作用

**可以使用简单的配置就能实现对其他语言的词法分析**

## IDE

本项目使用`PyCharm`进行编写

项目是在`linux`上编写的。

所以在`window`上，`config.json`文件的换行会有问题。不过不影响程序运行。

## 实现的功能

1. 针对多语言的词法分析
2. 词法错误定位

## 词法配置说明

目录`config`中，每个目录中的`config.json`文件存放一种词法配置

例子：

```json
{
    "commit": "(?<!:)\\/\\/.*|\\/\\*(\\s|.)*?\\*\/",
    "phrase": [
        { 
            "word": "$",
            "type": 0,
            "code": "FINISH",
            "description": "结束符"
        }, 
        { 
            "word": "-?\\d+",
            "regex": true,
            "type": 11,
            "code": "INT",
            "description": "整数"
        }
    ]
}
```

1. `commit`属性存放注释的正则表达式
2. `phrase`属性存放词法规则
   - `word`属性是必须（类型为字符串）
   - 文本识别
     - `word`属性中为要识别的文本
     - `regex`为`false`或为空
   - 正则表达式
     - `word`属性中为要识别的正则表达式
     - `regex`为`true`
3. 正则表达是中不得出现`^`或`$`
4. 词法判断是有优先级的。前面的项比后面的项优先级高，有些时候前后顺序不得颠倒。

```json
{ 
    "word": ">=",
    "type": 111,
    "code": "GE"
},
{ 
    "word": ">",
    "type": 110,
    "code": "GT"
},
{ 
    "word": "=",
    "type": 112,
    "code": "EQ"
}
```

注意优先级，要先识别`>=`再识别`>`，不然会出错。

## 运行方法

1. 使用`PyCharm`打开项目
2. 运行`src/main.py`

## 核心代码

文件```create_token.py```和```syntax_check.py```

```create_token.py```的部分代码

```python
def create_token(full_text, cnf):
    """
    生成Token表
    :param full_text: 代码内容
    :param cnf: 单词符号表
    :return: token表
    """
    # 入口参数合法性检查
    if full_text == '' or cnf is None or len(cnf) == 0:
        return None

    # 生成字符标识矩阵（与full_text等长的bool矩阵，True表示同下标的full_text的字符为有效字符）
    #   字符标识矩阵用于找出词法错误在原代码中对应的位置（和下面的变量end_index和起来一起用）
    full_text_bool = np.ones(len(full_text), dtype=bool)        # 字符标识矩阵
    # 将所有的注释标识为无效字符
    commit = cnf.get('commit')
    if type(commit) == str:
        it = re.finditer(commit, full_text)
        while True:
            try:
                ret = next(it)
                spans = ret.span()
                full_text_bool[spans[0]:spans[1]] = False
            except StopIteration:
                break

    # 将所有的空白符标识为无效字符
    #   空白符：' ', '\t', '\r', '\n'等
    it = re.finditer(r'\s+', full_text)
    while True:
        try:
            spans = next(it).span()
            full_text_bool[spans[0]:spans[1]] = False
        except StopIteration:
            break

    # 保留原代码内容，删除所有注释和空白符
    text = copy.deepcopy(full_text)         # 去除注释和空白符的代码内容
    if type(commit) == str:
        text = re.sub(commit, '', text)

    text = re.sub(r'\s+', '', text)

    # 若删除后，内容为空，返回None
    if text == '':
        return None

    end_index = 0                           # 词法分析当前的结束位置
    token_blk = []                          # 生成的Token表
    phrase_rules = cnf.get('phrase', [])    # 词法规则
    if type(phrase_rules) is not list or len(phrase_rules) == 0:
        return None

    while text:
        for rule in phrase_rules:
            word = rule.get('word')
            # 删除错误的词法规则
            if type(word) is not str:
                phrase_rules.remove(rule)
                continue

            regex = rule.get('regex', False)
            if type(regex) is bool and regex:
                # 判断是否以该正则表达式对应的字符串开头
                ret = re.match(word, text)
                if ret is None:
                    continue

                # 获取该正则表达式对应的字符串
                word = ret.group()
            else:
                # 判断是否以该字符串开头
                if not text.startswith(word):
                    continue

            # 移动识别区域和下标
            text = text[len(word):]
            end_index = end_index + len(word)

            # 生成token中的一项（将需要的信息从词法规则中保存到当前项）
            tmp = {'word': word}
            for attr in ['type', 'code', 'description']:
                ret = rule.get(attr)
                if ret is not None:
                    tmp[attr] = ret

            token_blk.append(tmp)
            break
        else:
            # 抛出词法错误（包含词法错误在源代码中的位置）
            raise creat_PhraseError(full_text, full_text_bool, end_index)

    return token_blk
```

## 测试数据和实验结果

### 正常运行

解析的文件内容

![1561878196654](assets/1561878196654.png)

得到的```Token表```

```json
['BEGIN', 'ID', 'COL_EQ', 'INT', 'SEMI', 'ID', 'COL_EQ', 'INT', 'SEMI', 'IF', 'LP', 'ID', 'GT', 'ID', 'RP', 'THEN', 'ID', 'COL_EQ', 'ID', 'ADD', 'ID', 'SEMI', 'ELSE', 'ID', 'COL_EQ', 'INT', 'MUL', 'ID', 'SEMI', 'ID', 'COL_EQ', 'ID', 'ADD', 'INT', 'SEMI', 'IF', 'LP', 'INT', 'LT', 'ID', 'RP', 'THEN', 'IF', 'LP', 'ID', 'NE', 'ID', 'RP', 'THEN', 'ID', 'COL_EQ', 'ID', 'ADD', 'ID', 'MUL', 'INT', 'SEMI', 'ELSE', 'ID', 'COL_EQ', 'INT', 'SEMI', 'ELSE', 'BEGIN', 'ID', 'COL_EQ', 'INT', 'MUL', 'INT', 'MUL', 'INT', 'SEMI', 'ID', 'COL_EQ', 'INT', 'SEMI', 'END', 'END', 'FINISH']
```

语法分析

```
识别到一个代码块
识别到一个赋值语句
识别到一个赋值语句
识别到一个if结构
识别到一条表达式
识别到一个赋值语句
识别到一条表达式
识别到一个赋值语句
识别到一条表达式
识别到一个赋值语句
识别到一条表达式
识别到一个if结构
识别到一条表达式
识别到一个if结构
识别到一条表达式
识别到一个赋值语句
识别到一条表达式
识别到一条表达式
识别到一个赋值语句
识别到一个代码块
识别到一个赋值语句
识别到一条表达式
识别到一条表达式
识别到一个赋值语句
```

### 错误分析

#### 词法错误

将分号从半角符变为全角符

错误信息：

```
Wrong in line: 2, column: 13

    A := 200；          /*Program Passed1;Var A,B,B3Y,C12A1,DD : integer;*/
```

#### 语法错误

将`IF`之后的`THEN`删除

语法分析结果

```
识别到一个代码块
识别到一个赋值语句
识别到一个赋值语句
识别到一个if结构
识别到一条表达式
识别到一个赋值语句
识别到一条表达式
识别到一个赋值语句
识别到一条表达式
识别到一个赋值语句
识别到一条表达式
识别到一个if结构
识别到一条表达式
if 结构错误，没有THEN关键字
```