openEuler某发行版报Page cache invalidation failure on direct I/O的问题

**1.问题描述** 
1）在MySQL5.7升级到8.0的过程中，现场操作人员先将MySQL5.7的数据压缩打包，通过scp传输到MySQL8.0上，解压到指定目录后启动MySQL服务，启动时发生core。core信息如下所示：
```
2021-08-20T17:16:37.999786+08:00 0 [ERROR] [MY-013622] [InnoDB] [FATAL] fsync() returned EIO, aborting.

2021-08-20T17:16:37.999838+08:00 0 [ERROR] [MY-013183] [InnoDB] Assertion failure: ut0ut.cc:563 thread 281432561479920
......
09:16:37 UTC - mysqld got signal 6 ;
......
/data/app/mysql/bin/mysqld() [0x1fda9cc]
/data/app/mysql/bin/mysqld(os_file_flush_func(int)+0x1d4) [0x1ed2394]
/data/app/mysql/bin/mysqld(Fil_shard::space_flush(unsigned int)+0x324) [0x20fc3fc]
```
随后重新启动，可以正常启动。在MySQL后续的执行过程中，也会发生同样的core,底层堆栈与上面的堆栈一致。
2）LINUX中对系统级函数fsync的解释如下：
```
 fsync, fdatasync - synchronize a file's in-core state with storage device
 
 ERRORS：
      EIO    An error occurred during synchronization.  This error may relate to data written to some other file descriptor on the same file.  Since Linux
4.13, errors from write-back will be reported to all file descriptors that might have written  the  data  which  triggered  the  error.   Some filesystems (e.g., NFS) keep close track of which data came through which file descriptor, and give more precise reporting.  Other filesystems (e.g., most local filesystems) will report errors to all file descriptors that were open on the file when the error was recorded.
```
调用fsync()后，如果数据同步时发生错误将会返回EIO错误符号
3）现场配置的innodb_flush_method为O_DIRECT，使用这样的方式flush会先用directio打开文件，再调用fsync()刷入data和log文件
 **2.问题初步分析** 
1）分析MySQL打印上述堆栈的原因
在MySQL8.0版本中，启动和运行时如果遇到需要刷盘的操作（如flush等）将通过os_file_flush_func调用系统函数fsync()进行数据同步操作，当该操作失败时，文件系统返回"EIO"错误到上层，MySQL捕捉到后停止当前操作进入abort流程，关闭数据库。在MySQL5.7.23及之前的版本中，遇到磁盘返回EIO将sleep并重试，在此之后的版本中直接abort，MySQL中的报错代码如下所示：
```
case EIO:
	ib::fatal(ER_IB_MSG_1358) << "fsync() returned EIO, aborting.";
	break;
```
这样做的原因是MySQL将系统底层报告的EIO错误视为一个严重的错误，不再进行重试，并且重试可能给数据带来风险。（参考链接：MySQL Bugs: #90296: Hard error should be report when fsync() return EIO.）
2）分析系统级日志
IP为84.158.101.5的机器（简称05）第一次错误如下：
`2021-08-20T12:46:18.237074+08:00 3 [ERROR] [MY-013622] [InnoDB] [FATAL] fsync() returned EIO, aborting`
05的系统message在这个时间点对应的日志如下：
```
Aug 20 12:46:18 pwghcpaasnap87f kernel: [179741.758405] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
Aug 20 12:46:18 pwghcpaasnap87f kernel: [179741.758412] File: /data/mysql_data/data/#innodb_temp/temp_10.ibt PID: 1774380 Comm: kworker/7:0
```
可以看到在MySQL尝试调用fsync时kernel发出了警告：direct I/O faulure，说明此次的IO失败发生在系统函数fsync()调用底层文件系统资源写入IO时产生的，报错的原因可能是buffer io与direct io冲突。
分析后续几个core的日志，得到MySQL的err日志与Linux的message的对应关系与上面一致。
```
【05】MySQL： 2021-08-20T17:16:37.999786+08:00 0 [ERROR] [MY-013622] [InnoDB] [FATAL] fsync() returned EIO, aborting.
【05】message:  Aug 20 17:16:38 pwghcpaasnap87f kernel: [195961.584395] Page cache invalidation failure on direct I/O.  Possible data corruption due to 					          collision with buffered I/O!
					
【12】Mysql:    2021-08-20T17:15:28.330552+08:00 0 [ERROR] [MY-013622] [InnoDB] [FATAL] fsync() returned EIO, aborting.
【12】message:  Aug 20 17:15:28 pwghcpaasnap87e kernel: [270478.923232] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
```
根据以上信息查找Linux内核中的代码，报错部分如下所示：
```
void dio_warn_stale_pagecache(struct file *filp) {
......  
  errseq_set(&inode->i_mapping->wb_err, -EIO);
    if (__ratelimit(&_rs)) {
      path = file_path(filp, pathname, sizeof(pathname));
      if (IS_ERR(path))
        path = "(unknown)";
      pr_crit("Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!\n");
      pr_crit("File: %s PID: %d Comm: %.20s\n", path, current->pid,
        current->comm);
```
当dio因为page cache invalidation执行失败时，调用dio_warn_stale_pagecache()函数进入报错流程，先设置err为EIO，再打印buffer冲突信息，这与现场的现象相一致。
 **3.问题复现** 
从上述代码中可以确定的是，Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!的错误信息一定会导致direct IO发生EIO的错误，利用这样的思路尝试复现IO冲突的现象。
使用FIO工具通过两个fio进程通过不同的IO类型对同一文件进行压测，fio循环压测脚本如下：
```
#!bin/sh
while :
do
        fio -filename=/home/test_disk/test  -direct=1 -buffered=0 -iodepth 1 -thread -rw=write -rwmixread=70 -ioengine=libaio -bs=16k -size=1G -numjobs=10 -runtime=600 -group_reporting -name=test_r_w
done
#!bin/sh
while :
do
        fio -filename=/home/test_disk/test  -direct=0 -buffered=1 -iodepth 1 -thread -rw=write -rwmixread=70 -ioengine=libaio -bs=16k -size=1G -numjobs=10 -runtime=600 -group_reporting -name=test_r_w
done
```
详细复现过程如下：

| 操作                                                                                                            | 结果                         |
|---------------------------------------------------------------------------------------------------------------|----------------------------|
|  **不重启** 直接umout磁盘，再重新mount ，rm -rf /home/test_disk/* 清除目录文件，执行direct io方式的压测脚本，再执行buffer io方式压测脚本                  | 未复现                        |
|  **不重启** 格式化磁盘mkfs.ext3 /dev/sda，mount磁盘到测试目录，rm -rf /home/test_disk/* 清除目录文件，执行direct io方式的压测脚本，再执行buffer io方式压测脚本 | 未复现                        |
|  **重启** ，umout磁盘，再重新mount ，rm -rf /home/test_disk/* 清除目录文件，执行direct io方式的压测脚本，再执行buffer io方式压测脚本                    |  **复现**                          |
|  **重启** ，格式化磁盘mkfs.ext3 /dev/sda，mount磁盘到测试目录，rm -rf /home/test_disk/* 清除目录文件，执行direct io方式的压测脚本，再执行buffer io方式压测脚本 |  **复现**                          |
|  **重启** ，rm -rf /home/test_disk/* 清除目录文件，仅启动多个direct io方式的脚本，观察message日志                                            | 未复现                        |
|  **重启** ，rm -rf /home/test_disk/* 清除目录文件，执行direct  io脚本3分钟，再执行buffer io脚本                                           | 三分钟内未发现错误， **启动buffer io后发现错误**  |
|  **重启** ，切换文件系统为xfs，重复上述步骤                                                                                          |  **复现**                          |
|  **重启** ，切换文件调度算法，重复上述步骤                                                                                            |  **mq-deadline bfq none 均复现**    |

从上表中发现，message中发现IO冲突错误都是在重启系统之后发现的，且单纯的direct io并不会发生IO冲突的问题。
   使用所内的x86的RHEL系统机器重复上述测试步骤，测试结果如下：

| 操作                                                                                          | 结果  |
|--------------------------------------------------------------------|-----|
|  **重启** ，rm -rf /home/test_disk/* 清除目录文件，执行direct io方式的压测脚本，再执行buffer io方式压测脚本|  未复现                          |
|  **重启** ，格式化磁盘，mount磁盘到测试目录，rm -rf /home/test_disk/* 清除目录文件，执行direct io方式的压测脚本，再执行buffer io方式压测脚本|  未复现                          |
|  **重启** ，rm -rf /home/test_disk/* 清除目录文件，执行direct io脚本等待3分钟，再执行buffer io脚本|  未复                          现                          |

总结上面两个表格得到复现步骤如下：
```
1. reboot 重启机器
2. rm -rf /home/test_disk/*   清除目录下的其他文件防止干扰
3. 先启动direct io方式的压测脚本，观察/var/log/message，未发现错误
4. 启动buffer io 方式的压测脚本，观察/var/log/message，发现错误日志
该方法必须要重启系统才能再次复现，每次启动后的第一次测试必然出现，若不重启从第三步开始再次测试则不立刻复现，需要等待一定的周期后才能再次在message中观察到错误日志（测试机器上是24小时为周期）。并且这样的冲突仅在ARM+KylinV10上出现，在X86+RHEL上不出现，长时间压测也不会出现。
```
 **4 . 分析结果** 
1. 发生IO冲突和文件系统没有直接的联系，目前在多个文件系统格式上都能复现冲突。
2. 发生IO冲突时一定会导致EIO问题，这个关联由Linux内核代码决定
3. 系统中多个不同方式的IO写一个文件会导致IO冲突的发生 ，多个相同的IO方式写一个文件不会发生冲突

另外，还有一些问题需要专家配合分析一下：
1. 系统中是否存在缓存对IO冲突进行处理，重启的过程也是一个初始化和清理缓存的过程，上述复现方式需要重启才能立刻复现
2. KylinV10系统中是否对文件系统调度有优化，相同的操作在RHEL7.4(内核版本Linux version 3.10.0-693.el7.x86_64 )无法复现相关冲突
3. Kylin内部是否有类似于定时器一样的错误处理操作，在测试过程中发现如果不重启系统，需要等待24小时后才能在message中发现冲突日志(期间脚本一直运行，观察4天均相隔24小时)。

GVP openEuler/kernel
关闭

内容风险标识

评论 (19)

GVPopenEuler/kernel关闭

内容风险标识