硬盘通过RAID(raid驱动为megaraid)方式挂载至系统。 修改块设备sdb的预读参数(read_ahead_kb),使用dd命令向块设备写数据,预读参数发生了变化。
问题总结预读参数改变的原因是由于systemd-udevd服务对sd块设备做了IO监听,一旦发生写操作,就会触发on_inotify处理函数重新获取分区信息,重新获取分区信息时会重置预读参数(sd_revalidate_disk函数中重置了预读参数)。修复sd_revalidate_disk函数修改预读参数的规则,可以修复该问题。
问题分析客户反馈:UOS 1032正常、UOS1040异常、麒麟25.2异常,但三者raid驱动版本一致。
(资料图片仅供参考)
根据客户反馈加上预读参数本身与raid驱动无明显关联,排除raid驱动问题。 当前验证思路,先确认预读参数是由谁(内核模块、上层应用)修改的。
上层应用 手动修改预读参数后,去掉预读参数写权限chmod 444 ,预读参数发生变化,排除上层应用主动修改预读参数。 内核模块 定位预读参数read_ahead_kb的位置
static struct queue_sysfs_entry queue_ra_entry = { .attr = {.name = "read_ahead_kb", .mode = 0644 }, .show = queue_ra_show, .store = queue_ra_store, }; static ssize_t queue_ra_store(struct request_queue *q, const char *page, size_t count) { unsigned long ra_kb; ssize_t ret = queue_var_store(&ra_kb, page, count); if (ret < 0) return ret; q->backing_dev_info->ra_pages = ra_kb >> (PAGE_SHIFT - 10); return ret; }
预读参数保存在q->backing_dev_info->ra_pages中,向相关位置添加调试信息,定位改动原因。通过调试信息可知sd_revalidate_disk函数修改了预读参数。
static int sd_revalidate_disk(struct gendisk *disk) { ... ... q->backing_dev_info->ra_pages = max_t(unsigned long, VM_MAX_READAHEAD, ra_kb) * 1024 / PAGE_SIZE; set_capacity(disk, logical_to_sectors(sdp, sdkp->capacity)); sd_config_write_same(sdkp); kfree(buffer); out: return 0; }
查看堆栈信息发现sd_revalidate_disk是通过系统调用运行的(el0_svc为用户态系统调用的入口)
Dec 8 13:13:32 localhost kernel: [ 352.426379] CPU: 79 PID: 1739 Comm: systemd-udevd Not tainted 4.19.90-25.2.v2101.gfb012.ky10.aarch64 #1 Dec 8 13:13:32 localhost kernel: [ 352.436164] Hardware name: Unisyue Technologies Co., Ltd. UNIS Server R3810 G5/RS41M2C9S, BIOS KL4.1.60 12/02/2021 Dec 8 13:13:32 localhost kernel: [ 352.446896] Call trace: Dec 8 13:13:32 localhost kernel: [ 352.449777] dump_backtrace+0x0/0x170 Dec 8 13:13:32 localhost kernel: [ 352.453856] show_stack+0x24/0x30 Dec 8 13:13:32 localhost kernel: [ 352.457598] dump_stack+0xa4/0xe8 Dec 8 13:13:32 localhost kernel: [ 352.461338] sd_revalidate_disk+0x3a4/0x1300 Dec 8 13:13:32 localhost kernel: [ 352.466026] rescan_partitions+0xac/0x3b8 Dec 8 13:13:32 localhost kernel: [ 352.470451] __blkdev_reread_part+0x60/0x88 Dec 8 13:13:32 localhost kernel: [ 352.475048] blkdev_reread_part+0x2c/0x48 Dec 8 13:13:32 localhost kernel: [ 352.479472] blkdev_ioctl+0x498/0xb88 Dec 8 13:13:32 localhost kernel: [ 352.483556] block_ioctl+0x50/0x68 Dec 8 13:13:32 localhost kernel: [ 352.487376] do_vfs_ioctl+0xb0/0x898 Dec 8 13:13:32 localhost kernel: [ 352.491366] ksys_ioctl+0x8c/0xa0 Dec 8 13:13:32 localhost kernel: [ 352.495099] __arm64_sys_ioctl+0x28/0x98 Dec 8 13:13:32 localhost kernel: [ 352.499439] el0_svc_common+0x78/0x130 Dec 8 13:13:32 localhost kernel: [ 352.503603] el0_svc_handler+0x38/0x78 Dec 8 13:13:32 localhost kernel: [ 352.507768] el0_svc+0x8/0x1b0
Comm显示具体应用为systemd-udevd查阅systemd-udevd源码,发现systemd-udev对sd块设备做了IO监听,一旦发生写操作,就会触发on_inotify处理函数重新获取分区信息。 BLKRRPART:重新读取分区表 on_inotify函数调用链为:on_inotify-->synthesize_change-->ioctl(fd, BLKRRPART, 0)
udevd.c:static int synthesize_change(sd_device *dev) { ... ... if (streq_ptr("block", subsystem) && streq_ptr("disk", devtype) && !startswith(sysname, "dm-")) { _cleanup_(sd_device_enumerator_unrefp) sd_device_enumerator *e = NULL; bool part_table_read = false, has_partitions = false; sd_device *d; int fd; fd = open(devname, O_RDONLY|O_CLOEXEC|O_NOFOLLOW|O_NONBLOCK); if (fd >= 0) { r = flock(fd, LOCK_EX|LOCK_NB); if (r >= 0) r = ioctl(fd, BLKRRPART, 0); close(fd); if (r >= 0) part_table_read = true; } ... ...}static int on_inotify(sd_event_source *s, int fd, uint32_t revents, void *userdata) {... ... FOREACH_INOTIFY_EVENT(e, buffer, l) { _cleanup_(sd_device_unrefp) sd_device *dev = NULL; const char *devnode; if (udev_watch_lookup(e->wd, &dev) <= 0) continue; if (sd_device_get_devname(dev, &devnode) < 0) continue; log_device_debug(dev, "Inotify event: %x for %s", e->mask, devnode); if (e->mask & IN_CLOSE_WRITE) synthesize_change(dev); else if (e->mask & IN_IGNORED) udev_watch_end(dev); } return 1; }static int main_loop(Manager *manager) {... ... r = sd_event_add_io(manager->event, &manager->inotify_event, manager->fd_inotify, EPOLLIN, on_inotify, manager); if (r < 0) return log_error_errno(r, "Failed to create inotify event source: %m");... ...}
继续跟踪sd_revalidate_disk的历史提交记录发现,ra_pages的修改是在64cf457219acf8e3524530af064784f5677682fe版本中提交的,目的是采用硬盘VPD信息的OPTIMAL TRANSFER LENGTH(最优传输长度)来调整read_ahead_kb。
导致问题的补丁对sd_revalidate_disk函数进行历史追溯,发现修改预读参数的功能是在64cf457219acf8e3524530af064784f5677682fe中提交的
From 64cf457219acf8e3524530af064784f5677682fe Mon Sep 17 00:00:00 2001From: huhai Date: Tue, 3 Mar 2020 16:17:42 +0800Subject: [PATCH] KYLIN: block/sd: incrase read_ahead_kb for FC-SANMIME-Version: 1.0Content-Type: text/plain; charset=UTF-8Content-Transfer-Encoding: 8bit使用硬盘VPD信息的OPTIMAL TRANSFER LENGTH(最优传输长度)来调整read_ahead_kbSigned-off-by: huhai Signed-off-by: Jackie Liu --- drivers/scsi/sd.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.cindex f9d02f638c43..88958c7e5330 100644--- a/drivers/scsi/sd.c+++ b/drivers/scsi/sd.c@@ -3130,7 +3130,7 @@ static int sd_revalidate_disk(struct gendisk *disk) struct request_queue *q = sdkp->disk->queue; sector_t old_capacity = sdkp->capacity; unsigned char *buffer;- unsigned int dev_max, rw_max;+ unsigned int dev_max, rw_max, ra_kb; SCSI_LOG_HLQUEUE(3, sd_printk(KERN_INFO, sdkp, "sd_revalidate_disk\n"));@@ -3199,9 +3199,12 @@ static int sd_revalidate_disk(struct gendisk *disk) if (sd_validate_opt_xfer_size(sdkp, dev_max)) { q->limits.io_opt = logical_to_bytes(sdp, sdkp->opt_xfer_blocks); rw_max = logical_to_sectors(sdp, sdkp->opt_xfer_blocks);- } else+ ra_kb = sdkp->opt_xfer_blocks;+ } else { rw_max = min_not_zero(logical_to_sectors(sdp, dev_max), (sector_t)BLK_DEF_MAX_SECTORS);+ ra_kb = VM_MAX_READAHEAD;+ } /* Do not exceed controller limit */ rw_max = min(rw_max, queue_max_hw_sectors(q));@@ -3217,6 +3220,8 @@ static int sd_revalidate_disk(struct gendisk *disk) sdkp->first_scan = 0;+ q->backing_dev_info->ra_pages = max_t(unsigned long, VM_MAX_READAHEAD,+ ra_kb) * 1024 / PAGE_SIZE; set_capacity(disk, logical_to_sectors(sdp, sdkp->capacity)); sd_config_write_same(sdkp); kfree(buffer);-- 2.23.0
解决方案修改内核代码,硬盘初始化,首次进行扫描时配置最佳预读参数,之后重读分区信息时不再修改预读参数。
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.cindex 7914f304255d..8e9d9d3065df 100644--- a/drivers/scsi/sd.c+++ b/drivers/scsi/sd.c@@ -3235,10 +3235,11 @@ static int sd_revalidate_disk(struct gendisk *disk) q->limits.max_sectors > q->limits.max_hw_sectors) q->limits.max_sectors = rw_max;- sdkp->first_scan = 0;+ if (sdkp->first_scan)+ q->backing_dev_info->ra_pages = max_t(unsigned long, VM_MAX_READAHEAD,+ ra_kb) * 1024 / PAGE_SIZE;- q->backing_dev_info->ra_pages = max_t(unsigned long, VM_MAX_READAHEAD,- ra_kb) * 1024 / PAGE_SIZE;+ sdkp->first_scan = 0; set_capacity(disk, logical_to_sectors(sdp, sdkp->capacity)); sd_config_write_same(sdkp); kfree(buffer);-- 2.23.0