0%

使用block_dump观察block io

Linux提供了一种机制,可以用来dump出block io详情,即/proc/sys/vm/block_dump。

如何使用 (1)

1
2
3
4
5
6
7
8
9
10
11
12
13
# echo 1 > /proc/sys/vm/block_dump

read/write some files

# dmesg
...
[10489963.254731] kworker/u65:0(119773): WRITE block 3017556368 on sdm (136 sectors)
[10489963.254763] kworker/u65:0(119773): WRITE block 3017556360 on sdm (8 sectors)
[10489963.367301] prometheus(142509): dirtied inode 177997980 (00000050) on sdm
[10489963.367312] prometheus(142509): dirtied inode 177997980 (00000050) on sdm
...

# echo 0 > /proc/sys/vm/block_dump

原理 (2)

在当前的linux内核中(下面以kernel-3.19.8为例)有两个地方检察block_dump,若为true,则打印相应的block io信息。信息的格式如第1节所示。

block/blk-core.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
void submit_bio(int rw, struct bio *bio)
{
...

if (unlikely(block_dump)) {
char b[BDEVNAME_SIZE];
printk(KERN_DEBUG "%s(%d): %s block %Lu on %s (%u sectors)\n",
current->comm, task_pid_nr(current),
(rw & WRITE) ? "WRITE" : "READ",
(unsigned long long)bio->bi_iter.bi_sector,
bdevname(bio->bi_bdev, b),
count);
}

...
}

这段代码输入形如:

1
[10489963.254731] kworker/u65:0(119773): WRITE block 3017556368 on sdm (136 sectors)

[10489963.254731] kworker/u65:0current->comm,即struct task_struct的comm字段,注释上说”executable name excluding path”;
119773task_pid_nr(current),即线程id;
WRITE表明当前block io是写操作;
3017556368bio->bi_iter.bi_sector,这个值是struct biostruct bvec_iter bi_iter中的bi_sector,表示当前block io的第一个sector号;
sdmbdevname(bio->bi_bdev, b),设备名;
136是count,当前block io的sector数;

也就是说,当前block io是写sdm从3017556368号sector开始的136个sector;发起线程是kworker/u65:0(119773)。

fs/fs-writeback.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
void __mark_inode_dirty(struct inode *inode, int flags)
{
...

if (unlikely(block_dump))
block_dump___mark_inode_dirty(inode);

...
}

static noinline void block_dump___mark_inode_dirty(struct inode *inode)
{
if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
struct dentry *dentry;
const char *name = "?";

dentry = d_find_alias(inode);
if (dentry) {
spin_lock(&dentry->d_lock);
name = (const char *) dentry->d_name.name;
}
printk(KERN_DEBUG
"%s(%d): dirtied inode %lu (%s) on %s\n",
current->comm, task_pid_nr(current), inode->i_ino,
name, inode->i_sb->s_id);
if (dentry) {
spin_unlock(&dentry->d_lock);
dput(dentry);
}
}
}

这段代码输入形如:

1
[10489963.367301] prometheus(142509): dirtied inode 177997980 (00000050) on sdm

177997980inode->i_ino,即inode号;
00000050是文件名;
sdm是设备名;

rsyslog (3)

打开block_dump会产生很多KERN_DEBUG级别的日志。若系统的rsyslog配置的级别包含KERN_DEBUG,这些log将会写入/var/log/messages,可能影响系统性能。

1
2
# vim /etc/rsyslog.conf
*.info;*.debug;mail.none;authpriv.none;cron.none /var/log/messages

这时可以暂时停掉rsyslog:

1
systemctl stop rsyslog.service

小结 (4)

这个机制,使我们很容易的拿到generic block层的输入,以及writeback时的dirty inode。

写的不错,有赏!