0%

Linux read ahead

应用程序可以通过posix_fadvise()来告诉内核访问文件的模式,建议内核如何进行IO以达到最优性能(如名字所示,它仅仅是一个建议或期望,内核不承诺遵守)。可用的模式有:POSIX_FADV_NORMALPOSIX_FADV_SEQUENTIALPOSIX_FADV_RANDOMPOSIX_FADV_NOREUSEPOSIX_FADV_WILLNEEDPOSIX_FADV_DONTNEED。本文看一看它们的行为。

POSIX_FADV_RANDOM (1)

先做个测试:

bs(KiB) read_ahead_kb(KiB) max_sectors_kb(KiB) avgrq-sz(KiB)
16 8 4 4
16 8 64 8
16 32 64 16

说明:

  • 测试是fio发起的,顺序读但设置fadvise_hint=random(这样是为了突出POSIX_FADV_RANDOM的作用);文件系统是ext4;bs如表中所示;
  • read_ahead_kb:/sys/block/sda/queue/read_ahead_kb;
  • max_sectors_kb:/sys/block/sda/queue/max_sectors_kb;
  • avgrq-sz:iostat观察到的请求大小;

结果是:

$$ AvgRqSz = min(bs,ReadAheadKB,MaxSectorsKB) $$

一些文档说POSIX_FADV_RANDOM用于禁止read ahead。可是,要是read ahead被禁止的话,请求应该是page-by-page的(见linux 3.19.8 mm/filemap.c:do_generic_file_read()),也就是avgrq-sz=4KiB(通常page size是4KiB),而实际上并不是如此,这是为什么呢?

找到patch-187024才发现,在过去的版本中的确是page-by-page的,因为这很低效,现在已经修改了:

1
2
3
4
This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.

POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor
performance: a 16K read will be carried out in 4 _sync_ 1-page reads.

现在ra_pages=0(完全禁止read ahead,page-by-page的读)只在multi-page读没有帮助或者应该被避免的地方:

1
2
- it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
- some IO error happened

现在POSIX_FADV_RANDOM的语义不再是禁止read ahead,而是禁止基因算法,从而如实地把应用程序的read IO发下去:

1
2
3
4
5
POSIX_FADV_RANDOM actually want a different semantics: to disable the
*heuristic* readahead algorithm, and to use a dumb one which faithfully
submit read IO for whatever application requests.

So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.

也就是直接发送应用程序的请求大小(bs),当然,也受限于read_ahead_kbmax_sectors_kb(三者中取最小)。

看这个patch的diff:

linux 3.19.8 mm/fadvise.c:fadvise64_64:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
     switch (advice) {
case POSIX_FADV_NORMAL:
file->f_ra.ra_pages = bdi->ra_pages;
+ spin_lock(&file->f_lock);
+ file->f_flags &= ~FMODE_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_RANDOM:
- file->f_ra.ra_pages = 0;
+ spin_lock(&file->f_lock);
+ file->f_flags |= FMODE_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_SEQUENTIAL:
file->f_ra.ra_pages = bdi->ra_pages * 2;
+ spin_lock(&file->f_lock);
+ file->f_flags &= ~FMODE_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_WILLNEED:

linux 3.19.8 mm/readahead.c:page_cache_sync_readahead():

1
2
3
4
5
6
7
8
9
10
11
     if (!ra->ra_pages)
return;

+ /* be dumb */
+ if (filp->f_mode & FMODE_RANDOM) {
+ force_page_cache_readahead(mapping, filp, offset, req_size);
+ return;
+ }
+
/* do read-ahead */
ondemand_readahead(mapping, ra, filp, false, offset, req_size);

就是POSIX_FADV_RANDOM不再设置ra_pages=0,而是设置FMODE_RANDOMFMODE_RANDOM就是这个patch新引入的);在page_cache_sync_readahead()中就不会立即返回(立即返回就会page-by-page的读),而是检查FMODE_RANDOM,从而进行强制read ahead。

POSIX_FADV_SEQUENTIAL (2)

写的不错,有赏!