diff --git a/_posts/2018-08-25-13-16-05-lwn-23732-object-based-reverse-mapping-vm.md b/_posts/2018-08-25-13-16-05-lwn-23732-object-based-reverse-mapping-vm.md index 65a07ea8a79504c214493b557313143fb886f3bf..af32647a4f52125259c101a1ba5d3387f301c2ad 100644 --- a/_posts/2018-08-25-13-16-05-lwn-23732-object-based-reverse-mapping-vm.md +++ b/_posts/2018-08-25-13-16-05-lwn-23732-object-based-reverse-mapping-vm.md @@ -17,7 +17,7 @@ tags: > 原文:[The object-based reverse-mapping VM](https://lwn.net/Articles/23732/) > 原创:By corbet @ Feb. 25, 2003 -> 翻译:By [unicornx](https://github.com/unicornx) of [TinyLab.org][1] +> 翻译:By [unicornx](https://github.com/unicornx) > 校对:By [Wen Yang](https://github.com/w-simon) > The reverse-mapping VM (RMAP) was merged into 2.5 to solve a specific problem: there was no easy way for the kernel to find out which page tables referred to a given physical page. Certain activities - swapping being at the top of the list - require making changes to all relevant page tables. You simply can not swap a page to disk until all of the page table entries pointing to it have been invalidated. The 2.4 kernel handles swapping by scanning through the page tables, one process at a time, and invalidating entries for pages that look like suitable victims. If it happens to find all of the page table entries in time, the page can then be evicted to disk. @@ -30,7 +30,7 @@ tags: > Now a new technique, as embodied in [this patch](https://lwn.net/Articles/23584/) by Dave McCracken, has been proposed. This approach, called "object-based reverse mapping," is based on the realization that, in some cases at least, there are other paths from a `struct page` to a page table entry. If those paths can be used, the full RMAP overhead is unnecessary and can be cut out. -Dave McCracken 提交的[补丁](https://lwn.net/Articles/23584/)提出了一种新的解决方法。这种被称之为 “基于对象的反向映射” ("object-based reverse mapping",译者注,下文直接使用 object-based RMAP,不再翻译)的方法至少说明,我们可以找到新的方法,从 `struct page` 找到映射该物理页的页表条目。如果该方法可行的话,将显著解决 RMAP 的巨大开销问题。 +Dave McCracken 提交的[补丁][1] 提出了一种新的解决方法。这种被称之为 “基于对象的反向映射” ("object-based reverse mapping",译者注,下文直接使用 object-based RMAP,不再翻译)的方法至少说明,我们可以找到新的方法,从 `struct page` 找到映射该物理页的页表条目。如果该方法可行的话,将显著解决 RMAP 的巨大开销问题。 > By one reckoning, there are two basic types of user-mode page in a Linux system. ***Anonymous*** pages are just plain memory, the kind a process would get from `malloc()`. Most other pages are ***file-backed*** in some way; this means that, behind the scenes, the contents of that page are associated with a file somewhere in the system. File-backed pages include program code and files mapped in with `mmap()`. For these pages, it is possible to find their page table entries without using RMAP entries. To see how, let us refer to the following low-quality graphic, the result of your editor's nonexistent drawing skills: @@ -52,10 +52,13 @@ object-based RMAP 补丁没有更改匿名页的处理方式,因为对于匿 > Martin Bligh has posted [some initial benchmarks](https://lwn.net/Articles/23740/) showing some moderate improvement in the all-important kernel compilation test. The object-based approach does seem to help with some of the worst RMAP performance regressions. Andrew Morton [pointed out](https://lwn.net/Articles/23742/) a worst-case performance scenario for this approach, but it is not clear how big a problem it would really be. Andrew has included this patch in his [2.5.62-mm3](https://lwn.net/Articles/23567/) tree. -Martin Bligh 发布了[一些初步的基准测试结果](https://lwn.net/Articles/23740/),对于一些重要的内核编译版本的测试结果显示,情况有了一定的改善。在性能回归测试中可以看到,基于对象的方法确实有助于改进原来最差情况下反向映射的执行效果。Andrew Morton [指出](https://lwn.net/Articles/23742/)了基于这种方法可能会碰到的一种最差的情况,但目前尚不清楚实际运行中它究竟会带来多大的影响。无论如何,Andrew 已在他维护的 [2.5.62-mm3](https://lwn.net/Articles/23567/) 版本中加入了这个补丁。 +Martin Bligh 发布了[一些初步的基准测试结果][2],对于一些重要的内核编译版本的测试结果显示,情况有了一定的改善。在性能回归测试中可以看到,基于对象的方法确实有助于改进原来最差情况下反向映射的执行效果。Andrew Morton [指出][3] 了基于这种方法可能会碰到的一种最差的情况,但目前尚不清楚实际运行中它究竟会带来多大的影响。无论如何,Andrew 已在他维护的 [2.5.62-mm3][4] 版本中加入了这个补丁。 > Assuming that this patch goes in (it's late in the development process, but that hasn't stopped Linus from taking rather more disruptive VM patches before...), one might wonder if a complete object-based implementation might follow. The answer is "probably not." Anonymous pages tend to be private to individual processes, so there is no long chain of reverse mappings to manage in any case. So even if such pages came to look like file-backed pages (as could happen, say, with a rework of the swapping code), there isn't necessarily much to be gained from the object-based approach. 假定这个补丁会被内核主线所采纳(从目前的开发阶段来看是有点晚,但根据以往的经验,虽然这些补丁在改动上比较激进,但并不排除 Linus 同志仍会将它们继续合入虚拟内存子系统),人们可能会推测后面是否会有一个基于对象技术的更全面的实现。但答案是 “可能不会”。匿名页对于各个进程来说往往是私有的,因此一般情况下不会存在需要管理很多反向映射项的问题。因此,即使可以让这些页(指匿名页)看起来和文件映射页一样工作(这是可能的,例如,通过重新设计页交换部分的代码),但由于匿名页并不能从基于对象的方法上获得好处,所以进一步的统一也没有必要。 -[1]: http://tinylab.org +[1]: https://lwn.net/Articles/23584/ +[2]: https://lwn.net/Articles/23740/ +[3]: https://lwn.net/Articles/23742/ +[4]: https://lwn.net/Articles/23567/ diff --git a/_posts/2018-08-29-15-53-18-lwn-75198-vm-ii-return-of-objrmap.md b/_posts/2018-08-29-15-53-18-lwn-75198-vm-ii-return-of-objrmap.md index 40ff6a46d01e1b0bbb4b5249e4e100ea45decddf..2006034ffd251f77052929a2117d07b4843fbfbf 100644 --- a/_posts/2018-08-29-15-53-18-lwn-75198-vm-ii-return-of-objrmap.md +++ b/_posts/2018-08-29-15-53-18-lwn-75198-vm-ii-return-of-objrmap.md @@ -17,7 +17,7 @@ tags: > 原文:[Virtual Memory II: the return of objrmap](https://lwn.net/Articles/75198/) > 原创:By corbet @ Mar. 10, 2004 -> 翻译:By [unicornx](https://github.com/unicornx) of [TinyLab.org][1] +> 翻译:By [unicornx](https://github.com/unicornx) > 校对:By [Wen Yang](https://github.com/w-simon) > Andrea Arcangeli not only wants to make the Linux kernel scale to and beyond 32GB of memory on 32-bit processors; he seems to be in a real hurry. There are, it would seem, customers waiting for a 2.6-based distribution which can run in such environments. @@ -26,7 +26,7 @@ tags: > For Andrea, the real culprit in the exhaustion of low memory is clear: it's the reverse-mapping virtual memory ("rmap") code. The rmap code was first described on this page [in January, 2002](http://lwn.net/2002/0124/kernel.php3); its purpose is to make it easier for the kernel to free memory when swapping is required. To that end, rmap maintains, for each physical page in the system, a chain of reverse pointers; each pointer indicates a page table which has a reference for that page. By following the rmap chains, the kernel can quickly find all mappings for a given page, unmap them, and swap the page out. -对于 Andrea 来说,真正导致低端内存被耗尽的罪魁祸首是虚拟内存中实现的反向映射代码(译者注,反向映射,英文是 reverse-mapping,简称 “rmap”,下文用 rmap 指代早期内核版本中的反向映射技术,和 objrmap 相对)。 在 [2002 年 1 月]((http://lwn.net/2002/0124/kernel.php3)) 首次给大家介绍了 rmap;引入它的目的是为了方便内核在执行页交换(swap)时释放内存。rmap 为系统中的每个物理页维护一个链表用于保存反向映射指针;每个指针指向映射该物理页的一个页表。在给定一个物理页后,通过遍历其 rmap 链表,内核可以快速查找到映射该物理页的所有进程,逐个取消这些映射关系后就可以将该物理页交换出来(page out)。 +对于 Andrea 来说,真正导致低端内存被耗尽的罪魁祸首是虚拟内存中实现的反向映射代码(译者注,反向映射,英文是 reverse-mapping,简称 “rmap”,下文用 rmap 指代早期内核版本中的反向映射技术,和 objrmap 相对)。 在 [2002 年 1 月][1] 首次给大家介绍了 rmap;引入它的目的是为了方便内核在执行页交换(swap)时释放内存。rmap 为系统中的每个物理页维护一个链表用于保存反向映射指针;每个指针指向映射该物理页的一个页表。在给定一个物理页后,通过遍历其 rmap 链表,内核可以快速查找到映射该物理页的所有进程,逐个取消这些映射关系后就可以将该物理页交换出来(page out)。 > The rmap code solved some real performance problems in the kernel's virtual memory subsystem, but it, too has a cost. Every one of those reverse mapping entries consumes memory - low memory in particular. Much effort has gone into reducing the memory cost of the rmap chains, but the simple fact remains: as the amount of memory (and the number of processes using that memory) goes up, the rmap chains will consume larger amounts of low memory. Eliminating the rmap overhead would go a long way toward allowing the kernel to scale to larger systems. Of course, one wants to eliminate this overhead while not losing the benefits that rmap brings. @@ -34,15 +34,15 @@ rmap 技术解决了内核虚拟内存子系统中的性能问题,但它也是 > Andrea's approach is to bring back and extend the object-based reverse mapping patches. The initial object-based patch was created by Dave McCracken; LWN [covered this patch](http://lwn.net/Articles/23732/) a year ago. Essentially, this patch eliminates the rmap chains for memory which maps a file by following pointers "the long way around" and searching candidate virtual memory areas (VMAs). Andrea has [updated this patch](https://lwn.net/Articles/74812/) and fixed some bugs, but the core of the patch remains the same; see last year's description for the details. -Andrea 参考了原先的基于对象的反向映射(object-based reverse mapping)补丁并基于该补丁做了改进。这个补丁最初是由 Dave McCracken 提交的;LWN 一年前[为大家介绍过](/lwn-23732)。这个补丁最主要的优点,是针对文件映射使用的物理页,消除了 rmap 对内存的巨大需求,但代价是它需要通过 “更复杂” 的方式反向查找到映射该物理页的页表项,这其中还包括需要搜索关联的虚拟内存区域(virtual memory area,简称 VMA)。Andrea [对该补丁进行了修改](https://lwn.net/Articles/74812/)并修复了一些错误,但补丁的核心思想仍然保持不变;有关其核心思想可以参阅[去年的详细介绍](/lwn-23732-object-based-reverse-mapping-vm)。 +Andrea 参考了原先的基于对象的反向映射(object-based reverse mapping)补丁并基于该补丁做了改进。这个补丁最初是由 Dave McCracken 提交的;LWN 一年前[为大家介绍过][2]。这个补丁最主要的优点,是针对文件映射使用的物理页,消除了 rmap 对内存的巨大需求,但代价是它需要通过 “更复杂” 的方式反向查找到映射该物理页的页表项,这其中还包括需要搜索关联的虚拟内存区域(virtual memory area,简称 VMA)。Andrea [对该补丁进行了修改][3] 并修复了一些错误,但补丁的核心思想仍然保持不变;有关其核心思想可以参阅[去年的详细介绍][2]。 > [Last week](https://lwn.net/Articles/73100/), we raised the possibility that the virtual memory subsystem could see fundamental changes in the course of the 2.6 "stable" series. This week, Linus [confirmed that possibility](https://lwn.net/Articles/75217/) in response to Andrea's object-based reverse mapping patch: > I certainly prefer this to the 4:4 horrors. So it sounds worth it to put it into -mm if everybody else is ok with it. -[上周](https://lwn.net/Articles/73100/),我们提出是否有可能在 2.6 的 “稳定”版本系列中看到这个重大改变。本周,Linus [确认了这种可能性](https://lwn.net/Articles/75217/)并提到了 Andrea 的基于对象的反向映射补丁: +[上周][4],我们提出是否有可能在 2.6 的 “稳定”版本系列中看到这个重大改变。本周,Linus [确认了这种可能性][5] 并提到了 Andrea 的基于对象的反向映射补丁: - 相对于 “4:4”(译者注,指 [4G/4G 补丁](http://lwn.net/Articles/39925/)),我更倾向于合入这个补丁(译者注,指 Andrea 的基于对象的反向映射补丁)。如果其他人都觉得没问题的话,我将把它合入 “-mm” 代码版本库。 + 相对于 “4:4”(译者注,指 [4G/4G 补丁][6]),我更倾向于合入这个补丁(译者注,指 Andrea 的基于对象的反向映射补丁)。如果其他人都觉得没问题的话,我将把它合入 “-mm” 代码版本库。 > Assuming this work goes forward, it has the usual implications for the stable kernel. Even assuming that it stays in the -mm tree for some time, its inclusion into 2.6 is likely to destabilize things for a few releases until all of the obscure bugs are shaken out. @@ -54,7 +54,7 @@ Dave McCracken 提交的补丁起初只解决了部分问题。它解决了那 > To that end, Andrea has posted [another patch](https://lwn.net/Articles/75098/) (in preliminary form) which provides object-based reverse mapping for anonymous memory as well. It works, essentially, by replacing the rmap chain with a pointer to a chain of virtual memory area (VMA) structures. -为此,Andrea 提交了[另一个补丁](https://lwn.net/Articles/75098/)(目前还处于原型状态),它为匿名内存也提供了基于对象的反向映射机制。它本质上是用虚拟内存区域(VMA)链表替换了 rmap 所使用的针对每个物理页所维护的反向映射链表。 +为此,Andrea 提交了[另一个补丁][7](目前还处于原型状态),它为匿名内存也提供了基于对象的反向映射机制。它本质上是用虚拟内存区域(VMA)链表替换了 rmap 所使用的针对每个物理页所维护的反向映射链表。 > Anonymous pages are always created in response to a request for memory from a single process; as a result, they are never shared at creation time. Given that, there is no need for a new anonymous page to have a chain of reverse mappings; we know that there can be only a single mapping. Andrea's patch adds a union to `struct page` which includes the existing `mapping` pointer (for non-anonymous memory) and adds a couple of new ones. One of those is simply called `vma`, and it points to the (single) VMA structure pointing to the page. So if a process has several non-shared, anonymous pages in the same virtual memory area, the structure looks somewhat like this: @@ -82,10 +82,18 @@ Dave McCracken 提交的补丁起初只解决了部分问题。它解决了那 > This approach does incur a greater computational cost. Freeing a page requires scanning multiple VMAs which may or may not contain references to the page under consideration. This cost will increase with the number of processes sharing a memory region. Ingo Molnar, who is fond of O(1) solutions, [is nervous](https://lwn.net/Articles/75225/) about object-based schemes for this reason. According to Ingo, losing the possibility of creating an O(1) page unmapping scheme is a heavy cost to pay for the prize of making large amounts of memory work on obsolete hardware. -这种方法确实会产生更大的计算成本。释放物理页需要扫描多个 VMA,这些 VMA 可能映射了该物理页,也可能没有。查找的成本将随共享内存区的进程的数量增加而增加。更倾向于 `O(1)` 解决方案的 Ingo Molnar 针对该场景下的 objrmap 方案表达了他的[担忧](https://lwn.net/Articles/75225/)。根据 Ingo 的说法,在取消页面映射处理过程中,仅仅是为了在过时的机器上支持大容量内存就放弃 `O(1)` 的算法实在是得不偿失。 +这种方法确实会产生更大的计算成本。释放物理页需要扫描多个 VMA,这些 VMA 可能映射了该物理页,也可能没有。查找的成本将随共享内存区的进程的数量增加而增加。更倾向于 `O(1)` 解决方案的 Ingo Molnar 针对该场景下的 objrmap 方案表达了他的[担忧][8]。根据 Ingo 的说法,在取消页面映射处理过程中,仅仅是为了在过时的机器上支持大容量内存就放弃 `O(1)` 的算法实在是得不偿失。 > The solution that Ingo would like to see, instead, is to reduce the per-page memory overhead by reducing the number of pages. The means to that end is [page clustering](https://lwn.net/Articles/23785/) - grouping adjacent hardware pages into larger virtual pages. Page clustering would reduce rmap overhead, and reduce the size of the main kernel memory map as well. The available page clustering patch is even more intrusive than object-based reverse mapping, however; it seems seriously unlikely to be considered for 2.6. -相反,Ingo 建议的解决方案是通过减少物理页的数量来减少每页的内存开销。解决的方案是[对页面进行合并(page clustering)](https://lwn.net/Articles/23785/),即将相邻的物理页面组合为更大的虚拟页面。合并物理页后会减少 rmap 的开销,同时也会减少内核中内存映射表的大小。然而,相对于 objrmap 补丁,“page clustering” 补丁的修改过于激进,看上去不太可能为 2.6 版本所接受。 - -[1]: http://tinylab.org +相反,Ingo 建议的解决方案是通过减少物理页的数量来减少每页的内存开销。解决的方案是[对页面进行合并(page clustering)][9],即将相邻的物理页面组合为更大的虚拟页面。合并物理页后会减少 rmap 的开销,同时也会减少内核中内存映射表的大小。然而,相对于 objrmap 补丁,“page clustering” 补丁的修改过于激进,看上去不太可能为 2.6 版本所接受。 + +[1]: http://lwn.net/2002/0124/kernel.php3 +[2]: /lwn-23732 +[3]: https://lwn.net/Articles/74812/ +[4]: https://lwn.net/Articles/73100/ +[5]: https://lwn.net/Articles/75217/ +[6]: https://lwn.net/Articles/39925/ +[7]: https://lwn.net/Articles/75098/ +[8]: https://lwn.net/Articles/75225/ +[9]: https://lwn.net/Articles/23785/ \ No newline at end of file diff --git a/_posts/2018-09-01-08-43-22-lwn-383162-case-of-overly-anonymous-anon_vma.md b/_posts/2018-09-01-08-43-22-lwn-383162-case-of-overly-anonymous-anon_vma.md index 3ebaa242b12622520d2fb12188964d3e1bd56dee..005754880321f62d79c0c9fadf183f9782a7b35f 100644 --- a/_posts/2018-09-01-08-43-22-lwn-383162-case-of-overly-anonymous-anon_vma.md +++ b/_posts/2018-09-01-08-43-22-lwn-383162-case-of-overly-anonymous-anon_vma.md @@ -17,24 +17,24 @@ tags: > 原文:[The case of the overly anonymous anon_vma](https://lwn.net/Articles/383162/) > 原创:By corbet @ Apr. 13, 2010 -> 翻译:By [unicornx](https://github.com/unicornx) of [TinyLab.org][1] +> 翻译:By [unicornx](https://github.com/unicornx) > 校对:By [Wen Yang](https://github.com/w-simon) > During the stabilization phase of the kernel development cycle, the -rc releases typically happen about once every week. [2.6.34-rc4](http://lwn.net/Articles/383198/) is a clear exception to that rule, coming nearly two weeks after the preceding -rc3 release. The holdup in this case was a nasty regression which occupied a number of kernel developers nearly full time for days. The hunt for this bug is a classic story of what can happen when the code gets too complex. -在内核开发周期的集成阶段,“-rc” 版本通常每周发布一次。但在上一个 “-rc3” 版本发布后经过了将近整整两周的时间,新版本 2.6.34-rc4 才姗姗来迟。背后的具体原因是为了定位一个令人头痛的 bug,以及 bug 解决后执行了一次全面的回归测试,这耗费了众多内核开发人员的大量时间。整个过程称得上是一个经典的案例,它告诉我们,当代码过于复杂时究竟会发生些什么。下面就给大家介绍一下这个故事。 +在内核开发周期的集成阶段,“-rc” 版本通常每周发布一次。但在上一个 “-rc3” 版本发布后经过了将近整整两周的时间,新版本 [2.6.34-rc4][1] 才姗姗来迟。背后的具体原因是为了定位一个令人头痛的 bug,以及 bug 解决后执行了一次全面的回归测试,这耗费了众多内核开发人员的大量时间。整个过程称得上是一个经典的案例,它告诉我们,当代码过于复杂时究竟会发生些什么。下面就给大家介绍一下这个故事。 > Sending email to linux-kernel can be an intimidating prospect for a number of reasons, one of which being that one never knows when a massive thread - involving hundreds of messages copied back to the original sender - might result. Borislav Petkov's [2.6.34-rc3 bug report](https://lwn.net/Articles/383163/) was one such posting. In this case, though, the ensuing thread was in no way inflammatory; it represents, instead, some of the most intensive head-scratching which has been seen on the list for a while. -给 linux-kernel (译者注:内核开发的邮件列表)发送电子邮件的结果可能会超出预期,原因有很多,其中一个原因是说不定就会收到海量的(数百封)的邮件回复。Borislav Petkov 发送的的[有关 2.6.34-rc3 版本测试的错误报告](https://lwn.net/Articles/383163/)就是这样一个帖子。当然这些回复绝对不是针对他个人的,这只是说明社区的确碰到了一个非常令人头痛的问题。 +给 linux-kernel (译者注:内核开发的邮件列表)发送电子邮件的结果可能会超出预期,原因有很多,其中一个原因是说不定就会收到海量的(数百封)的邮件回复。Borislav Petkov 发送的的[有关 2.6.34-rc3 版本测试的错误报告][2] 就是这样一个帖子。当然这些回复绝对不是针对他个人的,这只是说明社区的确碰到了一个非常令人头痛的问题。 > The bug, as reported by Borislav, was a null pointer dereference which would happen reasonably reliably after hibernating (and restarting) the system. It was quickly recognized as being the same as [another bug report](https://bugzilla.kernel.org/show_bug.cgi?id=15680) filed the same day by Steinar H. Gunderson, though this one did not involve hibernation. The common thread was null pointer dereferences provoked by memory pressure. The offending patch was [identified by Linus](https://lwn.net/Articles/383165/) almost immediately; it's worth taking a look at what that patch did. -Borislav 报告的这个错误是有关一个空指针异常,该异常在系统休眠(并重新启动)后必现。很快它被认定与 Steinar H. Gunderson 在同一天提交的另一份错误报告是同一件事情,尽管另一份报告并未涉及系统休眠。这两个错误报告相同的部分都涉及在内存紧张时会导致空指针异常。Linus 几乎立即就[发现了](https://lwn.net/Articles/383165/)导致问题的补丁; 我们一起来看看那个补丁做了什么。 +Borislav 报告的这个错误是有关一个空指针异常,该异常在系统休眠(并重新启动)后必现。很快它被认定与 Steinar H. Gunderson 在同一天提交的 [另一份错误报告][3] 是同一件事情,尽管另一份报告并未涉及系统休眠。这两个错误报告相同的部分都涉及在内存紧张时会导致空指针异常。Linus 几乎立即就[发现了][4] 导致问题的补丁; 我们一起来看看那个补丁做了什么。 > Way back in 2004, LWN [covered the addition of the anon_vma code](http://lwn.net/Articles/75198/); this patch was controversial at the time because the upcoming 2.6.7 kernel was still expected to be an old-style "stable, no new features" release. This patch, a 40-part series which fundamentally reworked the virtual memory subsystem, was not seen as stable material, despite Linus's [attempt](http://lwn.net/Articles/86718/) to characterize it as an "implementation detail." Still, over time, this code has proved solid and has not been changed significantly since - until now. -早在 2004 年,LWN 就[介绍了有关内核增加 `anon_vma` 的事](/lwn-75198);这个补丁在当时是有争议的,因为当时准备合入该补丁的内核版本 2.6.7 按计划其发布目标是 “稳定,不引入新功能”。尽管 Linus [试图](http://lwn.net/Articles/86718/) 将该补丁描述为 “只是实现细节上的改变” ,但实际情况是该补丁集包含了 40 个补丁修改,从根本上改造了虚拟内存子系统,对内核的稳定有很大的影响。不过,随着时间的推移,这段代码已经被证明是可靠的,并且从那以后也一直没有大的改变。 +早在 2004 年,LWN 就[介绍了有关内核增加 `anon_vma` 的事][5];这个补丁在当时是有争议的,因为当时准备合入该补丁的内核版本 2.6.7 按计划其发布目标是 “稳定,不引入新功能”。尽管 Linus [试图][6] 将该补丁描述为 “只是实现细节上的改变” ,但实际情况是该补丁集包含了 40 个补丁修改,从根本上改造了虚拟内存子系统,对内核的稳定有很大的影响。不过,随着时间的推移,这段代码已经被证明是可靠的,并且从那以后也一直没有大的改变。 > The problem solved by anon_vma was that of locating all `vm_area_struct` (VMA) structures which reference a given anonymous (heap or stack memory) page. Anonymous pages are not normally shared between processes, but every call to `fork()` will cause all such pages to be shared between the parent and the new child; that sharing will only be broken when one of the processes writes to the page, causing a copy-on-write (COW) operation to take place. Many pages are never written, so the kernel must be able to locate multiple VMAs which reference a given anonymous page. Otherwise, it would not be able to unmap the page, meaning that the page could not be swapped out. @@ -54,7 +54,7 @@ Borislav 报告的这个错误是有关一个空指针异常,该异常在系 > In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. -这个解决方案在扩展性上远远超过上个版本(译者注,指 2.6 早期所使用的反向映射技术),但随着硬件和应用的发展,其不足之处开始逐渐显现。这导致 Rik van Riel 开始着手解决其性能问题,编写了[这个补丁](http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5beb49305251e5669852ed541e8e2f2f7696c53e),并将其合入了 2.6.34。下面是 Rik 描述这个问题的原话: +这个解决方案在扩展性上远远超过上个版本(译者注,指 2.6 早期所使用的反向映射技术),但随着硬件和应用的发展,其不足之处开始逐渐显现。这导致 Rik van Riel 开始着手解决其性能问题,编写了[这个补丁][7],并将其合入了 2.6.34。下面是 Rik 描述这个问题的原话: 假设一个父进程其 VMA 映射了 1000 个物理页,而该父进程派生(fork)了 1000 个子进程,当这 1000 个子进程对每个匿名页都发生了写入操作(COWed),这将导致系统中存在一百万个匿名页,并且这一百万个匿名页全都指向同一个 anon_vma(译者注,在该场景下这个 anon_vma 所管理的 VMA 链表上实际会有 1001 项(包括父进程),具体参考上图),当我们从任一个匿名页出发寻找其对应的进程(即 VMA)时会发现遍历的这个链表很长但实际对应它的只有一项,也就是说整个搜索算法的时间复杂度是 O(N)的。 @@ -115,7 +115,7 @@ Rik 的解决方案是为每个进程创建一个 `anon_vma` 结构,并将它 > Linus was clearly beginning to [wonder](https://lwn.net/Articles/383170/) when it might all end: "Three independent bugs found and fixed, and still no joy?" He repeatedly considered just reverting the change outright, but he was reluctant to do so; the solution seemed so tantalizingly close. Eventually he [developed another hypothesis](https://lwn.net/Articles/383171/) which seemed plausible. An anonymous page shared between parent and child would initially point to the parent's `anon_vma`: -Linus 显然开始[怀疑](https://lwn.net/Articles/383170/)这事情何时才会了结:“虽然我们发现并修复了三个毫无关系的错误,可是为什么一点也感觉不到快乐呢?” 他反复考虑是否需要彻底回退版本,但他实在不情愿这么做;离最终的解决似乎总是只有一步之遥。最终,他[提出了另一个看似合理的假设](https://lwn.net/Articles/383171/)。考虑如下场景,最初父进程和子进程之间共享的匿名页指向父进程的 `anon_vma`: +Linus 显然开始[怀疑][8] 这事情何时才会了结:“虽然我们发现并修复了三个毫无关系的错误,可是为什么一点也感觉不到快乐呢?” 他反复考虑是否需要彻底回退版本,但他实在不情愿这么做;离最终的解决似乎总是只有一步之遥。最终,他[提出了另一个看似合理的假设][9]。考虑如下场景,最初父进程和子进程之间共享的匿名页指向父进程的 `anon_vma`: ![AV Chain](https://static.lwn.net/images/ns/kernel/avchain5.png) @@ -137,6 +137,16 @@ Linus 显然开始[怀疑](https://lwn.net/Articles/383170/)这事情何时才 > The fix is straightforward; when linking an existing page to an `anon_vma` structure, the kernel needs to pick the one which is highest in the process hierarchy; that guarantees that the `anon_vma` will not go away prematurely. [Early testing](https://lwn.net/Articles/383172/) suggests that the problem has indeed been fixed. In the process, three other problems have been fixed and Linus has come to understand a tricky bit of code which, if he has his way, will soon gain some improved documentation. In other words, it would appear to be an outcome worth waiting for. -修复很简单; 当将一个物理页关联到一个 `anon_vma` 结构体时,内核应该选择进程派生层次中层次最高的那个(译者注,以上面的例子为例,即父进程的 AV);这保证了 `anon_vma` 不会过早被删除。 [早期测试](https://lwn.net/Articles/383172/)表明问题确实已经得到了解决。在整个过程中,不仅顺带解决了其他三个问题,Linus 还亲自理解和分析了一些棘手的代码,如果按照他的方式,将很快改进一些相关文档。换句话说,这一番折腾还是值得的。 +修复很简单; 当将一个物理页关联到一个 `anon_vma` 结构体时,内核应该选择进程派生层次中层次最高的那个(译者注,以上面的例子为例,即父进程的 AV);这保证了 `anon_vma` 不会过早被删除。 [早期测试][10] 表明问题确实已经得到了解决。在整个过程中,不仅顺带解决了其他三个问题,Linus 还亲自理解和分析了一些棘手的代码,如果按照他的方式,将很快改进一些相关文档。换句话说,这一番折腾还是值得的。 + +[1]: https://lwn.net/Articles/383198/ +[2]: https://lwn.net/Articles/383163/ +[3]: https://bugzilla.kernel.org/show_bug.cgi?id=15680 +[4]: https://lwn.net/Articles/383165/ +[5]: /lwn-75198 +[6]: https://lwn.net/Articles/86718/ +[7]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5beb49305251e5669852ed541e8e2f2f7696c53e +[8]: https://lwn.net/Articles/383170/ +[9]: https://lwn.net/Articles/383171/ +[10]: https://lwn.net/Articles/383172/ -[1]: http://tinylab.org