Linux页面分配及伙伴系统

伙伴系统

Linux系统在运行过程中页面不断的分配回收，慢慢的就会碎片化，当应用分配大内存时很难找到合适的连续内存，要花费更长的时间分配合适的内存，严重影响系统性能。因此内核设计了一种伙伴系统缓解页面碎片化。

伙伴系统设计的基本思路是把连续的空闲页面组织成内存块进行管理，因此系统将连续的2^{N次方个页面组织成一个大块，挂到对应的链表上。在页面管理结构zone中有一个free_area数组，数组长度为MAX_ORDER，free_area[0]管理的内存块的大小为2}0,free_area[1]管理的内存块的大小为2^{1,以此类推，free_area[MAX_ORDER-1]管理的内存块的大小为2}(MAX_ORDER-1)。当分配页面时，根据申请的页面的个数找到比申请页面大且最接近的链表，如果这个链表中有空闲内存块则获取一个内存块，分配器将内存块拆分一部分分配给申请者，另一部分按照2的N次幂拆分挂入到更小的内存块链表中，这个过程可能会多次拆分直到页面全部挂入到链表中，拆分的原则是尽量挂入更大的内存块链表。如果有内存块释放则过程相反，先将释放的内存块插入适合的链表，如果发现新插入的内存块和相邻的内存块可以合并成更高阶的内存块插入到更高阶的链表，然后重复上面过程直到无法合并成更大的内存块，这个过程其实就是把碎片化的页面组织成成片的页面块。

#define MAX_ORDER 11
 
struct zone{
    struct free_area    free_area[MAX_ORDER];//不同长度的空间区域
}

struct free_area {
    struct list_head        free_list[MIGRATE_TYPES];
    unsigned long           nr_free;
};

伙伴系统内存块

系统在启动时根据物理内存的配置信息将内存页构建到伙伴系统中，而且会尽量把页划分成最大块链入MAX_ORDER链表中，这样做可以保证系统在刚启动时内存碎片化最小。

内存迁移类型

伙伴系统是内核中工作最优秀的系统之一，但是它也存在一些缺陷，首先伙伴系统只能在释放内存时被动的去合并内存块来减小内存碎片化，不能在分配阶段主动以某种策略来避免产生内存碎片，其次一种极端情况下系统只分配了少量的内存页大部分内存页处于空闲状态，但是已分配的内存页散落在各处，系统无法按2^N次方将内存组织成大的内存块。我们很容易想到能不能再适当的时机对内存做下规整，将散落在各处集中在一处重新分配或释放呢，答案时肯定的，因此演化出了内存迁移性的概念。类比文件系统，磁盘也存在存储空间碎片化的问题，磁盘就可以用专门的应用工具整理文件系统。当然并不是所有内存页都可以重新分配或释放的，要针对页面不同使用情况具体分析。

观察struct free_area结构可以发现最新的伙伴系统并不是简单的把内存块挂链接到MAX_ORDER个双向链表中，而是每种大小的内存块都有MIGRATE_TYPES个类型的链表。内核将分配的内存按不可移动、可移动、可回收等分成几种类型，本文只讨论前3种迁移类型。

enum {
        MIGRATE_UNMOVABLE,
        MIGRATE_MOVABLE,
        MIGRATE_RECLAIMABLE,
        MIGRATE_PCPTYPES,       /* the number of types on the pcp lists */
        MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
        /*
         * MIGRATE_CMA migration type is designed to mimic the way
         * ZONE_MOVABLE works.  Only movable pages can be allocated
         * from MIGRATE_CMA pageblocks and page allocator never
         * implicitly change migration type of MIGRATE_CMA pageblock.
         *
         * The way to use it is to change migratetype of a range of
         * pageblocks to MIGRATE_CMA which can be done by
         * __free_pageblock_cma() function.  What is important though
         * is that a range of pageblocks must be aligned to
         * MAX_ORDER_NR_PAGES should biggest page be bigger then
         * a single pageblock.
         */
        MIGRATE_CMA,
#endif
#ifdef CONFIG_MEMORY_ISOLATION
        MIGRATE_ISOLATE,        /* can't allocate from here */
#endif
        MIGRATE_TYPES
};

可移动（MIGRATE_UNMOVABLE）类型：应用程序分配的内存大多是这种类型，用户空间访问物理内存完全是通过MMU地址映射完成，给定的线性地址映射到何处对应用程序来说完全是透明，这样系统可以在合适的时机为应用程序重新分配内存页将原来内存页内容拷贝过去，更改进程页表重新映射地址来完成内存规整，完全不会影响应用程序运行。
不可移动（MIGRATE_MOVABLE）类型：内核中核心数据结构及代码占用的内存属于这种类型，他们一旦分配就不可以更改，除非主动释放。
可回收（MIGRATE_RECLAIMABLE）: 应用程序的代码段、打开文件的缓存页面大多数属于这种类型，这些页面的主要特点是可以通过磁盘或其它存储介质恢复，当内存碎片化严重或不足时可以直接释放页面，需要时重新从备份数据恢复。

在申请的时候表明申请页的使用倾向，通过在不同区间中申请页，防止长期使用的UNMOVABLE类型的页破坏空闲物理内存的连续性。在申请内存失败的时候首先进行内存规整，主要是移动物理页的内容并更新映射关系，释放可回收的页，这样通过积极调整页的位置来使得物理内存连续。

伙伴系统可移动页面管理.png

可移动内存分配

既然系统将内存页按照使用性质划分成了不同类型，使用不同链表进行管理，那系统在什么时候将内存挂入不同链表呢。从memmap_init_zone函数中可以看出系统在启动时所有内存都属于可移动内存，在内核初始化过程中会不断的向系统分配页面，这些页面大多时不可移动的而且是永久的，这时系统大概率按照最大内存块分配连续内存，这些分配可移动内存会被重新标记为不可移动的，这样系统初始化完后就产生了一批不可移动内存而且并不会产生大量的内存碎片。至于可回收内存大多都是在应用进程运行过程中产生的。

/*
 * Initially all pages are reserved - free ones are freed
 * up by free_all_bootmem() once the early boot process is
 * done. Non-atomic initialization, single-pass.
 */
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
        unsigned long start_pfn, enum memmap_context context)
{
    /*  ......  */
    for (pfn = start_pfn; pfn < end_pfn; pfn++) {
        /*  ......  */
not_early:
        if (!(pfn & (pageblock_nr_pages - 1))) {
            struct page *page = pfn_to_page(pfn);

            __init_single_page(page, pfn, zone, nid);
            set_pageblock_migratetype(page, MIGRATE_MOVABLE);
        } else {
            __init_single_pfn(pfn, zone, nid);
        }
    }
}

类似从zone中分配页面，如果当前节点没有足够的页面可以分配会根据pg_data_t中提供的zonelists列表规定的优先级在其它zone中分配内存。MIGRATE_TYPES也类似，定义了一个fallbacks数据来定义当某种可移动类型内存页面不足时依次从哪种类型页面申请内存。

/*
 * This array describes the order lists are fallen back to when
 * the free lists for the desirable migrate type are depleted
 * 该数组描述了指定迁移类型的空闲列表耗尽时
 * 其他空闲列表在备用列表中的次序
 */
static int fallbacks[MIGRATE_TYPES][4] = {
    //  分配不可移动页失败的备用列表
    [MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,   MIGRATE_TYPES },
    //  分配可回收页失败时的备用列表
    [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,   MIGRATE_TYPES },
    //  分配可移动页失败时的备用列表
    [MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
#ifdef CONFIG_CMA
    [MIGRATE_CMA]     = { MIGRATE_TYPES }, /* Never used */
#endif
#ifdef CONFIG_MEMORY_ISOLATION
    [MIGRATE_ISOLATE]     = { MIGRATE_TYPES }, /* Never used */
#endif
};

可移动性分组特性的页面管理相关的全局变量和辅助函数总是编译到内核中，但只有在系统中有足够内存可以分配到多个迁移类型对应的链表时才有意义的。由于每个迁移链表都应该有适当数量的内存，内核需要定义”适当”的概念. 这是通过两个全局变量pageblock_order和pageblock_nr_pages提供的. 第一个表示内核认为是”大”的一个分配阶, pageblock_nr_pages则表示该分配阶对应的页数。如果各迁移类型的链表中没有一块较大的连续内存, 那么页面迁移不会提供任何好处, 因此在可用内存太少时内核会关闭该特性，build_all_zonelists函数会检查内存水位, 如果所有内存区域里面高水线以上的物理页总数小于（pageblock_nr_pages *迁移类型数量），那么禁用根据可移动性分组，全局变量page_group_by_mobility_disabled设置为0, 否则设置为1。禁用后申请的所有页面都是不可移动的。

在zone中有一个字段pageblock_flags，它指向一片bitmap内存区域，这片内存的大小是和zone管理的内存块数量相关，每NR_PAGEBLOCK_BITS位记录一个内存块属于哪种迁移类型，在页面回收时方便将页面放回原来的伙伴链表。

#define PB_migratetype_bits 3
/* Bit indices that affect a whole block of pages */
enum pageblock_bits {
    PB_migrate,
    PB_migrate_end = PB_migrate + PB_migratetype_bits - 1,
            /* 3 bits required for migrate types */
    PB_migrate_skip,/* If set the block is skipped by compaction */

    /*
     * Assume the bits will always align on a word. If this assumption
     * changes then get/set pageblock needs updating.
     */
    NR_PAGEBLOCK_BITS
};

// mm/page_alloc.c 
void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone) 
{ 
 if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES)) 
    page_group_by_mobility_disabled = 1; 
 else 
    page_group_by_mobility_disabled = 0; 
}

/* Convert GFP flags to their corresponding migrate type */
static inline int gfpflags_to_migratetype(gfp_t gfp_flags)
{
    WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);

    if (unlikely(page_group_by_mobility_disabled))
        return MIGRATE_UNMOVABLE;

    /* Group based on mobility */
    return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
        ((gfp_flags & __GFP_RECLAIMABLE) != 0);
}

void set_pageblock_migratetype(struct page *page, int migratetype)
{
    if (unlikely(page_group_by_mobility_disabled &&
             migratetype < MIGRATE_PCPTYPES))
        migratetype = MIGRATE_UNMOVABLE;

    set_pageblock_flags_group(page, (unsigned long)migratetype,
                    PB_migrate, PB_migrate_end);
}

#define get_pageblock_migratetype(page)                                 \
        get_pfnblock_flags_mask(page, page_to_pfn(page),                \
                        PB_migrate_end, MIGRATETYPE_MASK)

内核定义了__GFP_MOVABLE、__GFP_RECLAIMABLE两个标志区分不同的迁移类型，如果这两个标志都没有设置则标识是不可移动类型。

页面规整

页面规整有两种模式:异步和同步,在页面回收失败的时候首先开启异步的页面规整，如果异步页面规整不出满足要求的内存，接下来使用尝试通过直接内存回收方式回收到足够的内存，如果还是获取不到足够的连续内存，那么再次尝试通过同步页面规整的方式获取连续内存。

页面规整的对象类型有两种迁移类型:MIGRATE_RECLAIMABLE,MIGRATE_MOVABLE,从规整方式来看有两种：匿名映射和文件映射类型的页。异步页面规整只规整匿名页，而同步页面规整还会处理文件脏页，如果页面正在回写，它还会等待页面完成回写。异步页面规整相对来说比较保守，匿名页的页面迁移只需要将页的内容迁移到新页中，创建新的映射关系，解除旧的映射关系，全是内存的读写操作，而同步还涉及到IO回写，它的耗时更长。

页面分配

内核分配内存都是通过struct page *alloc_pages(int nid, gfp_t gfp_mask, unsigned int order)函数进行的。这个函数是伙伴系统分配页面的核心代码，其它所有的内存分配函数最终都会调用这个函数。

nid：节点ID，NUMA架构多核CPU系统中，每个CPU都是一个节点
gfp_mask：页面分配属性掩码
order: 分配页面阶数，例如2则便是分配2^2=4个页面

系统中还定义了一个alloc_page宏，自动获取节点ID，申请页面。

#define alloc_pages(gfp_mask, order) \ 
 alloc_pages_node(numa_node_id(), gfp_mask, order)

分配掩码很重要，它决定了伙伴系统页面分配的策略。GPF是get free page的简写。

/* Plain integer GFP bitmasks. Do not use this directly. */
//  区域修饰符,指定优先从哪个zone分配页面，如果下面三个标志都为0标识从低端内存开始分配页面
#define ___GFP_DMA              0x01u       //从ZONE_DMA分配内存
#define ___GFP_HIGHMEM          0x02u       //从高端内存分配内存
#define ___GFP_DMA32            0x04u       //从ZONE_DMA32分配内存

//  行为修饰符
#define ___GFP_MOVABLE          0x08u       /* 页是可移动的 */
#define ___GFP_RECLAIMABLE      0x10u       /* 页是可回收的 */
#define ___GFP_HIGH             0x20u       /* 应该访问紧急分配池？ */
#define ___GFP_IO               0x40u       /* 可以启动物理IO？ */
#define ___GFP_FS               0x80u       /* 可以调用底层文件系统？ */
#define ___GFP_COLD             0x100u      /* 需要非缓存的冷页 */
#define ___GFP_NOWARN           0x200u      /* 禁止分配失败警告 */
#define ___GFP_REPEAT           0x400u      /* 重试分配，可能失败 */
#define ___GFP_NOFAIL           0x800u      /* 一直重试，不会失败 */
#define ___GFP_NORETRY          0x1000u     /* 不重试，可能失败 */
#define ___GFP_MEMALLOC         0x2000u     /* 使用紧急分配链表 */
#define ___GFP_COMP             0x4000u     /* 增加复合页元数据 */
#define ___GFP_ZERO             0x8000u     /* 成功则返回填充字节0的页 */
//  类型修饰符
#define ___GFP_NOMEMALLOC       0x10000u    /* 不使用紧急分配链表 */
#define ___GFP_HARDWALL         0x20000u    /* 只允许在进程允许运行的CPU所关联的结点分配内存 */
#define ___GFP_THISNODE         0x40000u    /* 没有备用结点，没有策略 */
#define ___GFP_ATOMIC           0x80000u    /* 用于原子分配，在任何情况下都不能中断  */
#define ___GFP_ACCOUNT          0x100000u
#define ___GFP_NOTRACK          0x200000u
#define ___GFP_DIRECT_RECLAIM   0x400000u
#define ___GFP_OTHER_NODE       0x800000u
#define ___GFP_WRITE            0x1000000u
#define ___GFP_KSWAPD_RECLAIM   0x2000000u

/*
 * Physical address zone modifiers (see linux/mmzone.h - low four bits)
 *
 * Do not put any conditional on these. If necessary modify the definitions
 * without the underscores and use them consistently. The definitions here may
 * be used in bit comparisons.
 */
#define __GFP_DMA       ((__force gfp_t)___GFP_DMA)
#define __GFP_HIGHMEM   ((__force gfp_t)___GFP_HIGHMEM)
#define __GFP_DMA32     ((__force gfp_t)___GFP_DMA32)
#define __GFP_MOVABLE   ((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE allowed */
#define GFP_ZONEMASK    (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)

/*
 * Page mobility and placement hints
 *
 * These flags provide hints about how mobile the page is. Pages with similar
 * mobility are placed within the same pageblocks to minimise problems due
 * to external fragmentation.
 *
 * __GFP_MOVABLE (also a zone modifier) indicates that the page can be
 *   moved by page migration during memory compaction or can be reclaimed.
 *
 * __GFP_RECLAIMABLE is used for slab allocations that specify
 *   SLAB_RECLAIM_ACCOUNT and whose pages can be freed via shrinkers.
 *
 * __GFP_WRITE indicates the caller intends to dirty the page. Where possible,
 *   these pages will be spread between local zones to avoid all the dirty
 *   pages being in one zone (fair zone allocation policy).
 *
 * __GFP_HARDWALL enforces the cpuset memory allocation policy.
 *
 * __GFP_THISNODE forces the allocation to be satisified from the requested
 *   node with no fallbacks or placement policy enforcements.
 *
 * __GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
 */
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
#define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL)
#define __GFP_THISNODE  ((__force gfp_t)___GFP_THISNODE)
#define __GFP_ACCOUNT   ((__force gfp_t)___GFP_ACCOUNT)

/*
 * Watermark modifiers -- controls access to emergency reserves
 *
 * __GFP_HIGH indicates that the caller is high-priority and that granting
 *   the request is necessary before the system can make forward progress.
 *   For example, creating an IO context to clean pages.
 *
 * __GFP_ATOMIC indicates that the caller cannot reclaim or sleep and is
 *   high priority. Users are typically interrupt handlers. This may be
 *   used in conjunction with __GFP_HIGH
 *
 * __GFP_MEMALLOC allows access to all memory. This should only be used when
 *   the caller guarantees the allocation will allow more memory to be freed
 *   very shortly e.g. process exiting or swapping. Users either should
 *   be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
 *
 * __GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
 *   This takes precedence over the __GFP_MEMALLOC flag if both are set.
 */
#define __GFP_ATOMIC    ((__force gfp_t)___GFP_ATOMIC)
#define __GFP_HIGH  ((__force gfp_t)___GFP_HIGH)
#define __GFP_MEMALLOC  ((__force gfp_t)___GFP_MEMALLOC)
#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC)

/*
 * Reclaim modifiers
 *
 * __GFP_IO can start physical IO.
 *
 * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
 *   allocator recursing into the filesystem which might already be holding
 *   locks.
 *
 * __GFP_DIRECT_RECLAIM indicates that the caller may enter direct reclaim.
 *   This flag can be cleared to avoid unnecessary delays when a fallback
 *   option is available.
 *
 * __GFP_KSWAPD_RECLAIM indicates that the caller wants to wake kswapd when
 *   the low watermark is reached and have it reclaim pages until the high
 *   watermark is reached. A caller may wish to clear this flag when fallback
 *   options are available and the reclaim is likely to disrupt the system. The
 *   canonical example is THP allocation where a fallback is cheap but
 *   reclaim/compaction may cause indirect stalls.
 *
 * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
 *
 * The default allocator behavior depends on the request size. We have a concept
 * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
 * !costly allocations are too essential to fail so they are implicitly
 * non-failing by default (with some exceptions like OOM victims might fail so
 * the caller still has to check for failures) while costly requests try to be
 * not disruptive and back off even without invoking the OOM killer.
 * The following three modifiers might be used to override some of these
 * implicit rules
 *
 * __GFP_NORETRY: The VM implementation will try only very lightweight
 *   memory direct reclaim to get some memory under memory pressure (thus
 *   it can sleep). It will avoid disruptive actions like OOM killer. The
 *   caller must handle the failure which is quite likely to happen under
 *   heavy memory pressure. The flag is suitable when failure can easily be
 *   handled at small cost, such as reduced throughput
 *
 * __GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim
 *   procedures that have previously failed if there is some indication
 *   that progress has been made else where.  It can wait for other
 *   tasks to attempt high level approaches to freeing memory such as
 *   compaction (which removes fragmentation) and page-out.
 *   There is still a definite limit to the number of retries, but it is
 *   a larger limit than with __GFP_NORETRY.
 *   Allocations with this flag may fail, but only when there is
 *   genuinely little unused memory. While these allocations do not
 *   directly trigger the OOM killer, their failure indicates that
 *   the system is likely to need to use the OOM killer soon.  The
 *   caller must handle failure, but can reasonably do so by failing
 *   a higher-level request, or completing it only in a much less
 *   efficient manner.
 *   If the allocation does fail, and the caller is in a position to
 *   free some non-essential memory, doing so could benefit the system
 *   as a whole.
 *
 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 *   cannot handle allocation failures. The allocation could block
 *   indefinitely but will never return with failure. Testing for
 *   failure is pointless.
 *   New users should be evaluated carefully (and the flag should be
 *   used only when there is no reasonable failure policy) but it is
 *   definitely preferable to use the flag rather than opencode endless
 *   loop around allocator.
 *   Using this flag for costly allocations is _highly_ discouraged.
 */
#define __GFP_IO    ((__force gfp_t)___GFP_IO)
#define __GFP_FS    ((__force gfp_t)___GFP_FS)
#define __GFP_DIRECT_RECLAIM    ((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
#define __GFP_KSWAPD_RECLAIM    ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
#define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
#define __GFP_RETRY_MAYFAIL ((__force gfp_t)___GFP_RETRY_MAYFAIL)
#define __GFP_NOFAIL    ((__force gfp_t)___GFP_NOFAIL)
#define __GFP_NORETRY   ((__force gfp_t)___GFP_NORETRY)

/*
 * Action modifiers
 *
 * __GFP_NOWARN suppresses allocation failure reports.
 *
 * __GFP_COMP address compound page metadata.
 *
 * __GFP_ZERO returns a zeroed page on success.
 */
#define __GFP_NOWARN    ((__force gfp_t)___GFP_NOWARN)
#define __GFP_COMP  ((__force gfp_t)___GFP_COMP)
#define __GFP_ZERO  ((__force gfp_t)___GFP_ZERO)

/* Disable lockdep for GFP context tracking */
#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)

/* Room for N __GFP_FOO bits */
#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/*
 * Useful GFP flag combinations that are commonly used. It is recommended
 * that subsystems start with one of these combinations and then set/clear
 * __GFP_FOO flags as necessary.
 *
 * GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
 *   watermark is applied to allow access to "atomic reserves"
 *
 * GFP_KERNEL is typical for kernel-internal allocations. The caller requires
 *   ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
 *
 * GFP_KERNEL_ACCOUNT is the same as GFP_KERNEL, except the allocation is
 *   accounted to kmemcg.
 *
 * GFP_NOWAIT is for kernel allocations that should not stall for direct
 *   reclaim, start physical IO or use any filesystem callback.
 *
 * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
 *   that do not require the starting of any physical IO.
 *   Please try to avoid using this flag directly and instead use
 *   memalloc_noio_{save,restore} to mark the whole scope which cannot
 *   perform any IO with a short explanation why. All allocation requests
 *   will inherit GFP_NOIO implicitly.
 *
 * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
 *   Please try to avoid using this flag directly and instead use
 *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
 *   recurse into the FS layer with a short explanation why. All allocation
 *   requests will inherit GFP_NOFS implicitly.
 *
 * GFP_USER is for userspace allocations that also need to be directly
 *   accessibly by the kernel or hardware. It is typically used by hardware
 *   for buffers that are mapped to userspace (e.g. graphics) that hardware
 *   still must DMA to. cpuset limits are enforced for these allocations.
 *
 * GFP_DMA exists for historical reasons and should be avoided where possible.
 *   The flags indicates that the caller requires that the lowest zone be
 *   used (ZONE_DMA or 16M on x86-64). Ideally, this would be removed but
 *   it would require careful auditing as some users really require it and
 *   others use the flag to avoid lowmem reserves in ZONE_DMA and treat the
 *   lowest zone as a type of emergency reserve.
 *
 * GFP_DMA32 is similar to GFP_DMA except that the caller requires a 32-bit
 *   address.
 *
 * GFP_HIGHUSER is for userspace allocations that may be mapped to userspace,
 *   do not need to be directly accessible by the kernel but that cannot
 *   move once in use. An example may be a hardware allocation that maps
 *   data directly into userspace but has no addressing limitations.
 *
 * GFP_HIGHUSER_MOVABLE is for userspace allocations that the kernel does not
 *   need direct access to but can use kmap() when access is required. They
 *   are expected to be movable via page reclaim or page migration. Typically,
 *   pages on the LRU would also be allocated with GFP_HIGHUSER_MOVABLE.
 *
 * GFP_TRANSHUGE and GFP_TRANSHUGE_LIGHT are used for THP allocations. They are
 *   compound allocations that will generally fail quickly if memory is not
 *   available and will not wake kswapd/kcompactd on failure. The _LIGHT
 *   version does not attempt reclaim/compaction at all and is by default used
 *   in page fault path, while the non-light is used by khugepaged.
 */
#define GFP_ATOMIC      (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
#define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
#define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
#define GFP_NOWAIT      (__GFP_KSWAPD_RECLAIM)
#define GFP_NOIO        (__GFP_RECLAIM)
#define GFP_NOFS        (__GFP_RECLAIM | __GFP_IO)
#define GFP_TEMPORARY   (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
                         __GFP_RECLAIMABLE)
#define GFP_USER        (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_DMA         __GFP_DMA
#define GFP_DMA32       __GFP_DMA32
#define GFP_HIGHUSER    (GFP_USER | __GFP_HIGHMEM)
#define GFP_HIGHUSER_MOVABLE    (GFP_HIGHUSER | __GFP_MOVABLE)
#define GFP_TRANSHUGE   ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
                         __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
                         ~__GFP_RECLAIM)

/* Convert GFP flags to their corresponding migrate type */
#define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
#define GFP_MOVABLE_SHIFT 3

GFP_ATOMIC：用于原子分配，在任何情况下都不能被中断, 可能使用紧急分配链表中的内存, 这个标志用在中断处理程序, 下半部, 持有自旋锁以及其他不能睡眠的地方
GFP_KERNEL：这是一种常规的分配方式, 可能会被阻塞，不能用在不可中断的上下文。为了获取调用者所需的内存, 内核会尽力而为. 这个标志应该是首选标志
GFP_NOWAIT：与GFP_ATOMIC类似, 不同之处在于, 调用不会使用紧急内存池, 这就增加了内存分配失败的可能性
GFP_NOIO：这种分配可以阻塞, 但不会启动磁盘I/O, 在页面申请过程中如果发现内存不足时可以启动磁盘IO将一些页面交换到磁盘的交换分区的，有些场景不能启动IO操作因此要使用此标志。
GFP_NOFS：这种分配在必要时可以阻塞, 可以启动磁盘, 但是不能调用VFS相关操作, 一般文件系统在申请内存页面时要禁止VFS操作，否则可能在页面严重不足时写回文件脏页回收内存时导致死锁。
GFP_USER：这是一种常规的分配方式, 可以被阻塞. 通常由硬件分配一片内存然后映射用户空间使用。
GFP_HIGHUSER：是GFP_USER的一个扩展, 也用于用户空间. 它允许分配无法直接映射的高端内存. 用户进程使用高端内存页是没有坏处，因为用户过程的地址空间总是通过非线性页表组织的
GFP_HIGHUSER_MOVABLE：用途类似于GFP_HIGHUSER，优先从ZONE_MOVEABLE中分配页面。

GFP_KERNEL是内核中最最常用的分配掩码，它优先从低端内存分配页面，不过它分配的一般都是给内核管理数据结构、各种缓存使用的页面。

alloc_pages()最终调用__alloc_pages_nodemask()函数，它是伙伴系统的核心函数。

[alloc_pages->alloc_pages_node->__alloc_pages->__alloc_pages_nodemask] 
/*
 * This is the 'heart' of the zoned buddy allocator.
 */
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
                            nodemask_t *nodemask)
{
    struct page *page;
    unsigned int alloc_flags = ALLOC_WMARK_LOW;
    gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
    struct alloc_context ac = { };

    gfp_mask &= gfp_allowed_mask;
    alloc_mask = gfp_mask;
    if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
        return NULL;

    finalise_ac(gfp_mask, order, &ac);

    /* First allocation attempt */
    page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
    if (likely(page))
        goto out;

    /*
     * Apply scoped allocation constraints. This is mainly about GFP_NOFS
     * resp. GFP_NOIO which has to be inherited for all allocation requests
     * from a particular context which has been marked by
     * memalloc_no{fs,io}_{save,restore}.
     */
    alloc_mask = current_gfp_context(gfp_mask);
    ac.spread_dirty_pages = false;

    /*
     * Restore the original nodemask if it was potentially replaced with
     * &cpuset_current_mems_allowed to optimize the fast-path attempt.
     */
    if (unlikely(ac.nodemask != nodemask))
        ac.nodemask = nodemask;

    page = __alloc_pages_slowpath(alloc_mask, order, &ac);

out:
    if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
        unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
        __free_pages(page, order);
        page = NULL;
    }

    trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);

    return page;
} 

static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
        int preferred_nid, nodemask_t *nodemask,
        struct alloc_context *ac, gfp_t *alloc_mask,
        unsigned int *alloc_flags)
{
    ac->high_zoneidx = gfp_zone(gfp_mask);
    ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
    ac->nodemask = nodemask;
    ac->migratetype = gfpflags_to_migratetype(gfp_mask);

    if (cpusets_enabled()) {
        *alloc_mask |= __GFP_HARDWALL;
        if (!ac->nodemask)
            ac->nodemask = &cpuset_current_mems_allowed;
        else
            *alloc_flags |= ALLOC_CPUSET;
    }

    fs_reclaim_acquire(gfp_mask);
    fs_reclaim_release(gfp_mask);

    might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);

    if (should_fail_alloc_page(gfp_mask, order))
        return false;

    if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
        *alloc_flags |= ALLOC_CMA;

    return true;
}

alloc_context是内存分配的一个中间数据结构，保存内存分配策略的计算值，gfp_zone计算从优先从哪个zone开始分配开始分配内存，这个函数中用的宏定义如下，在嵌入式ARM体系中MAX_NR_ZONES为3（ZONE_NORMAL、ZONE_HIGMEM、ZONE_MOVABLE）所以ZONES_SHIFT为2。对于gfp_zone(GFP_KERNEL)计算结果为0，即high_zoneidx为0。在pglist_data->zonelist->zoneref中定义了从哪个zone分配内存的优先级，越靠前优先级越高。gfpflags_to_migratetype获取页面的MIGRATE_TYPES类型，这个决定了优先从哪个zone->free_area->free_list中分配内存，gfpflags_to_migratetype(GFP_KERNEL)计算结果为MIGRATE_UNMOVABLE。

static inline enum zone_type gfp_zone(gfp_t flags) 
{ 
    enum zone_type z; 
    int bit = (__force int) (flags & GFP_ZONEMASK); 
    z = (GFP_ZONE_TABLE >> (bit * ZONES_SHIFT)) &
    ((1 << ZONES_SHIFT) - 1); 
    return z; 
}

#define GFP_ZONEMASK (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE) 
#define GFP_ZONE_TABLE ( \ 
 (ZONE_NORMAL << 0 * ZONES_SHIFT) \ 
 | (OPT_ZONE_DMA << ___GFP_DMA * ZONES_SHIFT) \ 
 | (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * ZONES_SHIFT) \ 
 | (OPT_ZONE_DMA32 << ___GFP_DMA32 * ZONES_SHIFT) \ 
 | (ZONE_NORMAL << ___GFP_MOVABLE * ZONES_SHIFT) \ 
 | (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * ZONES_SHIFT) \ 
 | (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * ZONES_SHIFT) \ 
 | (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * ZONES_SHIFT) \ 
) 
#if MAX_NR_ZONES < 2 
#define ZONES_SHIFT 0 
#elif MAX_NR_ZONES <= 2 
#define ZONES_SHIFT 1 
#elif MAX_NR_ZONES <= 4 
#define ZONES_SHIFT 2
#endif

[__alloc_pages_nodemask->finalise_ac->first_zones_zonelist->next_zones_zonelist->__next_zones_zonelist]

first_zones_zonelist计算preferred_zoneref最先从哪个zone开始分配页面，这个计算是依据high_zoneidx进行的，计算方法很简单，就是从pglist_data->zonelist->zoneref定义的zone顺序，从前到后顺序查找第一个小于等于high_zoneidx的zone，当然这个过程还要考虑nodemask，因为zoneref中的zone可以来自不同节点。最终调用__next_zones_zonelist实现。gfp_zone(GFP_KERNEL)计算为0，也就是GFP_KERNEL只能从zone_idx为0的zone分配内存。内核中定义的zone类型在enum zone_type中，ARM嵌入式系统中定义了ZONE_NORMAL、ZONE_HIGHMEM、ZONE_MOVABLE 3种类型zone，而且在内存中是按照枚举类型顺序排列的，因此通过zone_idx（zone）宏计算出来的ZONE_NORMAL的zone_idx为0，ZONE_HIGHMEM为1，ZONE_MOVABLE为2。这样GFP_KERNEL掩码只能从ZONE_NORMAL中分配页面。

enum zone_type {
#ifdef CONFIG_ZONE_DMA
  ZONE_DMA,
#endif
#ifdef CONFIG_ZONE_DMA32
  ZONE_DMA32,
#endif
  ZONE_NORMAL,
#ifdef CONFIG_HIGHMEM
  ZONE_HIGHMEM,
#endif
  ZONE_MOVABLE,
  __MAX_NR_ZONES
};

/*
 * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
 */
#define zone_idx(zone)      ((zone) - (zone)->zone_pgdat->node_zones)

/* Returns the next zone at or below highest_zoneidx in a zonelist */
struct zoneref *__next_zones_zonelist(struct zoneref *z,
                    enum zone_type highest_zoneidx,
                    nodemask_t *nodes)
{
    /*
     * Find the next suitable zone to use for the allocation.
     * Only filter based on nodemask if it's set
     */
    if (unlikely(nodes == NULL))
        while (zonelist_zone_idx(z) > highest_zoneidx)
            z++;
    else
        while (zonelist_zone_idx(z) > highest_zoneidx ||
                (z->zone && !zref_in_nodemask(z, nodes)))
            z++;

    return z;
}

[__alloc_pages_nodemask->get_page_from_freelist]

在获取最优的zone后就可以具体的从中分配页面了，get_page_from_freelist函数完成这个任务，整个代码逻辑是一个大循环，从最优zone开始分配页面。

如果使能了CPU_SET,则要判断当前进程是否可以从此zone中分配页面，如果不行跳过这个zone
判断node剩余页面水位，如果水位低于设置的值则跳过这个zone。根据zone->watermark、lowmem_reserve、gpf_mask计算出来的alloc_flags值共同计算出水位值，计算出来的值小于剩余页面数直接退出，否则判断order==0则直接返回成功，如果order>0则伙伴系统中还必须有一合适大小的连续内存块。系统中定义了 3 种水位，分别是 WMARK_MIN、WMARK_LOW 和 WMARK_HIGH，具体的水位值在系统初始化时计算出来。通常分配物理内存页面的内核路径是检查 WMARK_LOW 水位，而页面回收 kswapd 内核线程则是检查 WMARK_HIGH 水位。水位检测最终通过__zone_watermark_ok函数完成。
当水位满足条件从伙伴系统申请内存，不满足则先尝试回收内存然后再尝试从这个zone中分配内存
从zone中实在无法分配内存时，则再zonelists中顺序查找下一个zone_idx<=highest_zoneidx的zone重复上面步骤。

#define for_next_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
    for (zone = z->zone;    \
        zone;                           \
        z = next_zones_zonelist(++z, highidx, nodemask),    \
            zone = zonelist_zone(z))

/*
 * get_page_from_freelist goes through the zonelist trying to allocate
 * a page.
 */
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
                        const struct alloc_context *ac)
{
    struct zoneref *z = ac->preferred_zoneref;
    struct zone *zone;
    struct pglist_data *last_pgdat_dirty_limit = NULL;

    /*
     * Scan zonelist, looking for a zone with enough free.
     * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
     */
    for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
                                ac->nodemask) {
        struct page *page;
        unsigned long mark;

        if (cpusets_enabled() &&
            (alloc_flags & ALLOC_CPUSET) &&
            !__cpuset_zone_allowed(zone, gfp_mask))
                continue;
        /*
         * When allocating a page cache page for writing, we
         * want to get it from a node that is within its dirty
         * limit, such that no single node holds more than its
         * proportional share of globally allowed dirty pages.
         * The dirty limits take into account the node's
         * lowmem reserves and high watermark so that kswapd
         * should be able to balance it without having to
         * write pages from its LRU list.
         *
         * XXX: For now, allow allocations to potentially
         * exceed the per-node dirty limit in the slowpath
         * (spread_dirty_pages unset) before going into reclaim,
         * which is important when on a NUMA setup the allowed
         * nodes are together not big enough to reach the
         * global limit.  The proper fix for these situations
         * will require awareness of nodes in the
         * dirty-throttling and the flusher threads.
         */
        if (ac->spread_dirty_pages) {
            if (last_pgdat_dirty_limit == zone->zone_pgdat)
                continue;

            if (!node_dirty_ok(zone->zone_pgdat)) {
                last_pgdat_dirty_limit = zone->zone_pgdat;
                continue;
            }
        }

        mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
        if (!zone_watermark_fast(zone, order, mark,
                       ac_classzone_idx(ac), alloc_flags)) {
            int ret;

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
            /*
             * Watermark failed for this zone, but see if we can
             * grow this zone if it contains deferred pages.
             */
            if (static_branch_unlikely(&deferred_pages)) {
                if (_deferred_grow_zone(zone, order))
                    goto try_this_zone;
            }
#endif
            /* Checked here to keep the fast path fast */
            BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
            if (alloc_flags & ALLOC_NO_WATERMARKS)
                goto try_this_zone;

            if (node_reclaim_mode == 0 ||
                !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
                continue;

            ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
            switch (ret) {
            case NODE_RECLAIM_NOSCAN:
                /* did not scan */
                continue;
            case NODE_RECLAIM_FULL:
                /* scanned but unreclaimable */
                continue;
            default:
                /* did we reclaim enough */
                if (zone_watermark_ok(zone, order, mark,
                        ac_classzone_idx(ac), alloc_flags))
                    goto try_this_zone;

                continue;
            }
        }

try_this_zone:
        page = rmqueue(ac->preferred_zoneref->zone, zone, order,
                gfp_mask, alloc_flags, ac->migratetype);
        if (page) {
            prep_new_page(page, order, gfp_mask, alloc_flags);

            /*
             * If this is a high-order atomic allocation then check
             * if the pageblock should be reserved for the future
             */
            if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
                reserve_highatomic_pageblock(page, zone, order);

            return page;
        } else {
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
            /* Try again if zone has deferred pages */
            if (static_branch_unlikely(&deferred_pages)) {
                if (_deferred_grow_zone(zone, order))
                    goto try_this_zone;
            }
#endif
        }
    }

    return NULL;
}

/*
 * Return true if free base pages are above 'mark'. For high-order checks it
 * will return true of the order-0 watermark is reached and there is at least
 * one free page of a suitable size. Checking now avoids taking the zone lock
 * to check in the allocation paths if no pages are free.
 */
bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
             int classzone_idx, unsigned int alloc_flags,
             long free_pages)
{
    long min = mark;
    int o;
    const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));

    /* free_pages may go negative - that's OK */
    free_pages -= (1 << order) - 1;

    if (alloc_flags & ALLOC_HIGH)
        min -= min / 2;

    /*
     * If the caller does not have rights to ALLOC_HARDER then subtract
     * the high-atomic reserves. This will over-estimate the size of the
     * atomic reserve but it avoids a search.
     */
    if (likely(!alloc_harder)) {
        free_pages -= z->nr_reserved_highatomic;
    } else {
        /*
         * OOM victims can try even harder than normal ALLOC_HARDER
         * users on the grounds that it's definitely going to be in
         * the exit path shortly and free memory. Any allocation it
         * makes during the free path will be small and short-lived.
         */
        if (alloc_flags & ALLOC_OOM)
            min -= min / 2;
        else
            min -= min / 4;
    }


#ifdef CONFIG_CMA
    /* If allocation can't use CMA areas don't use free CMA pages */
    if (!(alloc_flags & ALLOC_CMA))
        free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
#endif

    /*
     * Check watermarks for an order-0 allocation request. If these
     * are not met, then a high-order request also cannot go ahead
     * even if a suitable page happened to be free.
     */
    if (free_pages <= min + z->lowmem_reserve[classzone_idx])
        return false;

    /* If this is an order-0 request then the watermark is fine */
    if (!order)
        return true;

    /* For a high-order request, check at least one suitable page is free */
    for (o = order; o < MAX_ORDER; o++) {
        struct free_area *area = &z->free_area[o];
        int mt;

        if (!area->nr_free)
            continue;

        for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
            if (!list_empty(&area->free_list[mt]))
                return true;
        }

#ifdef CONFIG_CMA
        if ((alloc_flags & ALLOC_CMA) &&
            !list_empty(&area->free_list[MIGRATE_CMA])) {
            return true;
        }
#endif
        if (alloc_harder &&
            !list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
            return true;
    }
    return false;
}

[__alloc_pages_nodemask->get_page_from_freelist->rmqueue]

/*
 * Allocate a page from the given zone. Use pcplists for order-0 allocations.
 */
static inline
struct page *rmqueue(struct zone *preferred_zone,
            struct zone *zone, unsigned int order,
            gfp_t gfp_flags, unsigned int alloc_flags,
            int migratetype)
{
    unsigned long flags;
    struct page *page;

    if (likely(order == 0)) {
        page = rmqueue_pcplist(preferred_zone, zone, order,
                gfp_flags, migratetype);
        goto out;
    }

    /*
     * We most definitely don't want callers attempting to
     * allocate greater than order-1 page units with __GFP_NOFAIL.
     */
    WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
    spin_lock_irqsave(&zone->lock, flags);

    do {
        page = NULL;
        if (alloc_flags & ALLOC_HARDER) {
            page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
            if (page)
                trace_mm_page_alloc_zone_locked(page, order, migratetype);
        }
        if (!page)
            page = __rmqueue(zone, order, migratetype);
    } while (page && check_new_pages(page, order));
    spin_unlock(&zone->lock);
    if (!page)
        goto failed;
    __mod_zone_freepage_state(zone, -(1 << order),
                  get_pcppage_migratetype(page));

    __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
    zone_statistics(preferred_zone, zone);
    local_irq_restore(flags);

out:
    VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
    return page;

failed:
    local_irq_restore(flags);
    return NULL;
}

/*
 * Go through the free lists for the given migratetype and remove
 * the smallest available page from the freelists
 */
static __always_inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
                        int migratetype)
{
    unsigned int current_order;
    struct free_area *area;
    struct page *page;

    /* Find a page of the appropriate size in the preferred list */
    for (current_order = order; current_order < MAX_ORDER; ++current_order) {
        area = &(zone->free_area[current_order]);
        page = list_first_entry_or_null(&area->free_list[migratetype],
                            struct page, lru);
        if (!page)
            continue;
        list_del(&page->lru);
        rmv_page_order(page);
        area->nr_free--;
        expand(zone, page, order, current_order, area, migratetype);
        set_pcppage_migratetype(page, migratetype);
        return page;
    }

    return NULL;
}

rmqueue时真正分配内存的函数

通过调用rmqueue_pcplist，从每个CPU的缓存页面中分配内存
上一步不成功调用__rmqueue_smallest函数真正从伙伴系统分配页面，这个函数简单的扫描伙伴系统查找合适的页面如果找不到直接退出不做发杂的回收页面处理
如果__rmqueue_smallest分配失败，则调用__rmqueue分配页面，这个函数会尽力根据migratetype的值在伙伴系统查找合适的内存空间，如果当前migratetype没找到，则根据fallback表中定义的优先级在其它内存块列表中查找。

当在伙伴系统中找到合适的内存块后就要把这个内存块从伙伴系统取出来，然后调用expand函数切出申请的大小，然后再把剩余的内存切分成更小2^N大小还回伙伴系统，这个过程中会尽量还会order大的空闲链表。

回到__alloc_pages_nodemask，如果get_page_from_freelist内存分配失败，会调用__alloc_pages_slowpath继续分配内存，这个过程会根据gfp_mask是否设置__GFP_KSWAPD_RECLAIM唤醒kswapd内核线程回收页面，紧接着再次调用get_page_from_freelist分配页面，如果还不成功则通过直接回收页面、迁移页面来释放部分内存后再尝试分配。还是不成功的话如果gfp_mask设置了__GFP_NOFAIL将中断当前进程一直到内存申请成功，否则分配失败直接返回。

页面回收

释放页面的核心函数是 free_page()，最终还是调用__free_pages()函数。__free_pages()函数会分两种情况，对于order等于0的情况做特殊处理；对于order大于0的情况，属于正常处理流程。

void __free_pages(struct page *page, unsigned int order)
{
    if (put_page_testzero(page)) {
        if (order == 0)
            free_unref_page(page);
        else
            __free_pages_ok(page, order);
    }
}

释放内存的核心操作是将页面还给伙伴系统，并计算释放的页面和相邻的内存是否可以合并，如果可以合并则合并内存将新的内存块链入更高一阶的内存块链表中，然后重复上面的操作知道内存块不能合并为止。

list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);

最终将内存块链入空闲内存列表中。

/*
 * Freeing function for a buddy system allocator.
 *
 * The concept of a buddy system is to maintain direct-mapped table
 * (containing bit values) for memory blocks of various "orders".
 * The bottom level table contains the map for the smallest allocatable
 * units of memory (here, pages), and each level above it describes
 * pairs of units from the levels below, hence, "buddies".
 * At a high level, all that happens here is marking the table entry
 * at the bottom level available, and propagating the changes upward
 * as necessary, plus some accounting needed to play nicely with other
 * parts of the VM system.
 * At each level, we keep a list of pages, which are heads of continuous
 * free pages of length of (1 << order) and marked with _mapcount
 * PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page)
 * field.
 * So when we are allocating or freeing one, we can derive the state of the
 * other.  That is, if we allocate a small block, and both were
 * free, the remainder of the region must be split into blocks.
 * If a block is freed, and its buddy is also free, then this
 * triggers coalescing into a block of larger size.
 *
 * -- nyc
 */

static inline void __free_one_page(struct page *page,
        unsigned long pfn,
        struct zone *zone, unsigned int order,
        int migratetype)
{
    unsigned long combined_pfn;
    unsigned long uninitialized_var(buddy_pfn);
    struct page *buddy;
    unsigned int max_order;

    max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);

    VM_BUG_ON(!zone_is_initialized(zone));
    VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);

    VM_BUG_ON(migratetype == -1);
    if (likely(!is_migrate_isolate(migratetype)))
        __mod_zone_freepage_state(zone, 1 << order, migratetype);

    VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
    VM_BUG_ON_PAGE(bad_range(zone, page), page);

continue_merging:
    while (order < max_order - 1) {
        buddy_pfn = __find_buddy_pfn(pfn, order);
        buddy = page + (buddy_pfn - pfn);

        if (!pfn_valid_within(buddy_pfn))
            goto done_merging;
        if (!page_is_buddy(page, buddy, order))
            goto done_merging;
        /*
         * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
         * merge with it and move up one order.
         */
        if (page_is_guard(buddy)) {
            clear_page_guard(zone, buddy, order, migratetype);
        } else {
            list_del(&buddy->lru);
            zone->free_area[order].nr_free--;
            rmv_page_order(buddy);
        }
        combined_pfn = buddy_pfn & pfn;
        page = page + (combined_pfn - pfn);
        pfn = combined_pfn;
        order++;
    }
    if (max_order < MAX_ORDER) {
        /* If we are here, it means order is >= pageblock_order.
         * We want to prevent merge between freepages on isolate
         * pageblock and normal pageblock. Without this, pageblock
         * isolation could cause incorrect freepage or CMA accounting.
         *
         * We don't want to hit this code for the more frequent
         * low-order merging.
         */
        if (unlikely(has_isolate_pageblock(zone))) {
            int buddy_mt;

            buddy_pfn = __find_buddy_pfn(pfn, order);
            buddy = page + (buddy_pfn - pfn);
            buddy_mt = get_pageblock_migratetype(buddy);

            if (migratetype != buddy_mt
                    && (is_migrate_isolate(migratetype) ||
                        is_migrate_isolate(buddy_mt)))
                goto done_merging;
        }
        max_order++;
        goto continue_merging;
    }

done_merging:
    set_page_order(page, order);

    /*
     * If this is not the largest possible page, check if the buddy
     * of the next-highest order is free. If it is, it's possible
     * that pages are being freed that will coalesce soon. In case,
     * that is happening, add the free page to the tail of the list
     * so it's less likely to be used soon and more likely to be merged
     * as a higher order page
     */
    if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)) {
        struct page *higher_page, *higher_buddy;
        combined_pfn = buddy_pfn & pfn;
        higher_page = page + (combined_pfn - pfn);
        buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
        higher_buddy = higher_page + (buddy_pfn - combined_pfn);
        if (pfn_valid_within(buddy_pfn) &&
            page_is_buddy(higher_page, higher_buddy, order + 1)) {
            list_add_tail(&page->lru,
                &zone->free_area[order].free_list[migratetype]);
            goto out;
        }
    }

    list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
out:
    zone->free_area[order].nr_free++;
}

对于order为0的情况，内核做了些特殊处理，zone中有一个变量zone->pageset为每个CPU初始化一个percpu变量 struct per_cpu_pageset。当释放 order 等于 0 的页面时，首先页面释放到 per_cpu_page->list 对应的链表中。这样可以保证页面被同一个cpu使用避免频繁冲涮缓存。当然这个链表不能一直增长下去，函数会检查链表中页面的个数，当页面大于一定数量时将部分页面真正还给伙伴系统，最终还是调用__free_one_page完成页面释放。

[free_unref_page_commit->free_pcppages_bulk->__free_one_page]

struct per_cpu_pages {
    int count;      /* number of pages in the list */
    int high;       /* high watermark, emptying needed */
    int batch;      /* chunk size for buddy add/remove */

    /* Lists of pages, one per migrate type stored on the pcp-lists */
    struct list_head lists[MIGRATE_PCPTYPES];
};

count 表示当前 zone 中的 per_cpu_pages 的页面。
high 表示当缓存的页面高于这水位时，会回收页面到伙伴系统。
batch 表示一次回收页面到伙伴系统的页面数量。

/*
 * Frees a number of pages from the PCP lists
 * Assumes all pages on list are in same zone, and of same order.
 * count is the number of pages to free.
 *
 * If the zone was previously in an "all pages pinned" state then look to
 * see if this freeing clears that state.
 *
 * And clear the zone's pages_scanned counter, to hold off the "all pages are
 * pinned" detection logic.
 */
static void free_pcppages_bulk(struct zone *zone, int count,
                    struct per_cpu_pages *pcp)
{
    int migratetype = 0;
    int batch_free = 0;
    int prefetch_nr = 0;
    bool isolated_pageblocks;
    struct page *page, *tmp;
    LIST_HEAD(head);

    while (count) {
        struct list_head *list;

        /*
         * Remove pages from lists in a round-robin fashion. A
         * batch_free count is maintained that is incremented when an
         * empty list is encountered.  This is so more pages are freed
         * off fuller lists instead of spinning excessively around empty
         * lists
         */
        do {
            batch_free++;
            if (++migratetype == MIGRATE_PCPTYPES)
                migratetype = 0;
            list = &pcp->lists[migratetype];
        } while (list_empty(list));

        /* This is the only non-empty list. Free them all. */
        if (batch_free == MIGRATE_PCPTYPES)
            batch_free = count;

        do {
            page = list_last_entry(list, struct page, lru);
            /* must delete to avoid corrupting pcp list */
            list_del(&page->lru);
            pcp->count--;

            if (bulkfree_pcp_prepare(page))
                continue;

            list_add_tail(&page->lru, &head);

            /*
             * We are going to put the page back to the global
             * pool, prefetch its buddy to speed up later access
             * under zone->lock. It is believed the overhead of
             * an additional test and calculating buddy_pfn here
             * can be offset by reduced memory latency later. To
             * avoid excessive prefetching due to large count, only
             * prefetch buddy for the first pcp->batch nr of pages.
             */
            if (prefetch_nr++ < pcp->batch)
                prefetch_buddy(page);
        } while (--count && --batch_free && !list_empty(list));
    }

    spin_lock(&zone->lock);
    isolated_pageblocks = has_isolate_pageblock(zone);

    /*
     * Use safe version since after __free_one_page(),
     * page->lru.next will not point to original list.
     */
    list_for_each_entry_safe(page, tmp, &head, lru) {
        int mt = get_pcppage_migratetype(page);
        /* MIGRATE_ISOLATE page should not go to pcplists */
        VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
        /* Pageblock could have been isolated meanwhile */
        if (unlikely(isolated_pageblocks))
            mt = get_pageblock_migratetype(page);

        __free_one_page(page, page_to_pfn(page), zone, 0, mt);
        trace_mm_page_pcpu_drain(page, 0, mt);
    }
    spin_unlock(&zone->lock);
}
```c

本文中部分图片参考来及互联网在此标识感谢！

参考文章：

https://blog.csdn.net/WANGYONGZIXUE/article/details/124518138

https://zhuanlan.zhihu.com/p/468829050

https://blog.csdn.net/u012294613/article/details/124151163

https://blog.csdn.net/faxiang1230/article/details/106557298