UCC中集合通信算法选择

发表于 2022-07-29 更新于 2025-03-13 分类于 UCC

UCC（Unified Collective Communication）是UCF（Unified Communication Framework）中一个集合通信库，提供了丰富的功能与API，这篇文章是其中集合通信算法选择的部分，理解还不是很透彻，先占一个坑，以后比较全面的了解UCC后再用几篇博客详细介绍一下。

ucc每种集合通信操作提供的可选择算法比较有限，所有可选择的算法可以通过命令ucc_info -A查看

[rdma22@admin1 ~]$ ucc_info -A
cl/hier algorithms:
  Allreduce
    0 :              rab : intra-node reduce, followed by inter-node allreduce, followed by innode broadcast
    1 :       split_rail : intra-node reduce_scatter, followed by PPN concurrent  inter-node allreduces, followed by intra-node allgather
  Alltoall
    0 :       node_split : splitting alltoall into two concurrent a2av calls withing the node and outside of it
  Alltoallv
    0 :       node_split : splitting alltoallv into two concurrent a2av calls withing the node and outside of it

tl/ucp algorithms:
  Allgather
    0 :             ring : O(N) Ring
  Allgatherv
    0 :             ring : O(N) Ring
  Allreduce
    0 :          knomial : recursive knomial with arbitrary radix (optimized for latency)
    1 :      sra_knomial : recursive knomial scatter-reduce followed by knomial allgather (optimized for BW)
  Alltoall
    0 :         pairwise : pairwise two-sided implementation
    1 :         onesided : naive, linear one-sided implementation
  Alltoallv
    0 :         pairwise : O(N) pairwise exchange with adjustable number of outstanding sends/recvs
  Barrier
    0 :          knomial : recursive knomial with arbitrary radix
  Bcast
    0 :          knomial : bcast over knomial tree with arbitrary radix (optimized for latency)
    1 :      sag_knomial : recursive knomial scatter followed by knomial allgather (optimized for BW)
  Fanin
    0 :          knomial : fanin over knomial tree with arbitrary radix
  Fanout
    0 :          knomial : fanout over knomial tree with arbitrary radix
  Gather
    0 :          knomial : gather over knomial tree with arbitrary radix (optimized for latency)
  Reduce
    0 :          knomial : reduce over knomial tree with arbitrary radix (optimized for latency)
  Reduce_scatter
    0 :             ring : O(N) ring
  Reduce_scatterv
    0 :             ring : O(N) ring

UCC的集合通信有Transport Layer（TL）和Collective Layer（CL）两层，把集合操作分成了“编程模型”和“集合操作所需的通信”两部分。TL层基本就是集合通信操作使用不同算法的实现，针对不同的底层环境（ucp、sharp、cuda、nccl）分别作了实现；CL层可以利用TL层实现的操作提供分层集合通信等功能（目前UCC中只有这一个模块Hier，此外还有一个Basic模块提供与TL一致的基本功能）。

UCC的模块和算法的选择是通过coll_score来做的，score我的理解是相当于一个模块（如ucp或cuda、nccl等等）的优先级，有一个二维的score_map[COLL_TYPE_NUM][MEM_TYPE_NUM]来决定特定内存类型下一个集合通信操作具体由哪个模块实现，然后还可以为每个模块设置一个选择具体算法的策略。我在他们github仓库里面找到了一些资料但感觉理解的应该还不准确：

具体在运行的过程中去设定选择策略使用命令行参数和API的方式都可以，使用命令行参数的示例（在上面第一个链接中也有）：

UCC_TL_NCCL_TUNE=0：在TL中不使用NCCL
UCC_TL_NCCL_TUNE=allreduce:cuda:inf#alltoall:0 ：对CUDA类型内存的Allreduce强制使用NCCL，对所有Alltoall禁用NCCL
UCC_TL_UCP_TUNE=bcast:0-4K:cuda:0#bcast:65k-1M:[25-100]:cuda:inf ：对CUDA类型内存0到4KB消息的Bcast禁用UCP，对CUDA类型有25到100进程参与的65KB到1M大小消息的Bcast强制使用UCP
UCC_TL_UCP_TUNE=allreduce:0-4K:@0#allreduce:4K-inf:@sra_knomial ：对0-4K消息大小的Allreduce使用算法0（knomial），对4K以上消息大小的Allreduce使用算法1（sra_knomial ）

使用API设置调用ucc_context_config_modify()就可以，第一个参数是ucc_context_config，第二个参数是filed，对应命令行参数里中间两个字段，比如"tl/ucp", "tl/nccl"，第三个参数是name，就是“TUNE”，最后是value，和命令行设置的内容一样。

UCC_CHECK(ucc_context_config_read(lib, NULL, &ctx_config));
UCC_CHECK(ucc_context_config_modify(ctx_config, "tl/ucp", "TUNE", "Bcast:0-4095:Host:@knomial#Bcast:4096-inf:Host:@sag_knomial"));
UCC_CHECK(ucc_context_create(lib, &ctx_params, ctx_config, &ctx));
ucc_context_config_release(ctx_config);

这两种方法验证了一下都可以