diff --git a/articles/20220701-cpu-design-part1-riscv-instruction.md b/articles/20220701-cpu-design-part1-riscv-instruction.md index 4d737f359a0835549ba444c9df7de03ac5b64680..84c5d221387e8209ddcd8c22b1d53c3d67bdb603 100644 --- a/articles/20220701-cpu-design-part1-riscv-instruction.md +++ b/articles/20220701-cpu-design-part1-riscv-instruction.md @@ -5,7 +5,7 @@ > Project: [RISC-V CPU Design](https://gitee.com/tinylab/riscv-linux)
> Sponsor: PLCT Lab, ISCAS -# RISC-V 指令集 +# RISC-V CPU 设计(1):RISC-V 指令集 为了设计出一款基于 RISC-V 指令集的 CPU,我们必须先对 RISC-V 指令集本身进行一定的了解。本文以 RV32 为主来做介绍。 diff --git a/articles/20220710-cpu-design-part1-riscv-privilleged-instruction.md b/articles/20220710-cpu-design-part1-riscv-privilleged-instruction.md index d0093b80ebd11c6168007d71e97f1dd6d57f3527..6712cbd93e3d2da72dfb5569a8e4e6b00beb8595 100644 --- a/articles/20220710-cpu-design-part1-riscv-privilleged-instruction.md +++ b/articles/20220710-cpu-design-part1-riscv-privilleged-instruction.md @@ -6,7 +6,7 @@ > Environment: [Linux Lab](https://tinylab.org/linux-lab)
> Sponsor: PLCT Lab, ISCAS -# RISC-V 特权指令架构 +# RISC-V CPU 设计(2):RISC-V 特权指令架构 RISC-V 的指令集架构 ISA 是由两大部分组成,分别是**非特权级 ISA** 和**特权级 ISA**。而正是因为**特权级 ISA** 的存在,才使得 RISC-V 可以在硬件层面(硬件线程)至多拥有 3 个不同的特权级模式,从而对不同的软件栈部件之间提供保护。 diff --git a/articles/20220722-digital-electronic-with-spinalhdl.md b/articles/20220722-cpu-design-digital-electronic-with-spinalhdl.md similarity index 99% rename from articles/20220722-digital-electronic-with-spinalhdl.md rename to articles/20220722-cpu-design-digital-electronic-with-spinalhdl.md index e1fc1553ab70924e7a1e9d0c125ed6f8852b1044..ae090503fe80a96a449ad2c835d2943cc4143757 100644 --- a/articles/20220722-digital-electronic-with-spinalhdl.md +++ b/articles/20220722-cpu-design-digital-electronic-with-spinalhdl.md @@ -6,7 +6,7 @@ > Proposal: [RISC-V CPU Design](https://gitee.com/tinylab/riscv-linux/issues/I5EIOA)
> Sponsor: PLCT Lab, ISCAS -# CPU 设计——数电基本知识与基于 Scala 的硬件设计框架 SpinalHDL +# RISC-V CPU 设计(3):数电基本知识与基于 Scala 的硬件设计框架 SpinalHDL ## 前言 diff --git a/articles/20220803-cpu-design-analysis-and-main-module-implement.md b/articles/20220803-cpu-design-analysis-and-main-module-implement.md index 87a8d0efec30b5fca90eff1c39e1446db5b41e1c..1d675b90cff68ef06ac38608a03d8d21711308c5 100644 --- a/articles/20220803-cpu-design-analysis-and-main-module-implement.md +++ b/articles/20220803-cpu-design-analysis-and-main-module-implement.md @@ -6,7 +6,7 @@ > Proposal: [RISC-V CPU Design](https://gitee.com/tinylab/riscv-linux/issues/I5EIOA)
> Sponsor: PLCT Lab, ISCAS -# RISC-V CPU 设计理论分析与主要模块的实现 +# RISC-V CPU 设计(4): RISC-V CPU 设计理论分析与主要模块的实现 ## 前言 @@ -779,7 +779,7 @@ object Register_fileSim { 本文部分图片来自参考资料(Wiki 和 RISC-V 手册等),感谢原作者的辛苦工作! [1]: https://gitee.com/tinylab/riscv-linux/blob/master/articles/20220701-cpu-design-part1-riscv-instruction.md -[2]: https://gitee.com/tinylab/riscv-linux/blob/master/articles/20220722-digital-electronic-with-spinalhdl.md +[2]: https://gitee.com/tinylab/riscv-linux/blob/master/articles/20220722-cpu-design-digital-electronic-with-spinalhdl.md [003]: https://github.com/SpinalHDL/SpinalTemplateSbt [004]: https://spinalhdl.github.io/SpinalDoc-RTD/master/index.html [005]: images/riscv_cpu_design/part2/mermaid-cpu-design-analysis-and-main-module-implement-1.png diff --git a/articles/20220816-cpu-design-module-board-test.md b/articles/20220816-cpu-design-module-board-test.md index 626bb6c4780e29d12add2b54bb9cd25bcdf3b6fd..2891da07e9bf348ea53293dcb067054b36078504 100644 --- a/articles/20220816-cpu-design-module-board-test.md +++ b/articles/20220816-cpu-design-module-board-test.md @@ -6,7 +6,7 @@ > Proposal: [RISC-V CPU Design](https://gitee.com/tinylab/riscv-linux/issues/I5EIOA)
> Sponsor: PLCT Lab, ISCAS -# RISC-V CPU 设计模块软件行为仿真与下板实验调试 +# RISC-V CPU 设计(5):RISC-V CPU 设计模块软件行为仿真与下板实验调试 ## 前言 diff --git a/articles/20220826-riscv-cpu-cotroller-module-design.md b/articles/20220826-cpu-design-riscv-cpu-controller-module-design.md similarity index 99% rename from articles/20220826-riscv-cpu-cotroller-module-design.md rename to articles/20220826-cpu-design-riscv-cpu-controller-module-design.md index 352ce886d1fecbcef0e361bf6cf03f39ad5e029c..733ce4bc318ddee65de807f43266147600b923e8 100644 --- a/articles/20220826-riscv-cpu-cotroller-module-design.md +++ b/articles/20220826-cpu-design-riscv-cpu-controller-module-design.md @@ -6,7 +6,7 @@ > Proposal: [RISC-V CPU Design](https://gitee.com/tinylab/riscv-linux/issues/I5EIOA)
> Sponsor: PLCT Lab, ISCAS -# RV64I CPU 控制器模块设计思路与实现 +# RISC-V CPU 设计(6): RV64I CPU 控制器模块设计思路与实现 ## 前言 diff --git a/articles/20230207-riscv-kvm-int-impl-2.md b/articles/20230207-riscv-kvm-int-impl-2.md index 7fb7d1822efa98f22d9a8db224dc949dd035e21c..a7f4f8c1fd4964196ceb60fb9f66527b46446802 100644 --- a/articles/20230207-riscv-kvm-int-impl-2.md +++ b/articles/20230207-riscv-kvm-int-impl-2.md @@ -1,723 +1,714 @@ -> Corrector: [TinyCorrect](https://gitee.com/tinylab/tinycorrect) v0.1 - [tounix spaces tables images urls]
-> Author: XiakaiPan <13212017962@163.com>
-> Date: 20230109
-> Revisor: Walimis
-> Project: [RISC-V Linux 内核剖析](https://gitee.com/tinylab/riscv-linux)
-> Proposal: [RISC-V 虚拟化技术调研与分析](https://gitee.com/tinylab/riscv-linux/issues/I5E4VB)
-> Sponsor: PLCT Lab, ISCAS - -# RISC-V KVM 中断处理的实现(二) - -## 前言 - -本文对于 kvmtool 和 KVM 中的中断注入与处理,以及 MMIO 设备的注册与使用,结合代码进行了分析和解读,并主要以流程图的方式呈现其代码实现。 - -## 代码版本 - -| Software | Version | -|-------------------|------------------------------------------| -| [Linux Kernel][1] | 6.0-rc6 | -| [kvmtool][6] | e17d182ad3f797f01947fc234d95c96c050c534b | - -## KVM 异常处理 - -### RISC-V Trap 类型、编码及其关系 - -在 RISC-V 中,CSR `mcause` / `scause` / `vscause` 用于记录引发 Trap 的编码,Interrupt 和 Exception 的区分是通过 CSR 最高位作为标志位来实现的,当标志位为 1 时表示当前 Trap 为 Interrupt,为 0 时则是 Exception。 - -RISC-V 中的中断分为三类:软件中断、计时器中断和外部中断,来自不同特权级的各类中断具有各自的编码。Linux 中对这些中断编码如下: - -```cpp -// arch/riscv/include/asm/csr.h: line 66 -/* Exception cause high bit - is an interrupt if set */ -#define CAUSE_IRQ_FLAG (_AC(1, UL) << (__riscv_xlen - 1)) - -/* Interrupt causes (minus the high bit) */ -#define IRQ_S_SOFT 1 -#define IRQ_VS_SOFT 2 -#define IRQ_M_SOFT 3 -#define IRQ_S_TIMER 5 -#define IRQ_VS_TIMER 6 -#define IRQ_M_TIMER 7 -#define IRQ_S_EXT 9 -#define IRQ_VS_EXT 10 -#define IRQ_M_EXT 11 -#define IRQ_PMU_OVF 13 - -/* Exception causes */ -#define EXC_INST_MISALIGNED 0 -#define EXC_INST_ACCESS 1 -#define EXC_INST_ILLEGAL 2 -#define EXC_BREAKPOINT 3 -#define EXC_LOAD_ACCESS 5 -#define EXC_STORE_ACCESS 7 -#define EXC_SYSCALL 8 -#define EXC_HYPERVISOR_SYSCALL 9 -#define EXC_SUPERVISOR_SYSCALL 10 -#define EXC_INST_PAGE_FAULT 12 -#define EXC_LOAD_PAGE_FAULT 13 -#define EXC_STORE_PAGE_FAULT 15 -#define EXC_INST_GUEST_PAGE_FAULT 20 -#define EXC_LOAD_GUEST_PAGE_FAULT 21 -#define EXC_VIRTUAL_INST_FAULT 22 -#define EXC_STORE_GUEST_PAGE_FAULT 23 -``` - -中断标记前缀为 `IRQ`(Interrupt ReQuest),异常标记前缀为 `EXC`(EXCeption)。 - -### KVM 异常处理 - -KVM 内部处理的是来自于 Guest 的异常,具体来说包括三类: - -- 指令异常:对应 Guest 的虚拟指令异常 -- 内存异常:对应 Guest page-fault -- 环境调用:对应来自于 Guest 在 VS-mode 的 `ecall` 指令 - -详细代码分析参见 [此文][2]。 - -## KVM 虚拟化相关的中断处理 - -在 Linux 内核的 `arch/riscv/kvm` 目录下,实现了对 RISC-V 虚拟化扩展的支持,此节将分析其中有关中断处理的代码实现。据代码可知,KVM 的架构相关的实现中仅包括了 VS-mode 对应的一系列中断的处理,其它中断的处理机制见下一节中断控制器分析。 - -### 全局中断基准 - -如果仅支持 M-Mode,那么默认的中断使能(Interrupt Enable)、Trap 向量、中断请求均以 M-Mode 为基准: - -- CSR 使用 `mstatus`, `mie`, `mtvec`, `mcause` 等 -- 状态寄存器标志以 `mstatus` 的为准:`mstatus.mie`, `mstatus.mpie`, `mstatus.mpp` -- 中断编码均对应 M-Mode:`IRQ_M_SOFT/TIMER/EXT` - -否则,就以 S-Mode 为基准,如下方代码所示。 - -```cpp -// arch/riscv/include/asm/csr.h: line 300 -#ifdef CONFIG_RISCV_M_MODE -/* CSR */ -# define CSR_STATUS CSR_MSTATUS -# define CSR_IE CSR_MIE -# define CSR_TVEC CSR_MTVEC -# define CSR_SCRATCH CSR_MSCRATCH -# define CSR_EPC CSR_MEPC -# define CSR_CAUSE CSR_MCAUSE -# define CSR_TVAL CSR_MTVAL -# define CSR_IP CSR_MIP - -/* Status Register Flags */ -# define SR_IE SR_MIE -# define SR_PIE SR_MPIE -# define SR_PP SR_MPP - -/* Interrupt Cause */ -# define RV_IRQ_SOFT IRQ_M_SOFT -# define RV_IRQ_TIMER IRQ_M_TIMER -# define RV_IRQ_EXT IRQ_M_EXT -#else /* CONFIG_RISCV_M_MODE */ -# define CSR_STATUS CSR_SSTATUS -# define CSR_IE CSR_SIE -# define CSR_TVEC CSR_STVEC -# define CSR_SCRATCH CSR_SSCRATCH -# define CSR_EPC CSR_SEPC -# define CSR_CAUSE CSR_SCAUSE -# define CSR_TVAL CSR_STVAL -# define CSR_IP CSR_SIP - -# define SR_IE SR_SIE -# define SR_PIE SR_SPIE -# define SR_PP SR_SPP - -# define RV_IRQ_SOFT IRQ_S_SOFT -# define RV_IRQ_TIMER IRQ_S_TIMER -# define RV_IRQ_EXT IRQ_S_EXT -# define RV_IRQ_PMU IRQ_PMU_OVF -# define SIP_LCOFIP (_AC(0x1, UL) << IRQ_PMU_OVF) - -#endif /* !CONFIG_RISCV_M_MODE */ - -/* IE/IP (Supervisor/Machine Interrupt Enable/Pending) flags */ -#define IE_SIE (_AC(0x1, UL) << RV_IRQ_SOFT) -#define IE_TIE (_AC(0x1, UL) << RV_IRQ_TIMER) -#define IE_EIE (_AC(0x1, UL) << RV_IRQ_EXT) -``` - -M/S-Mode 的中断做统一处理,Guest 内部的 VS-Mode 中断将由 KVM 单独处理。下面将对三类中断的实现分别进行分析。 - -### VS-Mode 软件中断 - -所谓软件中断也称为 IPI(Inter-Processor Interrupt),即处理器间中断。对于 KVM 虚拟机来说,VS-mode 的软件中断是通过 SBI 进行处理的,如下图所示。 - -具体注入过程如下: -1. 某个发送 vCPU 通过在 VS-mode 调用 ecall,给另外一个接收 vCPU 发送 IPI 中断。 -2. 此时触发发送 vCPU 所在 pCPU 的 HS-mode 异常,退出到 kvm_riscv_vcpu_exit 中,之后处理流程为:`kvm_riscv_vcpu_exit -> kvm_riscv_vcpu_sbi_ecall() -> sbi_ext->handler() -> kvm_sbi_ext_ipi_handler() -> kvm_riscv_vcpu_set_interrupt ()` -3. 最后 `kvm_riscv_vcpu_set_interrupt()` 函数,把 IPI 注入到接收 vCPU 的标志位上,`vcpu->arch.irqs_pending` 和 `vcpu->arch.irqs_pending_mask`,然后调用 `kvm_vcpu_kick()` 函数,提醒接收 vCPU,处理 IPI。实际上就是向接收 vCPU 所在的 pCPU 发送 HS-mode IPI(通过函数 `smp_send_reschedule()` 发送),让接收 vCPU 退出。 -4. 接收 vCPU 退出后,在重新进入运行前,会运行 `kvm_riscv_vcpu_flush_interrupts()` 函数,把 VS-level software interrupt 写入接收 vCPU 的 `vcpu->arch.guest_csr.hvip` 里,然后 `kvm_riscv_update_hvip()` 函数把 `vcpu->arch.guest_csr.hvip` 写入到 CSR_HVIP,即这个 pCPU 的 HVIP CSR 里。 -5. 接收 vCPU 运行到 VS-Mode 后,VS-level software interrupt 触发,由 VS-Mode 的 Guest OS 处理这个 IPI。 - -```mermaid -flowchart - -subgraph arch/riscv/include/asm/csr.h -isft[IRQ_VS_SOFT] -end - -subgraph arch/riscv/kvm/main.c -hwen[kvm_arch_hardware_enable] -end - -subgraph virt/kvm/kvm_main.c -startcpu[kvm_starting_cpu]-->hwennl -hwenall[hardware_enable_all]-->hwennl -mdl_init[module_kvm_init]--> -rv_init[riscv_kvm_init]--> -kvm_init[kvm_init]-->ops -kvm_exit[kvm_exit]-->ops -ops[kvm_syscore_ops]--> -resume[kvm_resume]-->hwennl -hwennl[hardware_enable_nolock]-->hwen - -vcpu[kvm_vcpu_ioctl]-->run - -dev_ioctl[kvm_dev_ioctl]--> -dev_create_vm[kvm_dev_ioctl_create_vm]--> -cvm[kvm_create_vm]-->hwenall -kvm_init-->startcpu - -kvm_compat[kvm_vcpu_compat_ioctl]-->vcpu - -exp_exit[EXPORT_SYMBOL_GPL]-->kvm_exit - -vm[kvm_vm_ioctl]--> -cvcpu[kvm_vm_ioctl_create_vcpu]--> -vcpu_fd[create_vcpu_fd]--> -fops[kvm_vcpu_fops]-->kvm_compat -end - -subgraph arch/riscv/kvm/vcpu_sbi_replace.c -ipi[kvm_sbi_ext_ipi_handler] -sbi_ipi[vcpu_sbi_ext_ipi]-->ipi -end - -subgraph arch/riscv/kvm/vcpu_sbi.c -ecall[kvm_riscv_vcpu_sbi_ecall]--> -sbi[sbi_ext]-->sbi_ipi - -sbi-->sbiv01 -end - -subgraph arch/riscv/kvm/vcpu.c -ustint[kvm_riscv_vcpu_unset_interrupt] -stint[kvm_riscv_vcpu_set_interrupt] -syncint[kvm_riscv_vcpu_sync_interrupts]-->isft -run[kvm_arch_vcpu_ioctl_run]-->syncint -end - -subgraph arch/riscv/kvm/vcpu_sbi_v01.c -sbiv01[vcpu_sbi_ext_v01]--> -v01[kvm_sbi_ext_v01_handler]-->stint -v01-->ustint -end - -ipi-->stint -stint-->isft -ustint-->isft -hwen-->isft - -subgraph arch/riscv/kvm/vcpu_exit.c -exit[kvm_riscv_vcpu_exit]-->ecall -end -``` - -([下载由 Mermaid 生成的 PNG 图片][007]) - -### VS-Mode 计时器中断 - -与 VS-mode 软件中断类似,vCPU 的计时器中断处理接口在 `arch/riscv/kvm/vcpu_timer.c` 中定义,而这些接口则是通过调用 `vcpu.c` 中统一的中断处理函数实现的(`kvm_riscv_vcpu_has/set/unset_interrupts`)。 - -```mermaid -flowchart LR - -subgraph arch/riscv/include/asm/csr.h -itimer[IRQ_VS_TIMER] -end - -subgraph arch/riscv/kvm/vcpu.c -ustint[kvm_riscv_vcpu_unset_interrupt]-->itimer -stint[kvm_riscv_vcpu_set_interrupt]-->itimer -hasint[kvm_riscv_vcpu_has_interrupts]-->itimer -end - -subgraph arch/riscv/kvm/vcpu_timer.c -expired[kvm_riscv_vcpu_hrtimer_expired]-->stint -update[kvm_riscv_vcpu_update_hrtimer]-->ustint -pending[kvm_riscv_vcpu_timer_pending]-->hasint - -init[kvm_riscv_vcpu_timer_init]-->expired -init-->update - -init--> -vstimer_expired[kvm_riscv_vcpu_vstimer_expired] - -init--> -vstimecmp_update[kvm_riscv_vcpu_update_vstimecmp] -end - -subgraph arch/riscv/kvm/vcpu.c -vcpu_create[kvm_arch_vcpu_create]-->init - -vcpu_pending[kvm_cpu_has_pending_timer]-->pending -end - -subgraph virt/kvm/kvm_main.c -check[kvm_vcpu_check_block]-->vcpu_pending -end - -``` - -([下载由 Mermaid 生成的 PNG 图片][008]) - -### VS-Mode 外部中断 - -#### KVM 中的 ioctl - -##### ioctl - -从 Kernel 到 VM:调用 `ioctl` 注册 KVM 虚拟机并为其申请资源。具体实现可以参见 [此文][15] 中有关 kvmtool 创建 VM 的部分。 - -kvmtool 作为用户态程序,对于 VM 的所有访问都是通过 `ioctl` 完成的,例如 `kvm_cpu__arch_init` 初始化 VM、vCPU 和内存: - -```cpp -struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm, unsigned long cpu_id) -{ - // ... - /* 创建 vCPU */ - vcpu->vcpu_fd = ioctl(kvm->vm_fd, KVM_CREATE_VCPU, cpu_id); - - // ... - /* 获取 VM 的寄存器 */ - if (ioctl(vcpu->vcpu_fd, KVM_GET_ONE_REG, ®) < 0) - - // ... -} -``` - -`ioctl` 函数自身定义如下: - -```cpp -#include - -int ioctl(int fd, unsigned long request, ...); -``` - -##### kvm_*_ioctl - -从 VM 到 Kernel:VM 内部触发 IO 控制,调用 `kvm_*_ioctl` 进行处理 - -```mermaid -flowchart - -subgraph kvm -direction LR -i -e -fops -end - -subgraph i[kvm_*_ioctl] -vcpu/device/vm/dev -end - -i-->e - -subgraph fops[kvm_*_fops] -vcpu/device/vm/chardev -end - -subgraph e[elements_in_fops] -... -ui[unlocked_ioctl] -end - -e-->fops-->vfs - -vfs[vfs_ioctl]--> -ept(EXPORT_SYMBOL) - -vfs-->dvfs[do_vfs_ioctl] - -dvfs-->d3(SYSCALL_DEFINE3) -dvfs-->cd3(COMPAT_SYSCALL_DEFINE3) - -subgraph fs/ioctl -vfs -dvfs -end - -subgraph include/linux - -subgraph syscalls -d3 -cd3 -end - -subgraph export -ept -end - -end -``` - -([下载由 Mermaid 生成的 PNG 图片][009]) - -#### 外部中断 - -`kvm_vcpu_ioctl` 函数作为 `kvm_vcpu_fops.unlocked_ioctl` 在 KVM 初始化之时就已经被注册。当发生对 `/dev/kvm` 的 `ioctl` 调用时,就会通过如上节所述的 `vfs_ioctl` 方法调用 `filp->f_op->unlocked_ioctl` 即 `kvm_vcpu_ioctl` 进行处理。 - -KVM 内部与 VS-Mode 外部中断相关的调用如下图所示: - -```mermaid -flowchart LR - -subgraph arch/riscv/include/asm/csr.h -ext[IRQ_VS_EXT] -end - -subgraph arch/riscv/kvm/vcpu.c - -async[kvm_arch_vcpu_async_ioctl]--> -int[kvm_riscv_vcpu_set/unset_interrupt]-->ext -end - -subgraph virt/kvm/kvm_main.c -vcpu[kvm_vcpu_ioctl]-->async -end - -``` - -([下载由 Mermaid 生成的 PNG 图片][010]) - -([下载由 Mermaid 生成的 PNG 图片][009]) - -`kvm_arch_vcpu_async_ioctl` 内部实现依据具体的中断类型采取对应的操作: - -```cpp -// arch/riscv/kvm/vcpu.c: line 569 - -long kvm_arch_vcpu_async_ioctl(struct file *filp, - unsigned int ioctl, unsigned long arg) -{ - struct kvm_vcpu *vcpu = filp->private_data; - void __user *argp = (void __user *)arg; - - if (ioctl == KVM_INTERRUPT) { - struct kvm_interrupt irq; - - // 将用户态的由 argp 所指向的中断信息复制到 irq 中 - if (copy_from_user(&irq, argp, sizeof(irq))) - return -EFAULT; - - // 根据 irq 的中断操作类型,对指定的 vcpu 进行中断操作(set, unset) - if (irq.irq == KVM_INTERRUPT_SET) - return kvm_riscv_vcpu_set_interrupt(vcpu, IRQ_VS_EXT); - else - return kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_VS_EXT); - } - - return -ENOIOCTLCMD; -} - -``` - -## RISC-V 中断在 Linux 中的实现 - -### Timer 驱动 - -参考 [此文][3] 对 RISC-V 计时器在 Linux 内核中的实现的分析,Linux Timer 的实现包含两个驱动文件: - -- 无 MMU 的 `drivers/clocksource/timer-riscv.c`:运行于 M-mode 下,可直接读取 `mtime` CSR 获取当前时间、通过 `mtimecmp` CSR 设置中断,考虑到虚拟化对于特权级的需求,该实现并不会在虚拟化系统中被调用。 -- 有 MMU 的 `drivers/clocksource/timer-clint.c`:支持 S-mode (S/HS/VS) 下的时钟访问,但因为权限问题,需要借助于 CSR 读写指令达成。在不支持 SSTC 扩展的情况下,需要通过 SBI 写入 `mtimecmp` 实现计时器中断。 - -在添加了虚拟化扩展之后,VS-mode 的计时器中断操作需要通过 SBI 进入 HS-mode 再进入 M-mode,访问 `htimedelta`,`mtimecmp` 等 CSR,开销较大。后续有望通过添加 [SSTC 扩展][4] 实现对 `vstimecmp` 的直接访问进而简化虚拟情况下的中断开销。 - -### 中断驱动与 PLIC 控制器 - -[这篇文章][5] 基于一个 RTC(Real Time Clock)例程分析了 RISC-V 中断的申请、产生、处理流程。 - -Linux 内核中涉及 RISC-V 中断相关的处理机制如下图所示,从左到右依次为 PLIC、INTC(INTerrupt Controller)和内核中断处理。 - -```mermaid -flowchart - -e[arch/riscv/kernel/entry.S]-->ghai - -subgraph kernel/irq/handle.c -ghai[generic_handle_arch_irq] -shi[set_handle_irq] -end - -subgraph kernel/softirq.c -ghai-->ie[irq_exit] -ghai-.->so[others] -end - -subgraph other -end - -ghai-.->other - -subgraph drivers/irqchip/irq-riscv-intc.c -direction -ii[IRQCHIP_DECLARE:riscv_intc_init]-->shi--> -rii[riscv_intc_irq] -idm[[intc_domain]] -end - -subgraph kernel/irq/irqdesc.c -ghdi[generic_handle_domain_irq] -end - -subgraph include/linux/irqdomain.h -al[irq_domain_add_linear] -end - -ii-->al-.return..->idm - -rii-->ghdi -idm-.arg..->ghdi - -subgraph drivers/irqchip/irq-sifive-plic.c -direction TB -epid[IRQCHIP_DECLARE: plic_edge_init]-->pei[plic_edge_init]-->pi -pid[IRQCHIP_DECLARE: plic_init]--> -tpi[__plic_init]-->pi[plic_init]-->phi[plic_handle_irq] -end -``` - -([下载由 Mermaid 生成的 PNG 图片][011]) - -([下载由 Mermaid 生成的 PNG 图片][010]) - -### 小结 - -结合本节和上一节中有关 Linux 以及 KVM 对 RISC-V 中断的分析可知,KVM 内实现了将虚拟机内部 VS-mode 的中断与外部中断处理控制器的绑定,同时实现了特定于 VS-mode 的中断处理功能,从而完成了对于 RISC-V 虚拟化的支持。 - -## MMIO 虚拟化 - -### KVM - -通过用户态程序(如 kvmtool)创建了 vCPU 之后,vcpu 内部就包含了 MMIO 相关的项,如下图所示。如此,便实现了虚拟机 MMIO 的管理。所以 Guest 的 MMIO 操作都是基于下图所示的数据结构实现的。 - -```mermaid -flowchart BT - -subgraph v[kvm_vcpu] - -subgraph va[kvm_vcpu_arch] -md[kvm_mmio_decode] -vao[other arch states, ...] -end - -subgraph r[kvm_run] -m[mmio] -ro[other run states, ...] -end - -end -``` - -([下载由 Mermaid 生成的 PNG 图片][012]) - -([下载由 Mermaid 生成的 PNG 图片][011]) - -mmio 在 Host 一端的注册与销毁如下图所示: - -```mermaid -flowchart LR -subgraph kvm_main.c -cv[kvm_create_vm] -cd[kvm_destroy_vm] -... -end - -subgraph coalseced_mmio.c -mi[kvm_coalesced_mmio_init] -mf[kvm_coalesced_mmio_free] -..., -end - -cv-->mi -cv-->mf -cd-->mf -``` - -([下载由 Mermaid 生成的 PNG 图片][013]) - -([下载由 Mermaid 生成的 PNG 图片][012]) - -KVM 中的 MMIO 的访存操作有如下三个对应处理函数: - -```cpp -// arch/riscv/include/asm/kvm_vcpu_insn.h: line 40 -int kvm_riscv_vcpu_mmio_load(struct kvm_vcpu *vcpu, struct kvm_run *run, - unsigned long fault_addr, - unsigned long htinst); -int kvm_riscv_vcpu_mmio_store(struct kvm_vcpu *vcpu, struct kvm_run *run, - unsigned long fault_addr, - unsigned long htinst); -int kvm_riscv_vcpu_mmio_return(struct kvm_vcpu *vcpu, struct kvm_run *run); - -``` - -下图展示了 MMIO 访存操作的具体实现,可以发现 LAOD/STORE 操作最终是通过调用 IO 设备中注册好的读写函数来实现的: - -```mermaid -flowchart LR -subgraph vi[arch/riscv/kvm/vcpu_insn.c] -l[kvm_riscv_vcpu_mmio_load]-->r -s[kvm_riscv_vcpu_mmio_store]-->r -r[kvm_riscv_vcpu_mmio_return] -end - -subgraph m[virt/kvm/kvm_main.c] -rd[kvm_io_bus_read] -wr[kvm_io_bus_write] -end - -l-->rd -s-->wr - -subgraph dv[include/kvm/iodev.h] -subgraph iodev -subgraph ops -frd[*read] -fwr[*write] -end - -end - -end - -rd-.->frd -wr-.->fwr -``` - -([下载由 Mermaid 生成的 PNG 图片][014]) - -([下载由 Mermaid 生成的 PNG 图片][013]) - -### kvmtool 中断注入及 MMIO 创建 - -在 kvmtool 中 MMIO 是作为 VIRTIO 设备之一连带着中断处理函数一起被注册的。整个过程可以分为两个部分: - -- PLIC,设备树初始化 -- MMIO/PCI 等设备与 PLIC 以及中断处理函数的绑定 -- Console/Net 等设备与初始化时与 MMIO/PCI 设备的绑定 - -执行完整个 Console 的创建过程就完成了 Guest 的 PLIC、IRQ 与设备的绑定,即实现了虚拟机的中断注入机制与 MMIO 创建。 - -下图中左上的 `virtio_dev_init:virtio_console__init` 表示以 KVM 指定的方式初始化设备完成绑定。 - -右边 RISC-V 模块左下方的 `late_init:setup_fdt` 则表示包含有 PLIC 的设备树的初始化。 - -```mermaid -flowchart LR - -subgraph riscv -subgraph irq.c -il[kvm__irq_line] -it[kvm__irq_trigger] -end - -subgraph plic.c -pit[plic__irq_trig] -pnd[pci__generate_fdt_nodes] -end -il-->pit -it-->pit -subgraph fdt.c -li[late_init:setup_fdt] -end -li-->pnd -end - -subgraph virtio - -subgraph unified_devices -subgraph console.c -cdi[virtio_dev_init:virtio_console__init] -end -subgraph net.c -bdi[virtio_dev_init:virtio_net__init] -end -udo[other unified_devices, ...] -end - -cdi-->vi -bdi-->vi -udo-.->vi - -subgraph pci.c -pvq[virtio_pci__signal_vq]-->it -pvq-->il -pcfg[virtio_pci__signal_config]-->it -po[other functions, ...] -end - -subgraph mmio.c -vq[virtio_mmio_signal_vq]-->it -cfg[virtio_mmio_signal_config]-->it -mo[other functions, ...] -end - -pm[pci-modern.c]-->il -pl[pci-legacy.c]-->il - -subgraph core.c -vi[virtio_init: case VIRTIO_*] -cm[mmio] -cp[pci] -end - -end - -cm-.->mmio.c -cp-.->pci.c - -subgraph hw -i8[i8042.c]-->il -sr[serial.c]-->il -end - -``` - -([下载由 Mermaid 生成的 PNG 图片][016]) - -([下载由 Mermaid 生成的 PNG 图片][014]) - -## 总结 - -RISC-V 中断通过 PLIC,CLINT 等驱动和控制器来实现,KVM 模块对于虚拟化的支持体现在两方面,一方面是 KVM 实现了与 Guest 外部的中断控制相关联的 VS-mode 的中断处理,另一方面则是通过为用户态程序如 kvmtool 提供接口,支持了虚拟机内部的设备与中断处理函数的注册与绑定,也实现了虚拟机与内核态的绑定,这使得 Guest 的 MMIO 访存等操作顺利进行。 - -## 参考资料 - -- [Linux Kernel][1] -- [RISC-V 异常处理在 KVM 中的实现][2] -- [RISC-V timer 在 Linux 中的实现][3] -- [RISC-V SSTC Extension][4] -- [RISC-V 中断子系统分析——PLIC 中断处理][5] -- [kvmtool][6] - -[1]: https://www.kernel.org/ -[2]: 20221021-riscv-kvm-excp-impl.md -[3]: https://tinylab.org/riscv-timer/#kvm-vcpu_timerc -[4]: https://github.com/riscv/riscv-time-compare/releases/download/v0.5.4/Sstc.pdf -[5]: https://gitee.com/tinylab/riscv-linux/blob/master/articles/20220919-riscv-irq-analysis-part2-interrupt-handling-plic.md -[6]: https://git.kernel.org/pub/scm/linux/kernel/git/will/kvmtool.git -[007]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-1.png -[008]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-2.png -[009]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-3.png -[010]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-4.png -[011]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-5.png -[012]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-6.png -[013]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-7.png -[014]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-8.png -[15]: 20220802-kvm-user-app.md#kvmtool -[016]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-9.png +> Corrector: [TinyCorrect](https://gitee.com/tinylab/tinycorrect) v0.2-rc1 - [refs]
+> Author: XiakaiPan <13212017962@163.com>
+> Date: 20230109
+> Revisor: Walimis
+> Project: [RISC-V Linux 内核剖析](https://gitee.com/tinylab/riscv-linux)
+> Proposal: [RISC-V 虚拟化技术调研与分析](https://gitee.com/tinylab/riscv-linux/issues/I5E4VB)
+> Sponsor: PLCT Lab, ISCAS + +# RISC-V KVM 中断处理的实现(二) + +## 前言 + +本文对于 kvmtool 和 KVM 中的中断注入与处理,以及 MMIO 设备的注册与使用,结合代码进行了分析和解读,并主要以流程图的方式呈现其代码实现。 + +## 代码版本 + +| Software | Version | +|-------------------|------------------------------------------| +| [Linux Kernel][1] | 6.0-rc6 | +| [kvmtool][6] | e17d182ad3f797f01947fc234d95c96c050c534b | + +## KVM 异常处理 + +### RISC-V Trap 类型、编码及其关系 + +在 RISC-V 中,CSR `mcause` / `scause` / `vscause` 用于记录引发 Trap 的编码,Interrupt 和 Exception 的区分是通过 CSR 最高位作为标志位来实现的,当标志位为 1 时表示当前 Trap 为 Interrupt,为 0 时则是 Exception。 + +RISC-V 中的中断分为三类:软件中断、计时器中断和外部中断,来自不同特权级的各类中断具有各自的编码。Linux 中对这些中断编码如下: + +```cpp +// arch/riscv/include/asm/csr.h: line 66 +/* Exception cause high bit - is an interrupt if set */ +#define CAUSE_IRQ_FLAG (_AC(1, UL) << (__riscv_xlen - 1)) + +/* Interrupt causes (minus the high bit) */ +#define IRQ_S_SOFT 1 +#define IRQ_VS_SOFT 2 +#define IRQ_M_SOFT 3 +#define IRQ_S_TIMER 5 +#define IRQ_VS_TIMER 6 +#define IRQ_M_TIMER 7 +#define IRQ_S_EXT 9 +#define IRQ_VS_EXT 10 +#define IRQ_M_EXT 11 +#define IRQ_PMU_OVF 13 + +/* Exception causes */ +#define EXC_INST_MISALIGNED 0 +#define EXC_INST_ACCESS 1 +#define EXC_INST_ILLEGAL 2 +#define EXC_BREAKPOINT 3 +#define EXC_LOAD_ACCESS 5 +#define EXC_STORE_ACCESS 7 +#define EXC_SYSCALL 8 +#define EXC_HYPERVISOR_SYSCALL 9 +#define EXC_SUPERVISOR_SYSCALL 10 +#define EXC_INST_PAGE_FAULT 12 +#define EXC_LOAD_PAGE_FAULT 13 +#define EXC_STORE_PAGE_FAULT 15 +#define EXC_INST_GUEST_PAGE_FAULT 20 +#define EXC_LOAD_GUEST_PAGE_FAULT 21 +#define EXC_VIRTUAL_INST_FAULT 22 +#define EXC_STORE_GUEST_PAGE_FAULT 23 +``` + +中断标记前缀为 `IRQ`(Interrupt ReQuest),异常标记前缀为 `EXC`(EXCeption)。 + +### KVM 异常处理 + +KVM 内部处理的是来自于 Guest 的异常,具体来说包括三类: + +- 指令异常:对应 Guest 的虚拟指令异常 +- 内存异常:对应 Guest page-fault +- 环境调用:对应来自于 Guest 在 VS-mode 的 `ecall` 指令 + +详细代码分析参见 [此文][2]。 + +## KVM 虚拟化相关的中断处理 + +在 Linux 内核的 `arch/riscv/kvm` 目录下,实现了对 RISC-V 虚拟化扩展的支持,此节将分析其中有关中断处理的代码实现。据代码可知,KVM 的架构相关的实现中仅包括了 VS-mode 对应的一系列中断的处理,其它中断的处理机制见下一节中断控制器分析。 + +### 全局中断基准 + +如果仅支持 M-Mode,那么默认的中断使能(Interrupt Enable)、Trap 向量、中断请求均以 M-Mode 为基准: + +- CSR 使用 `mstatus`, `mie`, `mtvec`, `mcause` 等 +- 状态寄存器标志以 `mstatus` 的为准:`mstatus.mie`, `mstatus.mpie`, `mstatus.mpp` +- 中断编码均对应 M-Mode:`IRQ_M_SOFT/TIMER/EXT` + +否则,就以 S-Mode 为基准,如下方代码所示。 + +```cpp +// arch/riscv/include/asm/csr.h: line 300 +#ifdef CONFIG_RISCV_M_MODE +/* CSR */ +# define CSR_STATUS CSR_MSTATUS +# define CSR_IE CSR_MIE +# define CSR_TVEC CSR_MTVEC +# define CSR_SCRATCH CSR_MSCRATCH +# define CSR_EPC CSR_MEPC +# define CSR_CAUSE CSR_MCAUSE +# define CSR_TVAL CSR_MTVAL +# define CSR_IP CSR_MIP + +/* Status Register Flags */ +# define SR_IE SR_MIE +# define SR_PIE SR_MPIE +# define SR_PP SR_MPP + +/* Interrupt Cause */ +# define RV_IRQ_SOFT IRQ_M_SOFT +# define RV_IRQ_TIMER IRQ_M_TIMER +# define RV_IRQ_EXT IRQ_M_EXT +#else /* CONFIG_RISCV_M_MODE */ +# define CSR_STATUS CSR_SSTATUS +# define CSR_IE CSR_SIE +# define CSR_TVEC CSR_STVEC +# define CSR_SCRATCH CSR_SSCRATCH +# define CSR_EPC CSR_SEPC +# define CSR_CAUSE CSR_SCAUSE +# define CSR_TVAL CSR_STVAL +# define CSR_IP CSR_SIP + +# define SR_IE SR_SIE +# define SR_PIE SR_SPIE +# define SR_PP SR_SPP + +# define RV_IRQ_SOFT IRQ_S_SOFT +# define RV_IRQ_TIMER IRQ_S_TIMER +# define RV_IRQ_EXT IRQ_S_EXT +# define RV_IRQ_PMU IRQ_PMU_OVF +# define SIP_LCOFIP (_AC(0x1, UL) << IRQ_PMU_OVF) + +#endif /* !CONFIG_RISCV_M_MODE */ + +/* IE/IP (Supervisor/Machine Interrupt Enable/Pending) flags */ +#define IE_SIE (_AC(0x1, UL) << RV_IRQ_SOFT) +#define IE_TIE (_AC(0x1, UL) << RV_IRQ_TIMER) +#define IE_EIE (_AC(0x1, UL) << RV_IRQ_EXT) +``` + +M/S-Mode 的中断做统一处理,Guest 内部的 VS-Mode 中断将由 KVM 单独处理。下面将对三类中断的实现分别进行分析。 + +### VS-Mode 软件中断 + +所谓软件中断也称为 IPI(Inter-Processor Interrupt),即处理器间中断。对于 KVM 虚拟机来说,VS-mode 的软件中断是通过 SBI 进行处理的,如下图所示。 + +具体注入过程如下: +1. 某个发送 vCPU 通过在 VS-mode 调用 ecall,给另外一个接收 vCPU 发送 IPI 中断。 +2. 此时触发发送 vCPU 所在 pCPU 的 HS-mode 异常,退出到 kvm_riscv_vcpu_exit 中,之后处理流程为:`kvm_riscv_vcpu_exit -> kvm_riscv_vcpu_sbi_ecall() -> sbi_ext->handler() -> kvm_sbi_ext_ipi_handler() -> kvm_riscv_vcpu_set_interrupt ()` +3. 最后 `kvm_riscv_vcpu_set_interrupt()` 函数,把 IPI 注入到接收 vCPU 的标志位上,`vcpu->arch.irqs_pending` 和 `vcpu->arch.irqs_pending_mask`,然后调用 `kvm_vcpu_kick()` 函数,提醒接收 vCPU,处理 IPI。实际上就是向接收 vCPU 所在的 pCPU 发送 HS-mode IPI(通过函数 `smp_send_reschedule()` 发送),让接收 vCPU 退出。 +4. 接收 vCPU 退出后,在重新进入运行前,会运行 `kvm_riscv_vcpu_flush_interrupts()` 函数,把 VS-level software interrupt 写入接收 vCPU 的 `vcpu->arch.guest_csr.hvip` 里,然后 `kvm_riscv_update_hvip()` 函数把 `vcpu->arch.guest_csr.hvip` 写入到 CSR_HVIP,即这个 pCPU 的 HVIP CSR 里。 +5. 接收 vCPU 运行到 VS-Mode 后,VS-level software interrupt 触发,由 VS-Mode 的 Guest OS 处理这个 IPI。 + +```mermaid +flowchart + +subgraph arch/riscv/include/asm/csr.h +isft[IRQ_VS_SOFT] +end + +subgraph arch/riscv/kvm/main.c +hwen[kvm_arch_hardware_enable] +end + +subgraph virt/kvm/kvm_main.c +startcpu[kvm_starting_cpu]-->hwennl +hwenall[hardware_enable_all]-->hwennl +mdl_init[module_kvm_init]--> +rv_init[riscv_kvm_init]--> +kvm_init[kvm_init]-->ops +kvm_exit[kvm_exit]-->ops +ops[kvm_syscore_ops]--> +resume[kvm_resume]-->hwennl +hwennl[hardware_enable_nolock]-->hwen + +vcpu[kvm_vcpu_ioctl]-->run + +dev_ioctl[kvm_dev_ioctl]--> +dev_create_vm[kvm_dev_ioctl_create_vm]--> +cvm[kvm_create_vm]-->hwenall +kvm_init-->startcpu + +kvm_compat[kvm_vcpu_compat_ioctl]-->vcpu + +exp_exit[EXPORT_SYMBOL_GPL]-->kvm_exit + +vm[kvm_vm_ioctl]--> +cvcpu[kvm_vm_ioctl_create_vcpu]--> +vcpu_fd[create_vcpu_fd]--> +fops[kvm_vcpu_fops]-->kvm_compat +end + +subgraph arch/riscv/kvm/vcpu_sbi_replace.c +ipi[kvm_sbi_ext_ipi_handler] +sbi_ipi[vcpu_sbi_ext_ipi]-->ipi +end + +subgraph arch/riscv/kvm/vcpu_sbi.c +ecall[kvm_riscv_vcpu_sbi_ecall]--> +sbi[sbi_ext]-->sbi_ipi + +sbi-->sbiv01 +end + +subgraph arch/riscv/kvm/vcpu.c +ustint[kvm_riscv_vcpu_unset_interrupt] +stint[kvm_riscv_vcpu_set_interrupt] +syncint[kvm_riscv_vcpu_sync_interrupts]-->isft +run[kvm_arch_vcpu_ioctl_run]-->syncint +end + +subgraph arch/riscv/kvm/vcpu_sbi_v01.c +sbiv01[vcpu_sbi_ext_v01]--> +v01[kvm_sbi_ext_v01_handler]-->stint +v01-->ustint +end + +ipi-->stint +stint-->isft +ustint-->isft +hwen-->isft + +subgraph arch/riscv/kvm/vcpu_exit.c +exit[kvm_riscv_vcpu_exit]-->ecall +end +``` + +([下载由 Mermaid 生成的 PNG 图片][007]) + +### VS-Mode 计时器中断 + +与 VS-mode 软件中断类似,vCPU 的计时器中断处理接口在 `arch/riscv/kvm/vcpu_timer.c` 中定义,而这些接口则是通过调用 `vcpu.c` 中统一的中断处理函数实现的(`kvm_riscv_vcpu_has/set/unset_interrupts`)。 + +```mermaid +flowchart LR + +subgraph arch/riscv/include/asm/csr.h +itimer[IRQ_VS_TIMER] +end + +subgraph arch/riscv/kvm/vcpu.c +ustint[kvm_riscv_vcpu_unset_interrupt]-->itimer +stint[kvm_riscv_vcpu_set_interrupt]-->itimer +hasint[kvm_riscv_vcpu_has_interrupts]-->itimer +end + +subgraph arch/riscv/kvm/vcpu_timer.c +expired[kvm_riscv_vcpu_hrtimer_expired]-->stint +update[kvm_riscv_vcpu_update_hrtimer]-->ustint +pending[kvm_riscv_vcpu_timer_pending]-->hasint + +init[kvm_riscv_vcpu_timer_init]-->expired +init-->update + +init--> +vstimer_expired[kvm_riscv_vcpu_vstimer_expired] + +init--> +vstimecmp_update[kvm_riscv_vcpu_update_vstimecmp] +end + +subgraph arch/riscv/kvm/vcpu.c +vcpu_create[kvm_arch_vcpu_create]-->init + +vcpu_pending[kvm_cpu_has_pending_timer]-->pending +end + +subgraph virt/kvm/kvm_main.c +check[kvm_vcpu_check_block]-->vcpu_pending +end + +``` + +([下载由 Mermaid 生成的 PNG 图片][008]) + +### VS-Mode 外部中断 + +#### KVM 中的 ioctl + +##### ioctl + +从 Kernel 到 VM:调用 `ioctl` 注册 KVM 虚拟机并为其申请资源。具体实现可以参见 [此文][15] 中有关 kvmtool 创建 VM 的部分。 + +kvmtool 作为用户态程序,对于 VM 的所有访问都是通过 `ioctl` 完成的,例如 `kvm_cpu__arch_init` 初始化 VM、vCPU 和内存: + +```cpp +struct kvm_cpu *kvm_cpu__arch_init(struct kvm *kvm, unsigned long cpu_id) +{ + // ... + /* 创建 vCPU */ + vcpu->vcpu_fd = ioctl(kvm->vm_fd, KVM_CREATE_VCPU, cpu_id); + + // ... + /* 获取 VM 的寄存器 */ + if (ioctl(vcpu->vcpu_fd, KVM_GET_ONE_REG, ®) < 0) + + // ... +} +``` + +`ioctl` 函数自身定义如下: + +```cpp +#include + +int ioctl(int fd, unsigned long request, ...); +``` + +##### kvm_*_ioctl + +从 VM 到 Kernel:VM 内部触发 IO 控制,调用 `kvm_*_ioctl` 进行处理 + +```mermaid +flowchart + +subgraph kvm +direction LR +i +e +fops +end + +subgraph i[kvm_*_ioctl] +vcpu/device/vm/dev +end + +i-->e + +subgraph fops[kvm_*_fops] +vcpu/device/vm/chardev +end + +subgraph e[elements_in_fops] +... +ui[unlocked_ioctl] +end + +e-->fops-->vfs + +vfs[vfs_ioctl]--> +ept(EXPORT_SYMBOL) + +vfs-->dvfs[do_vfs_ioctl] + +dvfs-->d3(SYSCALL_DEFINE3) +dvfs-->cd3(COMPAT_SYSCALL_DEFINE3) + +subgraph fs/ioctl +vfs +dvfs +end + +subgraph include/linux + +subgraph syscalls +d3 +cd3 +end + +subgraph export +ept +end + +end +``` + +([下载由 Mermaid 生成的 PNG 图片][009]) + +#### 外部中断 + +`kvm_vcpu_ioctl` 函数作为 `kvm_vcpu_fops.unlocked_ioctl` 在 KVM 初始化之时就已经被注册。当发生对 `/dev/kvm` 的 `ioctl` 调用时,就会通过如上节所述的 `vfs_ioctl` 方法调用 `filp->f_op->unlocked_ioctl` 即 `kvm_vcpu_ioctl` 进行处理。 + +KVM 内部与 VS-Mode 外部中断相关的调用如下图所示: + +```mermaid +flowchart LR + +subgraph arch/riscv/include/asm/csr.h +ext[IRQ_VS_EXT] +end + +subgraph arch/riscv/kvm/vcpu.c + +async[kvm_arch_vcpu_async_ioctl]--> +int[kvm_riscv_vcpu_set/unset_interrupt]-->ext +end + +subgraph virt/kvm/kvm_main.c +vcpu[kvm_vcpu_ioctl]-->async +end + +``` + +([下载由 Mermaid 生成的 PNG 图片][010]) + + +`kvm_arch_vcpu_async_ioctl` 内部实现依据具体的中断类型采取对应的操作: + +```cpp +// arch/riscv/kvm/vcpu.c: line 569 + +long kvm_arch_vcpu_async_ioctl(struct file *filp, + unsigned int ioctl, unsigned long arg) +{ + struct kvm_vcpu *vcpu = filp->private_data; + void __user *argp = (void __user *)arg; + + if (ioctl == KVM_INTERRUPT) { + struct kvm_interrupt irq; + + // 将用户态的由 argp 所指向的中断信息复制到 irq 中 + if (copy_from_user(&irq, argp, sizeof(irq))) + return -EFAULT; + + // 根据 irq 的中断操作类型,对指定的 vcpu 进行中断操作(set, unset) + if (irq.irq == KVM_INTERRUPT_SET) + return kvm_riscv_vcpu_set_interrupt(vcpu, IRQ_VS_EXT); + else + return kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_VS_EXT); + } + + return -ENOIOCTLCMD; +} + +``` + +## RISC-V 中断在 Linux 中的实现 + +### Timer 驱动 + +参考 [此文][3] 对 RISC-V 计时器在 Linux 内核中的实现的分析,Linux Timer 的实现包含两个驱动文件: + +- 无 MMU 的 `drivers/clocksource/timer-riscv.c`:运行于 M-mode 下,可直接读取 `mtime` CSR 获取当前时间、通过 `mtimecmp` CSR 设置中断,考虑到虚拟化对于特权级的需求,该实现并不会在虚拟化系统中被调用。 +- 有 MMU 的 `drivers/clocksource/timer-clint.c`:支持 S-mode (S/HS/VS) 下的时钟访问,但因为权限问题,需要借助于 CSR 读写指令达成。在不支持 SSTC 扩展的情况下,需要通过 SBI 写入 `mtimecmp` 实现计时器中断。 + +在添加了虚拟化扩展之后,VS-mode 的计时器中断操作需要通过 SBI 进入 HS-mode 再进入 M-mode,访问 `htimedelta`,`mtimecmp` 等 CSR,开销较大。后续有望通过添加 [SSTC 扩展][4] 实现对 `vstimecmp` 的直接访问进而简化虚拟情况下的中断开销。 + +### 中断驱动与 PLIC 控制器 + +[这篇文章][5] 基于一个 RTC(Real Time Clock)例程分析了 RISC-V 中断的申请、产生、处理流程。 + +Linux 内核中涉及 RISC-V 中断相关的处理机制如下图所示,从左到右依次为 PLIC、INTC(INTerrupt Controller)和内核中断处理。 + +```mermaid +flowchart + +e[arch/riscv/kernel/entry.S]-->ghai + +subgraph kernel/irq/handle.c +ghai[generic_handle_arch_irq] +shi[set_handle_irq] +end + +subgraph kernel/softirq.c +ghai-->ie[irq_exit] +ghai-.->so[others] +end + +subgraph other +end + +ghai-.->other + +subgraph drivers/irqchip/irq-riscv-intc.c +direction +ii[IRQCHIP_DECLARE:riscv_intc_init]-->shi--> +rii[riscv_intc_irq] +idm[[intc_domain]] +end + +subgraph kernel/irq/irqdesc.c +ghdi[generic_handle_domain_irq] +end + +subgraph include/linux/irqdomain.h +al[irq_domain_add_linear] +end + +ii-->al-.return..->idm + +rii-->ghdi +idm-.arg..->ghdi + +subgraph drivers/irqchip/irq-sifive-plic.c +direction TB +epid[IRQCHIP_DECLARE: plic_edge_init]-->pei[plic_edge_init]-->pi +pid[IRQCHIP_DECLARE: plic_init]--> +tpi[__plic_init]-->pi[plic_init]-->phi[plic_handle_irq] +end +``` + +([下载由 Mermaid 生成的 PNG 图片][011]) + +### 小结 + +结合本节和上一节中有关 Linux 以及 KVM 对 RISC-V 中断的分析可知,KVM 内实现了将虚拟机内部 VS-mode 的中断与外部中断处理控制器的绑定,同时实现了特定于 VS-mode 的中断处理功能,从而完成了对于 RISC-V 虚拟化的支持。 + +## MMIO 虚拟化 + +### KVM + +通过用户态程序(如 kvmtool)创建了 vCPU 之后,vcpu 内部就包含了 MMIO 相关的项,如下图所示。如此,便实现了虚拟机 MMIO 的管理。所以 Guest 的 MMIO 操作都是基于下图所示的数据结构实现的。 + +```mermaid +flowchart BT + +subgraph v[kvm_vcpu] + +subgraph va[kvm_vcpu_arch] +md[kvm_mmio_decode] +vao[other arch states, ...] +end + +subgraph r[kvm_run] +m[mmio] +ro[other run states, ...] +end + +end +``` + +([下载由 Mermaid 生成的 PNG 图片][012]) + + +mmio 在 Host 一端的注册与销毁如下图所示: + +```mermaid +flowchart LR +subgraph kvm_main.c +cv[kvm_create_vm] +cd[kvm_destroy_vm] +... +end + +subgraph coalseced_mmio.c +mi[kvm_coalesced_mmio_init] +mf[kvm_coalesced_mmio_free] +..., +end + +cv-->mi +cv-->mf +cd-->mf +``` + +([下载由 Mermaid 生成的 PNG 图片][013]) + +KVM 中的 MMIO 的访存操作有如下三个对应处理函数: + +```cpp +// arch/riscv/include/asm/kvm_vcpu_insn.h: line 40 +int kvm_riscv_vcpu_mmio_load(struct kvm_vcpu *vcpu, struct kvm_run *run, + unsigned long fault_addr, + unsigned long htinst); +int kvm_riscv_vcpu_mmio_store(struct kvm_vcpu *vcpu, struct kvm_run *run, + unsigned long fault_addr, + unsigned long htinst); +int kvm_riscv_vcpu_mmio_return(struct kvm_vcpu *vcpu, struct kvm_run *run); + +``` + +下图展示了 MMIO 访存操作的具体实现,可以发现 LAOD/STORE 操作最终是通过调用 IO 设备中注册好的读写函数来实现的: + +```mermaid +flowchart LR +subgraph vi[arch/riscv/kvm/vcpu_insn.c] +l[kvm_riscv_vcpu_mmio_load]-->r +s[kvm_riscv_vcpu_mmio_store]-->r +r[kvm_riscv_vcpu_mmio_return] +end + +subgraph m[virt/kvm/kvm_main.c] +rd[kvm_io_bus_read] +wr[kvm_io_bus_write] +end + +l-->rd +s-->wr + +subgraph dv[include/kvm/iodev.h] +subgraph iodev +subgraph ops +frd[*read] +fwr[*write] +end + +end + +end + +rd-.->frd +wr-.->fwr +``` + +([下载由 Mermaid 生成的 PNG 图片][014]) + +### kvmtool 中断注入及 MMIO 创建 + +在 kvmtool 中 MMIO 是作为 VIRTIO 设备之一连带着中断处理函数一起被注册的。整个过程可以分为两个部分: + +- PLIC,设备树初始化 +- MMIO/PCI 等设备与 PLIC 以及中断处理函数的绑定 +- Console/Net 等设备与初始化时与 MMIO/PCI 设备的绑定 + +执行完整个 Console 的创建过程就完成了 Guest 的 PLIC、IRQ 与设备的绑定,即实现了虚拟机的中断注入机制与 MMIO 创建。 + +下图中左上的 `virtio_dev_init:virtio_console__init` 表示以 KVM 指定的方式初始化设备完成绑定。 + +右边 RISC-V 模块左下方的 `late_init:setup_fdt` 则表示包含有 PLIC 的设备树的初始化。 + +```mermaid +flowchart LR + +subgraph riscv +subgraph irq.c +il[kvm__irq_line] +it[kvm__irq_trigger] +end + +subgraph plic.c +pit[plic__irq_trig] +pnd[pci__generate_fdt_nodes] +end +il-->pit +it-->pit +subgraph fdt.c +li[late_init:setup_fdt] +end +li-->pnd +end + +subgraph virtio + +subgraph unified_devices +subgraph console.c +cdi[virtio_dev_init:virtio_console__init] +end +subgraph net.c +bdi[virtio_dev_init:virtio_net__init] +end +udo[other unified_devices, ...] +end + +cdi-->vi +bdi-->vi +udo-.->vi + +subgraph pci.c +pvq[virtio_pci__signal_vq]-->it +pvq-->il +pcfg[virtio_pci__signal_config]-->it +po[other functions, ...] +end + +subgraph mmio.c +vq[virtio_mmio_signal_vq]-->it +cfg[virtio_mmio_signal_config]-->it +mo[other functions, ...] +end + +pm[pci-modern.c]-->il +pl[pci-legacy.c]-->il + +subgraph core.c +vi[virtio_init: case VIRTIO_*] +cm[mmio] +cp[pci] +end + +end + +cm-.->mmio.c +cp-.->pci.c + +subgraph hw +i8[i8042.c]-->il +sr[serial.c]-->il +end + +``` + +([下载由 Mermaid 生成的 PNG 图片][016]) + +## 总结 + +RISC-V 中断通过 PLIC,CLINT 等驱动和控制器来实现,KVM 模块对于虚拟化的支持体现在两方面,一方面是 KVM 实现了与 Guest 外部的中断控制相关联的 VS-mode 的中断处理,另一方面则是通过为用户态程序如 kvmtool 提供接口,支持了虚拟机内部的设备与中断处理函数的注册与绑定,也实现了虚拟机与内核态的绑定,这使得 Guest 的 MMIO 访存等操作顺利进行。 + +## 参考资料 + +- [Linux Kernel][1] +- [riscv kvm user app][15] +- [RISC-V 异常处理在 KVM 中的实现][2] +- [RISC-V timer 在 Linux 中的实现][3] +- [RISC-V SSTC Extension][4] +- [RISC-V 中断子系统分析——PLIC 中断处理][5] +- [kvmtool][6] + +[1]: https://www.kernel.org/ +[2]: 20221021-riscv-kvm-excp-impl.md +[3]: https://tinylab.org/riscv-timer/#kvm-vcpu_timerc +[4]: https://github.com/riscv/riscv-time-compare/releases/download/v0.5.4/Sstc.pdf +[5]: https://gitee.com/tinylab/riscv-linux/blob/master/articles/20220919-riscv-irq-analysis-part2-interrupt-handling-plic.md +[6]: https://git.kernel.org/pub/scm/linux/kernel/git/will/kvmtool.git +[007]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-1.png +[008]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-2.png +[009]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-3.png +[010]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-4.png +[011]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-5.png +[012]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-6.png +[013]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-7.png +[014]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-8.png +[15]: 20220802-riscv-kvm-user-app.md#kvmtool +[016]: images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-9.png diff --git a/articles/20230612-introduction-to-riscv-sbi.md b/articles/20230612-introduction-to-riscv-sbi.md new file mode 100644 index 0000000000000000000000000000000000000000..d37bb75671ca7d3a4b3ad4aeea0d1fb0036f8cd3 --- /dev/null +++ b/articles/20230612-introduction-to-riscv-sbi.md @@ -0,0 +1,609 @@ +> Corrector: [TinyCorrect](https://gitee.com/tinylab/tinycorrect) v0.2-rc1 - [spaces tables autocorrect]
+> Author: groot
+> Date: 2023/06/12
+> Revisor: Falcon
+> Project: [RISC-V Linux 内核剖析](https://gitee.com/tinylab/riscv-linux)
+> Proposal: [RISC-V Linux 内核 SBI 调用技术分析](https://gitee.com/tinylab/riscv-linux/issues/I64YC4)
+> Sponsor: PLCT Lab, ISCAS + +# RISC-V SBI 概述 + +## 前言 + +当今计算机体系结构中,RISC-V 架构无疑成为了备受关注的新星,在国内外的学界和工业领域都有着广泛的应用与研究。而其中的 Supervisor Binary Interface (SBI) 作为 RISC-V 执行环境接口(Execute Environment Interface, EEI)之一,为 RISC-V 架构在操作系统内核等方面的应用提供了重要的技术支持。 + +本文首先介绍了 SBI 的概念、作用和实现方法,重点分析了在 OpenSBI 的事件调用过程和 Linux 下层 SBI Implementation 如果进行交互。旨在帮助读者理解 SBI 在 RISC-V 架构中的重要地位及其中的实现细节,提高 RISC-V 应用开发者的技术水平和实践能力。 + +## 软件版本信息 + +| 软件 | 版本 | +|---------|----------| +| Linux | v6.4-rc5 | +| Opensbi | v1.2 | +| QEMU | 8.0.2 | + +## RISCV-V SBI + +### 什么是 SBI? + +SBI 全称 Supervisor Binary Interface,是 RISC-V 执行环境接口(Execute Environment Interface, EEI)之一,目的是使处于 Supervisor-mode (S-mode 或者 VS-mode) 的程序能够很方便地移植到实现不同扩展指令集的 RISC-V 架构的处理器上。提供 SBI 接口给监管模式软件的更高特权软件被称为 SBI 实现或监管执行环境(Supervisor Execution Environment, SEE)。 + +### 为什么要有 SBI + +![sbi](images/introduction-to-riscv-sbi/sbi1.svg) + +如果没有 SBI(如上图右侧),针对实现的扩展指令集不同的 RISC-V 微架构,可能要采用不同的方式才能够使操作系统内核触发 M-mode 的动作。而有了 SBI 之后,只要在扩展指令集不同的 RISC-V 微架构中实现统一的向上的 SBI 接口,上层的操作系统就可以不再关注具体的微架构细节,而是专注实现 SBI 接口提供的功能即可,大大提升了处于 Supervisor-mode 的程序的可移植性。 +这其实就是计算机中的一个很重要的哲学——抽象。通过将底层的具体实现屏蔽,向上提供统一的接口,使上层应用不需关注过多底层细节,大大简化了程序的开发难度。 +通俗地,我们将 SBI 比作手机充电器接口。曾经市场上可能有非常多种类的充电器接口,如果 A 的手机接口和 B 的手机接口不一样,那么他们没办法互相使用对方的充电器,但是如果我们将充电器接口全部统一为 type-c 接口,这种尴尬的场景就不会再发生了,大大的方便了用户。 + +### SBI 的作用 + +![sbi2](images/introduction-to-riscv-sbi/sbi2.svg) + +SBI 的第一个作用我们开头已经讲过了(图一左)。 +除此之外,如上图所示,SBI 也可能在 Hypervisor-mode(HS-mode)下作为虚拟机管理程序实现。 +从更高一级的特权模式来看,SBI Implementation 为 supervisor-mode 软件分配物理执行单元(HARTs)。 +因此,从 SBI Implementation 的角度来看,S-mode 的 HART 被称为虚拟 HART(图一)。而如果实现是一个虚拟机管理程序(图二),那么虚拟 HART 则表示 VS-mode 的虚拟 HART。 + +### SBI Spec + +#### 如何获取 SBI Spec + +SBI Specification 由 RISC-V 基金会发布,每更新一个版本基金会就会更新 GitHub 仓库,下面附上该仓库链接,进入仓库后开发者即可获取最新的 SBI Specification。 + +**[riscv-sbi-doc][001]** + +目前来看,SBI Specification 的发布周期并不固定,不过基本能够保持一年以内更新一个稳定版本。每次要更新新的版本之前,基金会都会推出 rc (Release Candidate) 版本,开发者可以与基金会联系提出自己的意见和建议,最终形成稳定版本。 + +#### SBI 版本变更概览 + +下面对 SBI 的历史版本变更做一个总结: + +##### Version 2.0-rc1 + +- 添加了共享内存物理地址范围参数的通用描述 +- 添加了 SBI 调试控制台扩展 (debug console extension) +- 放宽了 SBI PMU 固件计数器的计数位宽要求 +- 在 SBI PMU 扩展中添加了 sbi_pmu_counter_fw_read_hi() 函数 +- 为 SBI 实现特定的 firmware events 保留了空间 +- 添加了 SBI 系统暂停扩展 (system suspend extension) +- 添加了 SBI CPPC 扩展 (CPPC extension) +- 明确了只有定义发现已实现的 SBI 函数机制的 SBI 扩展才能部分实现的规定 +- 添加了错误代码 SBI_ERR_NO_SHMEM +- 添加了 SBI 嵌套加速扩展 (nested acceleration extension) +- 添加了虚拟 HART 的通用描述 +- 添加了 SBI 偷取时间核算扩展 (steal-time accounting extension) +- 添加了 SBI PMU 快照扩展 (PMU snapshot extension) + +##### Version 1.0 + +- 改进了 SBI 文档 Introduction 部分 +- 删除了所有对 RV32 的引用 +- 更新了调用规约 +- 添加了一个缩写词表 + +##### Version 0.3 + +- 改进文档样式和命名规范 +- 增加了 SBI 系统重置扩展 (system reset extension) +- 改进了 SBI 文档 Introduction 部分 +- 改进了 SBI hart 状态管理扩展(hart state management extension)的文档说明 +- 添加了 SBI hart 状态管理扩展(hart state management extension)的暂停(suspend)功能 +- 添加了性能监控单元扩展 (performance monitoring unit extension) +- 定义了 SBI 扩展不能部分实现的规定 + +##### Version 0.2 + +- 完整的 SBI v0.1 已经被移至遗留扩展,现在成为可选扩展。从技术上讲,这是一项向后不兼容的更改,因为遗留扩展变为了可选选项 + +注:总结日期截至到 2023/06/15,Version 2.0 还未正式发布。 + +#### SBI 版本对应扩展 + +| 扩展、版本 | 0.2 | 0.3 | 1.0 | 2.0-rc1 | +|-----------------------|-----|-----|-----|---------| +| Legacy | √ | √ | √ | √ | +| Base | | √ | √ | √ | +| Timer | | √ | √ | √ | +| IPI | | √ | √ | √ | +| RFENCE | | √ | √ | √ | +| HSM | | √ | √ | √ | +| System Reset | | √ | √ | √ | +| PMU | | √ | √ | √ | +| Debug Console | | | | √ | +| System Suspend | | | | √ | +| CPPC | | | | √ | +| Steal-time Accounting | | | | √ | +| Nested Acceleration | | | | √ | +| Experimental | | √ | √ | √ | +| Vendor-Specific | | √ | √ | √ | +| Firmware Specific | | √ | √ | √ | + +### SBI Implementations + +理论上说,因为 SBI Spec 是开源的,只要能够按照 Spec 说明实现其功能就可以称为 SBI Implementation。 +不过当前经过 RISC-V 官方认证的 Implementation 有如下几个: + +| Implementation ID | Name | Update | +|-------------------|----------------------------|:------------:| +| 0 | Berkeley Boot Loader (BBL) | Nov 1, 2020 | +| 1 | OpenSBI | Jun 14, 2023 | +| 2 | Xvisor | Dec 23, 2022 | +| 3 | KVM | Apr 21, 2023 | +| 4 | RustSBI | May 23, 2023 | +| 5 | Diosix | May 8, 2021 | +| 6 | Coffer | Mar 3, 2022 | +| 7 | Xen Project | | + +注:Xen Project 仅在 SBI Impelementation 中申请了占位,目前并没有实际支持 + +## OpenSBI 固件代码分析 + +### 什么是 OpenSBI + +OpenSBI 是 RISC-V SBI Spec 的一个 C 语言参考实现。它由 Western Digital 公司发起,并且在 2019 年开放了源代码。 + +### 编译 OpenSBI + +> 这里已经默认用户安装好 QEMU 和 U-Boot,如果遇到困难,请参考泰晓社区的相关文档:[https://tinylab.org/riscv-linux](https://tinylab.org/riscv-linux) + +1. 下载 OpenSBI 源码 + +```sh +git clone https://github.com/riscv-software-src/opensbi.git +``` + +2. 进入 OpenSBI 文件夹 + +```sh +cd opensbi +``` + +3. 新建文件夹并进入 + +```sh +mkdir build +cd build +``` + +3. 编译 + +```sh +make -C $(pwd)/.. PLATFORM=generic CROSS_COMPILE=riscv64-Linux-gnu- FW_PAYLOAD_PATH=path/to/u-boot.bin +``` + +### 启动 OpenSBI + +1. 在 `qemu-opensbi` 文件夹中执行下面的命令 + +```sh +qemu-system-riscv64 -M virt -m 256 -nographic -bios build/platform/generic/firmware/fw_payload.elf +``` + +2. 显示输出 + +![alt img](images/introduction-to-riscv-sbi/img.png) + +此时,OpenSBI 成功启动,并且引导进了 U-Boot。 + +### OpenSBI 源码分析 + +我们以一个 Base Extension 中获取硬件厂商 ID 信息的函数 `sbi_get_mvendorid()` 为例,分析它被调用的过程。 + +#### OpenSBI 异常处理程序 + +首先是异常处理程序的入口定义,也就是 `mtvec` 的设置,下面的代码将 `mtvec` 设置为 `_trap_handler`: + +``` +// opensbi/firmware/fw_base.S: 493 + +/* Setup trap handler */ +lla a4, _trap_handler +csrw CSR_MTVEC, a4 +``` + +这样就设置好了异常处理程序入口,如果在系统的执行过程中遇见了异常、中断或系统调用,硬件会自动找到 `_trap_handler` 所在的地址: + +``` +// opensbi/firmware/fw_base.S: 765 + +_trap_handler: + TRAP_SAVE_AND_SETUP_SP_T0 + + TRAP_SAVE_MEPC_MSTATUS 0 + + TRAP_SAVE_GENERAL_REGS_EXCEPT_SP_T0 + + TRAP_CALL_C_ROUTINE + +_trap_exit: + TRAP_RESTORE_GENERAL_REGS_EXCEPT_A0_T0 + + TRAP_RESTORE_MEPC_MSTATUS 0 + + TRAP_RESTORE_A0_T0 + + mret +``` + +`TRAP_CALL_C_ROUTINE` 之前和之后的宏是状态保存与恢复,`TRAP_CALL_C_ROUTINE` 是真正的异常处理程序。 + +``` +// opensbi/firmware/fw_base.S: 702 + +.macro TRAP_CALL_C_ROUTINE + /* Call C routine */ + add a0, sp, zero + call sbi_trap_handler +.endm +``` + +然后我们发现最终调用了 `sbi_trap_handler` 函数处理异常。 + +#### OpenSBI ecall 过程分析 + +书接上段,进入 `sbi_trap_handler()` 之后,找到里面关于处理 `ecall` 指令的部分: + +```c +// opensbi/lib/sbi/sbi_trap.c: 303 + +case CAUSE_SUPERVISOR_ECALL: +case CAUSE_MACHINE_ECALL: + rc = sbi_ecall_handler(regs); + msg = "ecall handler failed"; + break; +``` + +然后进入 `sbi_trap_handler()`,其中的 `sbi_ecall_find_extension()` 会检查该扩展是否被支持,如果被支持就调用之前注册好的回调函数进行处理,如果不被支持返回 `SBI_ENOTSUPP` (SBI_ERR_NOT_SUPPORTED)。 + +```c +// opensbi/lib/sbi/sbi_ecall.c: 108 + + ext = sbi_ecall_find_extension(extension_id); + if (ext && ext->handle) { + ret = ext->handle(extension_id, func_id, + regs, &out_val, &trap); + if (extension_id >= SBI_EXT_0_1_SET_TIMER && + extension_id <= SBI_EXT_0_1_SHUTDOWN) + is_0_1_spec = 1; + } else { + ret = SBI_ENOTSUPP; + } +``` + +#### OpenSBI 扩展初始化简要分析 + +在 fw_xxx.S 中,会调用 `sbi_init` 进行 OpenSBI 的初始化: + +``` +// opensbi/firmware/fw_base.S: 519 + + /* Initialize SBI runtime */ + call sbi_init +``` + +之后进入 `lib/sbi/sbi_init.c` 的 `sbi_init` () 函数进行一系列检查后开始初始化各个扩展: + +```c +// opensbi/lib/sbi/sbi_init.c: 264 + + rc = sbi_xxx_init(scratch, true); + if (rc) { + sbi_printf("%s: xxx init failed (error %d)\n", + __func__, rc); + sbi_hart_hang(); + } +``` + +最后向 `sbi_ecall_exts` 列表中注册各个扩展,完成初始化。 + +通过该函数的一系列操作,成功初始化 OpenSBI 之后,我们就可以调用 OpenSBI 提供的函数了。 + +#### OpenSBI 事件调用过程 + +现在我们假设 Linux Kernel 向 OpenSBI 发送了一个 `ecall` 指令,该指令的 `ext` 为 `sbi_get_mvendorid()` 所在扩展的 id,也就是 `0x10`。这时 OpenSBI 自动跳入异常处理程序,之后的处理过程前面已经讲解过了,这里不再赘述。 + +我们这里来讲讲之后的事情,在 `ret = ext->handle(extension_id, func_id,regs, &out_val, &trap);` 的过程中,会调用对应 ext 和 fid 的函数,我们这里是 `lib/sbi/sbi_ecall_base.c` 中的 `sbi_ecall_base_handler`: + +```c +// opensbi/lib/sbi/sbi_ecall_base.c: 56 + +case SBI_EXT_BASE_GET_MVENDORID: + *out_val = csr_read(CSR_MVENDORID); + break; +``` + +直接读取 Machine Information Registers 中的值,得到 `mvendorid`。 + +整个处理流程结束,逐级向上返回结果,然后由 `a1` 寄存器带回 `mvendorid`。 + +### OpenSBI 如何兼容不同 SBI 版本 + +SBI Spec 的设计中贯彻了 RISC-V 的设计哲学——模块化扩展: + +1. SBI Implementation 向 S-mode 提供的事件以 SBI 扩展为基本单位,如果想在 SBI Implementation 中实现某个事件,就必须实现该服务所在扩展的所有事件。 +2. 与 RISC-V Spec 一样,如果一个新的 SBI Spec 正式版本发布,那么该版本中定义的新扩展将会固定下来,不可以再进行更改。 + +因为每一版 OpenSBI 都会实现 SBI Spec 中所规定的所有扩展,也就是说新版的 OpenSBI 一定会兼容之前版本的 OpenSBI。 + +同时 OpenSBI 会在 include 文件中定义支持的扩展与事件: + +```c +// opensbi/include/sbi/sbi_ecall_interface.h: 15 + +/* SBI Extension IDs */ +#define SBI_EXT_0_1_SET_TIMER 0x0 +#define SBI_EXT_0_1_CONSOLE_PUTCHAR 0x1 +#define SBI_EXT_0_1_CONSOLE_GETCHAR 0x2 +#define SBI_EXT_0_1_CLEAR_IPI 0x3 +#define SBI_EXT_0_1_SEND_IPI 0x4 +#define SBI_EXT_0_1_REMOTE_FENCE_I 0x5 +#define SBI_EXT_0_1_REMOTE_SFENCE_VMA 0x6 +#define SBI_EXT_0_1_REMOTE_SFENCE_VMA_ASID 0x7 +#define SBI_EXT_0_1_SHUTDOWN 0x8 +#define SBI_EXT_BASE 0x10 +#define SBI_EXT_TIME 0x54494D45 +#define SBI_EXT_IPI 0x735049 +#define SBI_EXT_RFENCE 0x52464E43 +#define SBI_EXT_HSM 0x48534D +#define SBI_EXT_SRST 0x53525354 +#define SBI_EXT_PMU 0x504D55 +#define SBI_EXT_DBCN 0x4442434E +#define SBI_EXT_SUSP 0x53555350 +#define SBI_EXT_CPPC 0x43505043 + +/* SBI function IDs for BASE extension */ +#define SBI_EXT_BASE_GET_SPEC_VERSION 0x0 +#define SBI_EXT_BASE_GET_IMP_ID 0x1 +#define SBI_EXT_BASE_GET_IMP_VERSION 0x2 +#define SBI_EXT_BASE_PROBE_EXT 0x3 +#define SBI_EXT_BASE_GET_MVENDORID 0x4 +#define SBI_EXT_BASE_GET_MARCHID 0x5 +#define SBI_EXT_BASE_GET_MIMPID 0x6 + +/* SBI function IDs for TIME extension */ +#define SBI_EXT_TIME_SET_TIMER 0x0 + +/* SBI function IDs for IPI extension */ +#define SBI_EXT_IPI_SEND_IPI 0x0 + +/* SBI function IDs for RFENCE extension */ +#define SBI_EXT_RFENCE_REMOTE_FENCE_I 0x0 +#define SBI_EXT_RFENCE_REMOTE_SFENCE_VMA 0x1 +#define SBI_EXT_RFENCE_REMOTE_SFENCE_VMA_ASID 0x2 +#define SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA_VMID 0x3 +#define SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA 0x4 +#define SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA_ASID 0x5 +#define SBI_EXT_RFENCE_REMOTE_HFENCE_VVMA 0x6 + +/* SBI function IDs for HSM extension */ +#define SBI_EXT_HSM_HART_START 0x0 +#define SBI_EXT_HSM_HART_STOP 0x1 +#define SBI_EXT_HSM_HART_GET_STATUS 0x2 +#define SBI_EXT_HSM_HART_SUSPEND 0x3 + +#define SBI_HSM_STATE_STARTED 0x0 +#define SBI_HSM_STATE_STOPPED 0x1 +#define SBI_HSM_STATE_START_PENDING 0x2 +#define SBI_HSM_STATE_STOP_PENDING 0x3 +#define SBI_HSM_STATE_SUSPENDED 0x4 +#define SBI_HSM_STATE_SUSPEND_PENDING 0x5 +#define SBI_HSM_STATE_RESUME_PENDING 0x6 + +#define SBI_HSM_SUSP_BASE_MASK 0x7fffffff +#define SBI_HSM_SUSP_NON_RET_BIT 0x80000000 +#define SBI_HSM_SUSP_PLAT_BASE 0x10000000 + +#define SBI_HSM_SUSPEND_RET_DEFAULT 0x00000000 +#define SBI_HSM_SUSPEND_RET_PLATFORM SBI_HSM_SUSP_PLAT_BASE +#define SBI_HSM_SUSPEND_RET_LAST SBI_HSM_SUSP_BASE_MASK +#define SBI_HSM_SUSPEND_NON_RET_DEFAULT SBI_HSM_SUSP_NON_RET_BIT +#define SBI_HSM_SUSPEND_NON_RET_PLATFORM (SBI_HSM_SUSP_NON_RET_BIT | \ + SBI_HSM_SUSP_PLAT_BASE) +#define SBI_HSM_SUSPEND_NON_RET_LAST (SBI_HSM_SUSP_NON_RET_BIT | \ + SBI_HSM_SUSP_BASE_MASK) + +/* SBI function IDs for SRST extension */ +#define SBI_EXT_SRST_RESET 0x0 + +#define SBI_SRST_RESET_TYPE_SHUTDOWN 0x0 +#define SBI_SRST_RESET_TYPE_COLD_REBOOT 0x1 +#define SBI_SRST_RESET_TYPE_WARM_REBOOT 0x2 +#define SBI_SRST_RESET_TYPE_LAST SBI_SRST_RESET_TYPE_WARM_REBOOT + +#define SBI_SRST_RESET_REASON_NONE 0x0 +#define SBI_SRST_RESET_REASON_SYSFAIL 0x1 + +/* SBI function IDs for PMU extension */ +#define SBI_EXT_PMU_NUM_COUNTERS 0x0 +#define SBI_EXT_PMU_COUNTER_GET_INFO 0x1 +#define SBI_EXT_PMU_COUNTER_CFG_MATCH 0x2 +#define SBI_EXT_PMU_COUNTER_START 0x3 +#define SBI_EXT_PMU_COUNTER_STOP 0x4 +#define SBI_EXT_PMU_COUNTER_FW_READ 0x5 +#define SBI_EXT_PMU_COUNTER_FW_READ_HI 0x6 +``` + +如果不被支持,OpenSBI 会返回该扩展不被支持的错误代码,告诉上层该扩展不被支持。 + +## Linux Kernel SBI 代码分析 + +下面分析 Linux (v6.4-rc5) 源码中的 SBI 代码: + +### ecall 指令 + +`ecall` 指令用于向执行环境发出请求,在不同的特权等级中执行 `ecall` 指令有不同的效果:在 User-mode 中会引发 environment-call-from-U-mode 异常,在 Supervisor-mode 中会引发 environment-call-from-S-mode 异常,而在 Machine-mode 中会引发 environment-call-from-M-mode 异常。 + +### Linux 内核 SBI 代码 + +`ecall` 指令在 Linux 内核中用于 SBI 调用,如下为 `arch/riscv/kernel/sbi.c` 中的部分代码。 +`sbi_ecall` 指令接受 8 个参数,分别是 + +- `ext`: SBI extension ID (EID) +- `fid`: SBI function ID (FID) +- `arg0-arg5`: SBI 函数调用参数 + +```c +// linux/arch/riscv/kernel/sbi.c: 25 + +struct sbiret sbi_ecall(int ext, int fid, unsigned long arg0, + unsigned long arg1, unsigned long arg2, + unsigned long arg3, unsigned long arg4, + unsigned long arg5) +{ + struct sbiret ret; + + register uintptr_t a0 asm ("a0") = (uintptr_t)(arg0); + register uintptr_t a1 asm ("a1") = (uintptr_t)(arg1); + register uintptr_t a2 asm ("a2") = (uintptr_t)(arg2); + register uintptr_t a3 asm ("a3") = (uintptr_t)(arg3); + register uintptr_t a4 asm ("a4") = (uintptr_t)(arg4); + register uintptr_t a5 asm ("a5") = (uintptr_t)(arg5); + register uintptr_t a6 asm ("a6") = (uintptr_t)(fid); + register uintptr_t a7 asm ("a7") = (uintptr_t)(ext); + asm volatile ("ecall" + : "+r" (a0), "+r" (a1) + : "r" (a2), "r" (a3), "r" (a4), "r" (a5), "r" (a6), "r" (a7) + : "memory"); + ret.error = a0; + ret.value = a1; + + return ret; +} +``` + +下面对上述代码做简单分析: + +- 使用 `ecall` 指令时,将异常类型写在 a7 寄存器,参数写在 a0-a5 寄存器,后面会根据异常类型的不同调用不同的异常处理函数 +- `register` 关键字表明后面的变量直接存储在寄存器中 +- `asm ("ax")` 表明将后面的变量与 `ax` 寄存器进行绑定 +- `asm volatile` 表明嵌入汇编代码进入 C 代码中,并且将 `a0` 和 `a1` 寄存器既作为输入寄存器又作为输出寄存器传给 `ecall` 指令,而 `a2` - `a6` 寄存器作为输入寄存器传递给 `ecall` +- `ecall` 函数返回两个值 `a0` 和 `a1`,`sbi_ecall` 函数将这两个值作为错误和返回值传递给调用它的函数 + +比如实现一个 putchar 函数用于打印一个字符到系统控制台上,就通过如下 `sbi_ecall` 调用来实现: + +```c +// linux/arch/riscv/kernel/sbi.c: 101 + +void sbi_console_putchar(int ch) +{ + sbi_ecall(SBI_EXT_0_1_CONSOLE_PUTCHAR, 0, ch, 0, 0, 0, 0, 0); +} +``` + +然后我们进入 `arch/riscv/include/sbi.h`,观察宏定义: + +```c +// linux/arch/riscv/include/asm/sbi.h: 14 + +enum sbi_ext_id { +#ifdef CONFIG_RISCV_SBI_V01 + SBI_EXT_0_1_SET_TIMER = 0x0, + SBI_EXT_0_1_CONSOLE_PUTCHAR = 0x1, + SBI_EXT_0_1_CONSOLE_GETCHAR = 0x2, + SBI_EXT_0_1_CLEAR_IPI = 0x3, + SBI_EXT_0_1_SEND_IPI = 0x4, + SBI_EXT_0_1_REMOTE_FENCE_I = 0x5, + SBI_EXT_0_1_REMOTE_SFENCE_VMA = 0x6, + SBI_EXT_0_1_REMOTE_SFENCE_VMA_ASID = 0x7, + SBI_EXT_0_1_SHUTDOWN = 0x8, +#endif + SBI_EXT_BASE = 0x10, + SBI_EXT_TIME = 0x54494D45, + SBI_EXT_IPI = 0x735049, + SBI_EXT_RFENCE = 0x52464E43, + SBI_EXT_HSM = 0x48534D, + SBI_EXT_SRST = 0x53525354, + SBI_EXT_PMU = 0x504D55, + + /* Experimentals extensions must lie within this range */ + SBI_EXT_EXPERIMENTAL_START = 0x08000000, + SBI_EXT_EXPERIMENTAL_END = 0x08FFFFFF, + + /* Vendor extensions must lie within this range */ + SBI_EXT_VENDOR_START = 0x09000000, + SBI_EXT_VENDOR_END = 0x09FFFFFF, +}; + +``` + +观察到 `SBI_EXT_0_1_CONSOLE_PUTCHAR` 定义为 `0x1`。 + +### Linux 如何兼容不同的 SBI 版本 + +Linux 系统目前的默认 SBI 版本为 0.1,如果当前的 SBI 版本为 0.1,将执行 `arch/riscv/kernel/sbi.c` 中的 + +```c +// linux/arch/riscv/kernel/sbi.c: 101 + +#ifdef CONFIG_RISCV_SBI_V01 + +// 如果支持 SBI 0.1,下面的函数可以被调用 +... +void sbi_console_putchar(int ch) +{ + sbi_ecall(SBI_EXT_0_1_CONSOLE_PUTCHAR, 0, ch, 0, 0, 0, 0, 0); +} +... +... +#else + +// 如果 SBI 0.1 不被支持,返回 remote fence extension is not available in SBI x.x +... +static void __sbi_set_timer_v01(uint64_t stime_value) +{ + pr_warn("Timer extension is not available in SBI v%lu.%lu\n", + sbi_major_version(), sbi_minor_version()); +} +... +... +#endif /* CONFIG_RISCV_SBI_V01 */ +``` + +如果支持更新版本的 SBI,`#endif` 下面的代码将可以被执行,比如: + +```c +// linux/arch/riscv/kernel/sbi.c: 222 + +static void __sbi_set_timer_v02(uint64_t stime_value) +{ +#if __riscv_xlen == 32 + sbi_ecall(SBI_EXT_TIME, SBI_EXT_TIME_SET_TIMER, stime_value, + stime_value >> 32, 0, 0, 0, 0); +#else + sbi_ecall(SBI_EXT_TIME, SBI_EXT_TIME_SET_TIMER, stime_value, 0, + 0, 0, 0, 0); +#endif +} +``` + +借这个 `#ifdef` 和 `#endif` 两个宏,Linux 实现了对 0.1 和 0.2 两个版本的 SBI 支持。 + +## Linux 与 OpenSBI 互动流程 + +我们将以 `sbi_console_putchar` 为例,简要描述 Linux 与 OpenSBI 的互动流程,方便读者对 SBI 形成更直观的理解。 + +![sbi3](images/introduction-to-riscv-sbi/sbi3.svg) + +假设我们现在使用 C 语言 `printf()` 函数为例,给读者讲解一下 Linux 系统与 OpenSBI 的交互过程: + +首先,我们调用了 `printf()` 函数 (**①**),自然而然我们陷入了内核态,然后 Linux Kernel 去调用 OpenSBI 提供的 `sbi_ecall()` 函数 (**②**),并且在调用过程中将 eid, fid 以及之前提到的 5 个参数传递给 OpenSBI (**③**),之后由 OpenSBI 去真正的操作硬件。 + +最后操作完成之后,一级一级地向上返回执行结果 (**④ ⑤**),完成整个向 console 进行输出的过程。 + +## 小结 + +这篇文章介绍了 RISC-V 架构下的 SBI(Supervisor Binary Interface)概念、作用和实现方法,并提供了开源实现 OpenSBI 的编译和启动方法。SBI 的作用是使处于 supervisor-mode 的程序能够方便地移植到不同的 RISC-V 微架构处理器上,实现了底层统一接口的抽象,使开发者不需关注底层细节,大大简化了程序的开发难度。此外,文章还分析了在 Linux Kernel 中,SBI 的调用方法,这些对于理解 SBI 与 Linux Kernel 的交互过程非常有帮助。 + +本文简洁明了地讲述了 SBI 的概念,提高了读者对于 SBI 的理解,为 SBI 的学习提供了指导。 + +## 参考资料 + +- [Volume 1, Unprivileged Specification version 20191213][004] +- [Volume 2, Privileged Specification version 20211203][003] +- [RISC-V Supervisor Binary Interface Specification Version -v2.0-rc1, 2023-06-01: Draft][002] + +[001]: https://github.com/riscv-non-isa/riscv-sbi-doc +[002]: https://github.com/riscv-non-isa/riscv-sbi-doc/releases/download/v2.0-rc1/riscv-sbi.pdf +[003]: https://github.com/riscv/riscv-isa-manual/releases/download/Priv-v1.12/riscv-privileged-20211203.pdf +[004]: https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf diff --git a/articles/20230615-hugepaged-linear-mapping.md b/articles/20230615-hugepaged-linear-mapping.md new file mode 100644 index 0000000000000000000000000000000000000000..b106df1493ad636a9b956c30a7e67474028c2d49 --- /dev/null +++ b/articles/20230615-hugepaged-linear-mapping.md @@ -0,0 +1,567 @@ +> Corrector: [TinyCorrect](https://gitee.com/tinylab/tinycorrect) v0.1 - [tables]
+> Author: sugarfillet
+> Date: 2023/06/15
+> Revisor: Falcon falcon@tinylab.org
+> Project: [RISC-V Linux 内核剖析](https://gitee.com/tinylab/riscv-linux)
+> Proposal: [RISC-V Linux SMP 技术调研与分析](https://gitee.com/tinylab/riscv-linux/issues/I5MU96)
+> Sponsor: PLCT Lab, ISCAS + +# RISC-V Linux 线性地址大页支持补丁解读 + +## 前言 + +Linux-v6.4-rc1 发布后带来了几个比较重要的变更,其中 commit 3335068f8721 ("riscv: Use PUD/P4D/PGD pages for the linear mapping") 用来为内核线性地址空间提供的大页映射的支持,以获取更好的性能并回收了一些浪费的内存。然而,此补丁却带来不少的问题,比如:休眠场景下的 panic 以及 UEFI 启动过程中的 panic。 + +本文对该 commit 及其背景知识进行介绍,并对其引发的两个问题进行分析。 + +**说明**: + +- 本文的 Linux 版本采用 `Linux v6.4-rc1` +- 本文采用的一些缩略词解释如下 + +| 缩略词 | 全称 | 说明 | +|-------------|---------------------------------|------------------------------------------------------------------| +| PA | Physical Address | 物理地址 | +| VA | Virtual Address | 虚拟地址 | +| PPN | Physical Page Number | 物理页帧号,与页内偏移 page offset 构成物理地址 | +| VPN | Virtual Page Number | 虚拟页帧号,与页内偏移 page offset 构成虚拟地址 | +| PTE | Page Table Entry | 页表项,存放叶子页表或者页目录的 PPN,还有一层含义代表最后一级的页表 | +| PGD | Page Global Directory | 根(页)目录 | +| P4D/PUD/PMD | Page 4th/Upper/Middle Directory | 不同级别的页目录项,不同分页方案有不同页目录项,详见下文 | + +## RISC-V 地址翻译 + +如 "RISC-V-Reader-Chinese-v2p1.pdf" 文档所述: + +> RISC-V S 模式提供了一种传统的虚拟内存系统,它将内存划分为固定大小的页来进行地址转换和对内存内容的保护。启用分页的时候,大多数地址(包括 load 和 store 的有效地址和 +PC 中的地址)都是虚拟地址。要访问物理内存,它们必须被转换为真正的物理地址,这通过遍历一种称为页表的高基数树实现。页表中的叶节点指示虚地址是否已经被映射到了真正的物理页面,如果是,则指示了哪些权限模式和通过哪种类型的访问可以操作这个页。访问未被映射的页或访问权限不足会导致页错误例外(page fault exception)。 + +RISC-V MMU 可以将取指/load/store 等操作的的虚拟地址转化为物理地址,整个转换过程通过页表结构来实现。RISC-V 在 satp 寄存器的 Mode 字段定义所支持的分页方案,每种方案的区别主要体现在通过几个层级来映射不同长度的 VA 到不同长度的 PA,比如:Sv57 分页方案采用 5 级页表将 57 位的虚拟地址翻译为 56(44+12)位的物理地址: + +| satp.mode | SXLEN(pte_len) | VPNs | PPNs | page level | page tables | +|-----------|----------------|-------------------|------------------|------------|---------------------| +| Sv32 | 32 | 32 (10+10+12) | 22 (12+10) | 2 | PGD PTE | +| Sv39 | 64 | 39 (9+9+9+12) | 44 (26+9+9) | 3 | PGD PMD PTE | +| Sv48 | 64 | 48 (9+9+9+9+12) | 44 (17+9+9+9) | 4 | PGD PUD PMD PTE | +| Sv57 | 64 | 57 (9+9+9+9+9+12) | 44 (8+9+9+9+9+9) | 5 | PGD P4D PUD PMD PTE | + +### MMU 地址翻译过程 + +在 RISC-V 特权手册 "Virtual Address Translation Process" 一节中详细地描述了 Sv32 分页方案的地址翻译过程。这里在 Qemu 环境默认的 Sv57 的分页方案下,以内核的加载地址为例,结合 Linux 早期虚拟地址映射保留的每级页目录地址(early_p4d/early_pud/early_pmd),观察 MMU 如何将虚拟地址 -- `0xffffffff80000000` 翻译为其对应的物理地址: + +可参考此图阅读下文: + +![sv57_address_trans.png](images/riscv-linear-mapping/sv57_address_trans.png) + +首先将虚拟地址按照 Sv57 虚拟地址布局划分: + +``` +va=0xffffffff80000000 + +0x[f] 111| 1[ff] | [ff] 1|111 [f] 10| 00 [0] 000|0 [00] |[000] // [] 内的为 16 进制表示,[]之外的为按照 bit 表示,下文亦是如此 +> vpn4 vpn3 vpn2 vpn1 vpn0 offset +``` + +通过 satp.PPN 获取根页表:`early_pg_dir = satp.PPN << 12`,按照如下步骤依次获取每级目录的页表项: + +1. offset = va.vpn4 (511); pte = early_pg_dir[offset] (0x0000000020380c01) // `01` is not leaf so pte.ppn <<12 => early_p4d +2. offset = va.vpn3 (511); pte = early_p4d[offset] (0x0000000020380801) // pte.ppn <<12 => early_pud +3. offset = va.vpn2 (510); pte = early_pud[offset] (0x0000000020381001) // pte.ppn <<12 => early_pmd +4. offset = va.vpn1 (0); pte = early_pmd[offset] (0x00000000200800ef) // `ef` is leaf so + +在步骤 4 可以看到当前 VA 对应 earyly_pmd 中偏移为 0 的 PTE,此表项为叶子表项。对此 PTE 按照 PTE 格式划分如下,其中 ppn[4:0] 则为 44 位的物理地址表示,右移 10 位得到其对应的页帧号,再左移 12 位,填充 va.offset 则得到该地址对应的 56 位长度的物理地址。而这个物理地址 `0x00000080200000` 恰恰就是内核加载的物理地址,且映射存放在 PMD 上。 + +``` +pte=0x00000000200800ef + +N PBMT |Reserved PPN | RSW D A G U X W R V + +0x[00]0 | 000[0000020080] 00 | 00[ef] // pte +> ppn[4:0] page bits + +pte.ppn[4:0] = (0x00000000200800ef >>10) = 0x [000]00080200 (len is 44) + +pa = 0x00000080200[000] (ppt.ppn[4:0] <<12 | va.off) + +``` + +```c +(gdb) p/z early_pg_dir +$31 = {{pgd = 0x0000000000000000} , {pgd = 0x00000000205c1401}, { + pgd = 0x0000000000000000} , {pgd = 0x0000000020380c01}} // 511 +(gdb) p/z early_p4d +$35 = {{p4d = 0x0000000000000000} , {p4d = 0x0000000020380801}} // 511 +(gdb) p/z early_pud +$36 = {{pud = 0x0000000000000000} , {pud = 0x0000000020381001}, {pud = 0x0000000000000000}} //510 +(gdb) p/z early_pmd +$37 = {{pmd = 0x00000000200800ef}, {pmd = 0x00000000201000ef}, {pmd = 0x00000000201800ef}, {pmd = 0x00000000202000ef}, {pmd = 0x00000000202800ef}, {pmd = 0x00000000203000ef}, {pmd = 0x00000000203800ef}, { + pmd = 0x00000000204000ef}, {pmd = 0x00000000204800ef}, {pmd = 0x00000000205000ef}, {pmd = 0x00000000205800ef}, {pmd = 0x0000000000000000} } // 0 +``` + +基于以上过程,S-mode 软件如果要开启虚拟内存,则需要建立页表并将根页表的 PFN 写入到 satp 寄存器中,同时需要执行 `sfence.vma` 指令用以同步当前所有的内存读写操作。 + +### Linux 设置页表 + +RISC-V Linux 中使用 `create_pgd_mapping()` 函数用于创建(根)页表,此函数的使用场景有: + +1. 初期的内存映射 `setup_vm()` 为 fix-mapping、内核镜像创建早期的临时页表 -- `early_pg_dir` +2. 系统内存发现后的后期内存映射 `setup_vm_final()` 为内核镜像以及线性地址空间创建页表 -- `swapper_pg_dir` +3. efi 为 runtime 内存创建 runtime 页表 -- `efi_mm` +4. 在休眠唤醒过程中,需要对切换到休眠镜像中保存的页表,调用 `temp_pgtable_mapping()` 函数创建临时页表 + +这里以内核线性地址的页表创建过程来做说明:在 `setup_vm_final()` 阶段,系统内存通过 dtb 或者 UEFI 内存映射表已添加到 `memblok.memory` 或者保留在 `memblock.reserved` 中,调用 `create_linear_mapping_page_table()` 将 `memblock.memory` 中的可用物理内存映射到 `PAGE_OFFSET` 开始的线性虚拟内存区域,调用 `create_pgd_mapping()` 创建页表,最终写入 `satp` 寄存器。调用 `create_pgd_mapping()` 传递的参数有: + +1. `swapper_pg_dir`: 根页表,后续以此变量写入 satp +2. `va`: 要映射的虚拟地址 +3. `pa`: 要映射的物理地址 +4. `map_size`: 要映射的内存大小,在 `best_map_size()` 函数中通过物理地址是否对齐 PGDIR_SIZE/../PMD_SIZE 来决定映射大小(下文的 UEFI 启动 panic 就与此有关) + + 比如:物理地址 `0x00000080200000` 对齐 `PMD_SIZE`,则返回 `PMD_SIZE`,物理地址 `0x00000080210000` 则返回 `PAGE_SIZE` + +5. `prot`: 定义当前映射 PTE 的保护位 + + 比如:内核代码段则为 `PAGE_KERNEL_READ_EXEC` + +```c +// arch/riscv/mm/init.c : 1248 + +static void __init setup_vm_final(void) + create_linear_mapping_page_table(); + /* Map all memory banks in the linear mapping */ + for_each_mem_range(i, &start, &end) { + if (start >= end) + break; + if (start <= __pa(PAGE_OFFSET) && + __pa(PAGE_OFFSET) < end) + start = __pa(PAGE_OFFSET); + if (end >= __pa(PAGE_OFFSET) + memory_limit) + end = __pa(PAGE_OFFSET) + memory_limit; + + create_linear_mapping_range(start, end); + } + + csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | satp_mode); // 写入根页表 PFN 到 satp + local_flush_tlb_all(); + +static void __init create_linear_mapping_range(phys_addr_t start, + phys_addr_t end) +{ + phys_addr_t pa; + uintptr_t va, map_size; + + for (pa = start; pa < end; pa += map_size) { + va = (uintptr_t)__va(pa); + map_size = best_map_size(pa, end - pa); + + create_pgd_mapping(swapper_pg_dir, va, pa, map_size, + pgprot_from_va(va)); + } +} +``` + +`create_pgd_mapping()` 函数用于在页表树中迭代建立内存映射,大致的过程如下: + +1. `pgd_index(va)` 获取 VA 在 PGD 中的偏移(PGD 目录中的 PTE 数目为 `PTRS_PER_PGD -1`) +2. 如果 `map_size` 为 `PGDIR_SIZE`,则表示此映射存放在 PGD 目录中对应索引的叶子 PTE,否则存放到下一级页目录 +3. 存放到下一级页目录 + - 如果此 PTE 为空,则调用 `pt_ops` 结构中分配下一级页目录的函数,并保存在当前 PTE 中。`opt_ops` 在不同启动阶段采用的分配后端不同,可参考 (`pt_ops_set_{early,fixmap,late}`) + - 获取下一级页目录的(虚拟)地址,并调用下一级页目录的创建函数 `create_pgd_next_mapping()`,不同的分页方案调用不同的函数 + - 如果指定的是最小的 `map_size` -- `PAGE_SIZE`,则最终调用 `create_pte_mapping()` 在 PTE 中存储该页物理地址的 PFN + +```c +// arch/riscv/mm/init.c : 635 + +void __init create_pgd_mapping(pgd_t *pgdp, + uintptr_t va, phys_addr_t pa, + phys_addr_t sz, pgprot_t prot) +{ + pgd_next_t *nextp; + phys_addr_t next_phys; + // 该虚拟地址在其对应的页目录中的索引 + uintptr_t pgd_idx = pgd_index(va); /// (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1)) + + if (sz == PGDIR_SIZE) { // 存放在 PGD 中 + if (pgd_val(pgdp[pgd_idx]) == 0) + pgdp[pgd_idx] = pfn_pgd(PFN_DOWN(pa), prot); + return; + } + + if (pgd_val(pgdp[pgd_idx]) == 0) { // 创建下一级页目录,并将其物理地址保存到此表项中 + next_phys = alloc_pgd_next(va); + pt_ops.alloc_pmd(__va) + + pgdp[pgd_idx] = pfn_pgd(PFN_DOWN(next_phys), PAGE_TABLE); + + nextp = get_pgd_next_virt(next_phys); // 获取虚拟地址 + pt_ops.get_pmd_virt + + memset(nextp, 0, PAGE_SIZE); + } else { + next_phys = PFN_PHYS(_pgd_pfn(pgdp[pgd_idx])); // 如果表项不为空,取出物理地址 + nextp = get_pgd_next_virt(next_phys); + } + + create_pgd_next_mapping(nextp, va, pa, sz, prot); // 在下一级目录项中对此地址进行映射 + create_pmd_mapping((pmd_t *)__nextp, __va, __pa, __sz, __prot))) + create_pte_mapping(ptep, va, pa, sz, prot) + ptep[pte_idx] = pfn_pte(PFN_DOWN(pa), prot); +} +``` + +如果系统支持五级页表方案 (Sv57),则在 `map_size` 为 `PAGE_SIZE` 的情况下,依次调用 `create_p4d_mapping() => create_pud_mapping() => create_pmd_mapping() => create_pte_mapping()`;如果系统支持四级页表方案 (Sv48),则在 `map_size` 为 `PAGE_SIZE` 的情况下,依次调用 `create_pud_mapping() => create_pmd_mapping() => create_pte_mapping()`;如果系统支持三级页表方案(Sv39),则在 `map_size` 为 `PAGE_SIZE` 的情况下,依次调用 `create_pmd_mapping() => create_pte_mapping()`;如果是 32 位的二级页表方案(Sv32),则直接调用 `create_pte_mapping()`。 + +```c +// arch/riscv/mm/init.c : 611 + +#define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \ + (pgtable_l5_enabled ? \ + create_p4d_mapping(__nextp, __va, __pa, __sz, __prot) : \ + (pgtable_l4_enabled ? \ + create_pud_mapping((pud_t *)__nextp, __va, __pa, __sz, __prot) : \ + create_pmd_mapping((pmd_t *)__nextp, __va, __pa, __sz, __prot))) + +create_pmd_mapping(pmd_t *pmdp, uintptr_t va, phys_addr_t pa, phys_addr_t sz, pgprot_t prot) + if (sz == PMD_SIZE) + pmdp[pmd_idx] = pfn_pmd(PFN_DOWN(pa), prot); + pte_phys = PFN_PHYS(_pmd_pfn(pmdp[pmd_idx])); + ptep = pt_ops.get_pte_virt(pte_phys); + create_pte_mapping(ptep, va, pa, sz, prot); + ptep[pte_idx] = pfn_pte(PFN_DOWN(pa), prot); + +``` + +通过本节的介绍,相信大家对 MMU 的地址翻译,以及 Linux 如何创建页表有了一定的了解。这两个知识是理解本文要讨论的这个 commit 的背景知识,我们继续。 + +## riscv: Use PUD/P4D/PGD pages for the linear mapping + +> 补丁原文就不贴了,可以结合具体的 commit id(3335068f8721)或者[邮件][1]来看 + +这个补丁在 Linux v6.4-rc1 版本引入,主要带来两个变更: + +- kernel_map.va_pa_offset 从之前的 `PAGE_OFFSET - kernel_map.phys_addr` 变更到 `PAGE_OFFSET - phys_ram_base`。 + + va_pa_offset 用于 `__va()/__pa()` 的计算,比如:对于线性地址的 `__pa()` 等价于:`#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - kernel_map.va_pa_offset)`。此变更就导致:线性地址空间的物理地址可以不从 `kernel_map.phys_addr` -- `0x00000080200000` 开始(这个地址根据前文所述,`map_size` 为 `PMD_SIZE` 即只能进行 2M 的 PMD 的映射),从而可以采用更大的页进行映射,比如:`0x00000000c0000000` 地址可映射到 1G 的页目录上。这样能带来可能的更好的 TLB 性能。 + +- `MIN_MEMBLOCK_ADDR` 从 `__pa(PAGE_OFFSET)` 变更到 `0` + + 此变量用于在早期的内存发现过程中,定义 DRAM 内存发现的下限。此变更导致:可以利用 `0x00000080200000` 地址之前的物理地址,避免的一定的内存浪费。 + +> 注意:此修改使得 `setup_vm_setup()` 之前的 __va/__pa 操作都是无意义,如果你的代码执行了该操作,可能会报错 +> +> 为描述方面,后文统一以“大页补丁”来称呼此补丁。 + +## 休眠 panic + +此问题是我在对 Linux v6.4-rc1 的休眠特性进行分析(可参考社区对休眠的[三篇分析文章][2])的过程中发现的,在做了一些前期定位后,向社区发送了 [Bug Report][3] 邮件。本节结合邮件列表对此问题做个系统的梳理: + +> 你可以在邮件中找到问题的复现方法和系统日志 + +在 `/sys/power/state` 中输入 `disk`,触发休眠过程中,出现 panic,关键日志如下: + +```sh +[root@stage4 ~]# echo disk > /sys/power/state +[ 448.600860] PM: hibernation: hibernation entry +[ 448.633200] Filesystems sync: 0.023 seconds +[ 448.637578] Freezing user space processes +[ 448.642714] Freezing user space processes completed (elapsed 0.004 seconds) +[ 448.643801] OOM killer disabled. +[ 448.646150] PM: hibernation: Preallocating image memory +[ 450.448810] PM: hibernation: Allocated 57556 pages for snapshot +[ 450.449347] PM: hibernation: Allocated 230224 kbytes in 1.80 seconds (127.90 MB/s) +[ 450.449950] Freezing remaining freezable tasks +[ 450.453384] Freezing remaining freezable tasks completed (elapsed 0.003 seconds) +[ 450.498622] Disabling non-boot CPUs ... +[ 450.501293] CPU0 attaching NULL sched-domain. +[ 450.501879] CPU1 attaching NULL sched-domain. +[ 450.503247] CPU0 attaching NULL sched-domain. +[ 450.503624] root domain span: 0 (max cpu_capacity = 1024) +[ 450.514289] CPU1: off +[ 450.525152] PM: hibernation: Creating image: +[ 450.525152] PM: hibernation: Need to copy 56199 pages + +[ 450.525152] Oops - load access fault [#1] +[ 450.525152] Modules linked in: +[ 450.525152] CPU: 0 PID: 210 Comm: bash Not tainted 6.4.0-rc1-00004-gcce672326817 #18 +[ 450.525152] Hardware name: riscv-virtio,qemu (DT) +[ 450.525152] epc : swsusp_save+0x2ee/0x45a +[ 450.525152] ra : swsusp_save+0x2b2/0x45a +[ 450.525152] epc : ffffffff809ac404 ra : ffffffff809ac3c8 sp : ff200000007dbc20 +[ 450.525152] gp : ffffffff815dc700 tp : ff6000000346b000 t0 : 65626968203a4d50 +[ 450.525152] t1 : 0000000000080000 t2 : 7265626968203a4d s0 : ff200000007dbc90 +[ 450.525152] s1 : 0000000000000001 a0 : 0000000000000001 a1 : ff5fffff80000000 +[ 450.525152] a2 : ff60000000000000 a3 : 0000000000001000 a4 : 0000000000000000 +[ 450.525152] a5 : ff60000000000000 a6 : ffffffff815eb000 a7 : ffffffffffff8000 +[ 450.525152] s2 : ff6000000ac22000 s3 : ffffffff815dbf45 s4 : 000000000000db87 +[ 450.525152] s5 : 0000000100000000 s6 : 0004000000000000 s7 : 0040000000000000 +[ 450.525152] s8 : ffffffff815dbf44 s9 : ff1c000002000000 s10: 0000000000080000 +[ 450.525152] s11: ffffffff81082060 t3 : 0000000000078000 t4 : ffffffff815f20c7 +[ 450.525152] t5 : ffffffff815f20c8 t6 : ff200000007dba28 +[ 450.525152] status: 0000000200000100 badaddr: ff60000000000000 cause: 0000000000000005 +[ 450.525152] [] swsusp_save+0x2ee/0x45a +[ 450.525152] [] swsusp_arch_suspend+0x4a/0x98 +[ 450.525152] [] hibernation_snapshot+0x1cc/0x3e2 +[ 450.525152] [] hibernate+0x14e/0x236 +[ 450.525152] [] state_store+0x6a/0x72 +[ 450.525152] [] kobj_attr_store+0xe/0x1a +[ 450.525152] [] sysfs_kf_write+0x32/0x3c +[ 450.525152] [] kernfs_fop_write_iter+0xfa/0x164 +[ 450.525152] [] vfs_write+0x27c/0x31e +[ 450.525152] [] ksys_write+0x68/0xda +[ 450.525152] [] sys_write+0x1a/0x22 +[ 450.525152] [] do_trap_ecall_u+0xc2/0xd6 +[ 450.525152] [] do_trap_ecall_u+0xc2/0xd6 +[ 450.525152] [] ret_from_exception+0x0/0x64 +[ 450.525152] Code: 8f91 8f95 87b3 40fc 8799 07b2 97ae 6685 8633 00e7 (620c) 0633 +[ 450.525152] ---[ end trace 0000000000000000 ]--- +``` + +从日志中可以看到,错误指令为 `epc: swsusp_save+0x2ee/0x45a`,对其执行反汇编后发现:在 `do_copy_page` 函数中,对寄存器 a2 中的地址执行 load 操作触发了 load access fault(此错误同时体现在 "Oops" 提示和 scause 寄存器的值中),可以判断这可能是一个访问 PMP 保护内存触发的异常。而发生错误的虚拟地址为 `0xff60000000000000`,正好的是内核线性地址的起始地址 -- `PAGE_OFFSET`,需要进一步确认该地址的映射的物理内存地址。 + +```c + +1381 *dst++ = *src++; + 0xffffffff809ac400 <+738>: add a2,a5,a4 + 0xffffffff809ac404 <+742>: ld a1,0(a2) // 0xff60000000000000 + 0xffffffff809ac406 <+744>: add a2,s2,a4 + 0xffffffff809ac40a <+748>: addi a4,a4,8 + 0xffffffff809ac40c <+750>: sd a1,0(a2) +``` + +> 结合当前环境,介绍 Linux 内存发现过程(如何将 dtb 中描述的内存信息添加到 memblock) + +Linux 初期的内存发现过程中,以 `parse_dtb() => early_init_dt_scan_memory()` 调用 `memblock_add()`,保存完整的系统内存在 `memblock.memory`,(范围由 `MIN_MEMBLOCK_ADDR` = 0 控制)。在正式页表 `swapper_pg_dir` 建立之前,调用 `early_init_fdt_scan_reserved_mem()` 初始化保留内存,如果对应的 "reserved-memory" 节点中没有 "no-map" 属性,则直接调用 `memblock_reserve()`,而不调用 `memblock_mark_nomap()`。可以在如下日志中看到初始化保留内存的信息: + +```c +[ 0.000000] OF: fdt: Looking for usable-memory-range property... +[ 0.000000] OF: fdt: Reserved memory: reserved region for node 'mmode_resv0@80000000': base 0x0000000080000000, size 0 MiB +[ 0.000000] OF: reserved mem: 0x0000000080000000..0x000000008003ffff (256 KiB) map non-reusable mmode_resv0@80000000 +``` + +此内存区域 `0x0000000080000000..0x000000008003ffff` 正是 OpenSBI 的固件内存(mmode_resv0@80000000),被识别为 "map"、"non-reusable",则该区域同时存在于 `memblock.memory` 和 `memblock.reserved` 中,且无 `MEMBLOCK_NOMAP` 标志。在后续的线性映射过程 `create_linear_mapping_page_table()` 中,`for_each_mem_range` 会对该内存区域进行映射,从内核页表查询工具 -- `ptdump` 中可以看到,内核线性地址正好映射到 OpenSBI 的固件内存物理地址: + +```sh +# cat /sys/kernel/debug/kernel_page_tables +... +---[ Linear mapping ]--- +0xff60000000000000-0xff60000000200000 0x0000000080000000 2M PMD D A G . . W R V // 固件内存 +0xff60000000200000-0xff60000000c00000 0x0000000080200000 10M PMD D A G . . . R V +0xff60000000c00000-0xff60000001000000 0x0000000080c00000 4M PMD D A G . . W R V +0xff60000001000000-0xff60000001600000 0x0000000081000000 6M PMD D A G . . . R V +0xff60000001600000-0xff60000040000000 0x0000000081600000 1002M PMD D A G . . W R V +0xff60000040000000-0xff60000100000000 0x00000000c0000000 3G PUD D A G . . W R V +---[ Modules/BPF mapping ]--- +---[ Kernel mapping ]--- +0xffffffff80000000-0xffffffff80a00000 0x0000000080200000 10M PMD D A G . X . R V +0xffffffff80a00000-0xffffffff80c00000 0x0000000080c00000 2M PMD D A G . . . R V +0xffffffff80c00000-0xffffffff80e00000 0x0000000080e00000 2M PMD D A G . . W R V +0xffffffff80e00000-0xffffffff81400000 0x0000000081000000 6M PMD D A G . . . R V +0xffffffff81400000-0xffffffff81800000 0x0000000081600000 4M PMD +``` + +> 为何 OpenSBI 不在 dtb 中对此固件内存设置 "no-map" 属性呢? + +在 OpenSBI 的 v0.8 中引入 commit 6966ad0abe70 ("platform/lib: Allow the OS to map the regions that are protected by PMP"),此提交对 PMP 保护的内存(比如:固件内存)默认不再设置 "no-map" 属性,并允许操作系统对其进行映射,同时提供 platform_override 使得某个平台(比如:sifive,fu540)可手动设置 "no-map" 属性。而此补丁的出发点,与上节描述的大页补丁是一致的,都是为了更好的 TLB 性能。 + +这样的话,panic 的原因就比较清晰了: + +1. OpenSBI 在 v0.8 之后对于固件内存默认不设置 "no-map" 属性 +2. 上节的大页补丁导致 OpenSBI 的固件内存被映射到线性地址空间 +3. 休眠过程中调用 `swsusp_save()` 拷贝当前系统内存页到休眠镜像,当对固件内存进行拷贝时,触发了 PMP 保护,进而 hart 发生 access fault 异常 + +那针对以上三个原因可以有如下解决方案: + +1. OpenSBI 恢复设置 "no-map" 属性 + + 牺牲 TLB 性能,还要考虑向后的兼容性 + +2. Linux 回退大页补丁 + + 回退会牺牲 TLB 性能(尽管此补丁的作者表示:对线性地址进行大页映射并不会带来更好的性能) + +3. 在休眠过程中跳过固件内存 + + 在启动早期对 "mmode_resv" 节点进行解析,并调用休眠的 `register_nosave_regions()` 接口保证此内存区域不会被休眠过程保存,实现可参考邮件中的[实验性补丁][4]。但此补丁不具有通用性,没办法处理非 "mmode_resv" 节点。 + +4. 设置休眠选项 -- `ARCH_HIBERNATION_POSSIBLE` 为 `NONPORTABLE` + + 休眠功能只能在 OpenSBI v8.0 之前(那些禁止 OS 映射固件内存的 SBI 实现)的系统中开启。而将休眠选项设置为 `NONPORTABLE`,用户可根据自己的系统配置,设置 `NONPORTABLE` 来开启或者关闭休眠功能。此方案在 commit (ed309ce52218 "RISC-V: mark hibernation as nonportable") 中实现。 + +个人比较赞成方案 1,但是推动起来应该比较困难,简单谈谈我对这个几个方案的看法: + +- 方案 1:固件内存应该没有让内核对其映射的强烈需求,那么 OpenSBI 就应该将该区域设置为 "no-map",至于向后的兼容性,可通过文档的形式来描述 +- 方案 2:如果对线性地址进行大页映射并不会带来更好的性能,可以回退此补丁 +- 方案 3:其实此问题并非休眠的单点问题,邮件列表中对此方案的讨论从最初就有点偏差 + - 比如:通过内核模块直接访问 `PAGE_OFFSET` 也会崩溃(虽然此访问没有经过内存分配器进行,但不代表某些组件(比如:[memory debugging stuff][5])不会这么做) +- 方案 4: 为了 v6.4 版本稳定的规避方案,需要用户自己判断固件环境来选择开启或者关闭休眠功能 + +最后,我会持续跟踪与此问题相关的一些内核/OpenSBI 的变更,让子弹再飞一会儿。 + +## UEFI 启动 panic + +此问题是我在对 Linux UEFI 启动过程的分析(参考社区对 Linux UEFI [相关文章][6])中发现的,在做了一些前期定位后,向社区发送了 [Bug Report][7] 邮件,本节结合邮件列表对此问题做个系统的梳理: + +> 你可以在邮件中找到问题的复现方法和系统日志 + +启动过程中内核 panic,关键日志如下: + +```sh +[ 0.000000] Unable to handle kernel paging request at virtual address ff6000007fdb1000 +[ 0.000000] Oops [#1] +[ 0.000000] Modules linked in: +[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.4.0-rc1-00007-g6966d7988c4f #65 +[ 0.000000] Hardware name: riscv-virtio,qemu (DT) +[ 0.000000] epc : __memset+0x60/0xfc +[ 0.000000] ra : memblock_alloc_try_nid+0x72/0x82 +[ 0.000000] epc : ffffffff8081d48c ra : ffffffff80a126e4 sp : ffffffff81403e80 +[ 0.000000] gp : ffffffff814fbb38 tp : ffffffff8140dac0 t0 : ff6000007fdb1000 +[ 0.000000] t1 : 0000000000000000 t2 : 5f6b636f6c626d65 s0 : ffffffff81403ec0 +[ 0.000000] s1 : 0000000000026000 a0 : ff6000007fdb1000 a1 : 0000000000000000 +[ 0.000000] a2 : 0000000000026000 a3 : ff6000007fdd7000 a4 : 0000000000000000 +[ 0.000000] a5 : ff5fffff7ffc0000 a6 : 0000000000000018 a7 : 0000000000000080 +[ 0.000000] s2 : ff6000007fdb1000 s3 : ffffffffffffffff s4 : 0000000000009e38 +[ 0.000000] s5 : ffffffffffffffff s6 : ff6000007fdd8000 s7 : 0000000000002000 +[ 0.000000] s8 : 00000000000071c8 s9 : 0000000000000000 s10: 0000000000000000 +[ 0.000000] s11: 0000000000000000 t3 : ffffffff80c0be40 t4 : ffffffff80c0be40 +[ 0.000000] t5 : ffffffff80c0bdb0 t6 : ffffffff80c0be40 +[ 0.000000] status: 0000000200000100 badaddr: ff6000007fdb1000 cause: 000000000000000f // Store/AMO page fault +[ 0.000000] [] __memset+0x60/0xfc +[ 0.000000] [] pcpu_embed_first_chunk+0x568/0x738 +[ 0.000000] [] setup_per_cpu_areas+0x22/0xb6 +[ 0.000000] [] start_kernel+0x1ce/0x57e +[ 0.000000] Code: 1007 82b3 40e2 0797 0000 8793 00e7 8305 97ba 8782 (b023) 00b2 +[ 0.000000] ---[ end trace 0000000000000000 ]--- +[ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task! +[ 0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]--- +``` + +此问题与休眠问题,有如下不同: + +1. 异常类型不同 + + 此问题为 page fault 是系统开启分页后 MMU 相关的异常,而休眠问题是 access-fault,是有 PMAs 或者 PMP 引发的异常 + +2. 休眠问题中固件内存映射到线性地址的情况,在 UEFI 环境中不存在 + + 对于 `reserved-memory` 节点中没有设置 "no-map" 且保留的固件内存,EDK2 (RiscVVirt) 将其保存在 `EfiReservedMemoryType`(下文会详细介绍)。而 Linux UEFI 初始化过程 `efi_init() => reserve_regions()` 会以 EFI memory mapping 重新构建 memblock,对于 `EfiReservedMemoryType` 内存,不会将其添加到 `memblock.memory` 中,而是在之后的 `early_init_fdt_scan_reserved_mem()` 函数中保存在 `memblock.reserved`,故而不会映射到线性地址。 + +而发生 page fault 的错误地址 `0xff6000007fdb1000` 属于线性地址空间,调试一下线性地址空间的映射过程,看到如下日志: + +``` +song # lowmem region: [0x0000000081800000 -- 0x00000000ffe3d000], va: 0xff6000007fbc0000, pa: 0x00000000ffc00000, map_size: 200000 ,pg: e7 +song # lowmem region: [0x0000000081800000 -- 0x00000000ffe3d000], va: 0xff6000007fdc0000, pa: 0x00000000ffe00000, map_size: 1000 ,pg: e7 +``` + +错误地址存在于 "va: 0xff6000007fbc0000"、"pa: 0x00000000ffc00000" 的 2M PMD 映射中,尝试对该映射做个手动分析: + +1. map_size = best_map_size(pa, end - pa) + + 映射大小通过物理地址及其与 DRAM 最大内存地址进行计算,由于 "0x00000000ffc00000" 对齐 `PMD_SIZE` 则设置映射大小为 `PMD_SIZE` + +2. create_pgd_mapping(swapper_pg_dir, va, pa, map_size, pgprot_from_va(va)); + + 此函数在 "Linux 设置页表" 一节有详细介绍,迭代调用到 `create_pmd_mapping()` 并以 `pmd_index(va)` 为索引,设置当前物理地址的 PFN 到 PMD 表项中。而 "va: 0xff6000007fbc0000"(注意:此地址不是 2M 对齐的)在经过 `pmd_index(va)` 后,在最终的页表中实际映射到了 "va: 0xff6000007fa00000",参考下面代码块中的地址展开: + + ``` + va = 0xff6000007fbc0000 + + [ff6000007f]101|1[c0]|000 + vpn0 + + // the va after pmd_index(va) + + [ff6000007f]101|0[00]|000 + + real va : 0xff6000007fa00000 + + ``` + +整个 2M PMD 的真实映射表示为:"va: [0xff6000007fa00000,0xff6000007fc00000)" => "pa: [0x00000000ffc00000,0x00000000ffe00000)",而在此虚拟地址范围之后且在下一个 4K PTE 映射的起始虚拟地址之间 -- `[0xff6000007fc00000,0xff6000007fdc0000` 存在一个虚拟地址空洞,如果对其进行访问,都会导致 page fault,而此问题的错误地址 `0xff6000007fdb1000` 正好就在这个区间。 + +为了解决此虚拟地址空洞,应该在映射大小计算中考虑虚拟地址的与某个映射大小对齐,可参考如下代码,那么在此问题中,由于 va 不能对齐 `PMD_SIZE`,则此映射会以 `PAGE_SIZE` 进行。此修改在 riscv/fixes commit (49a0a3731596 "riscv: Check the virtual alignment before choosing a map size") 中实现。 + +```c +static uintptr_t __init best_map_size(phys_addr_t pa, uintptr_t va, + phys_addr_t size) +{ + if (!(pa & (PGDIR_SIZE - 1)) && !(va & (PGDIR_SIZE - 1)) && size >= PGDIR_SIZE) + return PGDIR_SIZE; + + if (!(pa & (P4D_SIZE - 1)) && !(va & (P4D_SIZE - 1)) && size >= P4D_SIZE) + return P4D_SIZE; + + if (!(pa & (PUD_SIZE - 1)) && !(va & (PUD_SIZE - 1)) && size >= PUD_SIZE) + return PUD_SIZE; + + if (!(pa & (PMD_SIZE - 1)) && !(va & (PMD_SIZE - 1)) && size >= PMD_SIZE) + return PMD_SIZE; + + return PAGE_SIZE; +} + +``` + +从这个问题,我们了解到:在调用 `create_pxd_mapping()` 接口设置页表时,需要保证虚拟地址 va 和物理地址 pa 在映射大小 map_size 上对齐。 + +此邮件意外的展开了关于 `reserved-memory` 节点中设置 "no-map" 属性的内存区域应该如何在 UEFI 中保存的讨论: + +根据 DT 规范(devicetree-specification v0.4-rc1 3.5.4 "/reserved-memory and UEFI"),设置 "no-map" 的保留内存存放在 EfiReservedMemoryType,其他类型的保留内存存放在 BootServiceData(会被 OS 在 ExitBootServices 之后回收)。而在 EDK2 (RiscVVirt) 中对于固件内存(mmode_resv0)无视其 "no-map" 属性直接保存到 EfiReservedMemoryType,避免 OS 访问。相关代码及日志体现在: + +```c +// edk2: OvmfPkg/RiscVVirt/Sec/Memory.c :200 + +MemoryPeimInitialization() + Node = fdt_path_offset (FdtPointer, "/reserved-memory/mmode_resv0"); + MmodeResvBase = fdt64_to_cpu (ReadUnaligned64 (RegProp)); + MmodeResvSize = fdt64_to_cpu (ReadUnaligned64 (RegProp + 1)); + InitializeRamRegions ( CurBase, CurSize, MmodeResvBase, MmodeResvSize); + AddReservedMemoryBaseSizeHob (MmodeResvBase, MmodeResvSize); + +// Linux 日志 +[ 0.000000] efi: 0x000080000000-0x00008003ffff [Reserved | | | | | | | | | | | | | |UC] // EfiReservedMemoryType +[ 0.000000] memblock_reserve: [0x00000000f8fd1000-0x00000000f8fd1fff] efi_init+0x150/0x26c +[ 0.000000] memblock_reserve: [0x0000000080200000-0x00000000817fffff] paging_init+0xee/0x5ae +[ 0.000000] memblock_reserve: [0x00000000f2b61000-0x00000000f6760fff] reserve_initrd_mem+0x9a/0xfc +[ 0.000000] memblock_reserve: [0x0000000080000000-0x000000008003ffff] early_init_fdt_scan_reserved_mem+0x242/0x2c6 +[ 0.000000] OF: reserved mem: 0x0000000080000000..0x000000008003ffff (256 KiB) map non-reusable mmode_resv0@80000000 +``` + +但是在那些遵守 DT 标准的 UEFI 固件(比如 U-Boot)中,对于没有设置 "no-map" 属性的固件内存,则保留在 BootServiceData 区域,会被 OS 回收并映射,这样就会导致与休眠 panic 类似的问题。那么只能在 OpenSBI 中将固件内存设置为 "no-map",正如 Atish Patra 所述: + +``` +Let's have a no-map set for the reserved memory set for the firmware. +The fallout would be anybody with kernel > 6.4 has to upgrade the firmware version that sets the no-map correctly +if they care about hibernation or EFI booting. + +OpenSBI v1.3 is planned this month anyway. +We can communicate the same to the rust-sbi project as well. +``` + +在我写这篇文章的时候,这个补丁已经提交了,参考 [platform/lib: Set no-map attribute on all PMP regions][8]。 + +## 小结 + +本文首先介绍了 RISC-V MMU 的地址翻译过程,并分析了 `create_pgd_mapping()` 接口是如何创建页表,之后介绍了 RISC-V Linux v6.4-rc1 大页补丁的实现,并对该补丁引发的两个 panic 进行分析,这里做个总结。 + +在 OpenSBI v8.0 之后的固件为提升 TLB 性能默认不为固件内存设置 "no-map"属性,使得 OS 可以映射固件内存,Linux 大页补丁调整了物理内存发现的下限,使得固件内存映射到了线性地址空间,而休眠过程对其进行拷贝时发生 access-fault。内核目前在 commit (ed309ce52218 "RISC-V: mark hibernation as nonportable") 中以 `NONPORTABLE` 选项在没有为固件内存设置 "no-map" 属性的 OpenSBI 中临时关闭休眠功能。 + +Linux 大页补丁在调用 `create_pgd_mapping()` 接口设置页表时,计算映射大小没有考虑虚拟地址对齐,进而产生了虚拟地址空洞,使得对空洞的访问触发 page fault,从而导致 UEFI 启动失败。此问题在 riscv/fixes commit (49a0a3731596 "riscv: Check the virtual alignment before choosing a map size") 中解决。 + +第二个问题,同时引发了另外一个潜在问题:在没有为固件内存设置 "no-map"属性的 OpenSBI 并且遵守 DT 规范(/reserved-memory and UEFI)的固件环境中,OS 会映射固件内存,从而导致与休眠 panic 相似的问题。此问题最终推动 OpenSBI 恢复对固件内存设置 "no-map" 属性,预计在 OpenSBI v1.3 可以看到。 + +如果你使用 RISC-V Linux v6.4-rc1 及其之后的内核版本,并遇到与上述两个问题,可降级到 OpenSBI 到 v0.8 之前的版本或者采用这个[补丁][8]对你的 OpenSBI 进行升级。 + +## 参考资料 + +- [RISC-V 休眠实现分析][2] +- [RISC-V Linux 内核 UEFI 启动过程分析][6] +- [Bug report: kernel paniced when system hibernates][3] +- [Bug report: kernel paniced while booting with UEFI][7] + +[1]: https://lore.kernel.org/r/20230324155421.271544-4-alexghiti@rivosinc.com +[2]: https://gitee.com/tinylab/riscv-linux/pulls/694 +[3]: https://lore.kernel.org/linux-riscv/CAAYs2=gQvkhTeioMmqRDVGjdtNF_vhB+vm_1dHJxPNi75YDQ_Q@mail.gmail.com/ +[4]: https://lore.kernel.org/linux-riscv/CAAYs2=jEPQLwe83UDVFStLuei4C+8ZuHJ98_J13RhobpjkGBVw@mail.gmail.com/ +[5]: https://lore.kernel.org/linux-kernel/20230530080425.18612-1-alexghiti@rivosinc.com/ +[6]: https://gitee.com/tinylab/riscv-linux/pulls/660 +[7]: https://lore.kernel.org/linux-riscv/tencent_7C3B580B47C1B17C16488EC1@qq.com/ +[8]: https://github.com/riscv-software-src/opensbi/commit/8153b2622b08802cc542f30a1fcba407a5667ab9 diff --git a/articles/20230615-section-gc-part3.md b/articles/20230615-section-gc-part3.md new file mode 100644 index 0000000000000000000000000000000000000000..9e2534587e5a2525aeaed2c98c8d39cfa2bed60a --- /dev/null +++ b/articles/20230615-section-gc-part3.md @@ -0,0 +1,323 @@ +> Corrector: [TinyCorrect](https://gitee.com/tinylab/tinycorrect) v0.2-rc1 - [spaces toc codeinline urls]
+> Author: 谭源
+> Date: 2022/06/15
+> Revisor: Falcon
+> Project: [RISC-V Linux 内核剖析](https://gitee.com/tinylab/riscv-linux)
+> Sponsor: PLCT Lab, ISCAS + +# Section GC 分析 —— 引用建立过程 + +## 概述 + +[上一篇文章][001] 我们介绍了在开启 `--gc-sections` 选项后,gold 链接器删除未使用到的 section 的过程。 + +这篇文章我们将结合 `ld.bfd` 链接器(即默认使用的 `ld`)源码,探索链接器建立引用关系的过程。 + +## 准备工作 + +### 下载代码 + +```bash +wget https://ftp.gnu.org/gnu/binutils/binutils-2.40.tar.gz +tar xvf binutils-2.40.tar.gz +cd binutils-2.40/ +``` + +或者克隆 `binutils` 仓库 + +```bash +git clone https://mirrors.tuna.tsinghua.edu.cn/git/binutils-gdb.git +``` + +### 编译 + +```bash +make all-ld -j +``` + +编译生成的 `ld.bfd` 链接器位于 `ld/ld-new`。 + +### 配置调试环境 + +编写一个用来测试的程序 `test.c`: + +```c +int fun1() +{ + return 0; +} + +int fun2() +{ + return 0; +} + +int un_used(){ + return 0; +} + +int main(){ + fun1(); + fun2(); + return 0; +} +``` + +`fun1()` 和 `fun2()` 都被 `main()` 调用了,需要在 GC 过程中保留;`un_used()` 函数没有被使用过,需要在 GC 过程中删除。 + +和上一篇文章一样,我们编写一个配置文件,让我们能直接在 VSCode 中进行调试。具体使用方法可以参考 [上一篇文章][001]。 + +```json +{ + "version": "0.2.0", + "configurations": [ + { + "name": "GDB BFD", + "type": "cppdbg", + "request": "launch", + "program": "${workspaceFolder}/ld/ld-new", + "args": [ + "--gc-sections", + "-dynamic-linker", + "/lib64/ld-linux-x86-64.so.2", + "-pie", + "/usr/lib/gcc/x86_64-pc-linux-gnu/13.1.1/../../../../lib/Scrt1.o", + "/usr/lib/gcc/x86_64-pc-linux-gnu/13.1.1/../../../../lib/crti.o", + "/usr/lib/gcc/x86_64-pc-linux-gnu/13.1.1/crtbeginS.o", + "-L/usr/lib/gcc/x86_64-pc-linux-gnu/13.1.1", + "-L/usr/lib/gcc/x86_64-pc-linux-gnu/13.1.1/../../../../lib", + "-L/lib/../lib", + "-L/usr/lib/../lib", + "-L/usr/lib/gcc/x86_64-pc-linux-gnu/13.1.1/../../..", + "test.o", + "-lgcc_s", + "-lc", + "-lgcc", + "/usr/lib/gcc/x86_64-pc-linux-gnu/13.1.1/crtendS.o", + "/usr/lib/gcc/x86_64-pc-linux-gnu/13.1.1/../../../../lib/crtn.o" + ], + "cwd": "${workspaceFolder}", + "setupCommands": [ + { + "description": "Enable pretty-printing for gdb", + "text": "-enable-pretty-printing" + } + ], + "stopAtEntry": false + } + ] +} +``` + +### 术语解释 + +- 符号(Symbol):符号通常指代一个变量或者函数的名称。例如,在 C 语言中,当声明一个函数或变量,编译器会把它们的名称保存为符号。符号表是一个保存了所有符号及其相关信息的数据结构,链接器主要通过它来寻找和解决引用。 + +- 重定位(Relocation):在编译和链接过程中,重定位是一个重要步骤。当编译器编译源代码时,它并不知道每个符号最后会被放置在内存的什么位置。因此,编译器生成的对象文件中,会包含一些需要在链接过程中被填充真正地址的占位符,这些占位符就需要重定位。例如,如果一个函数调用了另一个函数,编译器在编译时可能并不知道被调用的函数在内存中的真正地址,所以它会生成一个占位符。然后在链接过程中,链接器会找到被调用函数的真正地址,替换掉占位符,完成重定位。 + +- 重定位条目(Relocation Entry):汇编器遇到最终位置未知的目标引用,会生成一个重定位条目,告诉链接器在将目标文件合并成可执行文件时如何修改这个引用。 + + ```C + typedef struct + { + Elf64_Addr r_offset; // 需要被修改的引用的节偏移 + Elf64_Xword r_info; // 存储符号表索引和重定位类型。 + Elf64_Sxword r_addend; + } Elf64_Rela; + ``` + +## 函数调用链分析 + +`elflink.c` 中的 `_bfd_elf_gc_mark()` 函数显而易见是用来标记已经用到的 section 的。 + +```C +bool +_bfd_elf_gc_mark (struct bfd_link_info *info, + asection *sec, + elf_gc_mark_hook_fn gc_mark_hook) +{ + bool ret; + asection *group_sec, *eh_frame; + + sec->gc_mark = 1; + + /* Mark all the sections in the group. */ + group_sec = elf_section_data (sec)->next_in_group; + if (group_sec && !group_sec->gc_mark) + if (!_bfd_elf_gc_mark (info, group_sec, gc_mark_hook)) + return false; + + /* Look through the section relocs. */ + ret = true; + eh_frame = elf_eh_frame_section (sec->owner); + if ((sec->flags & SEC_RELOC) != 0 + && sec->reloc_count > 0 + && sec != eh_frame) + { + struct elf_reloc_cookie cookie; + + if (!init_reloc_cookie_for_section (&cookie, info, sec)) + ret = false; + else + { + for (; cookie.rel < cookie.relend; cookie.rel++) + if (!_bfd_elf_gc_mark_reloc (info, sec, gc_mark_hook, &cookie)) + { + ret = false; + break; + } + fini_reloc_cookie_for_section (&cookie, sec); + } + } + + if (ret && eh_frame && elf_fde_list (sec)) + { + struct elf_reloc_cookie cookie; + + if (!init_reloc_cookie_for_section (&cookie, info, eh_frame)) + ret = false; + else + { + if (!_bfd_elf_gc_mark_fdes (info, sec, eh_frame, + gc_mark_hook, &cookie)) + ret = false; + fini_reloc_cookie_for_section (&cookie, eh_frame); + } + } + + eh_frame = elf_section_eh_frame_entry (sec); + if (ret && eh_frame && !eh_frame->gc_mark) + if (!_bfd_elf_gc_mark (info, eh_frame, gc_mark_hook)) + ret = false; + + return ret; +} +``` + +我们暂时不关心它的逻辑,先看看它的调用链。 + +在该函数处打断点,一直 continue 到 `sec.name` 为 `.text.main`。 + +![image-20230615160546236](images/20230615-section-gc-part3/image-20230615160546236.png) + +可以看到左下方的调用栈,有两个 `_bfd_elf_gc_mark()` 在栈中,`r_offset` 为 10。 + +如果在 13829 行继续运行,进入函数 `_bfd_elf_gc_mark_reloc()` 后,该函数又会调用一次 `_bfd_elf_gc_mark()`。 + +![image-20230601152244101](images/20230615-section-gc-part3/image-20230601152244101.png) + +这又向调用栈压入了两个 frame,有了三个 `_bfd_elf_gc_mark()` 栈。点击左侧的 Call Stack 某一项可以切换栈,查看不同栈的值。 + +| `frame` | `sec.name` | +|-----------|--------------| +| `frame 5` | `.text.fun1` | +| `frame 3` | `.text.main` | +| `frame 1` | `.text` | + +上表是不同 frame 下变量 `sec.name` 的值,表示当前 frame 处理的 section 名。说明此时压入了栈正在处理 `.text.fun1`。 + +![image-20230601153457712](images/20230615-section-gc-part3/image-20230601153457712.png) + +等到 `frame 5` 和 `frame 4` 执行完,返回到 `frame 3` 执行时,for 循环 `cookie.rel` 执行了++操作,这里又开始遍历 `.text.main` 的下一个引用。从上图我们可以得知,该引用项的 `r_offset` 为 20。这里调用 `_bfd_elf_gc_mark_reloc()` 函数,该函数又会调用 `_bfd_elf_gc_mark()` 来处理这个引用,即压入新的栈,重新建立了 `frame 4` 和 `frame 5`。 + +![image-20230601222533263](images/20230615-section-gc-part3/image-20230601222533263.png) + +下表是重新建立 `frame 5` 后当前调用栈的状态。和之前表不同,此时 `frame 5` 的 `sec.name` 值为 `.text.fun2`。 + +| `frame` | `sec.name` | +|-----------|--------------| +| `frame 5` | `.text.fun2` | +| `frame 3` | `.text.main` | +| `frame 1` | `.text` | + +据此可以推测出,这里是在递归扫描 section 引用到的其他 section,即扫描一个 section 时,会将当前 section 的 gc_mark 置为 1,然后遍历该 section 的引用(压入调用栈),直到栈空且 for 循环执行完毕,对该 section 的扫描才结束。 + +## 数据结构和代码解析 + +遍历当前 section 引用到的 section 是 `_bfd_elf_gc_mark()` 函数中的这段代码完成的: + +```C + for (; cookie.rel < cookie.relend; cookie.rel++) + if (!_bfd_elf_gc_mark_reloc (info, sec, gc_mark_hook, &cookie)) + { + ret = false; + break; + } +``` + +`_bfd_elf_gc_mark()` 函数会调用 `_bfd_elf_gc_mark_reloc()` 函数 + +这里 `cookie` 的类型是 `elf_reloc_cookie`: + +```c +struct elf_reloc_cookie +{ + Elf_Internal_Rela *rels, *rel, *relend; // 表示 ELF 文件中的重定位条目。分别表示重定位条目数组的开始、末尾,和当前处理的重定位条目 + Elf_Internal_Sym *locsyms; // ELF 文件中的本地符号表。 + bfd *abfd; + size_t locsymcount; + size_t extsymoff; + struct elf_link_hash_entry **sym_hashes; + int r_sym_shift; + bool bad_symtab; +}; +``` + +那么这个循环的目的是遍历所有的重定位条目(从 `cookie.rel` 到 `cookie.relend` 之间的所有条目)。在每次循环中,都会调用 `_bfd_elf_gc_mark_reloc` 函数对当前的重定位条目进行处理。 + +下表是处理到 `.text.fun2` 时,栈的情况: + +| `frame` | 调用函数 | 处理对象 | +|-----------|----------------------------|--------------| +| `frame 5` | `_bfd_elf_gc_mark()` | `.text.fun2` | +| `frame 4` | `_bfd_elf_gc_mark_reloc()` | `.text.fun2` | +| `frame 3` | `_bfd_elf_gc_mark()` | `.text.main` | +| `frame 5` | `_bfd_elf_gc_mark_reloc()` | `.text.main` | +| `frame 1` | `_bfd_elf_gc_mark()` | `.text` | + +## ELF 中的重定位条目 + +经过上面的解析,我们可以知道链接器是通过重定位条目来得知一个 section 引用了哪些其他 section 的。重定位条目其实就存储在 ELF 文件中。 + +```bash +$readelf -r test.o + +Relocation section '.rela.text.main' at offset 0x278 contains 2 entries: + Offset Info Type Sym. Value Sym. Name + Addend +00000000000a 000600000004 R_X86_64_PLT32 0000000000000000 fun1 - 4 +000000000014 000700000004 R_X86_64_PLT32 0000000000000000 fun2 - 4 + +Relocation section '.rela.eh_frame' at offset 0x2a8 contains 4 entries: + Offset Info Type Sym. Value Sym. Name + Addend +000000000020 000200000002 R_X86_64_PC32 0000000000000000 .text.fun1 + 0 +000000000040 000300000002 R_X86_64_PC32 0000000000000000 .text.fun2 + 0 +000000000060 000400000002 R_X86_64_PC32 0000000000000000 .text.un_used + 0 +000000000080 000500000002 R_X86_64_PC32 0000000000000000 .text.main + 0 +``` + +从这个命令的输出我们可以得到下表: + +| Sym. Name | Offset 十六进制 | Offset 十进制 | +|-----------|-----------------|---------------| +| `fun1` | 00000000000a | 10 | +| `fun2` | 000000000014 | 20 | + +这和函数调用链分析中的值分别为 10 和 20 的 `r_offset` 相同,同时 `.rela.text.main` 的条目项没有 `un_used`。说明链接器就是读取的这部分信息来解析引用关系的。 + +## 总结 + +我们通过研究链接器链接一个简单程序的例子,从源码层面分析了开启 `--gc-sections` 选项后链接器是如何确定一个函数的 section 引用了哪些其他函数 section 的。 + +链接器会从 ELF 文件中的重定位条目中解析处理引用信息。 + +其实对于全局变量来说,链接器会做一样的操作。`-fdata-sections` 选项会把每个全局变量放入单独的 `.bss` section 中。假如 `fun1()` 使用了全局变量 used,那么在遍历 `fun1()` 的引用时就会解析 `.bss.used` section。 + +## 参考资料 + +- Tiny Linux Kernel Project: Section Garbage Collection Patchset +- [重定位 - 深入理解计算机系统(CSAPP)][003] +- [符号和符号表 - 深入理解计算机系统(CSAPP)][002] + +[001]: 20230526-section-gc-part2.md +[002]: https://hansimov.gitbook.io/csapp/part2/ch07-linking/7.5-symbols-and-symbol-tables +[003]: https://hansimov.gitbook.io/csapp/part2/ch07-linking/7.7-relocation diff --git a/articles/20230617-riscv-klibc-opt-summary.md b/articles/20230617-riscv-klibc-opt-summary.md new file mode 100644 index 0000000000000000000000000000000000000000..340704d105ced1dd5e0496607c3eda643bdd1899 --- /dev/null +++ b/articles/20230617-riscv-klibc-opt-summary.md @@ -0,0 +1,618 @@ +> Corrector: [TinyCorrect](https://gitee.com/tinylab/tinycorrect) v0.1 - [urls pangu autocorrect]
+> Author: Jingqing 2351290287@qq.com
+> Date: 2023/6/17
+> Revisor: Falcon
+> Project: [RISC-V Linux 内核剖析](https://gitee.com/tinylab/riscv-linux)
+> Proposal: [【老师提案】RISC-V Generic library routines and assembly 技术调研、分析与优化 · Issue #I64R6O · 泰晓科技/RISCV-Linux - Gitee.com](https://gitee.com/tinylab/riscv-linux/issues/I64R6O)
+> Sponsor: PLCT Lab, ISCAS + +# 近半年 RISC-V 内核库中 str 和 mem 函数的优化内容总结 + +## 简介 + +本文结合 简要梳理了一下 RISC-V Linux 内核库函数的优化演进情况,主要涉及 Memory, String 操作两大部分。 + +## Memory + +### riscv: optimized mem* functions + +[riscv: optimized mem* functions][002] + +该组 patchset 对各种 mem 相关操作函数进行了优化,以下逐个分析。 + +#### memcpy + +主要是由“直接逐字节复制”转变为“先对齐再按字复制”。 + +1. 如果仍未启用高效对齐访问 CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS,则先在不改变 dest 和 src 相对距离的情况下将 desc 对齐在字边界上。 +2. 如果 `distance==0` 说明 src 和 dest 两者已经对齐,直接进行(32 or 64 bits)字长复制。 +3. 如果 `distance !=0` 说明未对齐,按照差值逐字复制。 + +```c ++void *__memcpy(void *dest, const void *src, size_t count) ++{ ++ union const_types s = { .as_u8 = src }; ++ union types d = { .as_u8 = dest }; ++ int distance = 0; ++ ++ if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) { ++ if (count < MIN_THRESHOLD) ++ goto copy_remainder; ++ ++ /* Copy a byte at time until destination is aligned. */ ++ for (; d.as_uptr & WORD_MASK; count--) ++ *d.as_u8++ = *s.as_u8++; ++ ++ distance = s.as_uptr & WORD_MASK; ++ } ++ ++ if (distance) { ++ unsigned long last, next; ++ ++ /* ++ * s is distance bytes ahead of d, and d just reached ++ * the alignment boundary. Move s backward to word align it ++ * and shift data to compensate for distance, in order to do ++ * word-by-word copy. ++ */ ++ s.as_u8 -= distance; ++ ++ next = s.as_ulong[0]; ++ for (; count >= BYTES_LONG; count -= BYTES_LONG) { ++ last = next; ++ next = s.as_ulong[1]; ++ ++ d.as_ulong[0] = last >> (distance * 8) | ++ next << ((BYTES_LONG - distance) * 8); ++ ++ d.as_ulong++; ++ s.as_ulong++; ++ } ++ ++ /* Restore s with the original offset. */ ++ s.as_u8 += distance; ++ } else { ++ /* ++ * If the source and dest lower bits are the same, do a simple ++ * 32/64 bit wide copy. ++ */ ++ for (; count >= BYTES_LONG; count -= BYTES_LONG) ++ *d.as_ulong++ = *s.as_ulong++; ++ } ++ ++copy_remainder: ++ while (count--) ++ *d.as_u8++ = *s.as_u8++; ++ ++ return dest; ++} ++EXPORT_SYMBOL(__memcpy); ++ ++void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy); ++EXPORT_SYMBOL(memcpy); +``` + +#### memmove + +如果 dest 和 src 不重叠或者 `dest src) { ++ const char *s = src + count; ++ char *tmp = dest + count; ++ ++ while (count--) ++ *--tmp = *--s; ++ } ++ return dest; ++} ++EXPORT_SYMBOL(__memmove); ++ ++void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmove); ++EXPORT_SYMBOL(memmove); +``` + +#### memset + +旧 memset:永远一次一个字节地填充。安全但是效率低。 + +修改后:也是采用对齐机制,先按字节填充,等到和最大填充单位的倍数对齐时按最大填充单位填入。 + +```c ++void *__memset(void *s, int c, size_t count) ++{ ++ union types dest = { .as_u8 = s }; ++ ++ if (count >= MIN_THRESHOLD) { ++ unsigned long cu = (unsigned long)c; ++ ++ /* Compose an ulong with 'c' repeated 4/8 times */ ++#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER ++ cu *= 0x0101010101010101UL; ++#else ++ cu |= cu << 8; ++ cu |= cu << 16; ++ /* Suppress warning on 32 bit machines */ ++ cu |= (cu << 16) << 16; ++#endif ++ if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) { ++ /* ++ * Fill the buffer one byte at time until ++ * the destination is word aligned. ++ */ ++ for (; count && dest.as_uptr & WORD_MASK; count--) ++ *dest.as_u8++ = c; ++ } ++ ++ /* Copy using the largest size allowed */ ++ for (; count >= BYTES_LONG; count -= BYTES_LONG) ++ *dest.as_ulong++ = cu; ++ } ++ ++ /* copy the remainder */ ++ while (count--) ++ *dest.as_u8++ = c; ++ ++ return s; ++} ++EXPORT_SYMBOL(__memset); ++ ++void *memset(void *s, int c, size_t count) __weak __alias(__memset); ++EXPORT_SYMBOL(memset); +``` + +### riscv: lib: optimize memcmp with ld insn + +[riscv: lib: optimize memcmp with ld insn][003] + +这笔优化发到了 v3, 但是 Maintainer 反馈了一些编译问题,没有看到作者提交新的版本。 + +这笔优化的核心代码和解读如下: + +旧代码: + +``` +sb a1, 0(t0) +addi t0, t0, 1 +bltu t0, a3, 5b +``` + +新代码: + +``` +/* fill head and tail with minimal branching */ +sb a1, 0(t0) +sb a1, -1(a3) +li a4, 2 +bgeu a4, a2, 6f + +sb a1, 1(t0) +sb a1, 2(t0) +sb a1, -2(a3) +sb a1, -3(a3) +li a4, 6 +bgeu a4, a2, 6f + +/* + * Adding additional detection to avoid + * redundant stores can lead + * to better performance + */ +sb a1, 3(t0) +sb a1, -4(a3) +li a4, 8 +bgeu a4, a2, 6f + +sb a1, 4(t0) +sb a1, -5(a3) +li a4, 10 +bgeu a4, a2, 6f + +sb a1, 5(t0) +sb a1, 6(t0) +sb a1, -6(a3) +sb a1, -7(a3) +li a4, 14 +bgeu a4, a2, 6f + +/* store the last byte */ +sb a1, 7(t0) +``` + +主要的改动如下: + +1. 将旧代码中的一行 `addi t0, t0, 1` 替换为一系列新的存储指令,用于填充头部和尾部。新代码中的存储指令是以一定的间隔连续存储数据。 +2. 添加了额外的条件检测和分支,以避免重复存储,这可能会提高性能。 +3. 添加了一行 `li a4, 2` 来设置一个常数,用于条件比较。 +4. 添加了 `6f` 标签,用于跳转到代码的结尾。 + +它的核心优化思路是用许多分支结构填充头尾,这样虽然可能有一部分存储冗余,但是因为并行存储,减少跳转次数,提高了效率。 + +### RISC-V: Apply Zicboz to clear_page and memset + +[RISC-V: Apply Zicboz to clear_page and memset][004] + +引入 Zicboz 扩展后,Zicboz 块大小的内存自然对齐。因此要对接收任意内存块地址和大小的 memset() 来清空内存的方法进行优化。 + +分析发现当输入的地址未对齐或者太小时,Zicboz 中的 memset 会显得效率低一些(多了几十条指令)。 + +1. 首先检查是否启用了 CONFIG_RISCV_ISA_ZICBOZ 来判断是否使用 Zicboz 扩展。如果不使用 Zicboz 扩展或者传入的参数不适合使用 Zicboz 扩展,则代码会跳转到.Ldo_memset 标签处执行内存清零的逻辑。 +2. 如果使用 Zicboz 扩展进行内存清零,代码会将地址和长度进行对齐,并使用 Zicboz 扩展的指令进行内存清零操作。 +3. 在进行 Zicboz 扩展内存清零时,如果还有一些字节无法使用 Zicboz 扩展一次性清零,则会使用 Duff's 设备来处理剩余的字节。 + +```c ++#ifdef CONFIG_RISCV_ISA_ZICBOZ ++ ALT_ZICBOZ("j .Ldo_memset", "nop") ++ /* ++ * t1 will be the Zicboz block size. ++ * Zero means we're not using Zicboz, and we don't when a1 != 0 ++ */ ++ li t1, 0 ++ bnez a1, .Ldo_memset ++ la a3, riscv_cboz_block_size ++ lw t1, 0(a3) ++ ++ /* ++ * Round to nearest Zicboz block-aligned address ++ * greater than or equal to the start address. ++ */ ++ addi a3, t1, -1 ++ not t2, a3 /* t2 is Zicboz block size mask */ ++ add a3, t0, a3 ++ and t3, a3, t2 /* t3 is Zicboz block aligned start */ ++ ++ /* Did we go too far or not have at least one block? */ ++ add a3, a0, a2 ++ and a3, a3, t2 ++ bgtu a3, t3, .Ldo_zero ++ li t1, 0 ++ j .Ldo_memset ++ ++.Ldo_zero: ++ /* Use Duff for initial bytes if there are any */ ++ bne t3, t0, .Ldo_memset ++ ++.Ldo_zero2: ++ /* Calculate end address */ ++ and a3, a2, t2 ++ add a3, t0, a3 ++ sub a4, a3, t0 ++ ++.Lzero_loop: ++ CBO_ZERO(t0) ++ add t0, t0, t1 ++ bltu t0, a3, .Lzero_loop ++ li t1, 0 /* We're done with Zicboz */ ++ ++ sub a2, a2, a4 /* Update count */ ++ sltiu a3, a2, 16 ++ bnez a3, .Lfinish ++ ++ /* t0 is Zicboz block size aligned, so it must be SZREG aligned */ ++ j .Ldo_duff3 ++#endif ++ +``` + +### RISC-V: Optimize memset for data sizes less than 16 bytes + +[RISC-V: Optimize memset for data sizes less than 16 bytes][006] ... + +在上述 memset 优化的基础上继续进行。 + +大于等于 16 字节先对齐后按 16byte 倍数存储。对于尾部数据或小于 16 字节的数据,memset 使用字节存储,效率相对低。改进方案决定用许多分支结构填充头尾,这样虽然可能有一部分存储冗余,但是因为并行存储,减少跳转次数,提高了效率。 + +```c ++void *__memset(void *s, int c, size_t count) ++{ ++ union types dest = { .as_u8 = s }; ++ ++ if (count >= MIN_THRESHOLD) { ++ unsigned long cu = (unsigned long)c; ++ ++ /* Compose an ulong with 'c' repeated 4/8 times */ ++#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER ++ cu *= 0x0101010101010101UL; ++#else ++ cu |= cu << 8; ++ cu |= cu << 16; ++ /* Suppress warning on 32 bit machines */ ++ cu |= (cu << 16) << 16;//8bits 的 c 复制 4 次来构造 unsigned long 的 cu ++#endif ++ if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) { ++ /* ++ * Fill the buffer one byte at time until ++ * the destination is word aligned. ++ */ ++ for (; count && dest.as_uptr & WORD_MASK; count--) ++ *dest.as_u8++ = c;//逐字节填充对应地址中的值=c ++ } ++ ++ /* Copy using the largest size allowed */ ++ for (; count >= BYTES_LONG; count -= BYTES_LONG) ++ *dest.as_ulong++ = cu;//BYTES_LONG 的整数倍部分复制为 cu ++ } ++ ++ /* copy the remainder */ ++ while (count--) ++ *dest.as_u8++ = c;//剩余值全部设置为 c ++ ++ return s; ++} ++EXPORT_SYMBOL(__memset); ++ ++void *memset(void *s, int c, size_t count) __weak __alias(__memset); ++EXPORT_SYMBOL(memset); +``` + +## String + +### Zbb string optimizations + +[Zbb string optimizations][001] + +主要是为 zbb 提供了通用的一些字符串支持,后续特定用法优化拓展需要单独实现。 + +- 为 Zbb 系统添加了允许未对齐访问的 strcmp,strncmp,strlen 以及生成相应 makefile 文件。 + +- 用位域而不是数字代替 CPU 的补丁拓展 errata-id 的宏定义,简化。 + + ```c + -#define CPUFEATURE_SVPBMT 0 + -#define CPUFEATURE_ZICBOM 1 + -#define CPUFEATURE_ZBB 2 + +#define CPUFEATURE_SVPBMT (1 << 0) + +#define CPUFEATURE_ZICBOM (1 << 1) + +#define CPUFEATURE_ZBB (1 << 2) + ``` + +### Zbb+ fast-unaligned string optimization + +[Zbb + fast-unaligned string optimization][005] ... + +添加多个 strcmp 变体用于快速比较非对齐访问。优先使用效率高的优化变体,在无法生效的情况下退回到通用情况。 + +```c ++static bool __init_or_module cpufeature_probe_fast_unaligned(unsigned int stage) ++{ ++ int cpu; ++ ++ if (stage == RISCV_ALTERNATIVES_EARLY_BOOT) ++ return false; ++ ++ for_each_possible_cpu(cpu) { ++ long perf = per_cpu(misaligned_access_speed, cpu); ++ ++ if (perf != RISCV_HWPROBE_MISALIGNED_FAST) ++ return false; ++ } ++ ++ return true; ++} ++ +``` + +#### strcmp_zbb + +检查两个字符串是否对齐到 SZREG 的边界。如果是,则以 SZREG 为单位比较两个字符串中的内容。如果不是,则按字节读取。 + +```c ++/* ++ * Variant of strcmp using the ZBB extension if available ++ */ ++#ifdef CONFIG_RISCV_ISA_ZBB ++strcmp_zbb: ++ ++.option push ++.option arch,+zbb ++ ++ /* ++ * Returns ++ * a0 - comparison result, value like strcmp ++ * ++ * Parameters ++ * a0 - string1 ++ * a1 - string2 ++ * ++ * Clobbers ++ * t0, t1, t2, t3, t4, t5 ++ */ ++ ++ or t2, a0, a1 ++ li t4, -1 ++ and t2, t2, SZREG-1 ++ bnez t2, 3f ++ ++ /* Main loop for aligned string. */ ++ .p2align 3 ++1: ++ REG_L t0, 0(a0) ++ REG_L t1, 0(a1) ++ orc.b t3, t0 ++ bne t3, t4, 2f ++ addi a0, a0, SZREG ++ addi a1, a1, SZREG ++ beq t0, t1, 1b ++ ++ /* ++ * Words don't match, and no null byte in the first ++ * word. Get bytes in big-endian order and compare. ++ */ ++#ifndef CONFIG_CPU_BIG_ENDIAN ++ rev8 t0, t0 ++ rev8 t1, t1 ++#endif ++ ++ /* Synthesize (t0 >= t1) ? 1 : -1 in a branchless sequence. */ ++ sltu a0, t0, t1 ++ neg a0, a0 ++ ori a0, a0, 1 ++ ret ++ ++2: ++ /* ++ * Found a null byte. ++ * If words don't match, fall back to simple loop. ++ */ ++ bne t0, t1, 3f ++ ++ /* Otherwise, strings are equal. */ ++ li a0, 0 ++ ret ++ ++ /* Simple loop for misaligned strings. */ ++ .p2align 3 ++3: ++ lbu t0, 0(a0) ++ lbu t1, 0(a1) ++ addi a0, a0, 1 ++ addi a1, a1, 1 ++ bne t0, t1, 4f ++ bnez t0, 3b ++ ++4: ++ sub a0, t0, t1 ++ ret ++ ++.option pop ++#endif +``` + +#### strlen_zbb + +启用 CONFIG_RISCV_ISA_ZBB 的前提下,移位对齐字符后从头开始以 SZREG 为单位读取,并剔除第一个和最后一个机器字头尾的空字符。最后计算结果求和。 + +```c ++#ifdef CONFIG_RISCV_ISA_ZBB ++strlen_zbb: ++ ++#ifdef CONFIG_CPU_BIG_ENDIAN ++# define CZ clz ++# define SHIFT sll ++#else ++# define CZ ctz ++# define SHIFT srl ++#endif ++ ++.option push ++.option arch,+zbb ++ ++ /* ++ * Returns ++ * a0 - string length ++ * ++ * Parameters ++ * a0 - String to measure ++ * ++ * Clobbers ++ * t0, t1, t2, t3 ++ */ ++ ++ /* Number of irrelevant bytes in the first word. */ ++ andi t2, a0, SZREG-1 ++ ++ /* Align pointer. */ ++ andi t0, a0, -SZREG ++ ++ li t3, SZREG ++ sub t3, t3, t2 ++ slli t2, t2, 3 ++ ++ /* Get the first word. */ ++ REG_L t1, 0(t0) ++ ++ /* ++ * Shift away the partial data we loaded to remove the irrelevant bytes ++ * preceding the string with the effect of adding NUL bytes at the ++ * end of the string's first word. ++ */ ++ SHIFT t1, t1, t2 ++ ++ /* Convert non-NUL into 0xff and NUL into 0x00. */ ++ orc.b t1, t1 ++ ++ /* Convert non-NUL into 0x00 and NUL into 0xff. */ ++ not t1, t1 ++ ++ /* ++ * Search for the first set bit (corresponding to a NUL byte in the ++ * original chunk). ++ */ ++ CZ t1, t1 ++ ++ /* ++ * The first chunk is special: compare against the number ++ * of valid bytes in this chunk. ++ */ ++ srli a0, t1, 3 ++ bgtu t3, a0, 3f ++ ++ /* Prepare for the word comparison loop. */ ++ addi t2, t0, SZREG ++ li t3, -1 ++ ++ /* ++ * Our critical loop is 4 instructions and processes data in ++ * 4 byte or 8 byte chunks. ++ */ ++ .p2align 3 ++1: ++ REG_L t1, SZREG(t0) ++ addi t0, t0, SZREG ++ orc.b t1, t1 ++ beq t1, t3, 1b ++2: ++ not t1, t1 ++ CZ t1, t1 ++ ++ /* Get number of processed words. */ ++ sub t2, t0, t2 ++ ++ /* Add number of characters in the first word. */ ++ add a0, a0, t2 ++ srli t1, t1, 3 ++ ++ /* Add number of characters in the last word. */ ++ add a0, a0, t1 ++3: ++ ret ++ ++.option pop ++#endif +``` + +## 总结 + +以上梳理了 memory 和 strcmp 相关优化代码,可以发现: + +memory 相关优化方法主要有两点:通过连续存储减少条件分支及其跳转次数,减少判断上的时间;以及通过对齐机制把内存操作函数拆为单位块的对齐部分和单独处理的非对齐部分,批量操作一定程度上提高效率。 + +string 对于 zbb 支持部分的函数优化,主要是先提供通用支持未对齐方式的字符串函数以及方便后续添加优化函数的框架,之后又提出了优化对齐方式下按 SZREG 块单位执行函数的优化方案。当优化方案不适用时再使用通用函数,以此优化部分情况下的 zbb 中 str 相关函数的使用效率。 + +接下来将按照 Memory, String, 数据运算,其他库函数等几个方面系统地展开对 RISC-V Linux 内核库函数的解读,敬请期待。 + +## 参考资料 + +- [Zbb string optimizations][001] +- [riscv: optimized mem* functions][002] +- [riscv: lib: optimize memcmp with ld insn][003] +- [RISC-V: Apply Zicboz to clear_page and memset][004] +- [Zbb+ fast-unaligned string optimization][005] +- [RISC-V: Optimize memset for data sizes less than 16 bytes][006] + +[001]: https://lore.kernel.org/all/20230113212301.3534711-1-heiko@sntech.de/ +[002]: https://lore.kernel.org/linux-riscv/20210929172234.31620-1-mcroce@linux.microsoft.com/ +[003]: https://lore.kernel.org/linux-riscv/20220906115359.173660-1-zouyipeng@huawei.com/ +[004]: https://lore.kernel.org/linux-riscv/20221027130247.31634-1-ajones@ventanamicro.com/ +[005]: https://lore.kernel.org/linux-riscv/20230113212351.3534769-1-heiko@sntech.de/ +[006]: https://lore.kernel.org/linux-riscv/20230511012604.3222-1-zhang_fei_0403@163.com/ diff --git a/articles/20230617-software-prefetch.md b/articles/20230617-software-prefetch.md new file mode 100644 index 0000000000000000000000000000000000000000..4611826099f031f33e59bfa630eb8dbe559d1c17 --- /dev/null +++ b/articles/20230617-software-prefetch.md @@ -0,0 +1,219 @@ +> Corrector: [TinyCorrect](https://gitee.com/tinylab/tinycorrect) v0.1 - [tounix spaces codeinline tables urls pangu autocorrect]
+> Author: Kepontry
+> Date: 2023/6/17
+> Revisor: Falcon ; Walimis
+> Project: [RISC-V Linux 内核剖析](https://gitee.com/tinylab/riscv-linux)
+> Proposal: [VisionFive 2 开发板软硬件评测及软件 gap 分析](https://gitee.com/tinylab/riscv-linux/issues/I64ESM)
+> Sponsor: PLCT Lab, ISCAS + +# 高速缓存的软件预取调优 + +## 前言 + +在 [上篇文章][002] 中介绍了缓存预取(Cache Prefetching),一种提前将指令或数据取到 Cache 中以提升性能的技术。根据预取数据的不同类型,缓存预取可以划分为指令预取和数据预取。而数据预取根据实现方式的不同,可以分为软件预取和硬件预取。上篇文章介绍了 RISC-V SoC JH7110 上的硬件预取,本篇文章将主要关注软件预取。 + +软件预取是由编译器或编程人员向程序中插入预取指令而实现的,这类指令类似于 load 指令,会将缓存行大小的数据从内存中取出并放入 Cache 中,但并不会因为等待访存结果而阻塞流水线。对于收到的软件预取指令,CPU 将在访存通路空闲时发送预取请求。而硬件预取由模式匹配策略而不是指令触发,当某一模式匹配成功时,由 CPU 中的预取器自动进行预取。 + +## X86 指令集的软件预取 + +### 预取指令介绍 + +下表中列举了 X86 指令集定义的软件预取指令,PREFETCHT0、PREFETCHT1 和 PREFETCHT2 指令可以指定预取到的 Cache 层级,而 PREFETCHNTA 指令针对的是短时间内不会被复用的数据,需要减少其对 Cache 的污染。需要注意的是,这些指令的实现与芯片架构相关,功能细节会有变化。此外,一些架构支持更多的预取指令,具体可以参考对应架构的手册。 + +将数据预取到离 CPU 越近的 Cache 层级,例如 L1,获得的性能收益越大,但预取地址错误或及时性差所带来的性能损失也越大。所以 L1 软件预取需要准确率高且预取及时性好,如果及时性不能保证且程序是访存密集型应用,应尽量避免将数据预取到 L1。 + +| 预取指令 | 作用 | +|-------------|--------------------------------------------------| +| PREFETCHT0 | 将数据预取到所有级别的高速缓存 | +| PREFETCHT1 | 将数据预取到除 L1 外所有级别的高速缓存 | +| PREFETCHT2 | 将数据预取到除 L1 和 L2 外所有级别的高速缓存 | +| PREFETCHNTA | 将数据预取到非临时缓冲结构中,以减少对 Cache 的污染 | + +在 X86 架构下,可以使用 `_mm_prefetch` 函数在代码中插入软件预取。该函数第一个参数为待预取数据的指针,第二个参数可以配置为 `_MM_HINT_T0`, `_MM_HINT_T1`, `_MM_HINT_T2` 和 `_MM_HINT_NTA` 等,分别与上表中的预取指令一一对应。 + +### 软件预取实验 + +下面将以一个简单的链表遍历程序为例来说明软件预取的使用方式与效果。在该程序中,startCounter 和 getCounter 函数分别用于初始化定时器和获取计时值。createLinkedList、destroyLinkedList 和 traverseLinkedList 函数分别用于创建、销毁和遍历链表。链表的长度设置为 1M,分别统计加入软件预取前后的链表遍历耗时,比较性能差异。需要注意的是,结构体 Node 的末尾有一个 int 类型的数组,这是为了模拟结构体中的其它成员变量,设置为 14 是为了保证两个 Node 尽量位于不同的缓存行上。本次实验使用 `_mm_prefetch` 函数,预取链表的下一个元素所在的缓存行到 L2 和 L3 Cache 中。 + +```C +// linkedlist.c +#include +#include +#include +#include +#include + +#define K 1000 +#define M 1000 * K + +struct timespec time1,time2; +void startCounter() { + clock_gettime(CLOCK_REALTIME, &time1); +} + +double getCounter() { + clock_gettime(CLOCK_REALTIME, &time2); + return (time2.tv_sec - time1.tv_sec) + \ + (double)(time2.tv_nsec - time1.tv_nsec) / 1000000000; +} + +typedef struct Node { + int data; + struct Node* next; + int arr[14]; +} Node; + +Node* createLinkedList(int n) { + Node* head = NULL; + Node* prev = NULL; + + for (int i = 0; i < n; i++) { + Node* newNode = (Node*)malloc(sizeof(Node)); + newNode->data = i; + newNode->next = NULL; + if (prev != NULL) { + prev->next = newNode; + } else { + head = newNode; + } + prev = newNode; + } + return head; +} + +void destroyLinkedList(Node* head) { + Node* current = head; + + while (current != NULL) { + Node* next = current->next; + free(current); + current = next; + } +} + +void traverseLinkedList(Node* head, bool withPref) { + Node* current = head; + while (current != NULL) { + if(withPref) + _mm_prefetch(current->next, _MM_HINT_T1); + current = current->next; + } +} + +int main() { + int n = 1 * M; + Node* head = createLinkedList(n); + printf("Linked list length: %d.\n", n); + + startCounter(); + traverseLinkedList(head, true); + double resultWithPref = getCounter(); + printf("Time with prefetch: %f\n", resultWithPref); + + startCounter(); + traverseLinkedList(head, false); + double resultWithoutPref = getCounter(); + printf("Time without prefetch: %f\n", resultWithoutPref); + + destroyLinkedList(head); + return 0; +} +``` + +编译上述代码后,使用 objdump 命令可以查看程序的汇编指令。从输出中可以看出,预取指令 prefetcht1 表示将数据预取到 L2 和 L3 Cache 中,与前面设置的 `_MM_HINT_T1` 参数对应。 + +```shell +$ gcc linkedlist.c -o linkedlist +$ objdump -d linkedlist| grep prefetch + 1312: 0f 18 10 prefetcht1 (%rax) +``` + +运行编译后的程序,从结果中可以看出,添加软件预取后,链表的遍历耗时减少。 + +```shell +$ ./linkedlist +Linked list length: 1000000. +Time with prefetch: 0.007642 +Time without prefetch: 0.008526 +``` + +### 软件预取优缺点分析 + +软件预取的优点有:在源码级确定预取地址,能够处理更加复杂的访存相关;显式预取,编程人员可见等。 + +当访存地址存在明显规律时,例如按一定步幅递增(数组遍历),访存部件能够较为容易地发现规律,现代处理器一般能够通过硬件预取自动进行优化。但当规律不明显时,例如下一次访存的地址是当前的访存值加上一定的偏移量(链表遍历),硬件预取出于实现开销考虑,往往不具备发现这类规律的能力,软件预取在源码层面能够非常方便地获得预取地址,如上面例子所示。 + +软件预取的缺点为:不便于判断预取及时性,存在取指、译码等指令执行开销等。 + +预取及时性指的是预取数据进入 Cache 的时机不能过早或过晚。由于源码级缺乏运行时信息,编程人员或编译器很难准确判断及时性,往往只能通过比较预取指令放在不同位置的性能收益来进行判断。 + +## RISC-V 指令集的软件预取 + +### CMO 扩展介绍 + +RISC-V 指令集中的软件预取指令包含在缓存管理操作(Cache-management operation,CMO)指令中。该扩展指令集标准已被批准,包括了 Zicbom、Zicboz 和 Zicbop 扩展,最新版本为 [v1.0.1][004]。 + +* Zicbom 扩展定义了 cbo.inval、cbo.clean 和 cbo.flush 等缓存块管理指令。其中,cbo.inval 指令用于无效缓存行,cbo.clean 指令用于清除缓存行的脏位,如果脏位置位,则将缓存行写回内存,cbo.flush 指令则先对缓存行做 flush 操作,再做 invalidate 操作。 + +* Zicboz 扩展定义了 cbo.zero 指令,用于向缓存行中写 0。 + +* Zicbop 扩展定义了 prefetch.i、prefetch.r 和 prefetch.w 指令,分别用于指令读、数据读和数据写的预取。 + +由于该扩展指令集在 2021 年底才被批准,许多以前的 RISC-V CPU 并不支持该扩展。不过根据 [社区讨论][001],平头哥新推出的 C920 处理器核与 Intel 的 [Nios V 处理器核][007] 提供对 CMO 指令扩展的支持。 + +### 指定预取 Cache 层级的方式 + +RISC-V 在 Zihintntl 扩展指令集中提供了 NTL(Non-Temporal Locality)指令,表明目标指令(即下一条指令)的显式内存访问的时间局部性较差。这类指令是提供给处理器的提示,不影响体系结构状态,具体实现由微架构决定。微架构可以使根据 NTL 指令决定将数据分配到哪一级缓存。例如,在一种实现中,ntl.p1 实现为不在私有 L1 Cache 中为该数据分配缓存行,而应该在 L2 中分配。在另一个实现中,ntl.p1 实现为在 L1 中分配缓存行,但是会被尽快替换出去。 + +如下表所示,该扩展共提供 4 条指令,分别定义了目标指令在共享和私有 Cache、最内层和所有级别 Cache 中的局部性。 + +| 指令 | 作用 | +|----------|---------------------------------------------------| +| ntl.p1 | 目标指令在最内层私有 Cache 中没有表现出时间局部性 | +| ntl.pall | 目标指令在任何级别的私有 Cache 中没有表现出时间局部性 | +| ntl.s1 | 目标指令在最内层共享 Cache 中没有表现出时间局部性 | +| ntl.all | 目标指令在任何级别的共享 Cache 中没有表现出时间局部性 | + +如下表所示,对于不同内存架构,NTL 指令所对应的 Cache 层级也不一样。以私有 L1/L2,共享 L3 为例,ntl.p1 对应 L1,ntl.pall 对应 L2,ntl.s1 与 ntl.all 均对应 L3。更详细的介绍可以参见 [Zihintntl 扩展指令集文档][010]。 + +| 内存架构 | ntl.p1 | ntl.pall | ntl.s1 | ntl.all | +|--------------|--------|----------|--------|---------| +| 私有 L1,共享 L2 | L1 | L1 | L2 | L2 | +| 私有 L1,共享 L2/L3 | L1 | L1 | L2 | L3 | +| 私有 L1/L2,共享 L3 | L1 | L2 | L3 | L3 | + +NTL 指令可以影响除 Zicbom 扩展中的缓存管理指令之外的所有内存访问指令。例如在 “私有 L1/L2,共享 L3” 的内存架构中执行 cbo.zero 指令,如果前面执行过 ntl.pall 指令,则表示该缓存行应在 L3 中分配并清零,而不是在 L1 或 L2 中分配。因为根据上表,其在 L2 Cache 中的局部性差。 + +### 使用方法 + +LLVM 和 GCC 目前已经支持 CMO 扩展,以 Zicbop 扩展为例,相关 patch 如下。 + +* [[RISCV] Add support for llvm.prefetch to use Zicbop instructions][012] +* [[RISCV] Implement support for the Zicbop extension][011] +* [RISC-V: Cache Management Operation instructions][008] +* [RISC-V: Cache Management Operation instructions testcases][009] + +在 GCC 中,可以调用内置函数 `__builtin_prefetch (const void *addr[, rw[, locality]])` 进行预取,第一个参数 addr 为预取地址,第二个参数 rw 用 0 和 1 分别表示读和写,第三个参数 locality 用于指定局部性,从 0-3 局部性逐渐增加,与前面介绍的 NTL 指令思想类似。其中,第二、三个参数是可选的。 + +## 总结 + +本文简要介绍了软件预取的原理和优缺点,并分析了 X86 和 RISC-V 架构下的软件预取,此外还给出了 X86 架构下的程序优化案例,为大家实际编程提供参考。 + +## 参考资料 + +- [X86 架构预取内建函数][006] +- [RISC-V 近期批准的扩展指令集][005] +- [RISC-V CMO 扩展标准][003] + +[001]: https://forum.rvspace.org/t/milk-v-pioneer/2838 +[002]: https://gitee.com/tinylab/riscv-linux/blob/master/articles/20230509-vf2-hw-prefetch.md +[003]: https://github.com/riscv/riscv-CMOs +[004]: https://github.com/riscv/riscv-CMOs/blob/master/specifications/cmobase-v1.0.1.pdf +[005]: https://wiki.riscv.org/display/HOME/Recently+Ratified+Extensions +[006]: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=154,5184 +[007]: https://www.intel.com/content/www/us/en/docs/programmable/683632/23-1/data-cache.html +[008]: https://gcc.gnu.org/git?p=gcc.git;a=commit;h=3df3ca9014f94fe4af07444fea19b4ab29ba8e73 +[009]: https://gcc.gnu.org/git?p=gcc.git;a=commit;h=d44e471cf041d5a304f2b2bbc7d104fa17f0e9da +[010]: https://github.com/riscv/riscv-isa-manual/blob/main/src/zihintntl.adoc +[011]: https://reviews.llvm.org/D117433 +[012]: https://reviews.llvm.org/D152723 diff --git a/articles/20230626-rvsec-intro-part1.md b/articles/20230626-rvsec-intro-part1.md new file mode 100644 index 0000000000000000000000000000000000000000..7d28b86ac17d0af51b2944f1866f9ef5f16ab78d --- /dev/null +++ b/articles/20230626-rvsec-intro-part1.md @@ -0,0 +1,148 @@ +> Author: Mingde Ren
+> Date: 2023/06/26
+> Revisor: Falcon
+> Project: [RISC-V Linux 内核剖析](https://gitee.com/tinylab/riscv-linux)
+> Proposal: [RISC-V Security 技术调研](https://gitee.com/tinylab/riscv-linux/issues/I5WEQX)
+> Sponsor: PLCT Lab, ISCAS + +# RISC-V 安全拓展调研(Part 1) + +## 前言 + +RISC-V 架构因其开源、低成本、高度可定制等特点,近年来受到众多大小厂商青睐,已经在物联网等领域投入生产使用,可预见的未来内还可能会被部署在云计算、移动端等场景中。在生产环境中,设备除了要提供足够的功能之外,还需要提供信息安全等保障。x86、Arm 等平台下已经有相对完善的可信计算框架(Intel SGX、Intel TDX、AMD SEV、Arm Trustzone、Arm CCA 等)。而 RISC-V 的 specification 目前只规定了一些安全特性(如 PMP),暂时还没有提供一套完整公认的可信计算框架。 + +针对 RISC-V 下的安全特性设计,学术界和工业界都进行了广泛的探索。但是由于 RISC-V 的高度可定制性,这些设计思路大多彼此独立,导致暂时并没有资料完整地总结介绍整个 RISC-V 生态下的安全特性设计。因此,本系列文章将从硬件特性、软件支持角度总结现存的学术界以及工业界中的 RISC-V 安全特性设计。 + +本文为系列文章的第一部分,将先介绍 RISC-V 官方 specification 中规定的硬件安全特性,并介绍基于 PMP 规范实现的可信计算框架 Keystone。 + +## RISC-V Specification 中的安全拓展 + +RISC-V 的各种拓展还在非常活跃的开发中,因此除了已经发布的 specification 外,还有许多值得参考的官方文档。可以在[这里](https://wiki.riscv.org/display/HOME/Recently+Ratified+Extensions)查看近期已经得到批准、但尚未合并到官方 specification 中的拓展。还在开发中,尚未完全确定并得到批准的拓展可以在[这里](https://wiki.riscv.org/display/HOME/Specification+Status)查看。其中与安全相关的有:已获得批准的 SMEPMP、SMSTATEEN,尚未获得批准的 SPMP、IOPMP、IOMMU 等。本文将主要介绍已经得到批准的安全拓展。 + +## PMP(物理内存保护,Physical Memory Protection) + +### PMP 介绍 + +PMP 是 RISC-V 架构 specification 中规定的一种硬件安全特性,用于对物理内存进行访问控制。使用 PMP 可以将物理内存划分为多个区域,并对每个区域分别设置读、写、可执行权限。处理器中每个核都有一个独立的 PMP 单元,用于限制核对物理内存的访问。此外,如果处理器支持虚拟地址,那么 PMP 同样会作用在 MMU 对物理内存的访问上,如果该核对任意一级页表所在的物理地址没有访问权限,那么这次地址翻译将会失败,并触发访问错误(access fault),陷入到 M-mode 中。 + +使用 PMP 可以提供深度防御(defend-in-depth),即使有漏洞的操作系统被攻破,攻击者所能造成的损害也会受限。例如,OpenSBI 默认使用一个 PMP 条目来禁止 S-mode 和 U-mode 的软件对自己所在物理内存的访问,从而保障自身安全。理论上我们也可以在内核启动后对代码段和只读数据段用 PMP 禁止写入操作来保证内核代码和只读数据的完整性。 + +PMP 规定了一系列 CSR(控制与状态寄存器,Control and Status Register)来划分物理内存区域和配置权限,这些 CSR 只能由 M-mode 的特权软件访问。这些 CSR 包括用于划分物理地址的 `pmpaddr` 寄存器和用于配置权限的 `pmpcfg` 寄存器。每个 `pmpaddr` 寄存器标注了一个物理地址,用于匹配一段 PMP 区域。每个 PMP 区域对应一个 8 比特的 `pmpcfg` 条目。在 RV32 中,每四个 `pmpcfg` 条目打包在一个 `pmpcfg` CSR 中(如下图),依次命名为 `pmpcfgN`,其中 N 为该寄存器的序数。 + +![RV32 PMPCFG](images/20230626-rvsec-intro-part1/RV32-pmpcfg-layout.png) + +而在 RV64 中,每八个 `pmpcfg` 条目被打包在一个 `pmpcfg` CSR 中。为了和 RV32 兼容,这些 CSR 命名会跳过奇数,仅使用偶数(如下图)。 + +![RV64 PMPCFG](images/20230626-rvsec-intro-part1/RV64-pmpcfg-layout.png) + +在 20211203 版的 specification 中,RISC-V 核的实现可以提供 0、16 或 64 个 PMP 寄存器。注意:PMP 寄存器的数目可以是 0,意味着标准并没有要求所有的 RISC-V 核都实现 PMP 功能。此外,尽管 specification 里 PMP 寄存器可以至多有 64 个,但在实际电路中很难实现这么多的数量,现有的 RISC-V 开发板大多只支持 8 个 PMP 寄存器。 + +PMP 提供了三种方式来使用 `pmpaddr` 寄存器:NAPOT,NA4 和 TOR,其中 NA4 可以看成 NAPOT 的一种特例。三种方法中,TOR 的使用相对简单,它使用两个连续的 `pmpaddr` 寄存器分别标注一段物理内存的起始地址和终止地址。如下图所示,当 `pmpcfg[i]` 被设置为 TOR 模式时,其对应的 PMP 区域将由 `pmpaddr[i-1]` 和 `pmpaddr[i]` 标注,即图中的 `0x8080_0000` 到 `0x80C0_0000` 区域。注意这里的 i 并不限定奇偶数,并且当 i 为 0 时,将自动匹配 `0x0` 到 `pmpaddr[0]` 的区域。 + +![TOR 配对模式](images/20230626-rvsec-intro-part1/pmpaddr.png) + +为了节约 PMP 寄存器的使用数量,specification 中还约定了 NAPOT 和 NA4 两种地址匹配模式。这两种模式要求 PMP 区域的起始地址是按照区域大小对齐的,并且区域大小必须为 2 的幂次。我们可以继续使用上图所示的区域(`0x8080_0000` 到 `0x80C0_0000`)作为例子。这段区域大小为 `0x40_0000`,即 4M,并且起始地址 `0x8080_0000` 是 4M 对齐的,因此可以使用 NAPOT 模式进行地址匹配。配置方式为将 `pmpcfg[i]` 设置为 NAPOT 模式,并将 `pmpaddr[i]` 的高位设置为地址右移两比特(地址要求是 4 字节对齐的),低位设置为区域大小减一:`0x205F_FFFF`(`(0x8080_0000 >> 2) | (0x40_0000 - 1)`)。更一般的配置方式如下图: + +![NAPOT 地址编码](images/20230626-rvsec-intro-part1/NAPOT-encoding.png) + +需要注意的是,尽管我们说 NA4 是 NAPOT 的特例,但是给定一个 `pmpaddr` 寄存器,我们无法判断这个寄存器的值是 NA4 模式下的一个地址,还是 NAPOT 模式中一个对齐的地址加上末尾的标识。因此 NA4 和 NAPOT 在 `pmpcfg` 寄存器的配置中要区分开来。此外,specification 中提到,RISC-V 设备不需要对上图中每个粒度都提供支持,粒度可以由厂商自行决定。如 D1 Nezha 开发板中,PMP 的最小粒度为 4K 内存页,小于此粒度的配置会对齐到此粒度。 + +刚刚提到每个 `pmpcfg` 包含八个比特,它们提供了对应的 PMP 区域的地址匹配模式、访问权限、配置锁定的功能,如下图: + +![PMPCFG 寄存器](images/20230626-rvsec-intro-part1/pmpcfg.png) + +其中 A 代表开关和刚刚介绍的三种地址匹配模式,对应关系如下: + +![PMPCFG A 比特控制三种地址匹配模式](images/20230626-rvsec-intro-part1/pmpcfg-a.png) + +其他位中,X、W、R 分别代表可执行、写、如权限,L 代表锁定。当 L 设置为 1 后,直到下次系统重置(如重启)之前,PMP 配置都无法被更改。并且当 L 未被设定时,PMP 只对 U/S-mode 生效;而设定 L 后,PMP 也会对 M-mode 生效,即此时无论机器处于哪种特权状态,违反 PMP 设定的物理内存访问都会触发访问错误。 + +此外,PMP 区域之间可以存在重叠,PMP 寄存器的序数越低则优先级越高,因此重叠区域的配置将以低序数 PMP 寄存器为准。 + +### 基于 PMP 实现的可信计算框架 + +Keystone 是一个软件 TEE(可信执行环境,Trusted Execution Environment)框架,RISC-V PMP 是 Keystone 唯一依赖的硬件安全特性。TEE 通常提供与操作系统相隔离的区域用来执行需要隐私保护的应用程序,这些隔离区域被称为飞地(Enclave)。TEE 可以在操作系统内核被攻破的情况下保障飞地内程序的完整性和隐私性。Keystone 在 M-mode 实现了一个基于 OpenSBI 的安全监视器(Security Monitor),安全监视器向操作系统提供了用于创建、执行、终止、验证飞地等功能的 ABI: + +![Keystone: sm/src/sm-sbi.h](images/20230626-rvsec-intro-part1/keystone-sm-api.png) + +Keystone: sm/src/sm-sbi.h + +操作系统可以为飞地分配初始资源,以及初始配置,当配置完成后,安全监视器将会审查配置并生成签名,用户可以通过签名确认飞地初始状态的完整性。飞地进入执行状态之前,安全监视器会更新 PMP 配置,使得除了飞地所在核以外的所有的核都无法访问飞地访问区域;同时,飞地本身所在核将被配置为只能访问自己所拥有的 PMP 区域,如下图所示: + +![图源:Keystone, EuroSys’20](images/20230626-rvsec-intro-part1/keystone-pmp.png) + +图源:Keystone, EuroSys’20 + +图中可以看到飞地中除了需要隔离保护的机密区域外,还预留了一段不可信空间(U1),用于与操作系统通信(如将计算结果发送至操作系统等)。 + +为了简化可信应用的开发,Keystone 提供了对在飞地中执行静态链接的 ELF 文件的支持。但是 ELF 文件的执行依赖系统调用,如果让操作系统内核直接提供系统调用的支持,则会使飞地对操作系统建立不必要的信任,违背了提供飞地的初衷。为此,Keystone 为每个飞地提供了一个小型运行时(runtime),运行时中提供了关键系统调用的支持,并将其余的系统调用转发给内核,如下图: + +![图源:Keystone, EuroSys’20](images/20230626-rvsec-intro-part1/keystone-arch.png) + +图源:Keystone, EuroSys’20 + +Keystone 是第一个仅依赖 RISC-V PMP 实现的基于软件的 TEE,它已经展示出类似近来流行的安全虚拟机的设计。它展示了新思路的同时,留下了很多值得完善的地方: + +- Keystone 中飞地的数量受到 PMP 数量的限制,使得其无法直接在云计算等场景中应用; +- Keystone 的飞地无法直接进行 I/O 操作,而是依赖系统内核代理这部分系统调用,I/O 作为常见的攻击面,对操作系统的 I/O 依赖可能会对飞地安全造成威胁; + +## SMEPMP(为防止 M-mode 内存访问和执行而提供的 PMP 增强,PMP Enhancement for memory access and execution prevention on Machine mode) + +### 设计 SMEPMP 的动机 + +SMEPMP 是一个已经得到 RISC-V 社区批准的安全拓展,全称是为防止 M-mode 内存访问和执行提供的 PMP 增强(PMP Enhancement for memory access and execution prevention on Machine mode),从名称可以看出这个拓展是一个对 PMP 的增强。 + +RISC-V 标准中通过 `sstatus.SUM` 比特和页表项中的 U 比特,提供了对 SMAP(S-mode 内存访问预防,Supervisor Memory Access Prevention)和 SMEP(S-mode 内存执行预防,Supervisor Memory Execution Prevention)的支持:当 `sstatus.SUM` 被设置时,设定了 U 比特的页将不能被 S-mode 访问;设定了 U 比特的页中的代码永远不能被 S-mode 直接执行。SMAP 和 SMEP 的设计也是一种 defend-in-depth,可以用于避免一些巧妙构造的攻击:这些攻击并不直接篡改特权等级的内存,而是通过诱导特权等级去访问或执行本不应该访问或执行的非特权等级内存,来间接攻破特权等级的软件。 + +但是有很多 RISC-V 的设备中只有 U/M-mode,这在嵌入式设备中非常常见。对于这些设备,此前的 RISC-V 标准无法提供 SMAP/SMEP 特性。此外,在存在 S-mode 的设备中,此前标准也无法为 M-mode 对来自 S-mode 的攻击提供 SMAP/SMEP 保护。这是因为此前的 SMAP/SMEP 是通过页表实现的,而 M-mode 会直接访问物理地址。设定有 L 比特的 PMP 是此前唯一可以限制 M-mode 访问的方式(如设定一段区域为只读),但是 PMP 做不到 M-mode 无法访问的同时让 U/S-mode 可以访问。SMEPMP 的设计就是为了给 M-mode 提供 SMAP/SMEP 支持。 + +### SMEPMP 的机制 + +SMEPMP 的运行机制相对 PMP 而言显得十分繁杂,通过图片可以比较清晰地梳理清楚。我们首先讨论当内存访问/执行的地址匹配到了某个 PMP 区域内的情况,如下图: + +![命中 PMP 区域时 SMEPMP 作用效果图](images/20230626-rvsec-intro-part1/smepmp-hit.png) + +`mseccfg` 是 SMEPMP 配置的核心 CSR,我们首先关注其中的 MML(M-mode 锁定,Machine Mode Lockdown)比特。MML 和原先 PMP 配置中的 L 比特决定了大部分情况。 + +- `MML=0` 时(图中左半边),所有设定和没有 SMEPMP 时一致(见前文),即当 L 未设置时,S/U-mode 的访问会根据 PMP 配置来管理(图中的 enforced);而 L 设置后,PMP 配置将对所有等级生效,并锁定。 +- `MML=1` 时(图中右半边,暂时忽略最后三行): + - 若 `L=0`,则 S/U-mode 的访问依据 PMP 配置管理,禁止 M-mode 的访问(图中的denied),从而实现 SMAP/SMEP。 + - 若 `L=1`,则禁止 S/U-mode,PMP 配置对 M-mode 的访问生效。 + +尽管这样的设计一定程度上提供了 SMAP/SMEP,但是它相当不灵活,因为一旦 `MML` 和 `L` 都设置为 1 后,就无法重新让 S/U-mode 能访问该区域。为此,SMEPMP 在 `mseccfg` 中引入了 RLB(规则锁定绕过,Rule Locking Bypass)比特。当 `RLB=1` 时,上图中的锁定效果将被忽略。但是为了防止 L 比特本身失去意义,RLB 本身又加入了锁定特性:当存在至少一个 L 比特设定的 PMP 配置时,如果 RLB 被关闭(置为0),则 RLB 将被锁定在关闭状态,此前带有 L 比特的 PMP 设置也将被锁定。需要注意的是,specification 中注明了 RLB 被设计为一个调试机制,或者为启动过程提供便利和优化的设置。在生产环境中,一旦系统启动完成,软件就不应该再依赖 RLB 特性,否则可能带来安全隐患。 + +除了灵活性外,刚刚的设计中一旦提供 SMAP/SMEP 特性后,就无法让一段内存在所有权限等级中共享,这会对系统性能造成影响,比如无法提供零拷贝等功能。SMEPMP 希望在提供 SMAP/SMEP 的同时,安全地在不同权限等级之间共享内存,这部分设计体现在上图中的后三行。共享内存功能的核心设计思路是写权限和执行权限永远不共存,这也是系统中常见的安全设计原则。依据这个原则,对照图片就可以很好地理解如何进行 SMEPMP 的配置了,此处我们不进行繁杂的列举。 + +上图展示了内存访问命中 PMP 区域时的情况,下面我们讨论未命中时的情况,如图: + +![未命中 PMP 区域时 SMEPMP 作用效果图](images/20230626-rvsec-intro-part1/smepmp-miss.png) + +这里 SMEPMP 为 `mseccfg` 引入了 MMWP(M-mode 白名单策略,Machine Mode Whitelist Policy)比特。`MML=0, MMWP=1` 对应了此前不启用 SMEPMP 的情况。当 `MMWP=1`时,所有的权限等级访问都会被拒绝。当 `MML=1, MMWP=1` 时,则 S/U-mode 的访问会像此前一样直接被拒绝,而 M-mode 仅可以进行读写访问。 + +### SMEPMP 的现有支持 + +在软件方面,目前最新版本的 QEMU(v8.0.2)中,已经提供了对 SMEPMP 的支持(v0.9.3),但实现的版本是获得社区批准之前的版本,并且名称依然保留了 EPMP(增强 PMP,Enhanced PMP)的旧称。获得批准版本的实现需要等待后续开发。此外,暂时没有看到 Linux 中有关于 SMEPMP 的支持。 + +在硬件方面,RISC-V 社区官方的[博文](https://riscv.org/blog/2023/06/noel-v-processors-security-extensions-for-safe-and-secure-computing/)中表示,已经有硬件提供了 SMEPMP 的支持:[NOEL-V](https://www.gaisler.com/index.php/products/processors/noel-v) 处理器。但是在 NOEL-V 官方网站中,SMEPMP 仍处于开发路线中,暂未更新为已实现的特性,所以现有 NOEL-V 设备对 SMEPMP 的支持有待确认。 + +截至目前,未有使用 SMEPMP 的可信计算框架被提出。 + +## SMSTATEEN(状态启用拓展,State Enable Extension) + +隐蔽信道(convert channel)是一个安全领域的研究课题。它是指非特权程序之间通过一些无法被特权等级感知的手段进行通讯,比如某些寄存器中的位、设备的状态等等。这可能会使敏感信息在不知情的情况下被传出,也会对系统的安全造成一定隐患。SMSTATEEN 拓展是针对 RISC-V 各类拓展中提供的 CSR 这一潜在的隐蔽信道而设计的。简单来说,它统一提供了各种特性的开关,防止未启用的特性的 CSR 被用作隐蔽信道。此前 `mstatus.FS/VS` 等比特可以用于控制浮点、向量拓展的启用,但随着拓展变得繁多,使用 `mstatus` 管理所有拓展已经不现实,因此需要引入 SMSTATEEN 拓展。 + +SMSTATEEN 的机制和使用方式非常直接,它为 S/H/M-mode 每个特权等级提供了四个 `stateen` CSR,其中的比特位用于控制所有可选的拓展开关。具体的对应关系可以查看已获社区批准的 [specification](https://github.com/riscv/riscv-state-enable/releases/download/v1.0.0/Smstateen.pdf)。 + +与 SMEPMP 类似,NOEL-V 处理器和 QEMU 中提供了对 SMSTATEEN 的支持。暂时没有看到 Linux 及其他工作对此拓展的支持。 + +## 小结 + +本文介绍了 PMP、SMEPMP、SMSTATEEN 三个已获得批准的 RISC-V 安全拓展标准,并简要介绍了基于 PMP 实现的可信计算框架 Keystone。其中 PMP 是一个官方标准中一个可选的硬件拓展,SMEPMP 针对 PMP 进行了增强,提供了 SMAP/SMEP 保护能力,SMSTATEEN 相对较为简单,用于提供各类拓展的开关控制,以防止其被用于隐蔽信道。目前上游软件对 RISC-V 安全拓展的支持仍然较为初级,有待进一步开发。 + +## 参考资料 + +- [Recently Ratified RISC-V Extensions](https://wiki.riscv.org/display/HOME/Recently+Ratified+Extensions) +- [Extensions in Progress](https://wiki.riscv.org/display/HOME/Specification+Status) +- [NOEL-V Blog Post](https://riscv.org/blog/2023/06/noel-v-processors-security-extensions-for-safe-and-secure-computing/) +- [NOEL-V Webpage](https://www.gaisler.com/index.php/products/processors/noel-v) +- [SMSTATEEN Extension Specification](https://github.com/riscv/riscv-state-enable/releases/download/v1.0.0/Smstateen.pdf) \ No newline at end of file diff --git a/articles/20230701-qemu-system-decode-analyse.md b/articles/20230701-qemu-system-decode-analyse.md new file mode 100644 index 0000000000000000000000000000000000000000..8728ab04dbae76682d5144996a478f47fba9d3d1 --- /dev/null +++ b/articles/20230701-qemu-system-decode-analyse.md @@ -0,0 +1,257 @@ +> Corrector: [TinyCorrect](https://gitee.com/tinylab/tinycorrect) v0.2-rc1 - [header urls refs pangu]
+> Author: jl-jiang
+> Date: 2023/07/01
+> Revisor: Bin Meng
+> Project: RISC-V Linux 内核剖析】(https://gitee.com/tinylab/riscv-linux)
+> Proposal: [【老师提案】QEMU 系统模拟模式分析](https://gitee.com/tinylab/riscv-linux/issues/I61KIY)
+> Sponsor: PLCT Lab, ISCAS + +# QEMU 系统模式下指令解码模块简析:以 RISC-V 为例 + +## 前言 + +QEMU 是一个通用的、开源的模拟器,通过纯软件方式实现硬件的虚拟化,模拟外部硬件,为用户提供抽象、虚拟的硬件环境。QEMU 亦可藉由硬件虚拟化变身为虚拟机。 + +QEMU 支持以下两种方式进行模拟: + +- 用户模式(User Mode Emulation):在一种架构的 CPU 上运行为另一种架构的 CPU 编译的程序。在该模式下,QEMU 作为进程级虚拟机,只模拟系统调用前的用户态代码,系统调用进入内核后,由宿主机操作系统原生执行,QEMU 提供不同架构系统调用映射的转换。 +- 系统模拟(System Emulation):在该模式下,QEMU 作为系统级虚拟机提供运行客户机所需的完整环境,包括 CPU,内存和外围设备等。 + +## 概述 + +### QEMU 模拟的基本逻辑 + +QEMU 提供两种 CPU 模拟实现方式,一种基于架构无关的中间码实现,另一种基于硬件虚拟化技术实现。 + +第一种方式,QEMU 使用 `Tiny Code Generator`(下文简称 `TCG`)将客户机指令动态翻译为宿主机指令执行。这种方式的主要思想是使用纯软件的方法将客户机 CPU 指令先解码为架构无关的中间码(即 `Intermediate Representation`),然后再把中间码翻译为宿主机 CPU 指令执行。其中,由客户机 CPU 指令解码为中间码的过程被称为前端,由中间码翻译为宿主机 CPU 指令的过程被称为后端。以 RISC-V 为例,指令的前端解码逻辑位于 `target/riscv` 中,后端翻译逻辑位于 `tcg/` 中,外设及其他硬件模拟代码位于 `hw/riscv` 中。 + +第二种方式,基于硬件虚拟化技术实现,直接使用宿主机 CPU 执行客户机指令,可以达到接近真实机器的运行性能。 + +本文只关注系统模式下 TCG 方式的分析,不讨论基础硬件虚拟化技术的运行逻辑。 + +### QEMU 翻译执行过程 + +TCG 定义了一系列中间码,将已经翻译的代码以代码块的形式储存在 `Translation Block` 中,通过跳转指令将宿主机 CPU 的指令集和客户机 CPU 的指令集链接。下图以 RISC-V 为例,给出了 QEMU v8.0.0 系统模拟的翻译执行流程。 + +![img](images/qemu-system-decode-analyse/translation_and_execution_loop.svg) + +我们可以发现 TCG 前端解码和后端翻译都按照指令块的粒度进行,将一个客户机指令块翻译成中间码,然后把中间码翻译成宿主机 CPU 指令,整个过程动态执行。为了提高翻译效率,QEMU 将翻译成的宿主机 CPU 指令块做了缓存,即上文提到的 `Translation Block`,CPU 执行的时候,先在缓存中查找对应的 `TB`,如果查找成功就直接执行,否则进入翻译流程。 + +从更加抽象的视角来看,TCG 模式下所谓客户机 CPU 的运行,实际上就是根据指令不断改变客户机 CPU 的状态,即改变描述客户机 CPU 的状态的数据结构中的有关变量。因为实际的代码执行过程是在宿主机 CPU 上完成的,因此客户机 CPU 的指令必须被翻译为宿主机 CPU 指令才能被执行,才能改变客户机 CPU 的数据状态。 + +QEMU 为了解耦把客户机 CPU 指令先解码为中间码,中间码其实就是一组描述如何改变客户机 CPU 数据状态且架构无关的语句,所以目标 CPU 状态参数会被传入中间码描述语句。中间码实际上是改变客户机 CPU 状态的抽象的描述,对于部分难以抽象成一般描述的指令就用 `helper` 函数进行补充。例如 RV32FD 指令集中的 `fadd.d rd, rs1, rs2` 指令,表示双精度浮点加,将 `rs1` 和 `rs2` 寄存器中的双精度浮点数相加并将舍入后的结果送到 `rd` 寄存器。对于该指令的行为,TCG 中间码并没有合适的描述,因此 QEMU 使用 `helper_fmadd_d` 函数根据情况直接调用模拟 FPU 解决问题。最后,将中间码翻译为宿主机 CPU 代码时,TCG 后端使用 `tcg_gen_xxx` 函数描述具体某条客户机 CPU 指令对客户机 CPU 数据状态的改变。 + +翻译过程中 `gen_intermediate_code` 函数负责前端解码,把客户机的指令翻译成中间码。而 `tcg_gen_code` 负责后端翻译,将中间码翻译成宿主机 CPU 上的指令,其中 `tcg_out_xxx` 函数执行具体的翻译工作。 + +## TCG 细节 + +### 前端解码 + +QEMU 定义了 `instruction pattern` 来描述客户机 CPU 指令,一个 `instruction pattern` 是指一组相同或相近的指令,RISC-V 架构的指令描述位于 `target/riscv` 目录下的 `insn16.decode`、`insn32.decode` 文件中。 + +QEMU 编译的时候会解析 `.decode` 文件,使用脚本 `scripts/decodetree.py` 生成对应指令描述函数的声明并存放于 `/libqemu-riscv64-softmmu.fa.p` 目录下的 `decode-insn32.c.inc` 和 `decode-insn16.c.inc` 文件中。在这两个文件中还定义两个较为关键的解码函数 `decode_insn32` 和 `decode_insn16`,QEMU 将客户机指令翻译成中间码的时候需要调用这两个解码函数。需要注意的是,脚本 `scripts/decodetree.py` 生成的只是 `trans_xxx` 函数的声明,其定义需要开发者实现,RISC-V 对应的实现位于 `target/riscv/insn_trans/` 目录中。 + +### Decode Tree + +每种 `instruction pattern` 都有固定位和固定掩码,它们的组合构成了模式匹配的条件: + +```c +(insn & fixedmask) == fixedbits +``` + +对于每种 `instruction pattern`,`scripts/decodetree.py` 脚本定义了具体描述形式,下面进行简要分析: + +- **Fields:** CPU 在解码的时候需要把指令中的特性 `field` 中的数据取出作为传入参数(寄存器编号,立即数,操作码等),`field` 描述一个指令编码中特定的字段,根据描述可以生成取对应字段的函数。 + + | Input | Generated code | + |-------------------------------------------|------------------------------------------------------------------------| + | %disp 0:s16 | sextract(i, 0, 16) | + | %imm9 16:6 10:3 | extract(i, 16, 6) << 3 \| extract(i, 10, 3) | + | %disp12 0:s1 1:1 2:10 | sextract(i, 0, 1) << 11 \| extract(i, 1, 1) << 10 \| extract(i, 2, 10) | + | %shimm8 5:s8 13:1 !function=expand_shimm8 | expand_shimm8(sextract(i, 5, 8) << 1 \| extract(i, 13, 1)) | + | %sz_imm 10:2 sz:3 !function=expand_sz_imm | expand_sz_imm(extract(i, 10, 2) << 3 \| extract(a->sz, 0, 3)) | + + 上表给出了一些例子,如第一行的 `%disp 0:s16` 表示指令编码第 0 位起的 16 位构成了一个带符号数,因此生成代码 `sextract(i, 0, 16)`,意即从指令 `i` 的编码第 0 位开始取 16 位解释为带符号数返回。还有第三行的 `%disp12 0:s1 1:1 2:10` 表示该立即数由三个部分拼接而成,因此生成的代码中就包含了相应的移位、拼接运算。由 `field` 定义所生成的函数就负责完成这种与从指令编码中取数有关的计算。 + +- **Argument Sets:** 定义数据结构。比如,`target/riscv/insn32.decode` 中定义的 `&b imm rs2 rs1` 在编译后的 `decode-insn32.c.inc` 中生成的数据结构如下,这个结构将作为 `trans_xxx` 函数的传入参数。 + + ```c + typedef struct { + int imm; + int rs2; + int rs1; + } arg_b; + ``` + +- **Formats:** 定义指令的格式,例如下面的例子是对一个 32-bit 指令编码的描述,其中 `.` 表示一个 bit 位。 + + ```c + @opr ...... ra:5 rb:5 ... 0 ....... rc:5 + @opi ...... ra:5 lit:8 1 ....... rc:5 + ``` + +- **Patterns:** 用来定义具体指令。这里借助 RV32I 基础指令集中的 `lui` 指令进行详细分析: + + ```c + lui .................... ..... 0110111 @u + ``` + + 另外列出相关的 format、argument、field 的定义,以便分析: + + ```c + # Argument sets: + &u imm rd + # Formats 32: + @u .................... ..... ....... &u imm=%imm_u %rd + # Fields: + %rd 7:5 + # immediates: + %imm_u 12:s20 !function=ex_shift_12 + ``` + + 可以看到 `lui` 指令的操作码是 `0110111`,指令的格式定义是 `@u`,使用的参数定义是 `&u`,而 `&u` 就是 `trans_lui` 函数的传入参数结构体里的变量定义,其中定义的变量名字是 `imm`、`rd`,这个 `imm` 实际的格式是 `%imm_u`,它是一个由指令编码 31-12 位定义的立即数,将指令编码 31-12 位的数值左移 12 位即可得到最终结果,`rd` 实际的格式是 `%rd`,是一个在指令编码 7-5 位定义的 `rd` 寄存器的标号。 + + 可以看到 `target/riscv/insn_trans/trans_rvi.c.inc` 中对应的 `trans_lui` 函数的实现如下: + + ```c + static bool trans_lui(DisasContext *ctx, arg_lui *a) + { + gen_set_gpri(ctx, a->rd, a->imm); + return true; + } + ``` + +### trans_xxx 函数 + +`trans_xxx` 函数负责将具体的客户机指令转换为中间码指令,若转换成功则返回 `true`,否则返回 `false`。下面以 RISC-V 架构的 `add` 指令为例进行分析。 + +如下是 `target/riscv/insn_trans/trans_rvi.c.inc` 文件中对 `add` 指令的模拟。 + +```c +static bool trans_add(DisasContext *ctx, arg_add *a) +{ + return gen_arith(ctx, a, EXT_NONE, tcg_gen_add_tl, tcg_gen_add2_tl); +} +``` + +函数 `gen_arith` 被定义在文件 `target/riscv/translate.c` 中: + +```c +static bool gen_arith(DisasContext *ctx, arg_r *a, DisasExtend ext, + void (*func)(TCGv, TCGv, TCGv), + void (*f128)(TCGv, TCGv, TCGv, TCGv, TCGv, TCGv)) +{ + TCGv dest = dest_gpr(ctx, a->rd); + TCGv src1 = get_gpr(ctx, a->rs1, ext); + TCGv src2 = get_gpr(ctx, a->rs2, ext); + + if (get_ol(ctx) < MXL_RV128) { + func(dest, src1, src2); + gen_set_gpr(ctx, a->rd, dest); + } else { + if (f128 == NULL) { + return false; + } + + TCGv src1h = get_gprh(ctx, a->rs1); + TCGv src2h = get_gprh(ctx, a->rs2); + TCGv desth = dest_gprh(ctx, a->rd); + + f128(dest, desth, src1, src1h, src2, src2h); + gen_set_gpr128(ctx, a->rd, dest, desth); + } + return true; +} +``` + +注意到函数中 `func` 指向的函数是由 `trans_add` 传入的 `tcg_gen_add_tl` 函数,而此函数又在 `inluce/tcg/tcg-op.h` 中以宏定义的形式被定义为 `tcg_gen_add_i64` 或 `tcg_gen_add_i32` 函数,下面给出 `tcg_gen_add_i64` 函数的定义: + +```c +void tcg_gen_addi_i64(TCGv_i64 ret, TCGv_i64 arg1, int64_t arg2) +{ + if (arg2 == 0) { + tcg_gen_mov_i64(ret, arg1); + } else if (TCG_TARGET_REG_BITS == 64) { + tcg_gen_add_i64(ret, arg1, tcg_constant_i64(arg2)); + } else { + tcg_gen_add2_i32(TCGV_LOW(ret), TCGV_HIGH(ret), + TCGV_LOW(arg1), TCGV_HIGH(arg1), + tcg_constant_i32(arg2), tcg_constant_i32(arg2 >> 32)); + } +} +``` + +RISC-V 的 `add` 指令内容是从 CPU 的 `rs1` 和 `rs2` 寄存器中取操作数,相加后送入 `rd` 寄存器中。宏观上看,`gen_arith` 函数首先调用 `dest_gpr` 和 `get_gpr` 这两个寄存器操作封装函数获取 `rs1` 和 `rs2` 寄存器的值,并准备 `rd` 寄存器。然后通过 `func(dest, src1, src2)` 最终调用 `tcg_gen_addi_i64` 函数完成两数相加,最后使用 `gen_set_gpr` 将结果传送至 `rd` 寄存器,完成 `add` 指令解码。 + +接着,我们针对 `gen_set_gpr` 进行深入分析,以 RV32 指令为例,追踪该函数的调用链: + +![img](images/qemu-system-decode-analyse/gen_set_gpr.svg) + +分析上述调用链的参数可以发现最后生成了一条 `mov_i32 t0, t1` 指令,意思是将 `t1` 寄存器中的数移动到 `t0` 寄存器中。该指令先被挂到了一个链表里,此后的后端翻译会把这些指令翻译成宿主机指令。到这里,前端解码的逻辑就基本上打通了。还有最后一个问题需要解决:`cpu_gpr[reg_num]` 这个全局变量是如何索引到客户机 CPU 寄存器的? + +解决该问题的基本思路是,只要 TCG 前端和后端约定描述客户机 CPU 状态数据结构相同,确保 `cpu_gpr[reg_num]` 指向的就是相关寄存器在这个数据结构中的位置即可,这一点在 `cpu_gpr[]` 数组的初始化过程中具体体现: + +```c +void riscv_translate_init(void) +{ + int i; + // ... + for (i = 1; i < 32; i++) { + cpu_gpr[i] = tcg_global_mem_new(cpu_env, + offsetof(CPURISCVState, gpr[i]), riscv_int_regnames[i]); + // ... + } + // ... +} +``` + +`cpu_gpr[]` 数组在初始化时调用 `tcg_global_mem_new` 函数在 TCG 上下文 `tcg_ctx` 中分配空间并返回其相对地址,而后段翻译时访问 `cpu_gpr[]` 数组就是在访问 TCG 上下文中描述寄存器的变量,这样 `cpu_gpr[reg_name]` 就在前端和后端之间建立了连接。 + +### 后端翻译 + +后端的代码主要负责将中间码翻译成宿主机指令,本质上就是根据中间码的描述使用宿主机指令来改变内存中表示的客户机 CPU 的数据结构以及客户机内存的状态。考虑以下两条 RISC-V 汇编指令: + +```assembly +addi sp,sp,-32 +sd s0,24(sp) +``` + +经过前端解码,可以得到以下中间码: + +```assembly +add_i64 x2/sp,x2/sp,$0xffffffffffffffe0 +add_i64 tmp4,x2/sp,$0x18 +qemu_st_i64 x8/s0,tmp4,leq,0 +``` + +注意到 `sd` 指令被翻译成了两条中间码,第一条 `add_i64` 是用来计算 `sd` 指令的目标地址,计算结果保存在 `tmp4` 这个虚拟寄存器里,第二条中间码把 `s0` 的值储存到虚拟寄存器 `tmp4` 描述的内存上。在中间码中,`x2/sp` 和 `x8/s0` 仍然是客户机 CPU 上寄存器的名字,但是逻辑上已经全部映射为 QEMU 虚拟寄存器。TCG 前端将 RISC-V 汇编指令解码为中间码和虚拟寄存器的表示,后端翻译则基于中间码和虚拟寄存器进行。再次审视上述两条指令,`addi` 的中间码表示要把客户机的 `sp` 寄存器加上 `-32`,`sd` 的中间码表示要将客户机的 `s0` 寄存器中的值送到 `sp` 寄存器加 24 后得到的地址处。对于这些中间码,在 ARM 架构的宿主机上可能被翻译为以下指令: + +```assembly +ldr x20, [x19, #0x10] +sub x20, x20, #0x20 +str x20, [x19, #0x10] +add x21, x20, #0x18 +ldr x22, [x19, #0x40] +str x22, [x21, xzr] +``` + +这段指令主要进行了以下操作:把客户机 CPU 的 `sp` 寄存器装载到宿主机 CPU 的 `x20` 寄存器,使用 `sub` 指令完成客户机 CPU `sp` 寄存器值的计算并进行更新;使用 `add` 指令计算客户机 CPU 的 `sd` 指令的目标地址并保存到宿主机 CPU 的 `x21` 寄存器,接着把客户机 CPU 的 `s0` 寄存器装载到宿主机 CPU 的 `x22` 寄存器,最后使用 `str` 指令更新目标地址处的值。 + +通过以上案例可以发现,TCG 后端主要完成三件事情:分配宿主机 CPU 寄存器、生成宿主机 CPU 指令以及宿主机 CPU 和客户机 CPU 之间的状态同步。其中,状态同步实际上通过两次映射完成:第一次是 TCG 前端解码时将客户机 CPU 寄存器映射为 QEMU 虚拟寄存器,第二次是 TCG 后端分配宿主机 CPU 寄存器时将 QEMU 虚拟寄存器映射为宿主机 CPU 的物理寄存器。 + +下面仍然以 `add` 指令为例,给出后端代码调用过程的详细分析: + +![img](images/qemu-system-decode-analyse/tcg_gen_code.svg) + +`tcg_gen_code` 是整个后端翻译的入口,负责寄存器和内存区域之间的同步逻辑并根据不同指令类型调用相关函数将中间码翻译为宿主机 CPU 指令。默认情况下,`tcg_gen_code` 会调用 `tcg_reg_alloc_op` 函数,该函数会生成用宿主机 CPU 指令描述的同步逻辑,存放在 `TB` 中,最后调用不同架构的开发者提供的 `tcg_out_op` 函数完成具体指令的翻译工作。针对 `add` 指令,最终会调用 `tcg_out32()` 函数,该函数负责将一个 32 位无符号整数 `v` 写入到指针 `s->code_ptr` 对应的内存位置,并根据目标平台的指令单元大小更新该指针的值。 + +## 总结 + +本文主要分析了 QEMU 系统模式下指令解码模块。QEMU 将客户机 CPU 指令解码为中间码,中间码是对指令如何改变客户机 CPU 数据状态的抽象描述,TCG 后端将中间码翻译为宿主机 CPU 指令,也就是将中间码所描述的对客户机 CPU 数据状态的更改用宿主机 CPU 指令的形式进行描述,执行完成后就达到了模拟客户机 CPU 运行的效果。 + +至此,从前端到后端,从解码到翻译的逻辑链条就完整了。 + +## 参考资料 + +- [Decodetree Specification](https://www.qemu.org/docs/master/devel/decodetree.html) +- [TCG Intermediate Representation](https://www.qemu.org/docs/master/devel/tcg-ops.html) diff --git a/articles/20230724-riscv-klibc-analysis.md b/articles/20230724-riscv-klibc-analysis.md new file mode 100644 index 0000000000000000000000000000000000000000..3d076a42b84fb84922df43d1d826ad802c531671 --- /dev/null +++ b/articles/20230724-riscv-klibc-analysis.md @@ -0,0 +1,430 @@ +> Corrector: [TinyCorrect](https://gitee.com/tinylab/tinycorrect) v0.2-rc1 - [urls refs]
+> Author: Jingqing3948 <2351290287@qq.com>
+> Date: 20230724
+> Revisor: Falcon
+> Project: [RISC-V Linux 内核剖析](https://gitee.com/tinylab/riscv-linux)
+> Sponsor: PLCT Lab, ISCAS + +# kernel libc + +## 简介 + +本文主要介绍了实验盘中的 kernel libc,其架构、基本功能、测试方法,以及引入了 Linux kernel & linux klibc 在 Linux 官网上单独下载的内核库的用法和介绍。 + +## kernel libc 是什么 + +创立目的:为嵌入式,启动准备的 Linux 内核精简小型 c 库。 + +他具有:轻量,高效,可靠,尽可能减少对 os 和硬件的依赖,启动快等特点。 + +## kernel libc vs glibc + +glibc 通用库相较 kernel libc 体积较大且功能齐全。kernel libc 在其轻量的特点上也付出了一些代价,貌似在优化方面做的并不如 glibc 好。但是能精简代码这点在嵌入式领域比较重要。 + +## kernel libc 目录结构 + +![1689861637506](images/riscv-klibc/20230724-klibc-file-structure.jpg) + +通过 chatgpt 总结分类: + +1. 内核功能相关文件 + +- 内核自旋锁、互斥锁、读写锁等锁相关的自测文件 + + - locking-selftest-wlock-hardirq.h + + - locking-selftest-wlock-softirq.h + + - locking-selftest-mutex.h + + - locking-selftest-rlock.h + + - locking-selftest-rlock-hardirq.h + + - locking-selftest-rlock-softirq.h + + - locking-selftest-rsem.h + + - locking-selftest-rtmutex.h + + - locking-selftest-softirq.h + + - locking-selftest-spin.h + + - locking-selftest-spin-hardirq.h + + - locking-selftest-spin-softirq.h + + - locking-selftest.c + +- 早期 cpio 文件支持 + + - earlycpio.c + +- 故障注入(及其拷贝)用于模拟故障的发生 + + - fault-inject.c + + - fault-inject-usercopy.c + +- Undefined Behavior Sanitizer 相关的文件 + + - ubsan.c + + - Kconfig.ubsan + + - ubsan.h + +- 用于 Kernel Address SANitizer 的文件 + + - kasan 相关文件 + +- 内核 Fence 机制相关文件 + + - kfence + +- 内核 Concurrency Sanitizer 相关文件 + + - kcsan + +2. 文件系统和配置相关文件: + +- 用于文件系统或其他模块的文件。 + - 842 +- 用于异常处理的外部表相关文件 + - extable.c +- 用于内核启动配置的文件 + - bootconfig.c +- 用于处理命令行参数的文件 + - cmdline.c + - cmdline_kunit.c +- 生成 CRC 表的文件 + - gen_crc*.c +- KUnit 测试框架相关文件 + - Kconfig + - kunit +- 用于内核后期初始化的文件 + - late_init.c +- 用于内核热补丁的相关文件 + - livepatch +- OID 注册表相关文件 + - oid_registry.c + - build_OID_registry + +3. 硬件和驱动相关文件 + +- Flattened Device Tree (FDT) 相关文件 + - fdt_addresses.c + - fdt.c + - fdt_empty_tree.c + - fdt_ro.c + - fdt_rw.c +- 用于逻辑设备的文件 + - logic_*.c +- PCI I/O 映射相关文件 + - pci_iomap.c +- 用于硬件或数据结构的文件 + - bch.c + - bsearch.c + - btree.c + +4. 网络和协议相关文件: + +- 用于审计相关功能的文件 + - audit.c +- 非屏蔽中断回溯相关文件 + - nmi_backtrace.c +- 用于文本搜索的文件 + - textsearch.c +- Virtual Dynamic Shared Object 相关文件 + - vdso + +5. 数据结构和算法相关文件 + +- 关联数组相关文件 + - assoc_array.c +- 基数树相关文件 + - radix-tree.c +- 位域测试文件 + - bitfield_kunit.c +- 用于内核功能测试的文件 + - 一些以 `test_` 开头的文件 + +6. 其他文件: + +- 密码学相关文件 + - crypto +- 用于构建 ID 的文件 + - buildid.c +- 数学库相关文件 + - math +- 内存相关文件 + - memcat_p.c +- UUID 相关文件 + - uuid.c +- zlib 压缩库相关文件 + - zlib_* +- 用于输出函数的文件 + - vsprintf.c + +## 新库函数的测试 + +首先熟悉一下测试流程。 + +在添加完一个库函数后,编写相应的测试文件,并不是通过直接编译运行此 test 文件来执行的,而是通过编辑 Makefile 文件,Kconfig 文件。以及在 kernel-menuconfig 里面,启动相应的测试文件,这样在 `make boot` 阶段即可执行。因此整体的测试流程如下: + +- 在 lib 库里编写相应的 test_xxx.c 文件。 +- 在 makefile 里添加编译时的配置。 +- 在 Kconfig 里添加配置。 +- 在 menuconfig 里启动这个 test 测试用例。 + +下面我们以 printf 函数的测试为例展开分析。 + +首先查看 vsprintf.c, vsprintf 函数可以把输出存到字符缓冲中: + +```c +/** + * vsnprintf - Format a string and place it in a buffer + * @buf: The buffer to place the result into + * @size: The size of the buffer, including the trailing null space + * @fmt: The format string to use + * @args: Arguments for the format string + * + * This function generally follows C99 vsnprintf, but has some + * extensions and a few limitations: + * + * - ``%n`` is unsupported + * - ``%p*`` is handled by pointer() + * + * See pointer() or Documentation/core-api/printk-formats.rst for more + * extensive description. + * + * **Please update the documentation in both places when making changes** + * + * The return value is the number of characters which would + * be generated for the given input, excluding the trailing + * '\0', as per ISO C99. If you want to have the exact + * number of characters written into @buf as return value + * (not including the trailing '\0'), use vscnprintf(). If the + * return is greater than or equal to @size, the resulting + * string is truncated. + * + * If you're not already dealing with a va_list consider using snprintf(). + */ +int vsnprintf(char *buf, size_t size, const char *fmt, va_list args); +EXPORT_SYMBOL(vsnprintf); +``` + +然后摘取 test_printf.c 中的一部分:test string 部分分析。首先封装了 do_test 和 test 函数: + +```c +do_test(int bufsize, const char *expect, int elen, + const char *fmt, va_list ap) +{ + va_list aq; + int ret, written; + + total_tests++; + + memset(alloced_buffer, FILL_CHAR, BUF_SIZE + 2*PAD_SIZE); + va_copy(aq, ap); + ret = vsnprintf(test_buffer, bufsize, fmt, aq); + va_end(aq); + + if (ret != elen) {// 输入长度是否和预期长度匹配 + pr_warn("vsnprintf(buf, %d, \"%s\", ...) returned %d, expected %d\n", + bufsize, fmt, ret, elen); + return 1; + } + + if (memchr_inv(alloced_buffer, FILL_CHAR, PAD_SIZE)) {// 写入范围是否超出了 PAD_SIZE + pr_warn("vsnprintf(buf, %d, \"%s\", ...) wrote before buffer\n", bufsize, fmt); + return 1; + } + + if (!bufsize) {// bufsize==0 但是有输入信息 + if (memchr_inv(test_buffer, FILL_CHAR, BUF_SIZE + PAD_SIZE)) { + pr_warn("vsnprintf(buf, 0, \"%s\", ...) wrote to buffer\n", + fmt); + return 1; + } + return 0; + } + + written = min(bufsize-1, elen); + if (test_buffer[written]) {// 校验结尾是否是、0 + pr_warn("vsnprintf(buf, %d, \"%s\", ...) did not nul-terminate buffer\n", + bufsize, fmt); + return 1; + } + + if (memchr_inv(test_buffer + written + 1, FILL_CHAR, BUF_SIZE + PAD_SIZE - (written + 1))) {// 是否是在 PAD_SIZE 后面终止的 + pr_warn("vsnprintf(buf, %d, \"%s\", ...) wrote beyond the nul-terminator\n", + bufsize, fmt); + return 1; + } + + if (memcmp(test_buffer, expect, written)) {// 比较一下写入缓冲区的数据和预期是否一样,assert + pr_warn("vsnprintf(buf, %d, \"%s\", ...) wrote '%s', expected '%.*s'\n", + bufsize, fmt, test_buffer, written, expect); + return 1; + } + return 0;// 以上问题都没有碰到,成功 +} +``` + +```c +static void __printf(3, 4) __init +__test(const char *expect, int elen, const char *fmt, ...) +{ + va_list ap; + int rand; + char *p; + + if (elen >= BUF_SIZE) {// 长度超出缓冲区,直接不存了 + pr_err("error in test suite: expected output length %d too long. Format was '%s'.\n", + elen, fmt); + failed_tests++; + return; + } + + va_start(ap, fmt); + + /* + * Every fmt+args is subjected to four tests: Three where we + * tell vsnprintf varying buffer sizes (plenty, not quite + * enough and 0), and then we also test that kvasprintf would + * be able to print it as expected. + */ + failed_tests += do_test(BUF_SIZE, expect, elen, fmt, ap);// 统计多少用例没有通过 + rand = 1 + prandom_u32_max(elen+1); + /* Since elen < BUF_SIZE, we have 1 <= rand <= BUF_SIZE. */ + failed_tests += do_test(rand, expect, elen, fmt, ap);// 随机缓冲区长度 + failed_tests += do_test(0, expect, elen, fmt, ap);// 0 缓冲区长度 + + p = kvasprintf(GFP_KERNEL, fmt, ap); + if (p) { + total_tests++; + if (memcmp(p, expect, elen+1)) { + pr_warn("kvasprintf(..., \"%s\", ...) returned '%s', expected '%s'\n", + fmt, p, expect); + failed_tests++; + } + kfree(p); + } + va_end(ap); +} +``` + +然后在具体测试用例中传入参数交给 test。比如下例是测试打印字符串。 + +```c +static void __init +test_string(void) +{ + test("", "%s%.0s", "", "123");// 参数 1 是预期输出,参数 2 是占位符组合,后面变长参数是传入占位符的数据 + test("ABCD|abc|123", "%s|%.3s|%.*s", "ABCD", "abcdef", 3, "123456"); + test("1 | 2|3 | 4|5 ", "%-3s|%3s|%-*s|%*s|%*s", "1", "2", 3, "3", 3, "4", -3, "5"); + test("1234 ", "%-10.4s", "123456"); + test(" 1234", "%10.4s", "123456"); + /* + * POSIX and C99 say that a negative precision (which is only + * possible to pass via a * argument) should be treated as if + * the precision wasn't present, and that if the precision is + * omitted (as in %.s), the precision should be taken to be + * 0. However, the kernel's printf behave exactly opposite, + * treating a negative precision as 0 and treating an omitted + * precision specifier as if no precision was given. + * + * These test cases document the current behaviour; should + * anyone ever feel the need to follow the standards more + * closely, this can be revisited. + */ + test(" ", "%4.*s", -5, "123456"); + test("123456", "%.s", "123456"); + test("a||", "%.s|%.0s|%.*s", "a", "b", 0, "c"); + test("a | | ", "%-3.s|%-3.0s|%-3.*s", "a", "b", 0, "c"); +} +``` + +然后在 Makefile 里有添加这么一行: + +```makefile +obj-$(CONFIG_TEST_PRINTF) += test_printf.o +``` + +根据 CONFIG_TEST_PRINTF 变量来判断是否编译出 test_printf 的目标文件。 + +然后在 lib/KConfig.debug 里可以看到 test 文件设定在 menuconfig 的什么位置下。 + +```shell +menu "Kernel hacking" +config TEST_PRINTF + tristate "Test printf() family of functions at runtime" +``` + +然后在 linux-lab 目录下 `make kernel-menuconfig` -> kernel-hacking -> Kernel Testing and Coverage -> Run time Testing 里可以看到 printf 对应的测试项,我们将其启用: + +![image-20230720214745198](images/riscv-klibc/20230724-klibc-menuconfig.jpg) + +也就是说我们在 menuconfig 里启用了 "Test printf() family of functions at runtime" 这一项,则相当于启用了 TEST_PRINTF 这个变量,则 makefile 中 CONFIG_TEST_PRINTF 这个变量也会被启用,然后编译链接的时候就会添加 test_printf.c 生成的目标文件,并在运行内核的时候一并运行。 + +完成配置后,`make kernel` 重新编译内核,并 `make boot` 运行,可以看到打印的调试信息中出现了: + +```shell +test_printf: loaded. +crng possibly not yet initialized. plain 'p' buffer contains "(____ptrval____)" +test_printf: crng possibly not yet initialized. plain 'p' buffer contains "(____ptrval____)" +test_printf: crng possibly not yet initialized. plain 'p' buffer contains "(____ptrval____)" +test_printf: crng possibly not yet initialized. plain 'p' buffer contains "(____ptrval____)" +test_printf: all 416 tests passed +``` + +![image-20230720215808605](images/riscv-klibc/20230724-klibc-test-vsprint-example.jpg) + +这一段说明测试模块顺利启动并通过测试。 + +## Linux kernel klibc 的测试 + +这里的 klibc 是从 Linux 官网上下载安装的 klibc 库,不是实验盘中的 libc 库,但很多地方是有异曲同工之妙的,比如测试部分整体框架都是 Kunit 的 Makefile-Kconfig-menuconfig 配置的大框架。 + +Linux kernel 下载地址:[The Linux Kernel Archives][001] + +klibc 库下载地址:[https://mirrors.kernel.org/pub/linux/libs/klibc/][002] + +我下载的 klibc 库是 2.0.12 版本,Linux kernel 下载的是 6.4.5 stable 版本。 + +首先,如果我想要直接编译 klibc 库,会提示:缺少一些需要的头文件,请执行:`make headers_install INSTALL_HDR_PATH=/...` 导出库函数。 + +这里指的意思是,如果想要编译 klibc 我们缺少一部分头文件,这一部分文件从什么地方来,是从 Linux kernel 内核中导出的,我们要去 Linux kernel 库执行他的提示指令,把需要的库函数导出到指定位置。 + +然后在 klibc 中执行 `make install` 安装 klibc 库。 + +```shell +make clean +make install +make test # 以这种方式运行 klibc 库中的测试案例 +``` + +## 总结 + +本文主要是对 RISC-V 架构下的 klibc 库的目录结构,编译、运行,以及添加、运行测试文件做了一些分析,并且也分析了在测试过程中发现的 Linux kernel 内核库如何编译、运行、测试。 + +同样,我们也可以借由这个过程反推出:如果我们要新添加一个库函数,添加流程以及如何对其展开测试。 + +- 首先,在 lib 库文件夹里编写相应文件; +- 在 /lib/Makefile 里以大致为 `obj-$(CONFIG_NAME) += XXX.o` 的形式把这个模块定义为 CONFIG_NAME,让 Kconfig 可见; +- 在 lib/Kconfig 里规定 CONFIG_NAME 的数据类型(如可选的 bool 型),所属菜单使得开发者使用时知道从 menuconfig 的什么位置去启用等; +- 在 menuconfig 中勾选启用对应模块,保存 .config 文件后 `make kernel` 重新编译,这样下次启动后该模块就会同时被启用; +- 测试文件的编写:基本流程同上,编写相应文件,在 lib/Kconfig.debug 里添加测试文件的信息,通常以 test_xxx 格式命名,添加于 `make kernel-menuconfig` -> kernel-hacking -> Kernel Testing and Coverage -> Run time Testing 目录下; +- 下次编译后 `make boot` 启动时就会自动运行 test 文件完成启动时的测试。 + +对于 Linux kernel 库的使用,本文只做了一个简单尝试。后续可能展开尝试 klibc 库的具体用法,以及继续对 kernel libc 库进行分析。 + +## 参考资料 + +- [https://www.kernel.org][001] +- [pub/linux/libs/klibc][002] + +[001]: https://www.kernel.org/ +[002]: https://mirrors.kernel.org/pub/linux/libs/klibc/ diff --git a/articles/images/20230615-section-gc-part3/image-20230601152244101.png b/articles/images/20230615-section-gc-part3/image-20230601152244101.png new file mode 100644 index 0000000000000000000000000000000000000000..566f28bda83643411122e078442c85713666097a Binary files /dev/null and b/articles/images/20230615-section-gc-part3/image-20230601152244101.png differ diff --git a/articles/images/20230615-section-gc-part3/image-20230601153457712.png b/articles/images/20230615-section-gc-part3/image-20230601153457712.png new file mode 100644 index 0000000000000000000000000000000000000000..26d8316a426790cdfd8cfac39e274c7c86906ff2 Binary files /dev/null and b/articles/images/20230615-section-gc-part3/image-20230601153457712.png differ diff --git a/articles/images/20230615-section-gc-part3/image-20230601222533263.png b/articles/images/20230615-section-gc-part3/image-20230601222533263.png new file mode 100644 index 0000000000000000000000000000000000000000..c50b437f9cb22e43d3c267618e9a1de89b7ad40f Binary files /dev/null and b/articles/images/20230615-section-gc-part3/image-20230601222533263.png differ diff --git a/articles/images/20230615-section-gc-part3/image-20230615160546236.png b/articles/images/20230615-section-gc-part3/image-20230615160546236.png new file mode 100644 index 0000000000000000000000000000000000000000..30341018368f3618a8af5774cc98fe68ce1660d8 Binary files /dev/null and b/articles/images/20230615-section-gc-part3/image-20230615160546236.png differ diff --git a/articles/images/20230626-rvsec-intro-part1/NAPOT-encoding.png b/articles/images/20230626-rvsec-intro-part1/NAPOT-encoding.png new file mode 100644 index 0000000000000000000000000000000000000000..e96ce828613a5c78e659f02746618a48d8cecda7 Binary files /dev/null and b/articles/images/20230626-rvsec-intro-part1/NAPOT-encoding.png differ diff --git a/articles/images/20230626-rvsec-intro-part1/RV32-pmpcfg-layout.png b/articles/images/20230626-rvsec-intro-part1/RV32-pmpcfg-layout.png new file mode 100644 index 0000000000000000000000000000000000000000..79e98139710c1f6f23b01bf3f3f71514577067a6 Binary files /dev/null and b/articles/images/20230626-rvsec-intro-part1/RV32-pmpcfg-layout.png differ diff --git a/articles/images/20230626-rvsec-intro-part1/RV64-pmpcfg-layout.png b/articles/images/20230626-rvsec-intro-part1/RV64-pmpcfg-layout.png new file mode 100644 index 0000000000000000000000000000000000000000..52ea5f63d63352caa7cdb297643842047eff9cb7 Binary files /dev/null and b/articles/images/20230626-rvsec-intro-part1/RV64-pmpcfg-layout.png differ diff --git a/articles/images/20230626-rvsec-intro-part1/keystone-arch.png b/articles/images/20230626-rvsec-intro-part1/keystone-arch.png new file mode 100644 index 0000000000000000000000000000000000000000..820e48a4fd911924d4c23c0e442e59c338dfd111 Binary files /dev/null and b/articles/images/20230626-rvsec-intro-part1/keystone-arch.png differ diff --git a/articles/images/20230626-rvsec-intro-part1/keystone-pmp.png b/articles/images/20230626-rvsec-intro-part1/keystone-pmp.png new file mode 100644 index 0000000000000000000000000000000000000000..620e0f2bf00ad0d576ee72e6c7622332a5cda9a5 Binary files /dev/null and b/articles/images/20230626-rvsec-intro-part1/keystone-pmp.png differ diff --git a/articles/images/20230626-rvsec-intro-part1/keystone-sm-api.png b/articles/images/20230626-rvsec-intro-part1/keystone-sm-api.png new file mode 100644 index 0000000000000000000000000000000000000000..62a0ab6ce0ed968322b7de17ce4b25479e064b54 Binary files /dev/null and b/articles/images/20230626-rvsec-intro-part1/keystone-sm-api.png differ diff --git a/articles/images/20230626-rvsec-intro-part1/pmpaddr.png b/articles/images/20230626-rvsec-intro-part1/pmpaddr.png new file mode 100644 index 0000000000000000000000000000000000000000..e7304eefa43e08b909507d2721c711a227be48ab Binary files /dev/null and b/articles/images/20230626-rvsec-intro-part1/pmpaddr.png differ diff --git a/articles/images/20230626-rvsec-intro-part1/pmpcfg-a.png b/articles/images/20230626-rvsec-intro-part1/pmpcfg-a.png new file mode 100644 index 0000000000000000000000000000000000000000..1b8bdf166c194f6e181d4a9fd4ee2d677c1a6bf9 Binary files /dev/null and b/articles/images/20230626-rvsec-intro-part1/pmpcfg-a.png differ diff --git a/articles/images/20230626-rvsec-intro-part1/pmpcfg.png b/articles/images/20230626-rvsec-intro-part1/pmpcfg.png new file mode 100644 index 0000000000000000000000000000000000000000..5a929922a3624f59edc5a2b1ee7892028c0a6142 Binary files /dev/null and b/articles/images/20230626-rvsec-intro-part1/pmpcfg.png differ diff --git a/articles/images/20230626-rvsec-intro-part1/smepmp-hit.png b/articles/images/20230626-rvsec-intro-part1/smepmp-hit.png new file mode 100644 index 0000000000000000000000000000000000000000..ae02ff56156758a369e07858075c55218434f5ab Binary files /dev/null and b/articles/images/20230626-rvsec-intro-part1/smepmp-hit.png differ diff --git a/articles/images/20230626-rvsec-intro-part1/smepmp-miss.png b/articles/images/20230626-rvsec-intro-part1/smepmp-miss.png new file mode 100644 index 0000000000000000000000000000000000000000..51bee68806e31844f84ed6bc2adefcc37a8db4f3 Binary files /dev/null and b/articles/images/20230626-rvsec-intro-part1/smepmp-miss.png differ diff --git a/articles/images/introduction-to-riscv-sbi/img.png b/articles/images/introduction-to-riscv-sbi/img.png new file mode 100644 index 0000000000000000000000000000000000000000..4b45cad4b1cc606f80f3aa16149cf76103de4fd0 Binary files /dev/null and b/articles/images/introduction-to-riscv-sbi/img.png differ diff --git a/articles/images/introduction-to-riscv-sbi/sbi1.svg b/articles/images/introduction-to-riscv-sbi/sbi1.svg new file mode 100644 index 0000000000000000000000000000000000000000..cd2ca83dd9ff5a2d7d7a94d20f902398163c3230 --- /dev/null +++ b/articles/images/introduction-to-riscv-sbi/sbi1.svg @@ -0,0 +1 @@ +
Applications
Operating System kernel
Platform Runtime Firmware (SEE)
U-mode
S-mode
M-mode
System Calls
SBI
Applications
Operating System kernel
Platform Runtime Firmware (SEE)
U-mode
S-mode
M-mode
System Calls
\ No newline at end of file diff --git a/articles/images/introduction-to-riscv-sbi/sbi2.svg b/articles/images/introduction-to-riscv-sbi/sbi2.svg new file mode 100644 index 0000000000000000000000000000000000000000..10351c797460bd27014b76c2ee0af69d22f0370c --- /dev/null +++ b/articles/images/introduction-to-riscv-sbi/sbi2.svg @@ -0,0 +1 @@ +
Guest Kernel
Host Kernel/Hypervisor(SEE)
Platform Runtime Firmware (SEE)
VS-mode
HS-mode
M-mode
System Calls
SBI
SBI
Guest Applications
VU-mode
System Calls
Virtualized World
Host Applications
Host/Hypervisor world
\ No newline at end of file diff --git a/articles/images/introduction-to-riscv-sbi/sbi3.svg b/articles/images/introduction-to-riscv-sbi/sbi3.svg new file mode 100644 index 0000000000000000000000000000000000000000..f8b51d3e18aed21d921ca3d4770d0037d7e8be6d --- /dev/null +++ b/articles/images/introduction-to-riscv-sbi/sbi3.svg @@ -0,0 +1,4 @@ + + + +
Linux kernel
Linux kernel
SBI Implementation
SBI Implementation
S-mode
S-mode
M-mode
M-mode
ecall (putchar)
ecall (putchar)
Hello World
Hello World
3
3
complete
complete
Application
Application
U-mode
U-mode
printf()
printf()
1
1
complete
complete
2
2
4
4
5
5
Text is not SVG - cannot display
\ No newline at end of file diff --git a/articles/images/qemu-system-decode-analyse/gen_set_gpr.svg b/articles/images/qemu-system-decode-analyse/gen_set_gpr.svg new file mode 100644 index 0000000000000000000000000000000000000000..c458520a45f4b02868527643516723422ebbe805 --- /dev/null +++ b/articles/images/qemu-system-decode-analyse/gen_set_gpr.svg @@ -0,0 +1,4 @@ + + + +
gen_set_gpr(ctx, a->rd, dest)
gen_set_gpr(ctx, a->rd, dest)
tcg_gen_ext32s_tl(cpu_gpr[reg_num], t)
tcg_gen_ext32s_tl(cpu_gpr[reg_num], t)
tcg_gen_mov_i32(cpu_gpr[reg_num], t)
tcg_gen_mov_i32(cpu_gpr[reg_num], t)
tcg_gen_op2_i32(INDEX_op_mov_i32, ret, arg)
tcg_gen_op2_i32(INDEX_op_mov_i32, ret, arg)
tcg_gen_op2 definition
tcg_gen_op2 definition
void tcg_gen_op2(TCGOpcode opc, TCGArg a1, TCGArg a2)
{
    TCGOp *op = tcg_emit_op(opc, 2);
    op->args[0] = a1;
    op->args[1] = a2;
}
void tcg_gen_op2(TCGOpcode opc, TCGArg a1, TCGArg a2)...
tcg_gen_op2(INDEX_op_mov_i32, ret, arg)
tcg_gen_op2(INDEX_op_mov_i32, ret, arg)
Text is not SVG - cannot display
\ No newline at end of file diff --git a/articles/images/qemu-system-decode-analyse/tcg_gen_code.svg b/articles/images/qemu-system-decode-analyse/tcg_gen_code.svg new file mode 100644 index 0000000000000000000000000000000000000000..b0009c66a62e2cbe7542d613bbdfd488e3fa08da --- /dev/null +++ b/articles/images/qemu-system-decode-analyse/tcg_gen_code.svg @@ -0,0 +1,4 @@ + + + +
tcg_gen_code()
tcg_gen_code()
tcg_reg_alloc_op()
tcg_reg_alloc_op()
tcg_out_op()
tcg_out_op()
tcg_out32()
tcg_out32()
case INDEX_op_add_i64
case INDEX_op_add_i64
tcg_out32 definition
tcg_out32 definition
static __attribute__((unused)) inline void tcg_out32(TCGContext *s, uint32_t v)
{
    if (TCG_TARGET_INSN_UNIT_SIZE == 4) {
        *s->code_ptr++ = v;
    } else {
        tcg_insn_unit *p = s->code_ptr;
        memcpy(p, &v, sizeof(v));
        s->code_ptr = p + (4 / TCG_TARGET_INSN_UNIT_SIZE);
    }
}
static __attribute__((unused)) inline void tcg_out32(TCGContext *s, uint32_t v)...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/articles/images/qemu-system-decode-analyse/translation_and_execution_loop.svg b/articles/images/qemu-system-decode-analyse/translation_and_execution_loop.svg new file mode 100644 index 0000000000000000000000000000000000000000..d530d6f0b4785c7664e3697a326c5df9a3e5bb3d --- /dev/null +++ b/articles/images/qemu-system-decode-analyse/translation_and_execution_loop.svg @@ -0,0 +1,4 @@ + + + +
qemu init
qemu init
main()
main()
tcg_cpus_exec()
tcg_cpus_exec()
cpu_exec()
cpu_exec()
cpu_exec_loop()
cpu_exec_loop()
TB lookup
TB lookup
cpu_loop_exec_tb()
cpu_loop_exec_tb()
cpu_tb_exec()
cpu_tb_exec()
Found
Found
tb_gen_code()
tb_gen_code()
Null
Null
setjmp_gen_code()
setjmp_gen_code()
TCG Front End
TCG Front End
gen_intermediate_code()
gen_intermediate_code()
translator_loop()
translator_loop()
riscv_tr_translate_insn()
riscv_tr_translate_insn()
decode_opc()
decode_opc()
decode_insn32()
decode_insn32()
TCG Back End
TCG Back End
tcg_gen_code()
tcg_gen_code()
tcg_out_op()
tcg_out_op()
tcg_out_xxx()
tcg_out_xxx()
update buffer
update buffer
Text is not SVG - cannot display
\ No newline at end of file diff --git a/articles/images/riscv-klibc/20230724-klibc-file-structure.jpg b/articles/images/riscv-klibc/20230724-klibc-file-structure.jpg new file mode 100644 index 0000000000000000000000000000000000000000..03537663d4926422a538475b21d1ef3269d417a8 Binary files /dev/null and b/articles/images/riscv-klibc/20230724-klibc-file-structure.jpg differ diff --git a/articles/images/riscv-klibc/20230724-klibc-menuconfig.jpg b/articles/images/riscv-klibc/20230724-klibc-menuconfig.jpg new file mode 100644 index 0000000000000000000000000000000000000000..90a8d9144932c180ebfb8a62eee5d62071fa988c Binary files /dev/null and b/articles/images/riscv-klibc/20230724-klibc-menuconfig.jpg differ diff --git a/articles/images/riscv-klibc/20230724-klibc-test-vsprint-example.jpg b/articles/images/riscv-klibc/20230724-klibc-test-vsprint-example.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b0b43c64607dc56d45927e8421d26d13477e58c4 Binary files /dev/null and b/articles/images/riscv-klibc/20230724-klibc-test-vsprint-example.jpg differ diff --git a/articles/images/riscv-linear-mapping/sv57_address_trans.png b/articles/images/riscv-linear-mapping/sv57_address_trans.png new file mode 100644 index 0000000000000000000000000000000000000000..1dbc0bcf63f89af983a10658ecb0755f07b96377 Binary files /dev/null and b/articles/images/riscv-linear-mapping/sv57_address_trans.png differ diff --git a/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-1.png b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-1.png new file mode 100644 index 0000000000000000000000000000000000000000..3849dca2a1be301b1b47ad3dfd67fccef38ad8bb Binary files /dev/null and b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-1.png differ diff --git a/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-2.png b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-2.png new file mode 100644 index 0000000000000000000000000000000000000000..097a02c6aae6ef0ec1b8f6f05d85ec7ae8825ff2 Binary files /dev/null and b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-2.png differ diff --git a/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-3.png b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-3.png new file mode 100644 index 0000000000000000000000000000000000000000..313572acf6e177b2363941673d8580a27ec32188 Binary files /dev/null and b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-3.png differ diff --git a/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-4.png b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-4.png new file mode 100644 index 0000000000000000000000000000000000000000..66fffecb5f3f8fbe83a6fef87043f78675351978 Binary files /dev/null and b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-4.png differ diff --git a/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-5.png b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-5.png new file mode 100644 index 0000000000000000000000000000000000000000..3145430505ef5166bba51119dd95123d0c76aba1 Binary files /dev/null and b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-5.png differ diff --git a/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-6.png b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-6.png new file mode 100644 index 0000000000000000000000000000000000000000..6b22e93c96244fbdee282d05cce7cc7196d520fb Binary files /dev/null and b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-6.png differ diff --git a/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-7.png b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-7.png new file mode 100644 index 0000000000000000000000000000000000000000..9040917b8050867e14f36485fb38be79bbba3aa2 Binary files /dev/null and b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-7.png differ diff --git a/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-8.png b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-8.png new file mode 100644 index 0000000000000000000000000000000000000000..9c6280c6562260b580025a31d7942f9598c829ae Binary files /dev/null and b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-8.png differ diff --git a/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-9.png b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-9.png new file mode 100644 index 0000000000000000000000000000000000000000..7b895572cf3b7f76bd65d89e70984d0cf0d4de49 Binary files /dev/null and b/articles/images/riscv-riscv_kvm_int_impl_2/mermaid-riscv-kvm-int-impl-2-9.png differ diff --git a/meeting/README.md b/meeting/README.md index e62c5333d502b8068f3f7308e3afee1be2e2e62d..03f756a8c36894a636be529a6c505b6057265d62 100644 --- a/meeting/README.md +++ b/meeting/README.md @@ -11,6 +11,27 @@ * 会议时间:20:00 PM - 20:30 PM * 直播时间:20:30 PM - 21:30 PM +## 讲师须知 + +1. 关于大纲 + + - 大纲发送之前可以先给指导老师和吴老师进行审核,两者确认无误之后再发给 @晓怡 + - 最好提前一周发送大纲,最晚时间为您进行直播分享周的星期三 + +2. 直播分享时使用的 PPT 模板 + + - [ppt](https://gitee.com/tinylab/riscv-linux/tree/master/ppt) 目录下有统一模板,检索 template 就好,记得在线下载最新的版本 + - 也可以使用 [Markdown Lab](https://gitee.com/tinylab/markdown-lab) 中的 slides 模板 + +3. 提前十分钟左右进入会议室 + + - 进会议室需要测试网络;如果电脑音频不好可以用手机登陆当音频输入设备 + - 会议地址见下一节 + +4. 提交幻灯 PR + + 演讲完以后,请把幻灯提交进上述 ppt 目录 + ## 会议地址 会议和直播均采用腾讯会议软件,2022/07/23 添加哔哩哔哩直播。 @@ -67,6 +88,26 @@ Linux 内核观察 —— 即时剖析每一个 Linux 内核大版本的关键 ### 已完成 +- 20230715:SBI 入门介绍 - 刘老师 - 哔哩哔哩 + - 这次邀请刘老师来分享 SBI 入门介绍 相关知识。 + - 已上传视频剪辑 + +- 20230708:RISC-V Linux v6.4-rc linear mapping practise - 宋老师 - 哔哩哔哩 @sugarfillet + - 这次邀请宋老师来分享 RISC-V Linux v6.4-rc linear mapping practise 相关知识。 + - 已上传视频剪辑 + +- 20230701: Linux 包管理器的演进与现状 - 张老师 - 哔哩哔哩 @IIE + - 这次邀请张老师来分享 Linux 包管理器的演进与现状 相关知识。 + - 已上传视频剪辑 + +- 20230624: 泰晓社区-TSoC2023-暑期实习-启动会 - 谭老师 - 哔哩哔哩 @Reset12138 + - 这次邀请谭老师来分享 泰晓社区-TSoC2023-暑期实习-启动会 相关知识。 + - 已上传视频剪辑 + +- 20230617: QEMU debug & upstream practice - 吴老师 - 哔哩哔哩 @falcon + - 这次邀请吴老师来分享 QEMU debug & upstream practice 相关知识。 + - 已上传视频剪辑 + - 20230610: RISC-V Semihosting 是个什么鬼? - 蒙老师 - 哔哩哔哩 @lbmeng - 这次邀请蒙老师来分享 RISC-V Semihosting 是个什么鬼? 相关知识。 - 已上传视频剪辑 @@ -298,6 +339,55 @@ Linux 内核观察 —— 即时剖析每一个 Linux 内核大版本的关键 ## 会议记录 +### 20230715:第六十八周 + + 这次邀请刘老师来分享 SBI 入门介绍 相关知识。 + +* @lbmeng: 知识星球分享:详解 dtc 的 -@ 选项。 +* @falcon:更新多个 issues 进度;所有 PR Review。 +* @tinylab:在多个渠道发布本周技术文章。 + +### 20230708:第六十七周 + + 这次邀请宋老师来分享 RISC-V Linux v6.4-rc linear mapping practise 相关知识。 + +* @YJMSTR: 知识星球分享:RISC-V 的指令集扩展命名约定。 +* @lbmeng: 知识星球分享:设备树 overlay 之语法糖。 +* @falcon:更新多个 issues 进度;所有 PR Review。 +* @tinylab:在多个渠道发布本周技术文章。 + +### 20230701:第六十六周 + + 这次邀请张老师来分享 Linux 包管理器的演进与现状 相关知识。 + +* @yooyoyo: 新增新闻 [RISC-V Linux 内核及周边技术动态][news]。 +* @iOSDevLog:知识星球分享两篇。 +* @Kepontry:增 !724 modify articles/20230617-software-prefetch。 +* @lbmeng: 知识星球分享四篇。 +* @falcon:更新多个 issues 进度;所有 PR Review。 +* @tinylab:在多个渠道发布本周技术文章。 + +### 20230624:第六十五周 + + 这次邀请谭老师来分享 泰晓社区-TSoC2023-暑期实习-启动会 相关知识。 + +* @yooyoyo: 新增新闻 [RISC-V Linux 内核及周边技术动态][news]。 +* @Reset12138: 知识星球分享:如何对 GCC 进行调试。增 !720 add article 20230615-section-gc-part3。 +* @groot00114:增 !710 introduction-to-riscv-sbi 文章第二次提交。 +* @Jingqing3948:增 !715 Add 20230617-summary-of-optimization-content-for-str-and-mem-functions。 +* @lbmeng: 增 719 ppt: Add riscv-semihosting。 +* @falcon:更新多个 issues 进度;所有 PR Review。 +* @tinylab:在多个渠道发布本周技术文章。 + +### 20230617:第六十四周 + + 这次邀请吴老师来分享 QEMU debug & upstream practice 相关知识。 + +* @yooyoyo: 新增新闻 [RISC-V Linux 内核及周边技术动态][news]。 +* @lbmeng: 知识星球分享:如何优雅地添加无密码访问远程服务器的账号。 +* @falcon:更新多个 issues 进度;所有 PR Review。 +* @tinylab:在多个渠道发布本周技术文章。 + ### 20230610:第六十三周 这次邀请蒙老师来分享 RISC-V Semihosting 是个什么鬼? 相关知识。 diff --git a/news/README.md b/news/README.md index e0c25fb38791951ca87c09cbafc21fefe0b31c63..ed30dfc3ea9aadaa37a84976ddda3db7d3e9ff5c 100644 --- a/news/README.md +++ b/news/README.md @@ -5,6 +5,4993 @@ * [2022 年](2022.md) * [2023 年 - 上半年](2023-1st-half.md) +## 20230721:第 54 期 + +### 内核动态 + +#### RISC-V 架构支持 + +**[v1: bpf-next: bpf, riscv: use BPF prog pack allocator in BPF JIT](http://lore.kernel.org/linux-riscv/20230720154941.1504-1-puranjay12@gmail.com/)** + +> BPF programs currently consume a page each on RISCV. For systems with many BPF +> programs, this adds significant pressure to instruction TLB. High iTLB pressure +> usually causes slow down for the whole system. +> + +**[v4: riscv: entry: set a0 = -ENOSYS only when syscall != -1](http://lore.kernel.org/linux-riscv/20230720140348.4716-1-CoelacanthusHex@gmail.com/)** + +> When we test seccomp with 6.4 kernel, we found errno has wrong value. +> If we deny NETLINK_AUDIT with EAFNOSUPPORT, after f0bddf50586d, we will +> get ENOSYS instead. We got same result with commit 9c2598d43510 ("riscv: entry: +> Save a0 prior syscall_enter_from_user_mode()"). +> + +**[v2: Add SiFive Private L2 cache and PMU driver](http://lore.kernel.org/linux-riscv/20230720135125.21240-1-eric.lin@sifive.com/)** + +> This patch series adds the SiFive Private L2 cache controller +> driver and Performance Monitoring Unit (PMU) driver. +> + +**[v1: riscv: add SBI SUSP extension support](http://lore.kernel.org/linux-riscv/tencent_B931BF1864B6AE8C674686ED9852ACFA0609@qq.com/)** + +> RISC-V SBI spec 2.0 [1] introduces System Suspend Extension which can be +> used to suspend the platform via SBI firmware. +> + +**[v1: Linux RISC-V IOMMU Support](http://lore.kernel.org/linux-riscv/cover.1689792825.git.tjeznach@rivosinc.com/)** + +> The RISC-V IOMMU specification is now ratified as-per the RISC-V international +> process [1]. The latest frozen specifcation can be found at: +> https://github.com/riscv-non-isa/riscv-iommu/releases/download/v1.0/riscv-iommu.pdf +> + +**[GIT PULL: StarFive clock driver additions for v6.6](http://lore.kernel.org/linux-riscv/20230719-trough-frisk-40b92acb485a@spud/)** + +> Please pull some clock driver additions for StarFive. I've had these +> commits, other than a rebase to pick up R-b tags from Emil, out for LKP +> to have a look at for a few days and they've gotten a clean bill of +> health. Some of the dt-binding stuff "only" has a review from me, but +> since I am a dt-binding maintainer that's fine, although maybe not +> common knowledge yet. +> + +**[v2: gpio: sifive: Module support](http://lore.kernel.org/linux-riscv/20230719163446.1398961-1-samuel.holland@sifive.com/)** + +> With the call to of_irq_count() removed, the SiFive GPIO driver can be +> built as a module. This helps to minimize the size of a multiplatform +> kernel, and is required by some downstream distributions (Android GKI). +> + +**[v1: Risc-V Kvm Smstateen](http://lore.kernel.org/linux-riscv/20230719160316.4048022-1-mchitale@ventanamicro.com/)** + +> This series adds support to detect the Smstateen extension for both, the +> host and the guest vcpu. It also adds senvcfg and sstateen0 to the ONE_REG +> interface and the vcpu context save/restore. +> + +**[v6: Linux RISC-V AIA Support](http://lore.kernel.org/linux-riscv/20230719113542.2293295-1-apatel@ventanamicro.com/)** + +> The RISC-V AIA specification is now frozen as-per the RISC-V international +> process. The latest frozen specifcation can be found at: +> https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf +> + +**[v1: Refactoring Microchip PolarFire PCIe driver](http://lore.kernel.org/linux-riscv/20230719102057.22329-1-minda.chen@starfivetech.com/)** + +> This patchset final purpose is add PCIe driver for StarFive JH7110 SoC. +> JH7110 using PLDA XpressRICH PCIe IP. Microchip PolarFire Using the +> same IP and have commit their codes, which are mixed with PLDA +> controller codes and Microchip platform codes. +> + +**[v5: Add initialization of clock for StarFive JH7110 SoC](http://lore.kernel.org/linux-riscv/20230719092545.1961401-1-william.qiu@starfivetech.com/)** + +> This patchset adds initial rudimentary support for the StarFive +> Quad SPI controller driver. And this driver will be used in +> StarFive's VisionFive 2 board. In 6.4, the QSPI_AHB and QSPI_APB +> clocks changed from the default ON state to the default OFF state, +> so these clocks need to be enabled in the driver.At the same time, +> dts patch is added to this series. +> + +**[v3: riscv: Reduce ARCH_KMALLOC_MINALIGN to 8](http://lore.kernel.org/linux-riscv/20230718152214.2907-1-jszhang@kernel.org/)** + +> Currently, riscv defines ARCH_DMA_MINALIGN as L1_CACHE_BYTES, I.E +> 64Bytes, if CONFIG_RISCV_DMA_NONCOHERENT=y. To support unified kernel +> Image, usually we have to enable CONFIG_RISCV_DMA_NONCOHERENT, thus +> it brings some bad effects to coherent platforms: +> +> Firstly, it wastes memory, kmalloc-96, kmalloc-32, kmalloc-16 and +> kmalloc-8 slab caches don't exist any more, they are replaced with +> either kmalloc-128 or kmalloc-64. +> + +**[v1: asm-generic: ticket-lock: Optimize arch_spin_value_unlocked](http://lore.kernel.org/linux-riscv/20230719070001.795010-1-guoren@kernel.org/)** + +> Using arch_spinlock_is_locked would cause another unnecessary memory +> access to the contended value. Although it won't cause a significant +> performance gap in most architectures, the arch_spin_value_unlocked +> argument contains enough information. Thus, remove unnecessary +> atomic_read in arch_spin_value_unlocked(). +> + +**[v2: riscv: entry: set a0 prior to syscall_enter_from_user_mode](http://lore.kernel.org/linux-riscv/20230718162940.226118-1-CoelacanthusHex@gmail.com/)** + +> When we test seccomp with 6.4 kernel, we found errno has wrong value. +> If we deny NETLINK_AUDIT with EAFNOSUPPORT, after f0bddf50586d, we will +> get ENOSYS instead. We got same result with 9c2598d43510 ("riscv: entry: Save a0 +> prior syscall_enter_from_user_mode()"). +> + +**[v1: riscv: Move the "Call Trace" to dump_backrace().](http://lore.kernel.org/linux-riscv/20230718023201.16018-1-minachou@andestech.com/)** + +> It would be appropriate to show "Call Trace" within the dump_backtrace +> function to ensure that some kernel dumps include this information. +> + +**[v2: usb: Explicitly include correct DT includes](http://lore.kernel.org/linux-riscv/20230718143027.1064731-1-robh@kernel.org/)** + +> The DT of_device.h and of_platform.h date back to the separate +> of_platform_bus_type before it as merged into the regular platform bus. +> As part of that merge prepping Arm DT support 13 years ago, they +> "temporarily" include each other. They also include platform_device.h +> and of.h. As a result, there's a pretty much random mix of those include +> files used throughout the tree. In order to detangle these headers and +> replace the implicit includes with struct declarations, users need to +> explicitly include the correct includes. +> + +**[v1: irqchip/sifive-plic: Avoid clearing the per-hart enable bits](http://lore.kernel.org/linux-riscv/20230717185841.1294425-1-samuel.holland@sifive.com/)** + +> Writes to the PLIC completion register are ignored if the enable bit for +> that (interrupt, hart) combination is cleared. This leaves the interrupt +> in a claimed state, preventing it from being triggered again. +> + +**[v11: KVM: guest_memfd() and per-page attributes](http://lore.kernel.org/linux-riscv/20230718234512.1690985-1-seanjc@google.com/)** + +> This is the next iteration of implementing fd-based (instead of vma-based) +> memory for KVM guests. If you want the full background of why we are doing +> this, please go read the v10 cover letter[1]. +> +> The biggest change from v10 is to implement the backing storage in KVM +> itself, and expose it via a KVM ioctl() instead of a "generic" sycall. +> See link[2] for details on why we pivoted to a KVM-specific approach. +> + +**[v1: riscv: dts: starfive: jh71x0: Add temperature sensor nodes and thermal-zones](http://lore.kernel.org/linux-riscv/20230718034937.92999-1-hal.feng@starfivetech.com/)** + +> These patches add temperature sensor nodes and thermal-zones for the +> StarFive JH71X0 SoC. I have tested them on the BeagleV Starlight board +> and StarFive VisionFive 1 / 2 board. Thanks. +> + +**[v1: clk: Explicitly include correct DT includes](http://lore.kernel.org/linux-riscv/20230714174342.4052882-1-robh@kernel.org/)** + +> The DT of_device.h and of_platform.h date back to the separate +> of_platform_bus_type before it as merged into the regular platform bus. +> As part of that merge prepping Arm DT support 13 years ago, they +> "temporarily" include each other. They also include platform_device.h +> and of.h. As a result, there's a pretty much random mix of those include +> files used throughout the tree. In order to detangle these headers and +> replace the implicit includes with struct declarations, users need to +> explicitly include the correct includes. +> + +**[v1: riscv: kernel: insert space before the open parenthesis '('](http://lore.kernel.org/linux-riscv/b90d162c4fb8062355634fb53b05173d@208suo.com/)** + +> Fix below checkpatch error: +> +> /riscv/kernel/smp.c:93:ERROR: space required before the open parenthesis +> '(' +> + +**[v7: Add PLL clocks driver and syscon for StarFive JH7110 SoC](http://lore.kernel.org/linux-riscv/20230717023040.78860-1-xingyu.wu@starfivetech.com/)** + +> This patch serises are to add PLL clocks driver and providers by writing +> and reading syscon registers for the StarFive JH7110 RISC-V SoC. And add +> documentation and nodes to describe StarFive System Controller(syscon) +> Registers. This patch serises are based on Linux 6.4. +> + +**[v2: riscv: support PREEMPT_DYNAMIC with static keys](http://lore.kernel.org/linux-riscv/20230716164925.1858-1-jszhang@kernel.org/)** + +> Currently, each architecture can support PREEMPT_DYNAMIC through +> either static calls or static keys. To support PREEMPT_DYNAMIC on +> riscv, we face three choices: +> +> only add static calls support to riscv +> As Mark pointed out in commit 99cf983cc8bc ("sched/preempt: Add +> PREEMPT_DYNAMIC using static keys"), static keys "...should have +> slightly lower overhead than non-inline static calls, as this +> effectively inlines each trampoline into the start of its callee. This +> may avoid redundant work, and may integrate better with CFI schemes." +> So even we add static calls(without inline static calls) to riscv, +> static keys is still a better choice. +> + +**[v1: riscv: Add HAVE_IOREMAP_PROT support](http://lore.kernel.org/linux-riscv/20230716152033.3713581-1-guoren@kernel.org/)** + +> Add pte_pgprot macro, then riscv could have HAVE_IOREMAP_PROT, +> which will enable generic_access_phys() code, it is useful for +> debug, eg, gdb. +> +> Because generic_access_phys() would call ioremap_prot()-> +> pgprot_nx() to disable excutable attribute, add definition +> of pgprot_nx() for riscv. +> + +**[v1: Add support for Allwinner D1 CAN controllers](http://lore.kernel.org/linux-riscv/20230715112523.2533742-1-contact@jookia.org/)** + +> This patch series adds support for the Allwinner D1 CAN controllers. +> It requires adding a new device tree compatible and driver support to +> work around some hardware quirks. +> + +**[v9: Add support for Allwinner GPADC on D1/T113s/R329/T507 SoCs](http://lore.kernel.org/linux-riscv/20230715091816.3074375-1-bigunclemax@gmail.com/)** + +> This series adds support for general purpose ADC (GPADC) on new +> Allwinner's SoCs, such as D1, T113s, T507 and R329. The implemented driver +> provides basic functionality for getting ADC channels data. +> + +**[v1: bpf: riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework](http://lore.kernel.org/linux-riscv/20230715090137.2141358-1-pulehui@huaweicloud.com/)** + +> Commit 6724a76cff85 ("riscv: ftrace: Reduce the detour code size to +> half") optimizes the detour code size of kernel functions to half with +> T0 register and the upcoming DYNAMIC_FTRACE_WITH_DIRECT_CALLS of riscv +> is based on this optimization, we need to adapt riscv bpf trampoline +> based on this. One thing to do is to reduce detour code size of bpf +> programs, and the second is to deal with the return address after the +> execution of bpf trampoline. Meanwhile, add more comments and rename +> some variables to make more sense. The related tests have passed. +> + +**[v1: pwm: Constistenly name pwm_chip variables "chip"](http://lore.kernel.org/linux-riscv/20230714205623.2496590-1-u.kleine-koenig@pengutronix.de/)** + +> The first offenders I found were the core and the atmel-hlcdc driver. +> After I found these I optimistically assumed these were the only ones +> with the unusual names and send patches for these out individually +> before checking systematically. +> + +**[v1: soc: microchip: Explicitly include correct DT includes](http://lore.kernel.org/linux-riscv/20230714175139.4067685-1-robh@kernel.org/)** + +> The DT of_device.h and of_platform.h date back to the separate +> of_platform_bus_type before it as merged into the regular platform bus. +> As part of that merge prepping Arm DT support 13 years ago, they +> "temporarily" include each other. They also include platform_device.h +> and of.h. As a result, there's a pretty much random mix of those include +> files used throughout the tree. In order to detangle these headers and +> replace the implicit includes with struct declarations, users need to +> explicitly include the correct includes. +> + +**[v1: reset: Explicitly include correct DT includes](http://lore.kernel.org/linux-riscv/20230714174939.4063667-1-robh@kernel.org/)** + +> The DT of_device.h and of_platform.h date back to the separate +> of_platform_bus_type before it as merged into the regular platform bus. +> As part of that merge prepping Arm DT support 13 years ago, they +> "temporarily" include each other. They also include platform_device.h +> and of.h. As a result, there's a pretty much random mix of those include +> files used throughout the tree. In order to detangle these headers and +> replace the implicit includes with struct declarations, users need to +> explicitly include the correct includes. +> + +**[v6: RISC-V: mm: Make SV48 the default address space](http://lore.kernel.org/linux-riscv/20230714165508.94561-1-charlie@rivosinc.com/)** + +> Make sv48 the default address space for mmap as some applications +> currently depend on this assumption. Users can now select a +> desired address space using a non-zero hint address to mmap. Previously, +> requesting the default address space from mmap by passing zero as the hint +> address would result in using the largest address space possible. Some +> applications depend on empty bits in the virtual address space, like Go and +> Java, so this patch provides more flexibility for application developers. +> + +**[v1: Add ethernet nodes for StarFive JH7110 SoC](http://lore.kernel.org/linux-riscv/20230714104521.18751-1-samin.guo@starfivetech.com/)** + +> This series adds ethernet nodes for StarFive JH7110 RISC-V SoC, +> and has been tested on StarFive VisionFive-2 v1.2A and v1.3B SBC boards. +> +> The first patch adds ethernet nodes for jh7110 SoC, the second patch +> adds ethernet nodes for visionfive 2 SBCs. +> + +**[v4: RESEND: riscv: Introduce KASLR](http://lore.kernel.org/linux-riscv/20230713150800.120821-1-alexghiti@rivosinc.com/)** + +> The following KASLR implementation allows to randomize the kernel mapping: +> +> - virtually: we expect the bootloader to provide a seed in the device-tree +> - physically: only implemented in the EFI stub, it relies on the firmware to +> provide a seed using EFI_RNG_PROTOCOL. arm64 has a similar implementation +> hence the patch 3 factorizes KASLR related functions for riscv to take +> advantage. +> + +**[v4: riscv: Introduce KASLR](http://lore.kernel.org/linux-riscv/20230713133401.116506-1-alexghiti@rivosinc.com/)** + +> The following KASLR implementation allows to randomize the kernel mapping: +> +> - virtually: we expect the bootloader to provide a seed in the device-tree +> - physically: only implemented in the EFI stub, it relies on the firmware to +> provide a seed using EFI_RNG_PROTOCOL. arm64 has a similar implementation +> hence the patch 3 factorizes KASLR related functions for riscv to take +> advantage. +> + +#### 进程调度 + +**[v1: sched/debug: Print tgid in sched_show_task()](http://lore.kernel.org/lkml/20230720080516.1515297-1-yajun.deng@linux.dev/)** + +> Multiple blocked tasks are printed when the system hangs. They may have +> the same parent pid, but belong to different task groups. +> +> Printing tgid lets users better know whether these tasks are from the same +> task group or not. +> + +**[v9: sched/fair: Scan cluster before scanning LLC in wake-up path](http://lore.kernel.org/lkml/20230719092838.2302-1-yangyicong@huawei.com/)** + +> This is the follow-up work to support cluster scheduler. Previously +> we have added cluster level in the scheduler for both ARM64[1] and +> X86[2] to support load balance between clusters to bring more memory +> bandwidth and decrease cache contention. This patchset, on the other +> hand, takes care of wake-up path by giving CPUs within the same cluster +> a try before scanning the whole LLC to benefit those tasks communicating +> with each other. +> + +**[v2: sched: Optimize in_task() and in_interrupt() a bit](http://lore.kernel.org/lkml/453f675efb082e08068736bf69293d48ff3129a7.1689641959.git.fthain@linux-m68k.org/)** + +> Except on x86, preempt_count is always accessed with READ_ONCE. +> Repeated invocations in macros like irq_count() produce repeated loads. +> These redundant instructions appear in various fast paths. In the one +> shown below, for example, irq_count() is evaluated during kernel entry +> if !tick_nohz_full_cpu(smp_processor_id()). +> + +**[v2: sched/core: Use empty mask to reset cpumasks in sched_setaffinity()](http://lore.kernel.org/lkml/20230717180243.3607603-1-longman@redhat.com/)** + +> Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested +> cpumask"), user provided CPU affinity via sched_setaffinity(2) is +> perserved even if the task is being moved to a different cpuset. However, +> that affinity is also being inherited by any subsequently created child +> processes which may not want or be aware of that affinity. +> + +**[v1: sched/fair: Add SMT4 group_smt_balance handling](http://lore.kernel.org/lkml/20230717145823.1531759-1-sshegde@linux.vnet.ibm.com/)** + +> For SMT4, any group with more than 2 tasks will be marked as +> group_smt_balance. Retain the behaviour of group_has_spare by marking +> the busiest group as the group which has the least number of idle_cpus. +> + +**[GIT PULL: sched/urgent for v6.5-rc2](http://lore.kernel.org/lkml/20230716183726.GEZLQ45tOt9L548BJ4@fat_crate.local/)** + +> please pull two urgent scheduler fixes for 6.5. +> + +**[v1: sched: Rename DIE domain](http://lore.kernel.org/lkml/20230712141056.GI3100107@hirez.programming.kicks-ass.net/)** + +> Thomas just tripped over the x86 topology setup creating a 'DIE' domain +> for the package mask :-) +> +> Since these names are SCHED_DEBUG only, rename them. +> I don't think anybody *should* be relying on this, but who knows. +> + +**[v2: sched: Implement shared runqueue in CFS](http://lore.kernel.org/lkml/20230710200342.358255-1-void@manifault.com/)** + +> This is v2 of the shared wakequeue (now called shared runqueue) +> patchset. The following are changes from the RFC v1 patchset +> (https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/). +> + +**[v1: net: sched: Replace strlcpy with strscpy](http://lore.kernel.org/lkml/20230710030711.812898-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +**[[PATCH AUTOSEL 4.14] sched/fair: Don't balance task to its current running CPU](http://lore.kernel.org/lkml/20230709150618.512785-1-sashal@kernel.org/)** + +> The new_dst_cpu is chosen from the env->dst_grpmask. Currently it +> contains CPUs in sched_group_span() and if we have overlapped groups it's +> possible to run into this case. This patch makes env->dst_grpmask of +> group_balance_mask() which exclude any CPUs from the busiest group and +> solve the issue. For balancing in a domain with no overlapped groups +> the behaviour keeps same as before. +> + +#### 内存管理 + +**[v2: context_tracking,x86: Defer some IPIs until a user->kernel transition](http://lore.kernel.org/linux-mm/20230720163056.2564824-1-vschneid@redhat.com/)** + +> The heart of this series is the thought that while we cannot remove NOHZ_FULL +> CPUs from the list of CPUs targeted by these IPIs, they may not have to execute +> the callbacks immediately. Anything that only affects kernelspace can wait +> until the next user->kernel transition, providing it can be executed "early +> enough" in the entry code. +> + +**[v3: Convert several functions in page_io.c to use a folio](http://lore.kernel.org/linux-mm/20230720130147.4071649-1-zhangpeng362@huawei.com/)** + +> This patch series converts several functions in page_io.c to use a +> folio, which can remove several implicit calls to compound_head(). +> + +**[v3: Optimize large folio interaction with deferred split](http://lore.kernel.org/linux-mm/20230720112955.643283-1-ryan.roberts@arm.com/)** + +> [Sending v3 to replace yesterday's v2 after Yu Zhou's feedback] +> +> This is v3 of a small series in support of my work to enable the use of large +> folios for anonymous memory (known as "FLEXIBLE_THP" or "LARGE_ANON_FOLIO") [1]. +> It first makes it possible to add large, non-pmd-mappable folios to the deferred +> split queue. Then it modifies zap_pte_range() to batch-remove spans of +> physically contiguous pages from the rmap, which means that in the common case, +> we elide the need to ever put the folio on the deferred split queue, thus +> reducing lock contention and improving performance. +> + +**[v4: mm/slub: Optimize slub memory usage](http://lore.kernel.org/linux-mm/20230720102337.2069722-1-jaypatel@linux.ibm.com/)** + +> In the current implementation of the slub memory allocator, the slab +> order selection process follows these criteria: +> +> 1) Determine the minimum order required to serve the minimum number of +> objects (min_objects). This calculation is based on the formula (order +> = min_objects * object_size / PAGE_SIZE). +> 2) If the minimum order is greater than the maximum allowed order +> (slub_max_order), set slub_max_order as the order for this slab. +> 3) If the minimum order is less than the slub_max_order, iterate +> through a loop from minimum order to slub_max_order and check if the +> condition (rem <= slab_size / fract_leftover) holds true. Here, +> slab_size is calculated as (PAGE_SIZE << order), rem is (slab_size % +> object_size), and fract_leftover can have values of 16, 8, or 4. If +> the condition is true, select that order for the slab. +> + +**[v3: Invalidate secondary IOMMU TLB on permission upgrade](http://lore.kernel.org/linux-mm/cover.b24362332ec6099bc8db4e8e06a67545c653291d.1689842332.git-series.apopple@nvidia.com/)** + +> The main change is to move secondary TLB invalidation mmu notifier +> callbacks into the architecture specific TLB flushing functions. This +> makes secondary TLB invalidation mostly match CPU invalidation while +> still allowing efficient range based invalidations based on the +> existing TLB batching code. +> + +**[v2: mm: use memmap_on_memory semantics for dax/kmem](http://lore.kernel.org/linux-mm/20230720-vv-kmem_memmap-v2-0-88bdaab34993@intel.com/)** + +> The dax/kmem driver can potentially hot-add large amounts of memory +> originating from CXL memory expanders, or NVDIMMs, or other 'device +> memories'. There is a chance there isn't enough regular system memory +> available to fit the memmap for this new memory. It's therefore +> desirable, if all other conditions are met, for the kmem managed memory +> to place its memmap on the newly added memory itself. +> + +**[v1: memory recharging for offline memcgs](http://lore.kernel.org/linux-mm/20230720070825.992023-1-yosryahmed@google.com/)** + +> This patch series implements the proposal in LSF/MM/BPF 2023 conference +> for reducing offline/zombie memcgs by memory recharging [1]. The main +> + +**[v1: shmem: add support for user extended attributes](http://lore.kernel.org/linux-mm/20230720065430.2178136-1-ovt@google.com/)** + +> User extended attributes are not enabled in tmpfs because +> the size of the value is not limited and the memory allocated +> for it is not counted against any limit. Malicious +> non-privileged user can exhaust kernel memory by creating +> user.* extended attribute with very large value. +> + +**[v1: mm,memblock: reset memblock.reserved to system init state to prevent UAF](http://lore.kernel.org/linux-mm/20230719154137.732d8525@imladris.surriel.com/)** + +> The memblock_discard function frees the memblock.reserved.regions +> array, which is good. +> +> However, if a subsequent memblock_free (or memblock_phys_free) comes +> in later, from for example ima_free_kexec_buffer, that will result in +> a use after free bug in memblock_isolate_range. +> + +**[v2: mm/hugetlb: get rid of page_hstate()](http://lore.kernel.org/linux-mm/20230719184145.301911-1-sidhartha.kumar@oracle.com/)** + +> Converts the last page_hstate() user to use folio_hstate() so +> page_hstate() can be safely removed. +> + +**[v1: mm: memcg: use rstat for non-hierarchical stats](http://lore.kernel.org/linux-mm/20230719174613.3062124-1-yosryahmed@google.com/)** + +> Currently, memcg uses rstat to maintain hierarchical stats. The rstat +> framework keeps track of which cgroups have updates on which cpus. +> +> For non-hierarchical stats, as memcg moved to rstat, they are no longer +> readily available as counters. Instead, the percpu counters for a given +> stat need to be summed to get the non-hierarchical stat value. This +> causes a performance regression when reading non-hierarchical stats on +> kernels where memcg moved to using rstat. This is especially visible +> when reading memory.stat on cgroup v1. There are also some code paths +> internal to the kernel that read such non-hierarchical stats. +> + +**[v2: mm: convert to vma_is_initial_heap/stack()](http://lore.kernel.org/linux-mm/20230719075127.47736-1-wangkefeng.wang@huawei.com/)** + +> Add vma_is_initial_stack() and vma_is_initial_heap() helper and use +> them to simplify code. +> + +**[v1: mm: hugetlb_vmemmap: use PageCompound() instead of PageReserved()](http://lore.kernel.org/linux-mm/20230719063132.37676-1-songmuchun@bytedance.com/)** + +> The ckeck of PageReserved() is easy to be broken in the future, PageCompound() +> is more stable to check if the page should be split. +> + +**[v1: udmabuf: Replace pages when there is FALLOC_FL_PUNCH_HOLE in memfd](http://lore.kernel.org/linux-mm/20230718082858.1570809-1-vivek.kasireddy@intel.com/)** + +> This patch series attempts to solve the coherency problem seen when +> a hole is punched in the region(s) of the mapping (associated with +> the memfd) that overlaps with pages registered with a udmabuf fd. +> + +**[v2: udmabuf: Add back support for mapping hugetlb pages (v2)](http://lore.kernel.org/linux-mm/20230718082605.1570740-1-vivek.kasireddy@intel.com/)** + +> The first patch ensures that the mappings needed for handling mmap +> operation would be managed by using the pfn instead of struct page. +> The second patch restores support for mapping hugetlb pages where +> subpages of a hugepage are not directly used anymore (main reason +> for revert) and instead the hugetlb pages and the relevant offsets +> are used to populate the scatterlist for dma-buf export and for +> mmap operation. +> + +**[v3: mm: kfence: allocate kfence_metadata at runtime](http://lore.kernel.org/linux-mm/20230718073019.52513-1-zhangpeng.00@bytedance.com/)** + +> kfence_metadata is currently a static array. For the purpose of allocating +> scalable __kfence_pool, we first change it to runtime allocation of +> metadata. Since the size of an object of kfence_metadata is 1160 bytes, we +> can save at least 72 pages (with default 256 objects) without enabling +> kfence. +> + +**[v1: add page_ext_data to get client data in page_ext](http://lore.kernel.org/linux-mm/20230718145812.1991717-1-shikemeng@huaweicloud.com/)** + +> Current client get data from page_ext by adding offset which is auto +> generated in page_ext core and expose the data layout design insdie +> page_ext core. This series adds a page_ext_data to hide offset from +> client. Thanks! +> + +**[v1: mm/damon/core-test: Initialise context before test in damon_test_set_attrs()](http://lore.kernel.org/linux-mm/20230718052811.1065173-1-feng.tang@intel.com/)** + +> Running kunit test for 6.5-rc1 hits one bug: +> +> ok 10 damon_test_update_monitoring_result +> + +**[v4: Add support for memmap on memory feature on ppc64](http://lore.kernel.org/linux-mm/20230718024409.95742-1-aneesh.kumar@linux.ibm.com/)** + +> This patch series update memmap on memory feature to fall back to +> memmap allocation outside the memory block if the alignment rules are +> not met. This makes the feature more useful on architectures like +> ppc64 where alignment rules are different with 64K page size. +> + +**[v5: Add support for DAX vmemmap optimization for ppc64](http://lore.kernel.org/linux-mm/20230718022934.90447-1-aneesh.kumar@linux.ibm.com/)** + +> This patch series implements changes required to support DAX vmemmap +> optimization for ppc64. The vmemmap optimization is only enabled with radix MMU +> translation and 1GB PUD mapping with 64K page size. The patch series also split +> hugetlb vmemmap optimization as a separate Kconfig variable so that +> architectures can enable DAX vmemmap optimization without enabling hugetlb +> vmemmap optimization. This should enable architectures like arm64 to enable DAX +> vmemmap optimization while they can't enable hugetlb vmemmap optimization. More +> details of the same are in patch "mm/vmemmap optimization: Split hugetlb and +> devdax vmemmap optimization" +> + +**[v1: 5.15.y: mm/damon/ops-common: atomically test and clear young on ptes and pmds](http://lore.kernel.org/linux-mm/20230717193008.122040-1-sj@kernel.org/)** + +> commit c11d34fa139e4b0fb4249a30f37b178353533fa1 upstream. +> +> It is racy to non-atomically read a pte, then clear the young bit, then +> write it back as this could discard dirty information. Further, it is bad +> practice to directly set a pte entry within a table. Instead clearing +> young must go through the arch-provided helper, +> ptep_test_and_clear_young() to ensure it is modified atomically and to +> give the arch code visibility and allow it to check (and potentially +> modify) the operation. +> + +#### 文件系统 + +**[v1: Various Rust bindings for files](http://lore.kernel.org/linux-fsdevel/20230720152820.3566078-1-aliceryhl@google.com/)** + +> This contains bindings for various file related things that binder needs +> to use. +> +> I would especially like feedback on the SAFETY comments. Particularly, +> the safety comments in patch 4 and 5 are non-trivial. For example: +> + +**[v1: vboxsf: Use flexible arrays for trailing string member](http://lore.kernel.org/linux-fsdevel/20230720151458.never.673-kees@kernel.org/)** + +> The declaration of struct shfl_string used trailing fake flexible arrays +> for the string member. This was tripping FORTIFY_SOURCE since commit +> df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3"). Replace the +> utf8 and utf16 members with actual flexible arrays, drop the unused ucs2 +> member, and retriain a 2 byte padding to keep the structure size the same. +> + +**[v1: fs/nls: make load_nls() take a const parameter](http://lore.kernel.org/linux-fsdevel/20230720063414.2546451-1-wentao@uniontech.com/)** + +> load_nls() take a char * parameter, use it to find nls module in list or +> construct the module name to load it. +> +> This change make load_nls() take a const parameter, so we don't need do +> some cast like this: +> +> ses->local_nls = load_nls((char *)ctx->local_nls->charset); +> +> Also remove the cast in cifs code. +> + +**[v1: fstests: add helper to canonicalize devices used to enable persistent disks](http://lore.kernel.org/linux-fsdevel/20230720061727.2363548-1-mcgrof@kernel.org/)** + +> The filesystem configuration file does not allow you to use symlinks to +> devices given the existing sanity checks verify that the target end +> device matches the source. +> +> Using a symlink is desirable if you want to enable persistent tests +> across reboots. For example you may want to use /dev/disk/by-id/nvme-eui.* +> so to ensure that the same drives are used even after reboot. This +> is very useful if you are testing for example with a virtualized +> environment and are using PCIe passthrough with other qemu NVMe drives +> with one or many NVMe drives. +> + +**[v3: Support negative dentries on case-insensitive ext4 and f2fs](http://lore.kernel.org/linux-fsdevel/20230719221918.8937-1-krisman@suse.de/)** + +> V3 applies the fixes suggested by Eric Biggers (thank you for your +> review!). Changelog inlined in the patches. +> +> Retested with xfstests for ext4 and f2fs. +> +> cover letter from v1. +> +> This patchset enables negative dentries for case-insensitive directories +> in ext4/f2fs. It solves the corner cases for this feature, including +> those already tested by fstests (generic/556). It also solves an +> existing bug with the existing implementation where old negative +> dentries are left behind after a directory conversion to +> case-insensitive. +> + +**[v1: nfsd: inherit required unset default acls from effective set](http://lore.kernel.org/linux-fsdevel/20230719-nfsd-acl-v1-1-eb0faf3d2917@kernel.org/)** + +> A well-formed NFSv4 ACL will always contain OWNER@/GROUP@/EVERYONE@ +> ACEs, but there is no requirement for inheritable entries for those +> entities. POSIX ACLs must always have owner/group/other entries, even for a +> default ACL. +> + +**[v1: fs: export emergency_sync](http://lore.kernel.org/linux-fsdevel/20230718214540.1.I763efc30c57dcc0284d81f704ef581cded8960c8@changeid/)** + +> emergency_sync forces a filesystem sync in emergency situations. +> Export this function so it can be used by modules. +> + +**[v4: io_uring getdents](http://lore.kernel.org/linux-fsdevel/20230718132112.461218-1-hao.xu@linux.dev/)** + +> This series introduce getdents64 to io_uring, the code logic is similar +> with the snychronized version's. It first try nowait issue, and offload +> it to io-wq threads if the first try fails. +> + +**[v2: xarray: Document necessary flag in alloc functions](http://lore.kernel.org/linux-fsdevel/20230718072533.4305-2-pstanner@redhat.com/)** + +> Adds a new line to the docstrings of functions wrapping __xa_alloc() and +> __xa_alloc_cyclic(), informing about the necessity of flag XA_FLAGS_ALLOC +> being set previously. +> +> The documentation so far says that functions wrapping __xa_alloc() and +> __xa_alloc_cyclic() are supposed to return either -ENOMEM or -EBUSY in +> case of an error. If the xarray has been initialized without the flag +> XA_FLAGS_ALLOC, however, they fail with a different, undocumented error +> code. +> + +**[GIT PULL: Create large folios in iomap buffered write path](http://lore.kernel.org/linux-fsdevel/ZLVrEkVU2YCneoXR@casper.infradead.org/)** + +> The following changes since commit 5b8d6e8539498e8b2fa67fbcce3fe87834d44a7a: +> +> Merge tag 'xtensa-20230716' of https://github.com/jcmvbkbc/linux-xtensa (2023-07-16 14:12:49 -0700) +> + +**[v1: fs/filesystems.c: ERROR: "(foo*)" should be "(foo *)"](http://lore.kernel.org/linux-fsdevel/a456720721d2f8fc33bb0befbe2ad115@208suo.com/)** + +> Fix five occurrences of the checkpatch.pl error: +> ERROR: "(foo*)" should be "(foo *)" +> + +**[v2: fs/address_space: add alignment padding for i_map and i_mmap_rwsem to mitigate a false sharing.](http://lore.kernel.org/linux-fsdevel/20230716145450.20108-1-lipeng.zhu@intel.com/)** + +> When running UnixBench/Shell Scripts, we observed high false sharing +> for accessing i_mmap against i_mmap_rwsem. +> + +**[v1: fs: inode: return proper errno on bmap()](http://lore.kernel.org/linux-fsdevel/20230715060217.1469690-1-lsahn@wewakecorp.com/)** + +> It better returns -EOPNOTSUPP instead of -EINVAL which has meaning of +> the argument is an inappropriate value. It doesn't make sense in the +> case of that a file system doesn't support bmap operation. +> +> -EINVAL could make confusion in the userspace perspective. +> + +**[v1: exfat: release s_lock before calling dir_emit()](http://lore.kernel.org/linux-fsdevel/20230714084354.1959951-1-sj1557.seo@samsung.com/)** + +> WARNING: possible circular locking dependency detected +> 6.4.0-next-20230707-syzkaller #0 Not tainted +> syz-executor330/5073 is trying to acquire lock: +> ffff8880218527a0 (&mm->mmap_lock){++++}-{3:3}, at: mmap_read_lock_killable include/linux/mmap_lock.h:151 [inline] +> ffff8880218527a0 (&mm->mmap_lock){++++}-{3:3}, at: get_mmap_lock_carefully mm/memory.c:5293 [inline] +> ffff8880218527a0 (&mm->mmap_lock){++++}-{3:3}, at: lock_mm_and_find_vma+0x369/0x510 mm/memory.c:5344 +> but task is already holding lock: +> ffff888019f760e0 (&sbi->s_lock){+.+.}-{3:3}, at: exfat_iterate+0x117/0xb50 fs/exfat/dir.c:232 +> + +**[[fstests PATCH] generic: add a test for multigrain timestamps](http://lore.kernel.org/linux-fsdevel/20230713230939.367068-1-jlayton@kernel.org/)** + +> Ensure that the mtime and ctime apparently change, even when there are +> multiple writes in quick succession. Older kernels didn't do this, but +> there are patches in flight that should help ensure it in the future. +> + +**[v5: fs: implement multigrain timestamps](http://lore.kernel.org/linux-fsdevel/20230713-mgctime-v5-0-9eb795d2ae37@kernel.org/)** + +> The VFS always uses coarse-grained timestamps when updating the +> ctime and mtime after a change. This has the benefit of allowing +> filesystems to optimize away a lot metadata updates, down to around 1 +> per jiffy, even when a file is under heavy writes. +> + +**[v2: procfs: block chmod on /proc/thread-self/comm](http://lore.kernel.org/linux-fsdevel/20230713141001.27046-1-cyphar@cyphar.com/)** + +> Due to an oversight in commit 1b3044e39a89 ("procfs: fix pthread +> cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD, +> chmod operations on /proc/thread-self/comm were no longer blocked as +> they are on almost all other procfs files. +> + +**[v4: RESEND: shmem: Add user and group quota support for tmpfs](http://lore.kernel.org/linux-fsdevel/20230713134848.249779-1-cem@kernel.org/)** + +> This is a resend of the quota support for tmpfs. This has been rebased on +> today Linus's TOT. These patches conflicted with Luis Chamberlain's series to +> include 'noswap' mount option to tmpfs, there was no code change since the +> previous version, other than moving the implementation of quota options 'after' +> 'noswap'. +> + +**[v1: exfat: check if filename entries exceeds max filename length](http://lore.kernel.org/linux-fsdevel/20230713130310.8445-1-linkinjeon@kernel.org/)** + +> exfat_extract_uni_name copies characters from a given file name entry into +> the 'uniname' variable. This variable is actually defined on the stack of +> the exfat_readdir() function. According to the definition of +> the 'exfat_uni_name' type, the file name should be limited 255 characters +> (+ null teminator space), but the exfat_get_uniname_from_ext_entry() +> function can write more characters because there is no check if filename +> entries exceeds max filename length. This patch add the check not to copy +> filename characters when exceeding max filename length. +> + +**[v1: fs: proc: Add error checking for d_hash_and_lookup()](http://lore.kernel.org/linux-fsdevel/20230713113303.6512-1-machel@vivo.com/)** + +> In case of failure, d_hash_and_lookup() returns NULL or an error +> pointer. The proc_fill_cache() needs to add the handling of the +> error pointer returned by d_hash_and_lookup(). +> + +**[v25: Implement IOCTL to get and optionally clear info about PTEs](http://lore.kernel.org/linux-fsdevel/20230713101415.108875-1-usama.anjum@collabora.com/)** + +> *Changes in v25*: +> - Do proper filtering on hole as well (hole got missed earlier) +> + +**[v1: eventfd: simplify signal helpers](http://lore.kernel.org/linux-fsdevel/20230713-vfs-eventfd-signal-v1-0-7fda6c5d212b@kernel.org/)** + +> This simplifies the eventfd_signal() and eventfd_signal_mask() helpers +> by removing the count argument which is effectively unused. +> + +**[v1: More filesystem folio conversions for 6.6](http://lore.kernel.org/linux-fsdevel/20230713035512.4139457-1-willy@infradead.org/)** + +> Remove the only spots in affs which actually use a struct page; there +> are a few places where one is mentioned, but it's part of the interface. +> + +#### 网络设备 + +**[v3: net-next: vsock/virtio/vhost: MSG_ZEROCOPY preparations](http://lore.kernel.org/netdev/20230720214245.457298-1-AVKrasnov@sberdevices.ru/)** + +> this patchset is first of three parts of another big patchset for +> MSG_ZEROCOPY flag support: +> https://lore.kernel.org/netdev/20230701063947.3422088-1-AVKrasnov@sberdevices.ru/ +> + +**[GIT PULL: Networking for v6.5-rc3](http://lore.kernel.org/netdev/20230720214559.163647-1-kuba@kernel.org/)** + +> The following changes since commit b1983d427a53911ea71ba621d4bf994ae22b1536: +> +> Merge tag 'net-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net (2023-07-13 14:21:22 -0700) +> + +**[v5: bpf-next: Support defragmenting IPv(4|6) packets in BPF](http://lore.kernel.org/netdev/cover.1689884827.git.dxu@dxuuu.xyz/)** + +> In the context of a middlebox, fragmented packets are tricky to handle. +> The full 5-tuple of a packet is often only available in the first +> fragment which makes enforcing consistent policy difficult. There are +> really only two stateless options, neither of which are very nice: +> +> Enforce policy on first fragment and accept all subsequent fragments. +> This works but may let in certain attacks or allow data exfiltration. +> + +**[v1: iproute2: bridge/mdb.c: include limits.h](http://lore.kernel.org/netdev/20230720203726.2316251-1-tgamblin@baylibre.com/)** + +> Include limits.h in bridge/mdb.c to fix this issue. This change is based +> on one in Alpine Linux, but the author there had no plans to submit: +> https://git.alpinelinux.org/aports/commit/main/iproute2/include.patch?id=bd46efb8a8da54948639cebcfa5b37bd608f1069 +> + +**[v4: net-next: ionic: add FLR support](http://lore.kernel.org/netdev/20230720190816.15577-1-shannon.nelson@amd.com/)** + +> Add support for handing and recovering from a PCI FLR event. +> This patchset first moves some code around to make it usable +> from multiple paths, then adds the PCI error handler callbacks +> for reset_prepare and reset_done. +> + +**[v1: net-next: page_pool: add a lockdep check for recycling in hardirq](http://lore.kernel.org/netdev/20230720173752.2038136-1-kuba@kernel.org/)** + +> Page pool use in hardirq is prohibited, add debug checks +> to catch misuses. IIRC we previously discussed using +> DEBUG_NET_WARN_ON_ONCE() for this, but there were concerns +> that people will have DEBUG_NET enabled in perf testing. +> I don't think anyone enables lockdep in perf testing, +> so use lockdep to avoid pushback and arguing :) +> + +**[v3: bpf-next: bpf, xdp: Add tracepoint to xdp attaching failure](http://lore.kernel.org/netdev/20230720155228.5708-1-hffilwlqm@gmail.com/)** + +> This series introduces a new tracepoint in bpf_xdp_link_attach(). By +> this tracepoint, error message will be captured when error happens in +> dev_xdp_attach(), e.g. invalid attaching flags. +> + +**[v6: bpf-next: Add SO_REUSEPORT support for TC bpf_sk_assign](http://lore.kernel.org/netdev/20230720-so-reuseport-v6-0-7021b683cdae@isovalent.com/)** + +> We want to replace iptables TPROXY with a BPF program at TC ingress. +> To make this work in all cases we need to assign a SO_REUSEPORT socket +> to an skb, which is currently prohibited. This series adds support for +> such sockets to bpf_sk_assing. +> + +**[v1: net-next: net: dsa: microchip: provide Wake on LAN support](http://lore.kernel.org/netdev/20230720132556.57562-1-o.rempel@pengutronix.de/)** + +> This series of patches provides Wake on LAN support for the KSZ9477 +> family of switches. It was tested on KSZ8565 Switch with PME pin +> attached to an external PMIC. +> + +**[v2: net-next: devlink: introduce dump selector attr and use it for per-instance dumps](http://lore.kernel.org/netdev/20230720121829.566974-1-jiri@resnulli.us/)** + +> For SFs, one devlink instance per SF is created. There might be +> thousands of these on a single host. When a user needs to know port +> handle for specific SF, he needs to dump all devlink ports on the host +> which does not scale good. +> + +**[v5: Add motorcomm phy pad-driver-strength-cfg support](http://lore.kernel.org/netdev/20230720111509.21843-1-samin.guo@starfivetech.com/)** + +> The motorcomm phy (YT8531) supports the ability to adjust the drive +> strength of the rx_clk/rx_data, and the default strength may not be +> suitable for all boards. So add configurable options to better match +> the boards.(e.g. StarFive VisionFive 2) +> + +**[v1: net-next: genetlink: add explicit ordering break check for split ops](http://lore.kernel.org/netdev/20230720111354.562242-1-jiri@resnulli.us/)** + +> Currently, if cmd in the split ops array is of lower value than the +> previous one, genl_validate_ops() continues to do the checks as if +> the values are equal. This may result in non-obvious WARN_ON() hit in +> these check. +> + +**[v2: net: vxlan: calculate correct header length for GPE](http://lore.kernel.org/netdev/544e8c6d0f48af2be49809877c05c0445c0b0c0b.1689843872.git.jbenc@redhat.com/)** + +> VXLAN-GPE does not add an extra inner Ethernet header. Take that into +> account when calculating header length. +> +> This causes problems in skb_tunnel_check_pmtu, where incorrect PMTU is +> cached. +> + +**[v4: net-next: virtio-net: don't busy poll for cvq command](http://lore.kernel.org/netdev/20230720083839.481487-1-jasowang@redhat.com/)** + +> The code used to busy poll for cvq command which turns out to have +> several side effects: +> +> 1) infinite poll for buggy devices +> 2) bad interaction with scheduler +> +> So this series tries to use cond_resched() in the waiting loop. Before +> doing this we need first make sure the cvq command is not executed in +> atomic environment, so we need first convert rx mode handling to a +> workqueue. +> + +**[v2: net-next: eth: bnxt: handle invalid Tx completions more gracefully](http://lore.kernel.org/netdev/20230720010440.1967136-1-kuba@kernel.org/)** + +> bnxt trusts the events generated by the device which may lead to kernel +> crashes. These are extremely rare but they do happen. For a while +> I thought crashing may be intentional, because device reporting invalid +> completions should never happen, and having a core dump could be useful +> if it does. But in practice I haven't found any clues in the core dumps, +> and panic_on_warn exists. +> + +**[v1: net-next: net: Use sockaddr_storage for getsockopt(SO_PEERNAME).](http://lore.kernel.org/netdev/20230720005456.88770-1-kuniyu@amazon.com/)** + +> Commit df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3") started +> applying strict rules to standard string functions. +> + +**[v1: net: phy: prevent stale pointer dereference in phy_init()](http://lore.kernel.org/netdev/20230720000231.1939689-1-vladimir.oltean@nxp.com/)** + +> mdio_bus_init() and phy_driver_register() both have error paths, and if +> those are ever hit, ethtool will have a stale pointer to the +> phy_ethtool_phy_ops stub structure, which references memory from a +> module that failed to load (phylib). +> + +**[v1: net: tcp: add missing annotations](http://lore.kernel.org/netdev/20230719212857.3943972-1-edumazet@google.com/)** + +> This series was inspired by one syzbot (KCSAN) report. +> +> do_tcp_getsockopt() does not lock the socket, we need to +> annotate most of the reads there (and other places as well). +> + +**[v8: net/tcp: Add TCP-AO support](http://lore.kernel.org/netdev/20230719202631.472019-1-dima@arista.com/)** + +> This is version 8 of TCP-AO support. I base it on master and there +> weren't any conflicts on my tentative merge to linux-next. +> +> The good news is that all pre-required patches have merged to +> Torvald's/master. Thanks to Herbert, crypto clone-tfm just works on +> master for all TCP-AO supported algorithms. + +#### 安全增强 + +**[v1: kunit: Add test attributes API](http://lore.kernel.org/linux-hardening/20230719222338.259684-1-rmoar@google.com/)** + +> This patch series adds a test attributes framework to KUnit. +> +> There has been interest in filtering out "slow" KUnit tests. Most notably, +> a new config, CONFIG_MEMCPY_SLOW_KUNIT_TEST, has been added to exclude a +> particularly slow memcpy test +> (https://lore.kernel.org/all/20230118200653.give.574-kees@kernel.org/). +> + +**[v1: HotBPF: Prevent Kernel Heap-based Exploitation](http://lore.kernel.org/linux-hardening/20230719155032.4972-1-wzc@smail.nju.edu.cn/)** + +> Request for Comments, a hot eBPF patch to prevent kernel heap exploitation. +> +> SLUB exploitation poses a significant threat to kernel security. The exploitation +> takes advantage of the fact that kernel objects share `kmalloc` slub caches. +> This sharing setting allows to create overlapping between vulnerable objects that +> introduce corruption, and other objects that contains sensitive data. +> To mitigate this, we introduce HotBPF. +> + +**[v1: next: fs: omfs: Use flexible-array member in struct omfs_extent](http://lore.kernel.org/linux-hardening/ZLGodUeD307GlINN@work/)** + +> Memory for 'struct omfs_extent' and a 'e_extent_count' number of extent +> entries is indirectly allocated through 'bh->b_data', which is a pointer +> to data within the page. This implies that the member 'e_entry' +> (which is the start of extent entries) functions more like an array than +> a single object of type 'struct omfs_extent_entry'. +> + +**[v5: Randomized slab caches for kmalloc()](http://lore.kernel.org/linux-hardening/20230714064422.3305234-1-gongruiqi@huaweicloud.com/)** + +> When exploiting memory vulnerabilities, "heap spraying" is a common +> technique targeting those related to dynamic memory allocation (i.e. the +> "heap"), and it plays an important role in a successful exploitation. +> Basically, it is to overwrite the memory area of vulnerable object by +> triggering allocation in other subsystems or modules and therefore +> getting a reference to the targeted memory location. It's usable on +> various types of vulnerablity including use after free (UAF), heap out- +> of-bound write and etc. +> + +**[v2: igc: Ignore AER reset when device is suspended](http://lore.kernel.org/linux-hardening/20230714050541.2765246-1-kai.heng.feng@canonical.com/)** + +> The issue is that the PTM requests are sending before driver resumes the +> device. Since the issue can also be observed on Windows, it's quite +> likely a firmware/hardware limitation. +> +> So avoid resetting the device if it's not resumed. Once the device is +> fully resumed, the device can work normally. +> + +**[v1: tracing: Add back FORTIFY_SOURCE logic to kernel_stack event structure](http://lore.kernel.org/linux-hardening/20230713092605.2ddb9788@rorschach.local.home/)** + +> For backward compatibility, older tooling expects to see the kernel_stack +> event with a "caller" field that is a fixed size array of 8 addresses. The +> code now supports more than 8 with an added "size" field that states the +> real number of entries. But the "caller" field still just looks like a +> fixed size to user space. +> + +**[v2: ACPI: APEI: Use ERST timeout for slow devices](http://lore.kernel.org/linux-hardening/20230712223448.145079-1-jeshuas@nvidia.com/)** + +> Slow devices such as flash may not meet the default 1ms timeout value, +> so use the ERST max execution time value that they provide as the +> timeout if it is larger. +> + +**[v3: pstore: Replace crypto API compression with zlib calls](http://lore.kernel.org/linux-hardening/20230712162332.2670437-1-ardb@kernel.org/)** + +> The pstore layer implements support for compression of kernel log +> output, using a variety of compression algorithms provided by the +> [deprecated] crypto API 'comp' interface. +> +> This appears to have been somebody's pet project rather than a solution +> to a real problem: the original deflate compression is reasonably fast, +> compresses well and is comparatively small in terms of code footprint, +> and so the flexibility that the crypto API integration provides does +> little more than complicate the code for no reason. +> + +**[v1: libxfs: Redefine 1-element arrays as flexible arrays](http://lore.kernel.org/linux-hardening/20230711222025.never.220-kees@kernel.org/)** + +> To allow for code bases that include libxfs (e.g. the Linux kernel) and +> build with strict flexible array handling (-fstrict-flex-arrays=3), +> FORTIFY_SOURCE, and/or UBSAN bounds checking, redefine the remaining +> 1-element trailing arrays as true flexible arrays, but without changing +> their structure sizes. This is done via a union to retain a single element +> (named "legacy_padding"). As not all distro headers may yet have the +> UAPI stddef.h __DECLARE_FLEX_ARRAY macro, include it explicitly in +> platform_defs.h.in. +> + +**[v1: wifi: mwifiex: Replace strlcpy with strscpy](http://lore.kernel.org/linux-hardening/20230710030625.812707-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +#### 异步 IO + +**[v1: io_uring: treat -EAGAIN for REQ_F_NOWAIT as final for io-wq](http://lore.kernel.org/io-uring/363d8e40-6acc-57bd-feb1-4dbd50e15c31@kernel.dk/)** + +> io-wq assumes that an issue is blocking, but it may not be if the +> request type has asked for a non-blocking attempt. If we get +> -EAGAIN for that case, then we need to treat it as a final result +> and not retry or arm poll for it. +> + +**[v1: io_uring: Use io_schedule* in cqring wait](http://lore.kernel.org/io-uring/20230718194920.1472184-2-axboe@kernel.dk/)** + +> I observed poor performance of io_uring compared to synchronous IO. That +> turns out to be caused by deeper CPU idle states entered with io_uring, +> due to io_uring using plain schedule(), whereas synchronous IO uses +> io_schedule(). +> + +**[v1: io_uring: don't audit the capability check in io_uring_create()](http://lore.kernel.org/io-uring/20230718115607.65652-1-omosnace@redhat.com/)** + +> The check being unconditional may lead to unwanted denials reported by +> LSMs when a process has the capability granted by DAC, but denied by an +> LSM. In the case of SELinux such denials are a problem, since they can't +> be effectively filtered out via the policy and when not silenced, they +> produce noise that may hide a true problem or an attack. +> + +**[v1: io_uring: Redefined the meaning of io_alloc_async_data's return value](http://lore.kernel.org/io-uring/20230710090957.10463-1-luhongfei@vivo.com/)** + +> Usually, successful memory allocation returns true and failure returns false, +> which is more in line with the intuitive perception of most people. So it +> is necessary to redefine the meaning of io_alloc_async_data's return value. +> + +#### Rust For Linux + +**[v1: rust: kunit: Support KUnit tests with a user-space like syntax](http://lore.kernel.org/rust-for-linux/20230720-rustbind-v1-0-c80db349e3b5@google.com/)** + +> This series was originally written by José Expósito, and can be found +> here: +> https://github.com/Rust-for-Linux/linux/pull/950 +> +> Add support for writing KUnit tests in Rust. While Rust doctests are +> already converted to KUnit tests and run, they're really better suited +> for examples, rather than as first-class unit tests. +> + +**[v1: rust: doctests: Use tabs for indentation in generated C code](http://lore.kernel.org/rust-for-linux/20230720062939.2411889-1-davidgow@google.com/)** + +> While Rust uses 4 spaces for indentation, we should use tabs in the +> generated C code. This does result in some scary-looking tab characters +> in a .rs file, but they're in a string literal, so shouldn't make +> anything complain too much. +> + +**[v2: Quality of life improvements for pin-init](http://lore.kernel.org/rust-for-linux/20230719141918.543938-1-benno.lossin@proton.me/)** + +> This patch series adds several improvements to the pin-init api: +> - a derive macro for the `Zeroable` trait, +> - makes hygiene of fields in initializers behave like normal struct +> initializers would behave, +> - prevent stackoverflow without optimizations +> - add `..Zeroable::zeroed()` syntax to zero missing fields. +> - support arbitrary paths in initializer macros +> + +**[v1: kbuild: rust: avoid creating temporary files](http://lore.kernel.org/rust-for-linux/20230718055235.1050223-1-ojeda@kernel.org/)** + +> `rustc` outputs by default the temporary files (i.e. the ones saved +> by `-Csave-temps`, such as `*.rcgu*` files) in the current working +> directory when `-o` and `--out-dir` are not given (even if +> `--emit=x=path` is given, i.e. it does not use those for temporaries). +> + +**[v1: rust: init: Implement Zeroable::zeroed()](http://lore.kernel.org/rust-for-linux/20230714-zeroed-v1-1-494d6820d61b@asahilina.net/)** + +> By analogy to Default::default(), this just returns the zeroed +> representation of the type directly. init::zeroed() is the version that +> returns an initializer. +> + +**[v2: Rust abstractions for Crypto API](http://lore.kernel.org/rust-for-linux/20230710102225.155019-1-fujita.tomonori@gmail.com/)** + +> This patchset adds minimum Rust abstractions for Crypto API; message +> digest and random number generator. +> + +**[v1: rust: add improved version of `ForeignOwnable::borrow_mut`](http://lore.kernel.org/rust-for-linux/20230710074642.683831-1-aliceryhl@google.com/)** + +> Previously, the `ForeignOwnable` trait had a method called `borrow_mut` +> that was intended to provide mutable access to the inner value. However, +> the method accidentally made it possible to change the address of the +> object being modified, which usually isn't what we want. (And when we +> want that, it can be done by calling `from_foreign` and `into_foreign`, +> like how the old `borrow_mut` was implemented.) +> + +**[v2: Rust abstractions for network device drivers](http://lore.kernel.org/rust-for-linux/20230710073703.147351-1-fujita.tomonori@gmail.com/)** + +> This patchset adds minimum Rust abstractions for network device +> drivers and an example of a Rust network device driver, a simpler +> version of drivers/net/dummy.c. +> + +#### BPF + +**[v1: bpf: bpf/memalloc: Allow non-atomic alloc_bulk](http://lore.kernel.org/bpf/cover.1689885610.git.zhuyifei@google.com/)** + +> This series attempts to add ways where the allocation could occur +> non-atomically, allowing the allocator to take mutexes, perform IO, +> and/or sleep. +> + +**[v1: dwarves: dwarves: detect BTF kinds supported by kernel](http://lore.kernel.org/bpf/20230720201443.224040-1-alan.maguire@oracle.com/)** + +> When a newer pahole is run on an older kernel, it often knows about BTF +> kinds that the kernel does not support, and adds them to the BTF +> representation. This is a problem because the BTF generated is then +> embedded in the kernel image. When it is later read - possibly by +> a different older toolchain or by the kernel directly - it is not usable. +> + +**[v3: bpf-next: bpf: Support new insns from cpu v4](http://lore.kernel.org/bpf/20230720000103.99949-1-yhs@fb.com/)** + +> This patch set added kernel support for insns proposed in [1] except +> BPF_ST which already has full kernel support. Beside the above proposed +> insns, LLVM will generate BPF_ST insn as well under -mcpu=v4 ([2]). +> + +**[v2: bpf-next: selftests/bpf: improve ringbuf benchmark output](http://lore.kernel.org/bpf/20230719201533.176702-1-awerner32@gmail.com/)** + +> The ringbuf benchmarks print headers for each section of benchmarks. +> The naming conventions lead a user of the benchmarks to some confusion. +> This change is a cosmetic update to the output of that benchmark; no +> changes were made to what the script actually executes. +> + +**[v3: bpf-next: XDP metadata via kfuncs for ice](http://lore.kernel.org/bpf/20230719183734.21681-1-larysa.zaremba@intel.com/)** + +> This series introduces XDP hints via kfuncs [0] to the ice driver. +> +> Series brings the following existing hints to the ice driver: +> - HW timestamp +> - RX hash with type +> + +**[v1: bpf-next: bpf: sync tools/ uapi header with](http://lore.kernel.org/bpf/20230719162257.20818-1-alan.maguire@oracle.com/)** + +> Seeing the following: +> +> Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs from latest version at 'include/uapi/linux/bpf.h' +> +> ...so sync tools version missing some list_node/rb_tree fields. +> + +**[v6: bpf-next: BPF link support for tc BPF programs](http://lore.kernel.org/bpf/20230719140858.13224-1-daniel@iogearbox.net/)** + +> This series adds BPF link support for tc BPF programs. We initially +> presented the motivation, related work and design at last year's LPC +> conference in the networking & BPF track [0], and a recent update on +> our progress of the rework during this year's LSF/MM/BPF summit [1]. +> The main changes are in first two patches and the last two have an +> extensive batch of test cases we developed along with it, please see +> individual patches for details. We tested this series with tc-testing +> selftest suite as well as BPF CI/selftests. Thanks! +> + +**[v7: bpf-next: xsk: multi-buffer support](http://lore.kernel.org/bpf/20230719132421.584801-1-maciej.fijalkowski@intel.com/)** + +> This series of patches add multi-buffer support for AF_XDP. XDP and +> various NIC drivers already have support for multi-buffer packets. With +> this patch set, programs using AF_XDP sockets can now also receive and +> transmit multi-buffer packets both in copy as well as zero-copy mode. +> ZC multi-buffer implementation is based on ice driver. +> + +**[v1: net-next: page_pool: split types and declarations from page_pool.h](http://lore.kernel.org/bpf/20230719121339.63331-1-linyunsheng@huawei.com/)** + +> Split types and pure function declarations from page_pool.h +> and add them in page_page_types.h, so that C sources can +> include page_pool.h and headers should generally only include +> page_pool_types.h as suggested by jakub. +> + +**[v1: bpf-next: bpf, x86: initialize the variable "first_off" in save_args()](http://lore.kernel.org/bpf/20230719110330.2007949-1-imagedong@tencent.com/)** + +> As Dan Carpenter reported, the variable "first_off" which is passed to +> clean_stack_garbage() in save_args() can be uninitialized, which can +> cause runtime warnings with KMEMsan. Therefore, init it with 0. +> + +**[v2: bpf-next: allow bpf_map_sum_elem_count for all program types](http://lore.kernel.org/bpf/20230719092952.41202-1-aspsk@isovalent.com/)** + +> This series is a follow up to the recent change [1] which added +> per-cpu insert/delete statistics for maps. The bpf_map_sum_elem_count +> kfunc presented in the original series was only available to tracing +> programs, so let's make it available to all. +> + +**[v12: vhost: virtio core prepares for AF_XDP](http://lore.kernel.org/bpf/20230719040422.126357-1-xuanzhuo@linux.alibaba.com/)** + +> So rethinking this, firstly, we can support premapped-dma only for devices with +> VIRTIO_F_ACCESS_PLATFORM. In the case of af-xdp, if the users want to use it, +> they have to update the device to support VIRTIO_F_RING_RESET, and they can also +> enable the device's VIRTIO_F_ACCESS_PLATFORM feature. +> + +**[v2: net: bpf: do not return NET_XMIT_xxx values on bpf_redirect](http://lore.kernel.org/bpf/ZLdY6JkWRccunvu0@debian.debian/)** + +> skb_do_redirect handles returns error code from both rx and tx path. The +> tx path codes are special, e.g. NET_XMIT_CN: they are non-negative, and +> can conflict with LWTUNNEL_XMIT_xxx values. Directly returning such code +> can cause unexpected behavior. We found at least one bug that will panic +> the kernel through KASAN report when we are redirecting packets to a +> down or carrier-down device at lwt xmit hook: +> +> https://gist.github.com/zhaiyan920/8fbac245b261fe316a7ef04c9b1eba48 +> + +**[v5: net-next: virtio/vsock: support datagrams](http://lore.kernel.org/bpf/20230413-b4-vsock-dgram-v5-0-581bd37fdb26@bytedance.com/)** + +> This series introduces support for datagrams to virtio/vsock. +> +> It is a spin-off (and smaller version) of this series from the summer: +> https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman@bytedance.com/ +> +> Please note that this is an RFC and should not be merged until +> associated changes are made to the virtio specification, which will +> follow after discussion from this series. +> + +**[v1: bpf-next: bpf, net: Introduce skb_pointer_if_linear().](http://lore.kernel.org/bpf/20230718234021.43640-1-alexei.starovoitov@gmail.com/)** + +> Network drivers always call skb_header_pointer() with non-null buffer. +> Remove !buffer check to prevent accidental misuse of skb_header_pointer(). +> Introduce skb_pointer_if_linear() instead. +> + +**[v1: V2,net-next: net: mana: Add page pool for RX buffers](http://lore.kernel.org/bpf/1689716837-22859-1-git-send-email-haiyangz@microsoft.com/)** + +> Add page pool for RX buffers for faster buffer cycle and reduce CPU +> usage. +> +> The standard page pool API is used. +> + +**[v1: bpf: lwt: do not return NET_XMIT_xxx values on bpf_redirect](http://lore.kernel.org/bpf/ZLbYdpWC8zt9EJtq@debian.debian/)** + +> skb_do_redirect handles returns error code from both rx and tx path. +> The tx path codes are special, e.g. NET_XMIT_CN: they are +> non-negative, and can conflict with LWTUNNEL_XMIT_xxx values. Directly +> returning such code can cause unexpected behavior. +> + +**[v5: bpf-next: bpf: Force to MPTCP](http://lore.kernel.org/bpf/3076188eb88cca9151a2d12b50ba1e870b11ce09.1689693294.git.geliang.tang@suse.com/)** + +> As is described in the "How to use MPTCP?" section in MPTCP wiki [1]: +> +> "Your app can create sockets with IPPROTO_MPTCP as the proto: +> ( socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP); ). Legacy apps can be +> forced to create and use MPTCP sockets instead of TCP ones via the +> mptcpize command bundled with the mptcpd daemon." +> + +**[v2: bpf-next: BPF Refcount followups 2: owner field](http://lore.kernel.org/bpf/20230718083813.3416104-1-davemarchevsky@fb.com/)** + +> This series adds an 'owner' field to bpf_{list,rb}_node structs, to be +> used by the runtime to determine whether insertion or removal operations +> are valid in shared ownership scenarios. Both the races which the series +> fixes and the fix itself are inspired by Kumar's suggestions in [0]. +> + +**[v1: net: igc: Prevent garbled TX queue with XDP ZEROCOPY](http://lore.kernel.org/bpf/20230717175444.3217831-1-anthony.l.nguyen@intel.com/)** + +> In normal operation, each populated queue item has +> next_to_watch pointing to the last TX desc of the packet, +> while each cleaned item has it set to 0. In particular, +> next_to_use that points to the next (necessarily clean) +> item to use has next_to_watch set to 0. +> + +**[v2: net-next: virtio_net: add per queue interrupt coalescing support](http://lore.kernel.org/bpf/20230717143037.21858-1-gavinl@nvidia.com/)** + +> Currently, coalescing parameters are grouped for all transmit and receive +> virtqueues. This patch series add support to set or get the parameters for +> a specified virtqueue. +> +> When the traffic between virtqueues is unbalanced, for example, one virtqueue +> is busy and another virtqueue is idle, then it will be very useful to +> control coalescing parameters at the virtqueue granularity. +> + +### 周边技术动态 + +#### Qemu + +**[v5: for-8.2: riscv: add 'max' CPU, deprecate 'any'](http://lore.kernel.org/qemu-devel/20230720171933.404398-1-dbarboza@ventanamicro.com/)** + +> I'm sending this new version based on another observation I made during +> another follow-up work (I'll post it shortly). +> +> 'mmu' and 'pmp' aren't really extensions in the most tradicional sense, +> they're more like features. So, in patch 1, I moved both to the new +> riscv_cpu_options array. +> + +**[v1: target/riscv: add missing riscv,isa strings](http://lore.kernel.org/qemu-devel/20230720132424.371132-1-dbarboza@ventanamicro.com/)** + +> Found these 2 instances while working in more 8.2 material. +> +> I believe both are safe for freeze but I won't lose my sleep if we +> decide to postpone it. +> + +**[v1: QEMU RISC-V IOMMU Support](http://lore.kernel.org/qemu-devel/cover.1689819031.git.tjeznach@rivosinc.com/)** + +> This series introduces a RISC-V IOMMU device emulation implementation with two stage +> address translation logic, device and process translation context mapping and queue +> interfaces, along with riscv/virt machine bindings (patch 5) and memory attributes +> extensions for PASID support (patch 3,4). +> +> This series is based on incremental patches created during RISC-V International IOMMU +> Task Group discussions and specification development process, with original series +> available in the the maintainer's repository branch [2]. +> + +**[v1: riscv-to-apply queue](http://lore.kernel.org/qemu-devel/20230719044538.2013401-1-alistair.francis@wdc.com/)** + +> The following changes since commit 361d5397355276e3007825cc17217c1e4d4320f7: +> +> Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into staging (2023-07-17 15:49:27 +0100) +> +> are available in the Git repository at: +> +> https://github.com/alistair23/qemu.git tags/pull-riscv-to-apply-20230719-1 +> +> for you to fetch changes up to 32be32509987fbe42cf5c2fd3cea3c2ad6eae179: +> +> target/riscv: Fix LMUL check to use VLEN (2023-07-19 14:37:26 +1000) +> + +**[v1: risc-v: Add ISA extension smcntrpmf support](http://lore.kernel.org/qemu-devel/cover.1689631639.git.kaiwenx@rivosinc.com/)** + +> This patch series adds the support for RISC-V ISA extension smcntrpmf (cycle and +> privilege mode filtering) [1]. QEMU only calculates dummy cycles and +> instructions, so there is no actual means to stop the icount in QEMU. Therefore, +> this series only add the read/write behavior of the relevant CSRs such that the +> implemented firmware support [2] can work without causing unnecessary illegal +> instruction exceptions. +> +> [1] https://github.com/riscv/riscv-smcntrpmf +> [2] https://github.com/rivosinc/opensbi/tree/dev/kaiwenx/smcntrpmf_upstream +> + +**[v1: target/riscv: Clearing the CSR values at reset and syncing the MPSTATE with the host](http://lore.kernel.org/qemu-devel/20230718130317.12545-1-18622748025@163.com/)** + +> Fix the guest reboot error when using KVM +> There are two issues when rebooting a guest using KVM +> 1. When the guest initiates a reboot the host is unable to stop the vcpu +> 2. When running a SMP guest the qemu monitor system_reset causes a vcpu crash +> +> This can be fixed by clearing the CSR values at reset and syncing the +> MPSTATE with the host. +> + +**[v1: for-8.2: target/riscv: add zicntr and zihpm flags](http://lore.kernel.org/qemu-devel/20230717215419.124258-1-dbarboza@ventanamicro.com/)** + +> I decided to include flags for both timer/counter extensions to make it +> easier for us later on when dealing with the RVA22 profile (which +> includes both). +> +> The features were already implemented by Atish Patra some time ago, but +> back then these 2 extensions weren't introduced yet. This means that, +> aside from extra stuff in riscv,isa FDT no other functional changes were +> made. +> + +**[v1: target/riscv/cpu.c: check priv_ver before auto-enable zca/zcd/zcf](http://lore.kernel.org/qemu-devel/20230717154141.60898-1-dbarboza@ventanamicro.com/)** + +> Commit bd30559568 made changes in how we're checking and disabling +> extensions based on env->priv_ver. One of the changes was to move the +> extension disablement code to the end of realize(), being able to +> disable extensions after we've auto-enabled some of them. +> + +**[v3: for-8.2: target/riscv: add 'max' CPU, deprecate](http://lore.kernel.org/qemu-devel/20230714174311.672359-1-dbarboza@ventanamicro.com/)** + +> This version has changes suggested in v2. The most significant change is +> the deprecation of the 'any' CPU in patch 8. +> +> The reasoning behind it is that Alistair mentioned that the 'any' CPU +> intended to work like the newly added 'max' CPU, so we're better of +> removing the 'any' CPU since it'll be out of place. We can't just +> remove the CPU out of the gate so we'll have to make it do with +> deprecation first. +> + +**[v6: Add RISC-V KVM AIA Support](http://lore.kernel.org/qemu-devel/20230714084429.22349-1-yongxuan.wang@sifive.com/)** + +> This series adds support for KVM AIA in RISC-V architecture. +> +> In order to test these patches, we require Linux with KVM AIA support which can +> be found in the riscv_kvm_aia_hwaccel_v1 branch at +> https://github.com/avpatel/linux.git +> + +**[v2: for-8.2: target/riscv: add 'max' CPU type](http://lore.kernel.org/qemu-devel/20230712205748.446931-1-dbarboza@ventanamicro.com/)** + +> This second version has smalls tweak in patch 6 that I found out +> missing while chatting with Conor in the v1 review. +> + +**[riscv kvm breakage](http://lore.kernel.org/qemu-devel/629afcc2-ffed-c081-9564-7faa6defc1f4@linaro.org/)** + +> This breakage crept in while cross-riscv64-system was otherwise broken in configure: +> +> https://gitlab.com/qemu-project/qemu/-/jobs/4633277557#L4165 +> + +**[v3: target/riscv: Add Zihintntl extension ISA string to DTS](http://lore.kernel.org/qemu-devel/20230711070402.5846-1-jason.chien@sifive.com/)** + +> In v2, I rebased the patch on +> https://github.com/alistair23/qemu/tree/riscv-to-apply.next +> However, I forgot to add "Reviewed-by" in v2, so I add them in v3. +> + +**[v8: riscv: Add support for the Zfa extension](http://lore.kernel.org/qemu-devel/20230710071243.282464-1-christoph.muellner@vrull.eu/)** + +> Since QEMU does not support the RISC-V quad-precision floating-point +> ISA extension (Q), this patch does not include the instructions that +> depend on this extension. All other instructions are included in this +> patch. +> + +#### Buildroot + +**[boot/edk2: bump to version edk2-stable202305](http://lore.kernel.org/buildroot/20230713181301.6CE2486F7D@busybox.osuosl.org/)** + +> The main motivation of this bump is the RISC-V QEMU Virt support +> introduced in edk2-stable202302 (not yet supported in Buildroot). +> + +#### U-Boot + +**[v7: Add StarFive JH7110 PCIe drvier support](http://lore.kernel.org/u-boot/20230720112333.9255-1-minda.chen@starfivetech.com/)** + +> These PCIe series patches are based on the JH7110 RISC-V SoC and VisionFive V2 board. +> +> The PCIe driver depends on gpio, pinctrl, clk and reset driver to do init. +> The PCIe dts configuation includes all these setting. +> +> The PCIe drivers codes has been tested on the VisionFive V2 boards. +> The test devices includes M.2 NVMe SSD and Realtek 8169 Ethernet adapter. +> + +**[Pull request: u-boot-spi/master](http://lore.kernel.org/u-boot/20230713163628.1763568-1-jagan@amarulasolutions.com/)** + +> The following changes since commit bf5152d0108683bbaabf9d7a7988f61649fc33f4: +> +> Merge branch 'master' of https://source.denx.de/u-boot/custodians/u-boot-riscv (2023-07-12 13:10:04 -0400) +> +> are available in the Git repository at: +> +> https://source.denx.de/u-boot/custodians/u-boot-spi master +> +> for you to fetch changes up to 4a31e145217cecc3d421f96eafcd2cfd9c670929: +> +> mtd: spi-nor: Add support for w25q256jwm (2023-07-13 14:17:40 +0530) +> + +**[Please pull u-boot-marvell/master](http://lore.kernel.org/u-boot/6bebb605-7a3c-0281-d12d-cda1721492fe@denx.de/)** + +> please pull the following Marvell MVEBU related patches into master: +> +> - mvebu: Thecus: Misc enhancement and cleanup (Tony) +> - mvebu: Add AC5X Allied Telesis x240 board support incl NAND +> controller enhancements for this SoC (Chris) +> +> Here the Azure build, without any issues: +> +> https://dev.azure.com/sr0718/u-boot/_build/results?buildId=305&view=results +> + +**[v2: riscv: Initial support for Lichee PI 4A board](http://lore.kernel.org/u-boot/20230708112435.23583-1-dlan@gentoo.org/)** + +> Sipeed's Lichee PI 4A board is based on T-HEAD's TH1520 SoC which consists of +> quad core XuanTie C910 CPU, plus one C906 CPU and one E902 CPU. +> +> In this series, we add a basic device tree, including UART CPU, PLIC, make it +> capable of running into a serial console. +> + +## 20230709:第 53 期 + +### 内核动态 + +#### RISC-V 架构支持 + +**[v3: RISC-V: archrandom support](http://lore.kernel.org/linux-riscv/20230709115549.2666557-1-sameo@rivosinc.com/)** + +> This patchset adds support for the archrandom API to the RISC-V +> architecture. +> +> The ratified crypto scalar extensions provide entropy bits via the seed +> CSR, as exposed by the Zkr extension. +> + +**[v1: riscv: support PREEMPT_DYNAMIC with static keys](http://lore.kernel.org/linux-riscv/20230709101653.720-1-jszhang@kernel.org/)** + +> Currently, each architecture can support PREEMPT_DYNAMIC through +> either static calls or static keys. To support PREEMPT_DYNAMIC on +> riscv, we face three choices: +> +> 1. only add static calls support to riscv +> As Mark pointed out in commit 99cf983cc8bc ("sched/preempt: Add +> PREEMPT_DYNAMIC using static keys"), static keys "...should have +> slightly lower overhead than non-inline static calls, as this +> effectively inlines each trampoline into the start of its callee. This +> may avoid redundant work, and may integrate better with CFI schemes." +> So even we add static calls(without inline static calls) to riscv, +> static keys is still a better choice. +> +> 2. add static calls and inline static calls to riscv +> Per my understanding, inline static calls requires objtool support +> which is not easy. +> + +**[v4: RISC-V: mm: Make SV48 the default address space](http://lore.kernel.org/linux-riscv/20230708011156.2697409-1-charlie@rivosinc.com/)** + +> Make sv48 the default address space for mmap as some applications +> currently depend on this assumption. Also enable users to select +> desired address space using a non-zero hint address to mmap. Previous +> kernel changes caused Java and other applications to be broken on sv57 +> which this patch fixes. +> + +**[v2: module: Ignore RISC-V mapping symbols too](http://lore.kernel.org/linux-riscv/20230707160051.2305-2-palmer@rivosinc.com/)** + +> RISC-V has an extended form of mapping symbols that we use to encode +> the ISA when it changes in the middle of an ELF. This trips up modpost +> as a build failure, I haven't yet verified it yet but I believe the +> kallsyms difference should result in stacks looking sane again. +> + +**[GIT PULL: RISC-V Patches for the 6.5 Merge Window, Part 2](http://lore.kernel.org/linux-riscv/mhng-4bd23a4e-dd7c-4f62-90c8-804c137c2621@palmer-ri-x1c9/)** + +> merged tag 'riscv-for-linus-6.5-mw1' +> The following changes since commit 533925cb760431cb496a8c965cfd765a1a21d37e: +> +> Merge tag 'riscv-for-linus-6.5-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux (2023-06-30 09:37:26 -0700) +> +> are available in the Git repository at: +> +> git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git tags/riscv-for-linus-6.5-mw2 +> + +**[v6: tools/nolibc: add a new syscall helper](http://lore.kernel.org/linux-riscv/cover.1688739492.git.falcon@tinylab.org/)** + +> Here is the v6 of the __sysret series [1], applies your suggestions. +> additionally, the sbrk() also uses the __sysret helper. +> + +**[v1: RISC-V: Support querying vendor extensions](http://lore.kernel.org/linux-riscv/20230705-thead_vendor_extensions-v1-0-ad6915349c4d@rivosinc.com/)** + +> Introduce extensible method of querying vendor extensions. Keys above +> 1UL<<63 passed into the riscv_hwprobe syscall are reserved for vendor +> extensions. The appropriate vendor is resolved using the discovered +> mvendorid. Vendor specific code is then entered which determines how to +> respond to the input hwprobe key. +> + +**[v2: RISC-V: Show accurate per-hart isa in /proc/cpuinfo](http://lore.kernel.org/linux-riscv/20230705172931.1099183-1-evan@rivosinc.com/)** + +> In /proc/cpuinfo, most of the information we show for each processor is +> specific to that hart: marchid, mvendorid, mimpid, processor, hart, +> compatible, and the mmu size. But the ISA string gets filtered through a +> lowest common denominator mask, so that if one CPU is missing an ISA +> extension, no CPUs will show it. +> + +**[v3: Obtain SMBIOS and ACPI entry from FFI](http://lore.kernel.org/linux-riscv/20230705114251.661-1-cuiyunhui@bytedance.com/)** + +> Here's version 3 of patch series. +> + +**[v1: RISC-V: KVM: provide UAPI for host SATP mode](http://lore.kernel.org/linux-riscv/20230705091535.237765-1-dbarboza@ventanamicro.com/)** + +> KVM userspaces need to be aware of the host SATP to allow them to +> advertise it back to the guest OS. +> +> Since this information is used to build the guest FDT we can't wait for +> the SATP reg to be readable. We just need to read the SATP mode, thus +> we can use the existing 'satp_mode' global that represents the SATP reg +> with MODE set and both ASID and PPN cleared. E.g. for a 32 bit host +> running with sv32 satp_mode is 0x80000000, for a 64 bit host running +> sv57 satp_mode is 0xa000000000000000, and so on. +> + +**[v7: -next: support allocating crashkernel above 4G explicitly on riscv](http://lore.kernel.org/linux-riscv/20230704212327.1687310-1-chenjiahao16@huawei.com/)** + +> On riscv, the current crash kernel allocation logic is trying to +> allocate within 32bit addressible memory region by default, if +> failed, try to allocate without 4G restriction. +> +> In need of saving DMA zone memory while allocating a relatively large +> crash kernel region, allocating the reserved memory top down in +> high memory, without overlapping the DMA zone, is a mature solution. +> Hence this patchset introduces the parameter option crashkernel=X,[high,low]. +> + +#### 异步 IO + +**[v1: io_uring: A new function has been defined to make get/put exist in pairs](http://lore.kernel.org/io-uring/20230706093208.6072-1-luhongfei@vivo.com/)** + +> A new function called io_put_task_refs has been defined for pairing +> with io_get_task_refs. +> +> In io_submit_sqes(), when req is not fully sent(i.e. left != 0), it is +> necessary to call the io_put_task_refs() to recover the current process's +> cached_refs and pair it with the io_get_task_refs(), which is easy to +> understand and looks more regular. +> + +### 周边技术动态 + +#### Qemu + +**[v3: target/riscv: improve code accuracy and](http://lore.kernel.org/qemu-devel/20230708091055.38505-1-reaperlu@hust.edu.cn/)** + +> I'm so sorry. As a newcomer, I'm not familiar with the patch mechanism. I mistakenly added the reviewer's "Reviewed-by" line into the wrong commit, So I have resent this patchset +> + +**[v1: target/riscv KVM_RISCV_SET_TIMER macro is not configured correctly](http://lore.kernel.org/qemu-devel/20230707032306.4606-1-gaoshanliukou@163.com/)** + +> Should set/get riscv all reg timer,i.e, time/compare/frequency/state. +> + +**[v2: riscv: Generate devicetree only after machine initialization is complete](http://lore.kernel.org/qemu-devel/20230706035937.1870483-1-linux@roeck-us.net/)** + +> If the devicetree is created before machine initialization is complete, +> it misses dynamic devices. Specifically, the tpm device is not added +> to the devicetree file and is therefore not instantiated in Linux. +> Load/create devicetree in virt_machine_done() to solve the problem. +> + +**[v1: riscv: add config for asid size](http://lore.kernel.org/qemu-devel/20230705105838.68806-1-ben.dooks@codethink.co.uk/)** + +> Add a config to the cpu state to control the size of the ASID area +> in the SATP CSR to enable testing with smaller than the default (which +> is currently maximum for both rv32 and rv64). It also adds the ability +> to stop the ASID feature by using 0 to disable it. +> + +#### U-Boot + +**[v2: riscv: Initial support for Lichee PI 4A board](http://lore.kernel.org/u-boot/20230708112435.23583-1-dlan@gentoo.org/)** + +> Sipeed's Lichee PI 4A board is based on T-HEAD's TH1520 SoC which consists of +> quad core XuanTie C910 CPU, plus one C906 CPU and one E902 CPU. +> +> In this series, we add a basic device tree, including UART CPU, PLIC, make it +> capable of running into a serial console. +> +> Please note that, we rely on pre shipped vendor u-boot which run in M-Mode to +> chain load this mainline u-boot either via eMMC storage or from tftp, thus the +> pinctrl and clock setting are not implemented in this series, which certainly +> can be improved later accordingly. +> + +**[v1: riscv: (visionfive2:) device tree binding for riscv_timer](http://lore.kernel.org/u-boot/20230707135333.GA30112@lst.de/)** + +> following the existing device tree binding[1], here is a draft to use it +> in drivers/timer/riscv_timer.c. This would also fix the regression we see +> with commit 55171aedda8 ("dm: Emit the arch_cpu_init_dm() even only +> before relocation"), at least on the VisionFive2, as sketched out below. +> The device tree addition suits the Linux kernel dirver +> + +**[v1: u-boot-riscv/riscv-for-next](http://lore.kernel.org/u-boot/ZKabX3HI7USoCEEt@ubuntu01/)** + +> The following changes since commit e80f4079b3a3db0961b73fa7a96e6c90242d8d25: +> +> Merge tag 'v2023.07-rc6' into next (2023-07-05 11:28:55 -0400) +> +> are available in the Git repository at: +> +> https://source.denx.de/u-boot/custodians/u-boot-riscv.git riscv-for-next +> + +## 20230705:第 52 期 + +### 内核动态 + +#### RISC-V 架构支持 + +**[v7: -next: support allocating crashkernel above 4G explicitly on riscv](http://lore.kernel.org/linux-riscv/20230704212327.1687310-1-chenjiahao16@huawei.com/)** +1 +> On riscv, the current crash kernel allocation logic is trying to +> allocate within 32bit addressible memory region by default, if +> failed, try to allocate without 4G restriction. +> + +**[v1: riscv: Start of DRAM should at least be aligned on PMD size for the direct mapping](http://lore.kernel.org/linux-riscv/20230704121837.248976-1-alexghiti@rivosinc.com/)** + +> So that we do not end up mapping the whole linear mapping using 4K +> pages, which is slow at boot time, and also very likely at runtime. +> +> So make sure we align the start of DRAM on a PMD boundary. +> + +**[v4: Add initialization of clock for StarFive JH7110 SoC](http://lore.kernel.org/linux-riscv/20230704091948.85247-4-william.qiu@starfivetech.com/)** + +> This patchset adds initial rudimentary support for the StarFive +> Quad SPI controller driver. And this driver will be used in +> StarFive's VisionFive 2 board. In 6.4, the QSPI_AHB and QSPI_APB +> clocks changed from the default ON state to the default OFF state, +> so these clocks need to be enabled in the driver.At the same time, +> dts patch is added to this series. +> + +**[v1: Add SPI module for StarFive JH7110 SoC](http://lore.kernel.org/linux-riscv/20230704091948.85247-1-william.qiu@starfivetech.com/)** + +> This patchset adds initial rudimentary support for the StarFive +> SPI controller. And this driver will be used in StarFive's +> VisionFive 2 board. The first patch constrain minItems of clocks +> for JH7110 SPI and Patch 2 adds support for StarFive JH7110 SPI. +> + +**[v6: Add PLL clocks driver and syscon for StarFive JH7110 SoC](http://lore.kernel.org/linux-riscv/20230704064610.292603-1-xingyu.wu@starfivetech.com/)** + +> This patch serises are to add PLL clocks driver and providers by writing +> and reading syscon registers for the StarFive JH7110 RISC-V SoC. And add +> documentation and nodes to describe StarFive System Controller(syscon) +> Registers. This patch serises are based on Linux 6.4. +> + +**[v4: riscv: Allow userspace to directly access perf counters](http://lore.kernel.org/linux-riscv/20230703124647.215952-1-alexghiti@rivosinc.com/)** + +> riscv used to allow direct access to cycle/time/instret counters, +> bypassing the perf framework, this patchset intends to allow the user to +> mmap any counter when accessed through perf. But we can't break the +> existing behaviour so we introduce a sysctl perf_user_access like arm64 +> does, which defaults to the legacy mode described above. +> + +**[v3: RISC-V: Probe DT extension support using riscv,isa-extensions & riscv,isa-base](http://lore.kernel.org/linux-riscv/20230703-repayment-vocalist-e4f3eeac2b2a@wendy/)** + +> Based on my latest iteration of deprecating riscv,isa [1], here's an +> implementation of the new properties for Linux. The first few patches, +> up to "RISC-V: split riscv_fill_hwcap() in 3", are all prep work that +> further tames some of the extension related code, on top of my already +> applied series that cleans up the ISA string parser. +> Perhaps "RISC-V: shunt isa_ext_arr to cpufeature.c" is a bit gratuitous, +> but I figured a bit of coalescing of extension related data structures +> would be a good idea. Note that riscv,isa will still be used in the +> absence of the new properties. Palmer suggested adding a Kconfig option +> to turn off the fallback for DT, which I have gone and done. It's locked +> behind the NONPORTABLE option for good reason. +> + +**[v1: riscv: optimize ELF relocation function in riscv](http://lore.kernel.org/linux-riscv/1688355132-62933-1-git-send-email-lixiaoyun@binary-semi.com/)** + +> The patch can optimize the running times of insmod command by modify ELF +> relocation function. +> In the 5.10 and latest kernel, when install the riscv ELF drivers which +> contains multiple symbol table items to be relocated, kernel takes a lot +> of time to execute the relocation. For example, we install a 3+MB driver +> need 180+s. +> We focus on the riscv architecture handle R_RISCV_HI20 and R_RISCV_LO20 +> type items relocation function in the arch\riscv\kernel\module.c and +> find that there are two-loops in the function. If we modify the begin +> number in the second for-loops iteration, we could save significant time +> for installation. We install the same 3+MB driver could just need 2s. +> + +**[v10: Add non-coherent DMA support for AX45MP](http://lore.kernel.org/linux-riscv/20230702203429.237615-1-prabhakar.mahadev-lad.rj@bp.renesas.com/)** + +> On the Andes AX45MP core, cache coherency is a specification option so it +> may not be supported. In this case DMA will fail. To get around with this +> issue this patch series does the below: +> +> 1] Andes alternative ports is implemented as errata which checks if the +> IOCP is missing and only then applies to CMO errata. One vendor specific +> SBI EXT (ANDES_SBI_EXT_IOCP_SW_WORKAROUND) is implemented as part of +> errata. +> + +**[v5: dt-bindings: riscv: deprecate riscv,isa](http://lore.kernel.org/linux-riscv/20230702-eats-scorebook-c951f170d29f@spud/)** + +> When the RISC-V dt-bindings were accepted upstream in Linux, the base +> ISA etc had yet to be ratified. By the ratification of the base ISA, +> incompatible changes had snuck into the specifications - for example the +> Zicsr and Zifencei extensions were spun out of the base ISA. +> + +**[v5: RISCV: Add KVM_GET_REG_LIST API](http://lore.kernel.org/linux-riscv/cover.1688010022.git.haibo1.xu@intel.com/)** + +> KVM_GET_REG_LIST will dump all register IDs that are available to +> KVM_GET/SET_ONE_REG and It's very useful to identify some platform +> regression issue during VM migration. +> + +**[GIT PULL: RISC-V Patches for the 6.5 Merge Window, Part 1](http://lore.kernel.org/linux-riscv/mhng-ebcc1b82-5dd0-4f2d-824e-8d9250374abf@palmer-ri-x1c9/)** + +> The following changes since commit ac9a78681b921877518763ba0e89202254349d1b: +> +> Linux 6.4-rc1 (2023-05-07 13:34:35 -0700) +> +> are available in the Git repository at: +> +> git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git tags/riscv-for-linus-6.5-mw1 +> + +**[v1: Add missing pins for RZ/Five SoC](http://lore.kernel.org/linux-riscv/20230630120433.49529-1-prabhakar.mahadev-lad.rj@bp.renesas.com/)** + +> This patch series intends to incorporate the absent port pins P19 to P28, +> which are exclusively available on the RZ/Five SoC. +> + +**[v2: riscv: Add BUG_ON() for no cpu nodes in devicetree](http://lore.kernel.org/linux-riscv/20230630105938.1377262-1-suagrfillet@gmail.com/)** + +> When only the ACPI tables are passed to kernel, the tiny devictree created +> by EFI Stub doesn't provide cpu nodes. +> + +**[v1: riscv: KCFI support](http://lore.kernel.org/linux-riscv/20230629234244.1752366-8-samitolvanen@google.com/)** + +> This series adds KCFI support for RISC-V. KCFI is a fine-grained +> forward-edge control-flow integrity scheme supported in Clang >=16, +> which ensures indirect calls in instrumented code can only branch to +> functions whose type matches the function pointer type, thus making +> code reuse attacks more difficult. +> + +**[v1: RISC-V: Provide a more helpful error message on invalid ISA strings](http://lore.kernel.org/linux-riscv/20230629223502.1924-1-palmer@rivosinc.com/)** + +> This adds a warning for the cases where the ISA string isn't valid. It's still +> above the BUG_ON cut, but hopefully it's at least a bit easier for users. +> + +**[v4: riscv: Discard vector state on syscalls](http://lore.kernel.org/linux-riscv/20230629142228.1125715-1-bjorn@kernel.org/)** + +> The RISC-V vector specification states: +> Executing a system call causes all caller-saved vector registers +> (v0-v31, vl, vtype) and vstart to become unspecified. +> +> The vector registers are set to all 1s, vill is set (invalid), and the +> vector status is set to Dirty. +> + +**[v1: arch,fbdev: Move screen_info into arch/](http://lore.kernel.org/linux-riscv/20230629121952.10559-1-tzimmermann@suse.de/)** + +> The variables screen_info and edid_info provide information about +> the system's screen, and possibly EDID data of the connected display. +> Both are defined and set by architecture code. But both variables are +> declared in non-arch header files. Dependencies are at bease loosely +> tracked. To resolve this, move the global state screen_info and its +> companion edid_info into arch/. Only declare them on architectures +> that define them. List dependencies on the variables in the Kconfig +> files. Also clean up the callers. +> + +**[v1: riscv: BUG_ON() for no cpu nodes in setup_smp](http://lore.kernel.org/linux-riscv/20230629105839.1160895-1-suagrfillet@gmail.com/)** + +> When booting with ACPI tables, the tiny devictree created by +> EFI Stub doesn't provide cpu nodes. +> +> In setup_smp(), of_parse_and_init_cpus() will bug on !found_boot_cpu +> if acpi_disabled. That's unclear, so bug for no cpu nodes before +> of_parse_and_init_cpus(). +> + +**[v8: Add JH7110 USB PHY driver support](http://lore.kernel.org/linux-riscv/20230629075115.11934-1-minda.chen@starfivetech.com/)** + +> This patchset adds USB and PCIe PHY for the StarFive JH7110 SoC. +> The patch has been tested on the VisionFive 2 board. +> + +**[v1: RISC-V: Document the ISA string parsing rules for ACPI](http://lore.kernel.org/linux-riscv/20230629031705.15575-1-palmer@rivosinc.com/)** + +> We've had a ton of issues around the ISA string parsing rules elsewhere +> in RISC-V, so let's at least be clear about what the rules are so we can +> try and avoid more issues. +> + +**[v1: tools/nolibc: shrink arch support](http://lore.kernel.org/linux-riscv/cover.1687976753.git.falcon@tinylab.org/)** + +> This patchset further improves porting of nolibc to new architectures, +> it is based on our previous v5 sysret helper series [1]. +> +> It mainly shrinks the assembly _start by moving most of its operations +> to a C version of _start_c() function. and also, it removes the old +> sys_stat() support by using the sys_statx() instead and therefore, +> removes all of the arch specific sys_stat_struct. +> + +**[v2: RISC-V: archrandom support](http://lore.kernel.org/linux-riscv/20230628131442.3022772-1-sameo@rivosinc.com/)** + +> This patchset adds support for the archrandom API to the RISC-V +> architecture. +> +> The ratified crypto scalar extensions provide entropy bits via the seed +> CSR, as exposed by the Zkr extension. +> + +**[v5: tools/nolibc: add a new syscall helper](http://lore.kernel.org/linux-riscv/cover.1687957589.git.falcon@tinylab.org/)** + +> It mainly applies the core part of suggestions from Thomas (Many thanks) +> and cleans up the multiple whitespaces issues reported by +> scripts/checkpatch.pl. +> + +**[v1: riscv: sigcontext: Correct the comment of sigreturn](http://lore.kernel.org/linux-riscv/20230628091213.2908149-1-guoren@kernel.org/)** + +> The real-time signals enlarged the sigset_t type, and most architectures +> have changed to using rt_sigreturn as the only way. The riscv is one of +> them, and there is no sys_sigreturn in it. Only some old architecture +> preserved sys_sigreturn as part of the historical burden. +> + +**[GIT PULL: RISC-V: make ARCH_THEAD preclude XIP_KERNEL](http://lore.kernel.org/linux-riscv/20230628-left-attractor-94b7bd5fbb83@wendy/)** + +> Randy reported build errors in linux-next where XIP_KERNEL was enabled. +> ARCH_THEAD requires alternatives to support the non-standard ISA +> extensions used by the THEAD cores, which are mutually exclusive with +> XIP kernels. Clone the dependency list from the Allwinner entry, since +> Allwinner's D1 uses T-Head cores with the same non-standard extensions. +> + +**[v1: Make SV39 the default address space](http://lore.kernel.org/linux-riscv/20230627222152.177716-1-charlie@rivosinc.com/)** + +> Make sv39 the default address space for mmap as some applications +> currently depend on this assumption. The RISC-V specification enforces +> that bits outside of the virtual address range are not used, so +> restricting the size of the default address space as such should be +> temporary. A hint address passed to mmap will cause the largest address +> space that fits entirely into the hint to be used. If the hint is less +> than or equal to 1<<38, a 39-bit address will be used. After an address +> space is completely full, the next smallest address space will be used. +> + +**[v3: Add support for Allwinner PWM on D1/T113s/R329 SoCs](http://lore.kernel.org/linux-riscv/20230627082334.1253020-1-privatesub2@gmail.com/)** + +> This series adds support for PWM controller on new +> Allwinner's SoCs, such as D1, T113s and R329. The implemented driver +> provides basic functionality for control PWM channels. +> + +#### 进程调度 + +**[v3: sched/core: introduce sched_core_idle_cpu()](http://lore.kernel.org/lkml/1688011324-42406-1-git-send-email-CruzZhao@linux.alibaba.com/)** + +> As core scheduling introduced, a new state of idle is defined as +> force idle, running idle task but nr_running greater than zero. +> + +**[v1: sched/core: Use empty mask to reset cpumasks in sched_setaffinity()](http://lore.kernel.org/lkml/20230628211637.1679348-1-longman@redhat.com/)** + +> Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested +> cpumask"), user provided CPU affinity via sched_setaffinity(2) is +> perserved even if the task is being moved to a different cpuset. However, +> that affinity is also being inherited by any subsequently created child +> processes which may not want or be aware of that affinity. +> + +**[v3: Sched/fair: Block nohz tick_stop when cfs bandwidth in use](http://lore.kernel.org/lkml/20230628190227.894195-1-pauld@redhat.com/)** + +> CFS bandwidth limits and NOHZ full don't play well together. Tasks +> can easily run well past their quotas before a remote tick does +> accounting. This leads to long, multi-period stalls before such +> tasks can run again. Currentlyi, when presented with these conflicting +> requirements the scheduler is favoring nohz_full and letting the tick +> be stopped. However, nohz tick stopping is already best-effort, there +> are a number of conditions that can prevent it, whereas cfs runtime +> bandwidth is expected to be enforced. +> + +#### 内存管理 + +**[v3: MDWE without inheritance](http://lore.kernel.org/linux-mm/20230704153630.1591122-1-revest@chromium.org/)** + +> Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags +> current with a flag that prevents pages that were previously not executable from +> becoming executable. +> This tag always gets inherited by children tasks. (it's in MMF_INIT_MASK) +> + +**[v2: mm/slub: refactor freelist to use custom type](http://lore.kernel.org/linux-mm/20230704135834.3884421-1-matteorizzo@google.com/)** + +> Currently the SLUB code represents encoded freelist entries as "void*". +> That's misleading, those things are encoded under +> CONFIG_SLAB_FREELIST_HARDENED so that they're not actually dereferencable. +> + +**[v1: block: Make blkdev_get_by_*() return handle](http://lore.kernel.org/linux-mm/20230629165206.383-1-jack@suse.cz/)** + +> this patch series implements the idea of blkdev_get_by_*() calls returning +> bdev_handle which is then passed to blkdev_put() [1]. This makes the get +> and put calls for bdevs more obviously matching and allows us to propagate +> context from get to put without having to modify all the users (again!). +> In particular I need to propagate used open flags to blkdev_put() to be able +> count writeable opens and add support for blocking writes to mounted block +> devices. I'll send that series separately. +> + +**[v1: mm: memory-failure: add missing set_mce_nospec() for memory_failure()](http://lore.kernel.org/linux-mm/20230704121948.1331846-1-linmiaohe@huawei.com/)** + +> If memory_failure() succeeds to hwpoison a page, the set_mce_nospec() is +> expected to be called to prevent speculative access to the page by marking +> it not-present. Add such missing call to set_mce_nospec() in async memory +> failure handling scene. +> + +**[v1: mm: page_alloc: avoid false page outside zone error info](http://lore.kernel.org/linux-mm/20230704111823.940331-1-linmiaohe@huawei.com/)** + +> If pfn is outside zone boundaries in the first round, ret will be set +> to 1. But if pfn is changed to inside the zone boundaries in zone span +> seqretry path, ret is still set to 1 leading to false page outside zone +> error info. +> + +**[v3: Documentation: admin-guide: correct "it's" to possessive "its"](http://lore.kernel.org/linux-mm/20230703232024.8069-1-rdunlap@infradead.org/)** + +> Correct 2 uses of "it's" to the possessive "its" as needed. +> + +**[v2: variable-order, large folios for anonymous memory](http://lore.kernel.org/linux-mm/20230703135330.1865927-1-ryan.roberts@arm.com/)** + +> This is v2 of a series to implement variable order, large folios for anonymous +> memory. The objective of this is to improve performance by allocating larger +> chunks of memory during anonymous page faults. See [1] for background. +> + +**[[PATCH v10 rebased on v6.4 00/25] DEPT(Dependency Tracker)](http://lore.kernel.org/linux-mm/20230703094752.79269-1-byungchul@sk.com/)** + +> From now on, I can work on LKML again! I'm wondering if DEPT has been +> helping kernel debugging well even though it's a form of patches yet. +> + +**[v1: mm: make MEMFD_CREATE into a selectable config option](http://lore.kernel.org/linux-mm/20230630-config-memfd-v1-1-9acc3ae38b5a@weissschuh.net/)** + +> The memfd_create() syscall, enabled by CONFIG_MEMFD_CREATE, is useful on +> its own even when not required by CONFIG_TMPFS or CONFIG_HUGETLBFS. +> +> Split it into its own proper bool option that can be enabled by users. +> + +**[v2: Documentation: mm/memfd: vm.memfd_noexec](http://lore.kernel.org/linux-mm/20230629233454.4166842-1-jeffxu@google.com/)** + +> Add documentation for sysctl vm.memfd_noexec +> +> Link:https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU6WEy5a8MaQY3Jw@mail.gmail.com/T/ +> + +**[v2: mm/slub: disable slab merging in the default configuration](http://lore.kernel.org/linux-mm/20230629221910.359711-1-julian.pidancet@oracle.com/)** + +> Make CONFIG_SLAB_MERGE_DEFAULT default to n unless CONFIG_SLUB_TINY is +> enabled. Benefits of slab merging is limited on systems that are not +> memory constrained: the memory overhead is low and evidence of its +> effect on cache hotness is hard to come by. +> + +**[v25: crash: Kernel handling of CPU and memory hot un/plug](http://lore.kernel.org/linux-mm/20230629192119.6613-1-eric.devolder@oracle.com/)** + +> This series is dependent upon "refactor Kconfig to consolidate +> KEXEC and CRASH options". +> https://lore.kernel.org/lkml/20230626161332.183214-1-eric.devolder@oracle.com/ +> +> Once the kdump service is loaded, if changes to CPUs or memory occur, +> either by hot un/plug or off/onlining, the crash elfcorehdr must also +> be updated. +> + +**[v1: mm: Always downgrade mmap_lock if requested](http://lore.kernel.org/linux-mm/20230629191414.1215929-1-willy@infradead.org/)** + +> Now that stack growth must always hold the mmap_lock for write, we can +> always downgrade the mmap_lock to read and safely unmap pages from the +> page table, even if we're next to a stack. +> + +**[v1: writeback: Account the number of pages written back](http://lore.kernel.org/linux-mm/20230628185548.981888-1-willy@infradead.org/)** + +> nr_to_write is a count of pages, so we need to decrease it by the number +> of pages in the folio we just wrote, not by 1. Most callers specify +> either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes() +> might end up writing 512x as many pages as it asked for. +> + +**[v24: crash: Kernel handling of CPU and memory hot un/plug](http://lore.kernel.org/linux-mm/20230628185215.40707-1-eric.devolder@oracle.com/)** + +> This series is dependent upon "refactor Kconfig to consolidate +> KEXEC and CRASH options". +> https://lore.kernel.org/lkml/20230626161332.183214-1-eric.devolder@oracle.com/ +> +> Once the kdump service is loaded, if changes to CPUs or memory occur, +> either by hot un/plug or off/onlining, the crash elfcorehdr must also +> be updated. +> + +**[v1: fs/address_space: add alignment padding for i_map and i_mmap_rwsem to mitigate a false sharing.](http://lore.kernel.org/linux-mm/20230628105624.150352-1-lipeng.zhu@intel.com/)** + +> When running UnixBench/Shell Scripts, we observed high false sharing +> for accessing i_mmap against i_mmap_rwsem. +> +> UnixBench/Shell Scripts are typical load/execute command test scenarios, +> the i_mmap will be accessed frequently to insert/remove vma_interval_tree. +> Meanwhile, the i_mmap_rwsem is frequently loaded. Unfortunately, they are +> in the same cacheline. +> + +**[v2: mm/slub: Optimize slub memory usage](http://lore.kernel.org/linux-mm/20230628095740.589893-1-jaypatel@linux.ibm.com/)** + +> In the previous version [1], we were able to reduce slub memory +> wastage, but the total memory was also increasing so to solve +> this problem have modified the patch as follow: +> +> 1) If min_objects * object_size > PAGE_ALLOC_COSTLY_ORDER, then it +> will return with PAGE_ALLOC_COSTLY_ORDER. +> 2) Similarly, if min_objects * object_size < PAGE_SIZE, then it will +> return with slub_min_order. +> 3) Additionally, I changed slub_max_order to 2. There is no specific +> reason for using the value 2, but it provided the best results in +> terms of performance without any noticeable impact. +> + +#### 文件系统 + +**[v2: 0/6: block: Add config option to not allow writing to mounted devices](http://lore.kernel.org/linux-fsdevel/20230704122727.17096-1-jack@suse.cz/)** + +> This is second version of the patches to add config option to not allow writing +> to mounted block devices. For motivation why this is interesting see patch 1/6. +> I've been testing the patches more extensively this time and I've found couple +> of things that get broken by disallowing writes to mounted block devices: +> 1) Bind mounts get broken because get_tree_bdev() / mount_bdev() first try to +> claim the bdev before searching whether it is already mounted. Patch 6 +> reworks the mount code to avoid this problem. +> 2) btrfs mounting is likely having the same problem as 1). It should be fixable +> AFAICS but for now I've left it alone until we settle on the rest of the +> series. +> 3) "mount -o loop" gets broken because util-linux keeps the loop device open +> read-write when attempting to mount it. Hopefully fixable within util-linux. +> 4) resize2fs online resizing gets broken because it tries to open the block +> device read-write only to call resizing ioctl. Trivial to fix within +> e2fsprogs. +> + +**[v1: block: Make blkdev_get_by_*() return handle](http://lore.kernel.org/linux-fsdevel/20230629165206.383-1-jack@suse.cz/)** + +> this patch series implements the idea of blkdev_get_by_*() calls returning +> bdev_handle which is then passed to blkdev_put() [1]. This makes the get +> and put calls for bdevs more obviously matching and allows us to propagate +> context from get to put without having to modify all the users (again!). +> In particular I need to propagate used open flags to blkdev_put() to be able +> count writeable opens and add support for blocking writes to mounted block +> devices. I'll send that series separately. +> + +**[v5: fanotify accounting for fs/splice.c](http://lore.kernel.org/linux-fsdevel/cover.1688393619.git.nabijaczleweli@nabijaczleweli.xyz/)** + +> Previously: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u +> +> In short: +> * most read/write APIs generate ACCESS/MODIFY for the read/written file(s) +> * except the [vm]splice/tee family +> (actually, since 6.4, splice itself /does/ generate events but only +> for the non-pipes being spliced from/to; this commit is Fixes:ed) +> * userspace that registers (i|fa)notify on pipes usually relies on it +> actually working (coreutils tail -f is the primo example) +> * it's sub-optimal when someone with a magic syscall can fill up a +> pipe simultaneously ensuring it will never get serviced +> + +**[[PATCH v10 rebased on v6.4 00/25] DEPT(Dependency Tracker)](http://lore.kernel.org/linux-fsdevel/20230703094752.79269-1-byungchul@sk.com/)** + +> From now on, I can work on LKML again! I'm wondering if DEPT has been +> helping kernel debugging well even though it's a form of patches yet. +> + +**[GIT PULL: iomap: new code for 6.5](http://lore.kernel.org/linux-fsdevel/168831482682.535407.9162875426107097138.stg-ugh@frogsfrogsfrogs/)** + +> Please pull this branch with changes for iomap for 6.5-rc1. +> +> As usual, I did a test-merge with the main upstream branch as of a few +> minutes ago, and didn't see any conflicts. Please let me know if you +> encounter any problems. +> + +**[v1: proc: proc_setattr for /proc/$PID/net](http://lore.kernel.org/linux-fsdevel/20230630140609.263790-1-falcon@tinylab.org/)** + +> Just applied your patchset on v6.4, and then: +> +> - revert the 1st patch: 'selftests/nolibc: drop test chmod_net' manually +> +> - do the 'run' test of nolibc on arm/vexpress-a9 +> + +**[v3: fuse: add a new fuse init flag to relax restrictions in no cache mode](http://lore.kernel.org/linux-fsdevel/20230630094602.230573-1-hao.xu@linux.dev/)** + +> Patch 1 is a fix for private mmap in FOPEN_DIRECT_IO mode +> This is added here together since the later two depends on it. +> Patch 2 is the main dish +> Patch 3 is to maintain direct write logic for shared mmap in FOPEN_DIRECT_IO mode +> + +**[v1: fs: Optimize unixbench's file copy test](http://lore.kernel.org/linux-fsdevel/1688117303-8294-1-git-send-email-zenghongling@kylinos.cn/)** + +> The iomap_set_range_uptodate function checks if the file is a private +> mapping,and if it is, it needs to do something about it.UnixBench's +> file copy tests are mostly share mapping, such a check would reduce +> file copy scores, so we added the unlikely macro for optimization. +> and the score of file copy can be improved after branch optimization. +> + +**[v1: fanotify: disallow mount/sb marks on kernel internal pseudo fs](http://lore.kernel.org/linux-fsdevel/20230629042044.25723-1-amir73il@gmail.com/)** + +> Hopefully, nobody is trying to abuse mount/sb marks for watching all +> anonymous pipes/inodes. +> +> I cannot think of a good reason to allow this - it looks like an +> oversight that dated back to the original fanotify API. +> + +**[GIT PULL: sysctl changes for v6.5-rc1](http://lore.kernel.org/linux-fsdevel/ZJx62RvS9TwjUUCi@bombadil.infradead.org/)** + +> The following changes since commit f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6: +> +> Linux 6.4-rc2 (2023-05-14 12:51:40 -0700) +> +> are available in the Git repository at: +> +> git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ tags/v6.5-rc1-sysctl-next +> +> for you to fetch changes up to 2f2665c13af4895b26761107c2f637c2f112d8e9: +> +> sysctl: replace child with an enumeration (2023-06-18 02:32:54 -0700) +> + +#### 网络设备 + +**[v3: net: nfp: clean mc addresses in application firmware when closing port](http://lore.kernel.org/netdev/20230705052818.7122-1-louis.peens@corigine.com/)** + +> When moving devices from one namespace to another, mc addresses are +> cleaned in software while not removed from application firmware. Thus +> the mc addresses are remained and will cause resource leak. +> + +**[v2: iwl-net: ice: prevent call trace during reload](http://lore.kernel.org/netdev/20230705040510.906029-1-michal.swiatkowski@linux.intel.com/)** + +> Calling ethtool during reload can lead to call trace, because VSI isn't +> configured for some time, but netdev is alive. +> +> To fix it add rtnl lock for VSI deconfig and config. Set ::num_q_vectors +> to 0 after freeing and add a check for ::tx/rx_rings in ring related +> ethtool ops. +> +> Add proper unroll of filters in ice_start_eth(). +> + +**[v1: net: octeontx2-af: Promisc enable/disable through mbox](http://lore.kernel.org/netdev/20230705033813.2744357-1-rkannoth@marvell.com/)** + +> In Legacy silicon, promisc mode is only modified +> through CGX mbox messages. In CN10KB silicon, it modified +> from CGX mbox and NIX. This breaks legacy application +> behaviour. Fix this by removing call from NIX. +> + +**[v2: vduse: add support for networking devices](http://lore.kernel.org/netdev/20230704164045.39119-1-maxime.coquelin@redhat.com/)** + +> This small series enables virtio-net device type in VDUSE. +> With it, basic operation have been tested, both with +> virtio-vdpa and vhost-vdpa using DPDK Vhost library series +> adding VDUSE support using split rings layout (merged in +> DPDK v23.07-rc1). +> + +**[v1: net: ftmac100: add multicast filtering possibility](http://lore.kernel.org/netdev/20230704154053.3475336-1-saproj@gmail.com/)** + +> If netdev_mc_count() is not zero and not IFF_ALLMULTI, filter +> incoming multicast packets. The chip has a Multicast Address Hash Table +> for allowed multicast addresses, so we fill it. +> + +**[v1: net: sched: Undo tcf_bind_filter in case of errors in set callbacks](http://lore.kernel.org/netdev/20230704151456.52334-1-victor@mojatatu.com/)** + +> Five different classifier (fw, bpf, u32, matchall, and flower) are +> calling tcf_bind_filter in their callbacks, but weren't undoing it by +> calling tcf_unbind_filter if their was an error after binding. +> +> This patch set fixes all this by calling tcf_unbind_filter in such +> cases. +> + +**[v5: bpf-next: Add SO_REUSEPORT support for TC bpf_sk_assign](http://lore.kernel.org/netdev/20230613-so-reuseport-v5-0-f6686a0dbce0@isovalent.com/)** + +> We want to replace iptables TPROXY with a BPF program at TC ingress. +> To make this work in all cases we need to assign a SO_REUSEPORT socket +> to an skb, which is currently prohibited. This series adds support for +> such sockets to bpf_sk_assing. +> + +**[v1: resubmit: net: fec: Refactor: rename `adapter` to `fep`](http://lore.kernel.org/netdev/20230704114058.5785-1-csokas.bence@prolan.hu/)** + +> Rename local `struct fec_enet_private *adapter` to `fep` in `fec_ptp_gettime()` to match the rest of the driver +> + +**[v1: igb: Add support for AF_XDP zero-copy](http://lore.kernel.org/netdev/20230704095915.9750-1-sriram.yagnaraman@est.tech/)** + +> Disclaimer: My first patches to Intel drivers, implemented AF_XDP +> zero-copy feature which seemed to be missing for igb. Not sure if it was +> a conscious choice to not spend time implementing this for older +> devices, nevertheless I send them to the list for review. +> + +**[v1: net: phy: at803x: support qca8081 1G version chip](http://lore.kernel.org/netdev/20230704090016.7757-1-quic_luoj@quicinc.com/)** + +> This patch series add supporting qca8081 1G version chip, the 1G version +> chip can be identified by the register mmd7.0x901d bit0. +> + +**[v1: net-next: bnxt_en: use dev_consume_skb_any() in bnxt_tx_int](http://lore.kernel.org/netdev/20230704085236.9791-1-imagedong@tencent.com/)** + +> Replace dev_kfree_skb_any() with dev_consume_skb_any() in bnxt_tx_int() +> to clear the unnecessary noise of "kfree_skb" event. +> + +**[v2: net: dsa: SERDES support for mv88e632x family](http://lore.kernel.org/netdev/20230704065916.132486-1-michael.haener@siemens.com/)** + +> This patch series brings SERDES support for the mv88e632x family. +> + +**[v1: can: j1939: prevent deadlock by changing j1939_socks_lock to rwlock](http://lore.kernel.org/netdev/20230704064710.3189-1-astrajoan@yahoo.com/)** + +> The following 3 locks would race against each other, causing the +> deadlock situation in the Syzbot bug report: +> +> - j1939_socks_lock +> - active_session_list_lock +> - sk_session_queue_lock +> +> A reasonable fix is to change j1939_socks_lock to an rwlock, since in +> the rare situations where a write lock is required for the linked list +> that j1939_socks_lock is protecting, the code does not attempt to +> acquire any more locks. This would break the circular lock dependency, +> where, for example, the current thread already locks j1939_socks_lock +> and attempts to acquire sk_session_queue_lock, and at the same time, +> another thread attempts to acquire j1939_socks_lock while holding +> sk_session_queue_lock. +> + +**[v2: bpf-next: XDP metadata via kfuncs for ice](http://lore.kernel.org/netdev/20230703181226.19380-1-larysa.zaremba@intel.com/)** + +> This series introduces XDP hints via kfuncs [0] to the ice driver. +> +> Series brings the following existing hints to the ice driver: +> - HW timestamp +> - RX hash with type +> +> Series also introduces new hints and adds their implementation +> to ice and veth: +> - VLAN tag with protocol +> - Checksum level +> + +**[v1: net: Replace strlcpy with strscpy](http://lore.kernel.org/netdev/20230703175840.3706231-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> No return values were used, so direct replacement is safe. +> + +**[v1: bpf, net: Allow setting SO_TIMESTAMPING* from BPF](http://lore.kernel.org/netdev/20230703175048.151683-1-jthinz@mailbox.tu-berlin.de/)** + +> BPF applications, e.g., a TCP congestion control, might benefit from +> precise packet timestamps. These timestamps are already available in +> __sk_buff and bpf_sock_ops, but could not be requested: A BPF program +> was not allowed to set SO_TIMESTAMPING* on a socket. This change enables +> BPF programs to actively request the generation of timestamps from a +> stream socket. +> + +**[v1: bpf-next: xsk: honor SO_BINDTODEVICE on bind](http://lore.kernel.org/netdev/20230703175329.3259672-1-i.maximets@ovn.org/)** + +> Initial creation of an AF_XDP socket requires CAP_NET_RAW capability. +> A privileged process might create the socket and pass it to a +> non-privileged process for later use. However, that process will be +> able to bind the socket to any network interface. Even though it will +> not be able to receive any traffic without modification of the BPF map, +> the situation is not ideal. +> + +**[v3: octeontx2-pf: Add additional check for MCAM rules](http://lore.kernel.org/netdev/20230703170054.2152662-1-sumang@marvell.com/)** + +> Due to hardware limitation, MCAM drop rule with +> ether_type == 802.1Q and vlan_id == 0 is not supported. Hence rejecting +> such rules. +> + +**[v1: netconsole: Append kernel version to message](http://lore.kernel.org/netdev/20230703154155.3460313-1-leitao@debian.org/)** + +> Create a new netconsole Kconfig option that prepends the kernel version in +> the netconsole message. This is useful to map kernel messages to kernel +> version in a simple way, i.e., without checking somewhere which kernel +> version the host that sent the message is using. +> + +**[v2: nf: netfilter: conntrack: Avoid nf_ct_helper_hash uses after free](http://lore.kernel.org/netdev/20230703145216.1096265-1-revest@chromium.org/)** + +> If nf_conntrack_init_start() fails (for example due to a +> register_nf_conntrack_bpf() failure), the nf_conntrack_helper_fini() +> clean-up path frees the nf_ct_helper_hash map. +> + +**[v1: vdpa: reject F_ENABLE_AFTER_DRIVER_OK if backend does not support it](http://lore.kernel.org/netdev/20230703142218.362549-1-eperezma@redhat.com/)** + +> With the current code it is accepted as long as userland send it. +> +> Although userland should not set a feature flag that has not been +> offered to it with VHOST_GET_BACKEND_FEATURES, the current code will not +> complain for it. +> + +**[v1: Add a driver for the Marvell 88Q2110 PHY](http://lore.kernel.org/netdev/20230703124440.391970-1-eichest@gmail.com/)** + +> Add support for 1000BASE-T1 to the phy_device driver and add a first +> + +**[[net PATCH] octeontx2-af: Install TC filter rules in hardware based on priority](http://lore.kernel.org/netdev/20230703120536.2148918-1-sumang@marvell.com/)** + +> As of today, hardware does not support installing tc filter +> rules based on priority. This patch fixes the issue and install +> the hardware rules based on priority. The final hardware rules +> will not be dependent on rule installation order, it will be strictly +> priority based, same as software. +> + +**[v1: net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX](http://lore.kernel.org/netdev/20230703110842.590282-1-linma@zju.edu.cn/)** + +> The attribute TCA_PEDIT_PARMS_EX is not be included in pedit_policy and +> one malicious user could fake a TCA_PEDIT_PARMS_EX whose length is +> smaller than the intended sizeof(struct tc_pedit). Hence, the +> dereference in tcf_pedit_init() could access dirty heap data. +> + +**[[net PATCH V2] octeontx2-pf: Add additional check for MCAM rules.](http://lore.kernel.org/netdev/20230703095600.2048397-1-sumang@marvell.com/)** + +> Due to hardware limitation, MCAM drop rule with +> ether_type == 802.1Q and vlan_id == 0 is not supported. Hence rejecting +> such rules. +> + +**[v1: I3C MCTP net driver](http://lore.kernel.org/netdev/20230703053048.275709-1-matt@codeconstruct.com.au/)** + +> This series adds an I3C transport for the kernel's MCTP network +> protocol. MCTP is a communication protocol between system components +> (BMCs, drives, NICs etc), with higher level protocols such as NVMe-MI or +> PLDM built on top of it (in userspace). It runs over various transports +> such as I2C, PCIe, or I3C. +> + +**[v4: wifi:mac80211: Replace the ternary conditional operator with conditional-statements](http://lore.kernel.org/netdev/20230703030200.1067-1-youkangren@vivo.com/)** + +> Replacing ternary conditional operators with conditional statements +> ensures proper expression of meaning while making it easier for +> the compiler to generate code. +> + +**[v5: vsock: MSG_ZEROCOPY flag support](http://lore.kernel.org/netdev/20230701063947.3422088-1-AVKrasnov@sberdevices.ru/)** + +> Difference with copy way is not significant. During packet allocation, +> non-linear skb is created and filled with pinned user pages. +> There are also some updates for vhost and guest parts of transport - in +> both cases i've added handling of non-linear skb for virtio part. vhost +> copies data from such skb to the guest's rx virtio buffers. In the guest, +> virtio transport fills tx virtio queue with pages from skb. +> + +**[v5: vsock: enable setting SO_ZEROCOPY](http://lore.kernel.org/netdev/20230701062310.3397129-14-AVKrasnov@sberdevices.ru/)** + +> For AF_VSOCK, zerocopy tx mode depends on transport, so this option must +> be set in AF_VSOCK implementation where transport is accessible (if +> transport is not set during setting SO_ZEROCOPY: for example socket is +> not connected, then SO_ZEROCOPY will be enabled, but once transport will +> be assigned, support of this type of transmission will be checked). +> + +**[v1: selftests/net: Add xt_policy config for xfrm_policy test](http://lore.kernel.org/netdev/20230701044103.1096039-1-daniel.diaz@linaro.org/)** + +> This is because IPsec "policy" match support is not available +> to the kernel. +> +> This patch adds CONFIG_NETFILTER_XT_MATCH_POLICY as a module +> to the selftests/net/config file, so that `make +> kselftest-merge` can take this into consideration. +> + +**[v1: Add virtio_rtc module and related changes](http://lore.kernel.org/netdev/20230630171052.985577-1-peter.hilber@opensynergy.com/)** + +> This patch series adds the virtio_rtc module, and related bugfixes and +> small interface extensions. The virtio_rtc module implements a driver +> compatible with the proposed Virtio RTC device specification [1]. The +> Virtio RTC (Real Time Clock) device provides information about current +> time. The device can provide different clocks, e.g. for the UTC or TAI time +> standards, or for physical time elapsed since some past epoch. The driver +> can read the clocks with simple or more accurate methods. +> + +#### 安全增强 + +**[v1: pstore: Replace crypto API compression with zlib calls](http://lore.kernel.org/linux-hardening/20230704135211.2471371-1-ardb@kernel.org/)** + +> The pstore layer implements support for compression of kernel log +> output, using a variety of compressions algorithms provided by the +> [deprecated] crypto API 'comp' interface. +> +> This appears to have been somebody's pet project rather than a solution +> to a real problem: the original deflate compression is reasonably fast, +> compressed well and is comparatively small in terms of code footprint, +> and so the flexibility that the crypto API integration provides does +> little more than complicate the code for no reason. +> + +**[v1: Revert "fortify: Allow KUnit test to build without FORTIFY"](http://lore.kernel.org/linux-hardening/20230703220210.never.615-kees@kernel.org/)** + +> The standard for KUnit is to not build tests at all when required +> functionality is missing, rather than doing test "skip". Restore this +> for the fortify tests, so that architectures without +> CONFIG_ARCH_HAS_FORTIFY_SOURCE do not emit unsolvable warnings. +> + +**[v1: wifi: mt76: Replace strlcpy with strscpy](http://lore.kernel.org/linux-hardening/20230703181256.3712079-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +**[v1: kobject: Replace strlcpy with strscpy](http://lore.kernel.org/linux-hardening/20230703180528.3709258-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +**[v1: kyber, blk-wbt: Replace strlcpy with strscpy](http://lore.kernel.org/linux-hardening/20230703172159.3668349-1-azeemshaikh38@gmail.com/)** + +> This patch series replaces strlcpy in the kyber and blk-wbt tracing subsystems wherever trivial +> replacement is possible, i.e return value from strlcpy is unused. The patches +> themselves are independent of each other and are applied to different subsystems. They are +> included as a series for ease of review. +> + +**[v1: perf: Replace strlcpy with strscpy](http://lore.kernel.org/linux-hardening/20230703165817.2840457-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> No return values were used, so direct replacement is safe. +> + +**[v1: next: media: venus: Use struct_size_t() helper in pkt_session_unset_buffers()](http://lore.kernel.org/linux-hardening/ZKBfoqSl61jfpO2r@work/)** + +> Prefer struct_size_t() over struct_size() when no pointer instance +> of the structure type is present. +> + +**[v2: pid: Replace struct pid 1-element array with flex-array](http://lore.kernel.org/linux-hardening/20230630180418.gonna.286-kees@kernel.org/)** + +> For pid namespaces, struct pid uses a dynamically sized array member, +> "numbers". This was implemented using the ancient 1-element fake flexible +> array, which has been deprecated for decades. Replace it with a C99 +> flexible array, refactor the array size calculations to use struct_size(), +> and address elements via indexes. Note that the static initializer (which +> defines a single element) works as-is, and requires no special handling. +> + +**[[GIT PULL v2] flexible-array transformations for 6.5-rc1](http://lore.kernel.org/linux-hardening/ZJ8C4PtPrxr6LTA7@work/)** + +> The following changes since commit f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6: +> +> Linux 6.4-rc2 (2023-05-14 12:51:40 -0700) +> +> are available in the Git repository at: +> +> git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux.git tags/flex-array-transformations-6.5-rc1 +> + +**[v3: Add documentation for sysctl vm.memfd_noexec](http://lore.kernel.org/linux-hardening/20230630032535.625390-1-jeffxu@google.com/)** + +> Add documentation for sysctl vm.memfd_noexec +> +> Thanks to Dominique Martinet who reported this. +> see [1] for context. +> +> [1] https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU6WEy5a8MaQY3Jw@mail.gmail.com/T/ +> + +**[v1: usb: ch9: Replace bmSublinkSpeedAttr 1-element array with flexible array](http://lore.kernel.org/linux-hardening/20230629190900.never.787-kees@kernel.org/)** + +> Since commit df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3"), +> UBSAN_BOUNDS no longer pretends 1-element arrays are unbounded. Walking +> bmSublinkSpeedAttr will trigger a warning, so make it a proper flexible +> array. Add a union to keep the struct size identical for userspace in +> case anything was depending on the old size. +> + +**[v1: next: scsi: aacraid: Replace one-element array with flexible-array member in struct user_sgmap](http://lore.kernel.org/linux-hardening/2ebb702f25c4764fb36ab29f4f40728e12b0e42b.1687974498.git.gustavoars@kernel.org/)** + +> Replace one-element array with flexible-array member in struct +> user_sgmap and refactor the rest of the code, accordingly. +> +> Issue found with the help of Coccinelle and audited and fixed, +> manually. +> +> This results in no differences in binary output. +> + +**[v1: next: scsi: aacraid: Use struct_size() helper in code related to struct sgmapraw](http://lore.kernel.org/linux-hardening/be2e5ecf1c4410ab419e2290341fbc8a0e2ba963.1687974498.git.gustavoars@kernel.org/)** + +> Prefer struct_size() over open-coded versions. +> + +**[v1: next: scsi: aacraid: Use struct_size() helper in aac_get_safw_ciss_luns()](http://lore.kernel.org/linux-hardening/cd80ea8f2446fe62ec15ffb0bbcecb69e0c342af.1687974498.git.gustavoars@kernel.org/)** + +> Prefer struct_size() over open-coded versions. +> +> This results in no differences in binary output. +> + +**[v1: next: scsi: aacraid: Replace one-element arrays with flexible-array members](http://lore.kernel.org/linux-hardening/cover.1687974498.git.gustavoars@kernel.org/)** + +> This series aims to replace one-element arrays with flexible-array +> members in multiple structures in drivers/scsi/aacraid/aacraid.h. +> +> This helps with the ongoing efforts to globally enable -Warray-bounds +> and get us closer to being able to tighten the FORTIFY_SOURCE routines +> on memcpy(). +> +> These issues were found with the help of Coccinelle and audited and fixed, +> manually. +> + +**[GIT PULL: flexible-array transformations for 6.5-rc1](http://lore.kernel.org/linux-hardening/ZJxZJDUDs1ry84Rc@work/)** + +> The following changes since commit f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6: +> +> Linux 6.4-rc2 (2023-05-14 12:51:40 -0700) +> +> are available in the Git repository at: +> +> git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux.git tags/flex-array-transformations-6.5-rc1 +> + +**[v1: pstore: ramoops: support pmsg size larger than kmalloc limitation](http://lore.kernel.org/linux-hardening/20230627202540.881909-2-yuxiaozhang@google.com/)** + +> Current pmsg implementation is using kmalloc for pmsg record buffer, +> which has max size limits based on page size. Currently even we +> allocate enough space with pmsg-size, pmsg will still fail if the +> file size is larger than what kmalloc allowed. +> + +**[v4: Randomized slab caches for kmalloc()](http://lore.kernel.org/linux-hardening/20230626031835.2279738-1-gongruiqi@huaweicloud.com/)** + +> When exploiting memory vulnerabilities, "heap spraying" is a common +> technique targeting those related to dynamic memory allocation (i.e. the +> "heap"), and it plays an important role in a successful exploitation. +> Basically, it is to overwrite the memory area of vulnerable object by +> triggering allocation in other subsystems or modules and therefore +> getting a reference to the targeted memory location. It's usable on +> various types of vulnerablity including use after free (UAF), heap out- +> of-bound write and etc. +> + +#### 异步 IO + +**[v3: Add a sysctl to disable io_uring system-wide](http://lore.kernel.org/io-uring/20230630151003.3622786-1-matteorizzo@google.com/)** + +> Over the last few years we've seen many critical vulnerabilities in +> io_uring[1] which could be exploited by an unprivileged process to gain +> control over the kernel. This patch introduces a new sysctl which disables +> the creation of new io_uring instances system-wide. +> + +**[v1: io_uring: Add {} to maintain consistency in code format](http://lore.kernel.org/io-uring/20230630062512.10724-1-luhongfei@vivo.com/)** + +> In io_issue_sqe, the if (ret == IOU_OK) branch uses {}, so to maintain code +> format consistency, it is better to add {} in the else branch. +> + +**[v4: io_uring: Add io_uring command support for sockets](http://lore.kernel.org/io-uring/20230627134424.2784797-1-leitao@debian.org/)** + +> Enable io_uring commands on network sockets. Create two new +> SOCKET_URING_OP commands that will operate on sockets. +> +> In order to call ioctl on sockets, use the file_operations->io_uring_cmd +> callbacks, and map it to a uring socket function, which handles the +> SOCKET_URING_OP accordingly, and calls socket ioctls. +> + +#### Rust For Linux + +**[v1: rust: types: make `Opaque` be `!Unpin`](http://lore.kernel.org/rust-for-linux/20230630150216.109789-1-benno.lossin@proton.me/)** + +> Adds a `PhantomPinned` field to `Opaque`. This removes the last Rust +> guarantee: the assumption that the type `T` can be freely moved. This is +> not the case for many types from the C side (e.g. if they contain a +> `struct list_head`). This change removes the need to add a +> `PhantomPinned` field manually to Rust structs that contain C structs +> which must not be moved. +> + +**[v1: rust: macros: add `paste!` proc macro](http://lore.kernel.org/rust-for-linux/20230628171108.1150742-1-gary@garyguo.net/)** + +> This macro provides a flexible way to concatenated identifiers together +> and it allows the resulting identifier to be used to declare new items, +> which `concat_idents!` does not allow. It also allows identifiers to be +> transformed before concatenated. +> + +**[v1: rust: build: Define MODULE macro iif the CONFIG_MODULES is enabled](http://lore.kernel.org/rust-for-linux/20230627121422.112246-1-wangrui@loongson.cn/)** + +> The LoongArch does not currently support modules when built with clang. +> A pre-processor error is expected on building modules, that's caused by: +> +> #if defined(MODULE) && defined(CONFIG_AS_HAS_EXPLICIT_RELOCS) +> # if __has_attribute(model) +> # define PER_CPU_ATTRIBUTES __attribute__((model("extreme"))) +> # else +> # error compiler support for the model attribute is necessary when a recent assembler is used +> # endif +> #endif +> + +**[v2: rust: alloc: Add realloc and alloc_zeroed to the GlobalAlloc impl](http://lore.kernel.org/rust-for-linux/20230625232528.89306-1-boqun.feng@gmail.com/)** + +> While there are default impls for these methods, using the respective C +> api's is faster. Currently neither the existing nor these new +> GlobalAlloc method implementations are actually called. Instead the +> __rust_* function defined below the GlobalAlloc impl are used. With +> rustc 1.71 these functions will be gone and all allocation calls will go +> through the GlobalAlloc implementation. +> + +**[v1: Rust device mapper abstractions](http://lore.kernel.org/rust-for-linux/20230625121657.3631109-1-changxian.cqs@antgroup.com/)** + +> This is a version of device mapper abstractions. Based on +> these, we also implement a linear target as a PoC. +> Any suggestions are welcomed, thanks! +> + +#### BPF + +**[v3: um: vector: Replace undo_user_init in old code with out_free_netdev](http://lore.kernel.org/bpf/20230704042942.3984-1-duminjie@vivo.com/)** + +> Thanks for your response and suggestions, +> I made some mistakes. This is a resubmitted patch. +> I got some errors with my local repository, +> so I lost the commit SHA-1 ID. +> + +**[v9: bpf-next: selftests/bpf: Add benchmark for bpf memory allocator](http://lore.kernel.org/bpf/20230704025039.938914-1-houtao@huaweicloud.com/)** + +> The benchmark could be used to compare the performance of hash map +> operations and the memory usage between different flavors of bpf memory +> allocator (e.g., no bpf ma vs bpf ma vs reuse-after-gp bpf ma). It also +> could be used to check the performance improvement or the memory saving +> provided by optimization. +> + +**[v1: bpf, net: Allow setting SO_TIMESTAMPING* from BPF](http://lore.kernel.org/bpf/20230703175048.151683-1-jthinz@mailbox.tu-berlin.de/)** + +> BPF applications, e.g., a TCP congestion control, might benefit from +> precise packet timestamps. These timestamps are already available in +> __sk_buff and bpf_sock_ops, but could not be requested: A BPF program +> was not allowed to set SO_TIMESTAMPING* on a socket. This change enables +> BPF programs to actively request the generation of timestamps from a +> stream socket. +> + +**[v1: x86/BPF: Add new BPF helper call bpf_rdtsc](http://lore.kernel.org/bpf/20230703105745.1314475-1-tero.kristo@linux.intel.com/)** + +> This patch series adds a new x86 arch specific BPF helper, bpf_rdtsc() +> which can be used for reading the hardware time stamp counter (TSC.) +> Currently the same counter is directly accessible from userspace +> (using RDTSC instruction), and kernel space using various rdtsc_*() +> APIs, however eBPF lacks the support. +> + +**[v1: fs: Add kfuncs to handle idmapped mounts](http://lore.kernel.org/bpf/c35fbb4cb0a3a9b4653f9a032698469d94ca6e9c.1688123230.git.legion@kernel.org/)** + +> Since the introduction of idmapped mounts, file handling has become +> somewhat more complicated. If the inode has been found through an +> idmapped mount the idmap of the vfsmount must be used to get proper +> i_uid / i_gid. This is important, for example, to correctly take into +> account idmapped files when caching, LSM or for an audit. +> + +**[[v3 PATCH bpf-next 0/6] bpf: add percpu stats for bpf_map](http://lore.kernel.org/bpf/20230630082516.16286-1-aspsk@isovalent.com/)** + +> This series adds a mechanism for maps to populate per-cpu counters on +> insertions/deletions. The sum of these counters can be accessed by a new kfunc +> from map iterator and tracing programs. +> + +**[v5: RFC: introduce page_pool_alloc() API](http://lore.kernel.org/bpf/20230629120226.14854-1-linyunsheng@huawei.com/)** + +> In [1] & [2] & [3], there are usecases for veth and virtio_net +> to use frag support in page pool to reduce memory usage, and it +> may request different frag size depending on the head/tail +> room space for xdp_frame/shinfo and mtu/packet size. When the +> requested frag size is large enough that a single page can not +> be split into more than one frag, using frag support only have +> performance penalty because of the extra frag count handling +> for frag support. +> + +**[v1: bpf-next: bpf: Support new insns from cpu v4](http://lore.kernel.org/bpf/20230629063715.1646832-1-yhs@fb.com/)** + +> This patch set added kernel support for insns proposed in [1] except +> BPF_ST which already has full kernel support. Beside the above proposed +> insns, LLVM will generate BPF_ST insn as well under -mcpu=v4 ([2]). +> +> The patchset implements interpreter and jit support for these new +> insns. It has minimum verifier support in order to pass bpf selftests. +> More work will be required to cover verification and other aspects +> (e.g. blinding, etc.). +> + +**[[PATCH RESEND v3 bpf-next 00/14] BPF token](http://lore.kernel.org/bpf/20230629051832.897119-1-andrii@kernel.org/)** + +> This patch set introduces new BPF object, BPF token, which allows to delegate +> a subset of BPF functionality from privileged system-wide daemon (e.g., +> systemd or any other container manager) to a *trusted* unprivileged +> application. Trust is the key here. This functionality is not about allowing +> unconditional unprivileged BPF usage. Establishing trust, though, is +> completely up to the discretion of respective privileged application that +> would create a BPF token, as different production setups can and do achieve it +> through a combination of different means (signing, LSM, code reviews, etc), +> and it's undesirable and infeasible for kernel to enforce any particular way +> of validating trustworthiness of particular process. +> + +**[v1: fprobe: Ensure running fprobe_exit_handler() finished before calling rethook_free()](http://lore.kernel.org/bpf/168796344232.46347.7947681068822514750.stgit@devnote2/)** + +> Ensure running fprobe_exit_handler() has finished before +> calling rethook_free() in the unregister_fprobe() so that caller can free +> the fprobe right after unregister_fprobe(). +> +> unregister_fprobe() ensured that all running fprobe_entry/exit_handler() +> have finished by calling unregister_ftrace_function() which synchronizes +> RCU. But commit 5f81018753df ("fprobe: Release rethook after the ftrace_ops +> is unregistered") changed to call rethook_free() after +> unregister_ftrace_function(). So call rethook_stop() to make rethook +> disabled before unregister_ftrace_function() and ensure it again. +> + +**[v8: bpf-next: bpf, x86: allow function arguments up to 12 for TRACING](http://lore.kernel.org/bpf/20230627115319.13128-1-imagedong@tencent.com/)** + +> Therefore, let's enhance it by increasing the function arguments count +> allowed in arch_prepare_bpf_trampoline(), for now, only x86_64. +> +> In the 1st patch, we save/restore regs with BPF_DW size to make the code +> in save_regs()/restore_regs() simpler. +> +> In the 2nd patch, we make arch_prepare_bpf_trampoline() support to copy +> function arguments in stack for x86 arch. Therefore, the maximum +> arguments can be up to MAX_BPF_FUNC_ARGS for FENTRY, FEXIT and +> MODIFY_RETURN. Meanwhile, we clean the potential garbage value when we +> copy the arguments on-stack. +> + +**[v1: bpf-next: Support defragmenting IPv(4|6) packets in BPF](http://lore.kernel.org/bpf/cover.1687819413.git.dxu@dxuuu.xyz/)** + +> In the context of a middlebox, fragmented packets are tricky to handle. +> The full 5-tuple of a packet is often only available in the first +> fragment which makes enforcing consistent policy difficult. +> So stateful tracking is the only sane option. RFC 8900 [0] calls this +> out as well in section 6.3: +> +> Middleboxes [...] should process IP fragments in a manner that is +> consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes +> must maintain state in order to achieve this goal. +> + +**[v1: Interest in additional endianness documentation](http://lore.kernel.org/bpf/CADx9qWgHCC4MML2d+mq25-aeTn+20qxjeTZSHMGPQrMq65a+bQ@mail.gmail.com/)** + +> Thank you to everyone in the community for building/working on such a +> great tool! I am helping build a userspace implementation of eBPF and +> following Dave's standardization process closely. +> + +### 周边技术动态 + +#### U-Boot + +**[u-boot compilation failure for Sifive unmatched board](http://lore.kernel.org/u-boot/CAK1XJzWofn5+OE7qCyf3nTb+hevULoAVss5dG2OzwbMh0C=YVA@mail.gmail.com/)** + +> This is Satish, compiling u-boot code based on the reference page: +> https://github.com/carlosedp/riscv-bringup/blob/master/unmatched/Readme.md#install-toolchain-to-build-kernel +> +> u-boot is failing with following commit id & its tag is +> commit d637294e264adfeb29f390dfc393106fd4d41b17 (HEAD, tag: v2022.01) +> + +**[Pull request: u-boot-rockchip-20230629](http://lore.kernel.org/u-boot/20230629121342.72391-1-kever.yang@rock-chips.com/)** + +> Please pull the fixex for rockchip platform: +> - rockchip inno phy fix; +> - pinctrl driver in SPL arort in specific case; +> - fix IO port voltage for rock5b-rk3588 board; +> +> CI: +> https://source.denx.de/u-boot/custodians/u-boot-rockchip/-/pipelines/16732 +> + +**[Trying to boot JH7110 RISCV-V CPU from MMC](http://lore.kernel.org/u-boot/e6cd461d-151e-3557-f58a-6118836f8e6a@ruabmbua.dev/)** + +> I am trying to use upstream u-boot + opensbi, to boot my visionfive2 SBC +> I got from external SD card. +> + +**[v1: riscv: sifive: fu70: downclock CPU clock for stability](http://lore.kernel.org/u-boot/20230628081530.3184607-1-uwu@icenowy.me/)** + +> When building the package `rustc` for AOSC OS on HiFive Unmatched, +> random SIGSEGV prevents the package from getting correctly built. +> Downclocking the CPU PLL clock seems to allow rustc to be built, +> although taking much more time. +> + +## 20230625:第 51 期 + +### 内核动态 + +#### RISC-V 架构支持 + +**[v1: Allwinner R329/D1/R528/T113s Dual/Quad SPI modes support](http://lore.kernel.org/linux-riscv/20230624131632.2972546-1-bigunclemax@gmail.com/)** + +> This series extends the previous https://lore.kernel.org/all/20230510081121.3463710-1-bigunclemax@gmail.com +> And adds support for Dual and Quad SPI modes for the listed SoCs. +> Both modes have been tested on the T113s and should work on +> other Allwinner's SoCs that have a similar SPI conttoller. +> It may also work for previous SoCs that support Dual/Quad modes. +> One of them are H6 and H616. +> + +**[v1: Add support to handle misaligned accesses in S-mode](http://lore.kernel.org/linux-riscv/20230624122049.7886-1-cleger@rivosinc.com/)** + +> Since commit 61cadb9 ("Provide new description of misaligned load/store +> behavior compatible with privileged architecture.") in the RISC-V ISA +> manual, it is stated that misaligned load/store might not be supported. +> However, the RISC-V kernel uABI describes that misaligned accesses are +> supported. In order to support that, this series adds support for S-mode +> handling of misaligned accesses, SBI call for misaligned trap delegation +> as well prctl support for PR_SET_UNALIGN. +> + +**[v1: riscv: Select HAVE_ARCH_USERFAULTFD_MINOR](http://lore.kernel.org/linux-riscv/20230624060321.3401504-1-samuel.holland@sifive.com/)** + +> This allocates the VM flag needed to support the userfaultfd minor fault +> functionality. Because the flag bit is >= bit 32, it can only be enabled +> for 64-bit kernels. See commit 7677f7fd8be7 ("userfaultfd: add minor +> fault registration mode") for more information. +> + +**[v2: Add support for Allwinner PWM on D1/T113s/R329 SoCs](http://lore.kernel.org/linux-riscv/20230623150012.1201552-1-privatesub2@gmail.com/)** + +> This series adds support for PWM controller on new +> Allwinner's SoCs, such as D1, T113s and R329. The implemented driver +> provides basic functionality for control PWM channels. +> + +**[v5: Risc-V Svinval support](http://lore.kernel.org/linux-riscv/20230623123849.1425805-1-mchitale@ventanamicro.com/)** + +> This patch adds support for the Svinval extension as defined in the +> Risc V Privileged specification. +> + +**[v4: RISCV: Add KVM_GET_REG_LIST API](http://lore.kernel.org/linux-riscv/cover.1687515463.git.haibo1.xu@intel.com/)** + +> KVM_GET_REG_LIST will dump all register IDs that are available to +> KVM_GET/SET_ONE_REG and It's very useful to identify some platform +> regression issue during VM migration. +> + +**[v2: RISC-V: T-Head vector handling](http://lore.kernel.org/linux-riscv/20230622231305.631331-1-heiko@sntech.de/)** + +> As is widely known the T-Head C9xx cores used for example in the +> Allwinner D1 implement an older non-ratified variant of the vector spec. +> +> While userspace will probably have a lot more problems implementing +> support for both, on the kernel side the needed changes are actually +> somewhat small'ish and can be handled via alternatives somewhat nicely. +> + +**[v5: Split ptdesc from struct page](http://lore.kernel.org/linux-riscv/20230622205745.79707-1-vishal.moola@gmail.com/)** + +> The MM subsystem is trying to shrink struct page. This patchset +> introduces a memory descriptor for page table tracking - struct ptdesc. +> +> This patchset introduces ptdesc, splits ptdesc from struct page, and +> converts many callers of page table constructor/destructors to use ptdescs. +> + +**[v1: riscv: Discard vector state on syscalls](http://lore.kernel.org/linux-riscv/20230622173613.30722-1-bjorn@kernel.org/)** + +> The RISC-V vector specification states: +> Executing a system call causes all caller-saved vector registers +> (v0-v31, vl, vtype) and vstart to become unspecified. +> + +**[GIT PULL: KVM/riscv changes for 6.5](http://lore.kernel.org/linux-riscv/CAAhSdy1iT=SbjSvv_7SDygSo0HhmgLjD-y+DU1_Q+6tnki7w+A@mail.gmail.com/)** + +> We have the following KVM RISC-V changes for 6.5: +> 1) Redirect AMO load/store misaligned traps to KVM guest +> 2) Trap-n-emulate AIA in-kernel irqchip for KVM guest +> 3) Svnapot support for KVM Guest +> + +**[Patch "riscv: Link with '-z norelro'" has been added to the 6.3-stable tree](http://lore.kernel.org/linux-riscv/2023062221-moving-eastward-9967@gregkh/)** + +> This is a note to let you know that I've just added the patch titled +> +> riscv: Link with '-z norelro' +> +> to the 6.3-stable tree which can be found at: +> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary +> +> The filename of the patch is: +> riscv-link-with-z-norelro.patch +> and it can be found in the queue-6.3 subdirectory. +> + +**[v1: RISC-V: make ARCH_THEAD preclude XIP_KERNEL](http://lore.kernel.org/linux-riscv/20230621-panorama-stuffing-f24b26546972@spud/)** + +> Randy reported build errors in linux-next where XIP_KERNEL was enabled. +> ARCH_THEAD requires alternatives to support the non-standard ISA +> extensions used by the THEAD cores, which are mutually exclusive with +> XIP kernels. Clone the dependency list from the Allwinner entry, since +> Allwinner's D1 uses T-Head cores with the same non-standard extensions. +> + +**[v1: 6.3: riscv: Link with '-z norelro'](http://lore.kernel.org/linux-riscv/20230620-6-3-fix-got-relro-error-lld-v1-1-f3e71ec912d1@kernel.org/)** + +> This patch fixes a stable only patch, so it has no direct upstream +> equivalent. +> +> After a stable only patch to explicitly handle the '.got' section to +> handle an orphan section warning from the linker, certain configurations +> error when linking with ld.lld, which enables relro by default: +> +> ld.lld: error: section: .got is not contiguous with other relro sections +> + +**[GIT PULL: RISC-V Devicetrees for v6.5 Part 2](http://lore.kernel.org/linux-riscv/20230620-fidelity-variety-60b47c889e31@spud/)** + +> Please pull a second part, if it is not too late for v6.5. +> This lot is based on top of v6.4-rc2, because Randy & Linus did a rejig +> of the MAINTAINERS file. As a result, the diff below includes what was +> in the previous PR. Wasn't sure if there was a request-pull incantation +> to exclude what was in PR #1 (I guess I'd have to do a local merge of my +> first PR & then use that as the base for the request-pull command?) +> + +**[v3: RISC-V: Document that V registers are clobbered on syscalls](http://lore.kernel.org/linux-riscv/20230619190142.26498-1-palmer@rivosinc.com/)** + +> This is included in the ISA manual, but it's pretty common for bits of +> the ISA manual that are actually ABI to change. So let's document it +> explicitly. +> + +**[v8: Add support for Allwinner GPADC on D1/T113s/R329/T507 SoCs](http://lore.kernel.org/linux-riscv/20230619154252.3951913-1-bigunclemax@gmail.com/)** + +> This series adds support for general purpose ADC (GPADC) on new +> Allwinner's SoCs, such as D1, T113s, T507 and R329. The implemented driver +> provides basic functionality for getting ADC channels data. +> + +**[v4: tools/nolibc: add a new syscall helper](http://lore.kernel.org/linux-riscv/cover.1687187451.git.falcon@tinylab.org/)** + +> Thanks very much for your kindly review. +> +> This is the revision of v3 "tools/nolibc: add a new syscall helper" [1], +> this mainly applies the suggestion from David in this reply [2] and +> rebased everything on the dev.2023.06.14a branch of linux-rcu [3]. +> + +**[v5: nolibc: add part2 of support for rv32](http://lore.kernel.org/linux-riscv/cover.1687176996.git.falcon@tinylab.org/)** + +> This is the revision of the v4 part2 of support for rv32 [1], this +> further split the generic KARCH code out of the old rv32 compile patch +> and also add kernel specific KARCH and nolibc specific NARCH for +> tools/include/nolibc/Makefile too. +> +> This is rebased on the dev.2023.06.14a branch of linux-rcu repo [2] with +> basic run-user and run tests. +> + +**[v7: Add JH7110 USB PHY driver support](http://lore.kernel.org/linux-riscv/20230619094759.21013-1-minda.chen@starfivetech.com/)** + +> This patchset adds USB and PCIe PHY for the StarFive JH7110 SoC. +> The patch has been tested on the VisionFive 2 board. +> + +**[v3: Add initialization of clock for StarFive JH7110 SoC](http://lore.kernel.org/linux-riscv/20230619083517.415597-1-william.qiu@starfivetech.com/)** + +> This patchset adds initial rudimentary support for the StarFive +> Quad SPI controller driver. And this driver will be used in +> StarFive's VisionFive 2 board. In 6.4, the QSPI_AHB and QSPI_APB +> clocks changed from the default ON state to the default OFF state, +> so these clocks need to be enabled in the driver.At the same time, +> dts patch is added to this series. +> + +**[v1: kdump: add generic functions to simplify crashkernel crashkernel in architecture](http://lore.kernel.org/linux-riscv/20230619055951.45620-1-bhe@redhat.com/)** + +> In the current arm64, crashkernel=,high support has been finished after +> several rounds of posting and careful reviewing. The code in arm64 which +> parses crashkernel kernel parameters firstly, then reserve memory can be +> a good example for other ARCH to refer to. +> + +**[v1: riscv: dts: sort makefile entries by directory](http://lore.kernel.org/linux-riscv/20230617-stimulant-untainted-3fa1955d386f@spud/)** + +> New additions to the list have tried to respect alphanumeric ordering, +> but the thing was out of order to start with. Sort it. +> + +**[v3: Add Sipeed Lichee Pi 4A RISC-V board support](http://lore.kernel.org/linux-riscv/20230617161529.2092-1-jszhang@kernel.org/)** + +> Sipeed's Lichee Pi 4A development board uses Lichee Module 4A core +> module which is powered by T-HEAD's TH1520 SoC. Add minimal device +> tree files for the core module and the development board. +> + +#### 进程调度 + +**[v1: Sched/fair: Block nohz tick_stop when cfs bandwidth in use](http://lore.kernel.org/lkml/20230622132751.2900081-1-pauld@redhat.com/)** + +> CFS bandwidth limits and NOHZ full don't play well together. Tasks +> can easily run well past their quotas before a remote tick does +> accounting. This leads to long, multi-period stalls before such +> tasks can run again. Currentlyi, when presented with these conflicting +> requirements the scheduler is favoring nohz_full and letting the tick +> be stopped. However, nohz tick stopping is already best-effort, there +> are a number of conditions that can prevent it, whereas cfs runtime +> bandwidth is expected to be enforced. +> + +**[v3: sched/isolation: add a workqueue parameter onto isolcpus to constrain unbound CPUs](http://lore.kernel.org/lkml/20230622032133.GA29012@didi-ThinkCentre-M930t-N000/)** + +> Motivation of doing this is to better improve boot times for devices when +> we want to prevent our workqueue works from running on some specific CPUs, +> i,e, some CPUs are busy with interrupts. +> + +**[v2: sched/cputime: Make IRQ time accounting configurable at boot time](http://lore.kernel.org/lkml/20230620141002.23914-1-bvanassche@acm.org/)** + +> IRQ time accounting reduces performance by 40% for some block storage +> workloads on Android. Despite this some producers of Android devices +> want to keep IRQ time accounting enabled. +> + +#### 内存管理 + +**[回复: v1: mm: vmscan: export func:shrink_slab](http://lore.kernel.org/linux-mm/TYZPR02MB55950D4E176AEEB2FF8EDF6DC621A@TYZPR02MB5595.apcprd02.prod.outlook.com/)** + +> >>> On 16.06.23 11:21, lipeifeng@oppo.com wrote: +> >>> +> >>> Some of shrinkers during shrink_slab would enter synchronous-wait due +> >>> to lock or other reasons, which would causes kswapd or direct_reclaim +> >>> to be blocked. +> >>> +> >>> This patch export shrink_slab so that it can be called in drivers +> >>> which can shrink memory independently. +> >>> + +**[v1: memblock: report failures when memblock_can_resize is not set](http://lore.kernel.org/linux-mm/20230624032607.921173-1-songshuaishuai@tinylab.org/)** + +> The callers of memblock_reserve() do not check the return value +> presuming that memblock_reserve() always succeeds, but there are +> cases where it may fail. +> +> Having numerous memblock reservations at early boot where +> memblock_can_resize is unset may exhaust the INIT_MEMBLOCK_REGIONS sized +> memblock.reserved regions array and an attempt to double this array via +> memblock_double_array() will fail and will return -1 to the caller. +> + +**[v1: memblock: Introduce memblock_reserve_node()](http://lore.kernel.org/linux-mm/20230624024622.2959376-1-yajun.deng@linux.dev/)** + +> It only returns address now in memblock_find_in_range_node(), we can add a +> parameter pointing to integer for node id of the range, which can be used +> to pass the node id to the new reserve region. +> + +**[v2: seqlock,mm: lockdep annotation + write_seqlock_irqsave()](http://lore.kernel.org/linux-mm/20230623171232.892937-1-bigeasy@linutronix.de/)** + +> this has been a single patch (2/2) but then it was pointed out that the +> lockdep annotation in seqlock needs to be adjusted to fully close the +> printk window so that there is no printing after the seq-lock has been +> acquired and before printk_deferred_enter() takes effect. +> + +**[v2: Improve hugetlbfs read on HWPOISON hugepages](http://lore.kernel.org/linux-mm/20230623164015.3431990-1-jiaqiyan@google.com/)** + +> Today when hardware memory is corrupted in a hugetlb hugepage, +> kernel leaves the hugepage in pagecache [1]; otherwise future mmap or +> read will suject to silent data corruption. This is implemented by +> returning -EIO from hugetlb_read_iter immediately if the hugepage has +> HWPOISON flag set. +> + +**[v2: elf: correct note name comment](http://lore.kernel.org/linux-mm/455b22b986de4d3bc6d9bfd522378e442943de5f.1687499411.git.baruch@tkos.co.il/)** + +> NT_PRFPREG note is named "CORE". Correct the comment accordingly. +> + +**[v1: zsmalloc: small compaction improvements](http://lore.kernel.org/linux-mm/20230623044016.366793-1-senozhatsky@chromium.org/)** + +> A tiny series that can reduce the number of +> find_alloced_obj() invocations (which perform a linear +> scan of sub-page) during compaction. Inspired by Alexey +> Romanov's findings. +> + +**[v1: Transparent Contiguous PTEs for User Mappings](http://lore.kernel.org/linux-mm/20230622144210.2623299-1-ryan.roberts@arm.com/)** + +> This is a series to opportunistically and transparently use contpte mappings +> (set the contiguous bit in ptes) for user memory when those mappings meet the +> requirements. It is part of a wider effort to improve performance of the 4K +> kernel with the aim of approaching the performance of the 16K kernel, but +> without breaking compatibility and without the associated increase in memory. It +> also benefits the 16K and 64K kernels by enabling 2M THP, since this is the +> contpte size for those kernels. +> + +**[v1: use refcount+RCU method to implement lockless slab shrink](http://lore.kernel.org/linux-mm/20230622085335.77010-1-zhengqi.arch@bytedance.com/)** + +> We used to implement the lockless slab shrink with SRCU [1], but then kernel +> test robot reported -88.8% regression in stress-ng.ramfs.ops_per_sec test +> case [2], so we reverted it [3]. +> +> This patch series aims to re-implement the lockless slab shrink using the +> refcount+RCU method proposed by Dave Chinner [4]. +> + +**[v1: udmabuf: Add back support for mapping hugetlb pages](http://lore.kernel.org/linux-mm/20230622072710.3707315-1-vivek.kasireddy@intel.com/)** + +> The first patch ensures that the mappings needed for handling mmap +> operation would be managed by using the pfn instead of struct page. +> The second patch restores support for mapping hugetlb pages where +> subpages of a hugepage are not directly used anymore (main reason +> for revert) and instead the hugetlb pages and the relevant offsets +> are used to populate the scatterlist for dma-buf export and for +> mmap operation. +> + +**[v1: RESEND: elf: correct note name comment](http://lore.kernel.org/linux-mm/a7e56e9c0f821348a4c833ac07e7518f457cbdb8.1687413763.git.baruch@tkos.co.il/)** + +> Only the NT_PRFPREG note is named "LINUX". Correct the comment +> accordingly. +> + +**[v2: mm: working set reporting](http://lore.kernel.org/linux-mm/20230621180454.973862-1-yuanchu@google.com/)** + +> RFC v1: https://lore.kernel.org/linux-mm/20230509185419.1088297-1-yuanchu@google.com/ +> For background and interfaces, see the RFC v1 posting. +> + +**[v1: mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().](http://lore.kernel.org/linux-mm/20230621104034.HT6QnNkQ@linutronix.de/)** + +> __build_all_zonelists() acquires zonelist_update_seq by first disabling +> interrupts via local_irq_save() and then acquiring the seqlock with +> write_seqlock(). This is troublesome and leads to problems on +> PREEMPT_RT because the inner spinlock_t is now acquired with disabled +> interrupts. +> The API provides write_seqlock_irqsave() which does the right thing in +> one step. +> printk_deferred_enter() has to be invoked in non-migrate-able context to +> ensure that deferred printing is enabled and disabled on the same CPU. +> This is the case after zonelist_update_seq has been acquired. +> + +**[v3: mm/min_free_kbytes: modify min_free_kbytes calculation rules](http://lore.kernel.org/linux-mm/20230621092048.5242-1-liuq131@chinatelecom.cn/)** + +> The current calculation of min_free_kbytes only uses ZONE_DMA and +> ZONE_NORMAL pages,but the ZONE_MOVABLE zone->_watermark[WMARK_MIN] +> will also divide part of min_free_kbytes.This will cause the min +> watermark of ZONE_NORMAL to be too small in the presence of ZONE_MOVEABLE. +> + +**[v1: mm: page_alloc: use the correct type of list for free pages](http://lore.kernel.org/linux-mm/7e7ab533247d40c0ea0373c18a6a48e5667f9e10.1687333557.git.baolin.wang@linux.alibaba.com/)** + +> Commit bf75f200569d ("mm/page_alloc: add page->buddy_list and page->pcp_list") +> introduces page->buddy_list and page->pcp_list as a union with page->lru, but +> missed to change get_page_from_free_area() to use page->buddy_list to clarify +> the correct type of list for a free page. +> + +#### 文件系统 + +**[v1: proc: proc_setattr for /proc/$PID/net](http://lore.kernel.org/linux-fsdevel/20230624-proc-net-setattr-v1-0-73176812adee@weissschuh.net/)** + +> /proc/$PID/net currently allows the setting of file attributes, +> in contrast to other /proc/$PID/ files and directories. +> +> This would break the nolibc testsuite so the first patch in the series +> removes the offending testcase. +> The "fix" for nolibc-test is intentionally kept trivial as the series +> will most likely go through the filesystem tree and if conflicts arise, +> it is obvious on how to resolve them. +> + +**[v1: pipe: Make a partially-satisfied blocking read wait for more](http://lore.kernel.org/linux-fsdevel/2730511.1687559668@warthog.procyon.org.uk/)** + +> Can you consider merging something like the attached patch? Unfortunately, +> there are applications out there that depend on a read from pipe() waiting +> until the buffer is full under some circumstances. Patch a28c8b9db8a1 +> removed the conditionality on there being an attached writer. +> + +**[GIT PULL: vfs: mount](http://lore.kernel.org/linux-fsdevel/20230623-leise-anlassen-5499500f0ce0@brauner/)** + +> /* Summary */ +> This contains the work to extend move_mount() to allow adding a mount +> beneath the topmost mount of a mount stack. +> +> There are two LWN articles about this. One covers the original patch +> series in [1]. The other in [2] summarizes the session and roughly the +> discussion between Al and me at LSFMM. The second article also goes into +> some good questions from attendees. +> + +**[GIT PULL: vfs: file](http://lore.kernel.org/linux-fsdevel/20230623-waldarbeiten-normung-c160bb98bf10@brauner/)** + +> /* Summary */ +> This contains Amir's work to fix a long-standing problem where an +> unprivileged overlayfs mount can be used to avoid fanotify permission +> events that were requested for an inode or superblock on the underlying +> filesystem. +> + +**[GIT PULL: vfs: rename](http://lore.kernel.org/linux-fsdevel/20230623-gebacken-abenteuer-00d6913052b6@brauner/)** + +> /* Summary */ +> This contains the work from Jan to fix problems with cross-directory +> renames originally reported in [1]. +> +> To quickly sum it up some filesystems (so far we know at least about +> ext4, udf, f2fs, ocfs2, likely also reiserfs, gfs2 and others) need to +> lock the directory when it is being renamed into another directory. +> + +**[GIT PULL: vfs: misc](http://lore.kernel.org/linux-fsdevel/20230623-motor-quirlig-c6afec03aeb4@brauner/)** + +> * Use mode 0600 for file created by cachefilesd so it can be run by +> unprivileged users. This aligns them with directories which are +> already created with mode 0700 by cachefilesd. +> * Reorder a few members in struct file to prevent some false sharing +> scenarios. +> * Indicate that an eventfd is used a semaphore in the eventfd's fdinfo +> procfs file. +> * Add a missing uapi header for eventfd exposing relevant uapi defines. +> * Let the VFS protect transitions of a superblock from read-only to +> read-write in addition to the protection it already provides for +> transitions from read-write to read-only. Protecting read-only to +> read-write transitions allows filesystems such as ext4 to perform +> internal writes, keeping writers away until the transition is +> completed. +> + +**[GIT PULL: fs: ntfs](http://lore.kernel.org/linux-fsdevel/20230623-pflug-reibt-3435a40349d3@brauner/)** + +> /* Summary */ +> This contains a pile of various smaller fixes for ntfs. There's really +> not a lot to say about them. I'm just the messenger, so this is an +> unusually short pull request. +> +> /* Testing */ +> clang: Ubuntu clang version 15.0.7 +> +> All patches are based on v6.4-rc2 and have been sitting in linux-next. +> No build failures or warnings were observed. +> + +**[v1: fcntl.2: document F_UNLCK F_OFD_GETLK extension](http://lore.kernel.org/linux-fsdevel/20230622165225.2772076-4-stsp2@yandex.ru/)** + +> F_UNLCK has the special meaning when used as a lock type on input. +> It returns the information about any lock found in the specified +> region on that particular file descriptor. Locks on other file +> descriptors are ignored by F_UNLCK. +> + +**[v3: F_OFD_GETLK extension to read lock info](http://lore.kernel.org/linux-fsdevel/20230622165225.2772076-1-stsp2@yandex.ru/)** + +> This extension allows to use F_UNLCK on query, which currently returns +> EINVAL. Instead it can be used to query the locks on a particular fd - +> something that is not currently possible. The basic idea is that on +> F_OFD_GETLK, F_UNLCK would "conflict" with (or query) any types of the +> lock on the same fd, and ignore any locks on other fds. +> + +**[v1: iomap regression for aio dio 4k writes](http://lore.kernel.org/linux-fsdevel/20230621174114.1320834-1-bongiojp@gmail.com/)** + +> There has been a standing performance regression involving AIO DIO +> 4k-aligned writes on ext4 backed by a fast local SSD since the switch +> to iomap. I think it was originally reported and investigated in this +> thread: https://lore.kernel.org/all/87lf7rkffv.fsf@collabora.com/ +> + +**[v1: minimum folio order support in filemap](http://lore.kernel.org/linux-fsdevel/20230621083823.1724337-1-p.raghav@samsung.com/)** + +> There has been a lot of discussion recently to support devices and fs for +> bs > ps. One of the main plumbing to support buffered IO is to have a minimum +> order while allocating folios in the page cache. +> +> Hannes sent recently a series[1] where he deduces the minimum folio +> order based on the i_blkbits in struct inode. This takes a different +> approach based on the discussion in that thread where the minimum and +> maximum folio order can be set individually per inode. +> + +**[v20: Implement IOCTL to get and optionally clear info about PTEs](http://lore.kernel.org/linux-fsdevel/20230621072404.2918101-1-usama.anjum@collabora.com/)** + +> This syscall is used in Windows applications and games etc. This syscall is +> being emulated in pretty slow manner in userspace. Our purpose is to +> enhance the kernel such that we translate it efficiently in a better way. +> Currently some out of tree hack patches are being used to efficiently +> emulate it in some kernels. We intend to replace those with these patches. +> So the whole gaming on Linux can effectively get benefit from this. It +> means there would be tons of users of this code. +> + +**[v4: Add support for Vendor Defined Error Types in Einj Module](http://lore.kernel.org/linux-fsdevel/20230621035102.13463-1-avadhut.naik@amd.com/)** + +> This patchset adds support for Vendor Defined Error types in the einj +> module by exporting a binary blob file in module's debugfs directory. +> Userspace tools can write OEM Defined Structures into the blob file as +> part of injecting Vendor defined errors. +> + +**[v1: next: readdir: Replace one-element arrays with flexible-array members](http://lore.kernel.org/linux-fsdevel/ZJHiPJkNKwxkKz1c@work/)** + +> One-element arrays are deprecated, and we are replacing them with flexible +> array members instead. So, replace one-element arrays with flexible-array +> members in multiple structures. +> + +**[v1: Support negative dentry cache for FUSE and virtiofs](http://lore.kernel.org/linux-fsdevel/20230620151328.1637569-1-keiichiw@chromium.org/)** + +> This patch series adds a new mount option called negative_dentry_timeout +> for FUSE and virtio-fs filesystems. This option allows the kernel to cache +> negative dentries, which are dentries that represent a non-existent file. +> When this option is enabled, the kernel will skip FUSE_LOOKUP requests for +> second and subsequent lookups to a non-existent file. +> + +**[v1: ovl: reserve ability to reconfigure mount options with new mount api](http://lore.kernel.org/linux-fsdevel/20230620-fs-overlayfs-mount-api-remount-v1-1-6dfcb89088e3@kernel.org/)** + +> We don't need to carry this issue into the new mount api port. Similar +> to FUSE we can use the fs_context::oldapi member to figure out that this +> is a request coming through the legacy mount api. If we detect it we +> continue silently ignoring all mount options. +> + +**[v1: RFC: F_OFD_GETLK should provide more info](http://lore.kernel.org/linux-fsdevel/20230620095507.2677463-1-stsp2@yandex.ru/)** + +> This patch-set implements 2 small extensions to the current F_OFD_GETLK, +> allowing it to gather more information than it currently returns. +> +> First extension allows to use F_UNLCK on query, which currently returns +> EINVAL. Instead it can be used to query the locks on a particular fd - +> something that is not currently possible. The basic idea is that on +> F_OFD_GETLK, F_UNLCK would "conflict" with (or query) any types of the +> lock on the same fd, and ignore any locks on other fds. +> + +**[v2: fs: Provide helpers for manipulating sb->s_readonly_remount](http://lore.kernel.org/linux-fsdevel/20230619111832.3886-1-jack@suse.cz/)** + +> Provide helpers to set and clear sb->s_readonly_remount including +> appropriate memory barriers. Also use this opportunity to document what +> the barriers pair with and why they are needed. +> + +**[v1: blk: optimization for classic polling](http://lore.kernel.org/linux-fsdevel/3578876466-3733-1-git-send-email-nj.shetty@samsung.com/)** + +> This removes the dependency on interrupts to wake up task. Set task +> state as TASK_RUNNING, if need_resched() returns true, +> while polling for IO completion. +> Earlier, polling task used to sleep, relying on interrupt to wake it up. +> This made some IO take very long when interrupt-coalescing is enabled in +> NVMe. +> + +#### 网络设备 + +**[v2: net-next: net: dsa: vsc73xx: Make vsc73xx usable](http://lore.kernel.org/netdev/20230625115343.1603330-8-paweldembicki@gmail.com/)** + +> This patch series is focused on getting vsc73xx usable. +> +> First patch was added in v2, it's switch from poll loop to +> read_poll_timeout. +> +> Second patch is simple convert to phylink, because adjust_link won't work +> anymore. +> + +**[tc.8: some remarks and a patch for the manual](http://lore.kernel.org/netdev/168764283038.2838.1146738227989939935.reportbug@kassi.invalid.is.lan/)** + +> Mark a full stop (.) with "\&", +> if it does not mean an end of a sentence. +> This is a preventive action, +> the paragraph could be reshaped, e.g., after changes. +> +> When typing, one does not always notice when the line wraps after the +> period. +> There are too many examples of input lines in manual pages, +> that end with an abbreviation point. +> + +**[v2: net-next: Support offload LED blinking to PHY.](http://lore.kernel.org/netdev/20230624205629.4158216-1-andrew@lunn.ch/)** + +> Allow offloading of the LED trigger netdev to PHY drivers and +> implement it for the Marvell PHY driver. Additionally, correct the +> handling of when the initial state of the LED cannot be represented by +> the trigger, and so an error is returned. +> + +**[v1: net: lan743x: Don't sleep in atomic context](http://lore.kernel.org/netdev/20230623232949.743733-1-moritzf@google.com/)** + +> dev_set_rx_mode() grabs a spin_lock, and the lan743x implementation +> proceeds subsequently to go to sleep using readx_poll_timeout(). +> +> Introduce a helper wrapping the readx_poll_timeout_atomic() function +> and use it to replace the calls to readx_polL_timeout(). +> + +**[v1: use array_size](http://lore.kernel.org/netdev/20230623211457.102544-1-Julia.Lawall@inria.fr/)** + +> Use array_size to protect against multiplication overflows. +> +> This follows up on the following patches by Kees Cook from 2018. +> +> 42bc47b35320 ("treewide: Use array_size() in vmalloc()") +> fad953ce0b22 ("treewide: Use array_size() in vzalloc()") +> + +**[v2: Add support for sam9x7 SoC family](http://lore.kernel.org/netdev/20230623203056.689705-1-varshini.rajendran@microchip.com/)** + +> This patch series adds support for the new SoC family - sam9x7. +> - The device tree, configs and drivers are added +> - Clock driver for sam9x7 is added +> - Support for basic peripherals is added +> - Target board SAM9X75 Curiosity is added +> + +**[v1: net-next: netlink: add display-hint to ynl](http://lore.kernel.org/netdev/20230623201928.14275-1-donald.hunter@gmail.com/)** + +> Add a display-hint property to the netlink schema, to be used by generic +> netlink clients as hints about how to display attribute values. +> +> A display-hint on an attribute definition is intended for letting a +> client such as ynl know that, for example, a u32 should be rendered as +> an ipv4 address. The display-hint enumeration includes a small number of +> networking domain-specific value types. +> + +**[v3: io_uring: Add io_uring command support for sockets](http://lore.kernel.org/netdev/20230623193532.88760-1-kuniyu@amazon.com/)** + +> Date: Thu, 22 Jun 2023 14:59:14 -0700 +> > Enable io_uring commands on network sockets. Create two new +> > SOCKET_URING_OP commands that will operate on sockets. +> > +> > In order to call ioctl on sockets, use the file_operations->io_uring_cmd +> > callbacks, and map it to a uring socket function, which handles the +> > SOCKET_URING_OP accordingly, and calls socket ioctls. +> > + +**[v1: net-next: dsa/88e6xxx/phylink changes after the next merge window](http://lore.kernel.org/netdev/ZJWpGCtIZ06jiBsO@shell.armlinux.org.uk/)** + +> This patch series contains the minimum set of patches that I would like +> to get in for the following merge window. +> +> The first four patches are laying the groundwork for converting the +> mv88e6xxx driver to use phylink PCS support. Patches 5 through 11 +> perform that conversion. +> + +**[v2: net-next: net/tcp: optimise locking for blocking splice](http://lore.kernel.org/netdev/80736a2cc6d478c383ea565ba825eaf4d1abd876.1687523671.git.asml.silence@gmail.com/)** + +> Even when tcp_splice_read() reads all it was asked for, for blocking +> sockets it'll release and immediately regrab the socket lock, loop +> around and break on the while check. +> +> Check tss.len right after we adjust it, and return if we're done. +> That saves us one release_sock(); lock_sock(); pair per successful +> blocking splice read. +> + +**[[net-next PATCH RFC] net: dsa: qca8k: make learning configurable and keep off if standalone](http://lore.kernel.org/netdev/20230623114005.9680-1-ansuelsmth@gmail.com/)** + +> Address learning should initially be turned off by the driver for port +> operation in standalone mode, then the DSA core handles changes to it +> via ds->ops->port_bridge_flags(). +> +> Currently this is not the case for qca8k where learning is enabled +> unconditionally in qca8k_setup for every user port. +> + +**[v2: net-next: net: phy: C45-over-C22 access](http://lore.kernel.org/netdev/20230620-feature-c45-over-c22-v2-0-def0ab9ccee2@kernel.org/)** + +> [Sorry for the very late follow-up on this series, I simply haven't had +> time to look into it. Should be better now.] +> +> The goal here is to get the GYP215 and LAN8814 running on the Microchip +> LAN9668 SoC. The LAN9668 suppports one external bus and unfortunately, the +> LAN8814 has a bug which makes it impossible to use C45 on that bus. +> Fortunately, it was the intention of the GPY215 driver to be used on a C22 +> bus. But I think this could have never really worked, because the +> phy_get_c45_ids() will always do c45 accesses and thus gpy_probe() will +> fail. +> + +#### 安全增强 + +**[v1: next: openprom: Use struct_size() helper](http://lore.kernel.org/linux-hardening/ZJTYWQ5NA726baWK@work/)** + +> Prefer struct_size() over open-coded versions. +> + +**[v1: ACPI: APEI: Use ERST timeout for slow devices](http://lore.kernel.org/linux-hardening/20230622153554.16847-1-jeshuas@nvidia.com/)** + +> Slow devices such as flash may not meet the default 1ms timeout value, +> so use the ERST max execution time value that they provide as the +> timeout if it is larger. +> + +**[v1: pstore/ram: Add support for dynamically allocated ramoops memory regions](http://lore.kernel.org/linux-hardening/20230622005213.458236-1-isaacmanjarres@google.com/)** + +> The reserved memory region for ramoops is assumed to be at a fixed +> and known location when read from the devicetree. This is not desirable +> in environments where it is preferred for the region to be dynamically +> allocated early during boot (i.e. the memory region is defined with +> the "alloc-ranges" property instead of the "reg" property). +> + +**[v1: next: reiserfs: Replace one-element array with flexible-array member](http://lore.kernel.org/linux-hardening/ZJN9Kqhcs0ZGET%2F8@work/)** + +> One-element arrays are deprecated, and we are replacing them with flexible +> array members instead. So, replace one-element array with flexible-array +> member in direntry_uarea structure, and refactor the rest of the code, +> accordingly. +> + +**[v1: next: ksmbd: Use struct_size() helper in ksmbd_negotiate_smb_dialect()](http://lore.kernel.org/linux-hardening/ZJNrsjDEfe0iwQ92@work/)** + +> Prefer struct_size() over open-coded versions. +> + +**[v1: next: smb: Replace one-element array with flexible-array member](http://lore.kernel.org/linux-hardening/ZJNnynWOoTp6uTwF@work/)** + +> One-element arrays are deprecated, and we are replacing them with flexible +> array members instead. So, replace one-element array with flexible-array +> member in struct smb_negotiate_req. +> +> This results in no differences in binary output. +> + +**[v1: next: scsi: smartpqi: Replace one-element arrays with flexible-array members](http://lore.kernel.org/linux-hardening/ZJNdKDkuRbFZpASS@work/)** + +> One-element arrays are deprecated, and we are replacing them with flexible +> array members instead. So, replace one-element arrays with flexible-array +> members in a couple of structures, and refactor the rest of the code, +> accordingly. +> +> This helps with the ongoing efforts to tighten the FORTIFY_SOURCE +> routines on memcpy(). +> +> This results in no differences in binary output. +> + +**[v1: scsi: Replace strlcpy with strscpy](http://lore.kernel.org/linux-hardening/20230621030033.3800351-1-azeemshaikh38@gmail.com/)** + +> This patch series replaces strlcpy in the scsi subsystem wherever trivial +> replacement is possible, i.e return value from strlcpy is unused. The patches +> themselves are independent of each other and are included as a series for +> ease of review. +> + +**[v1: net: wwan: iosm: Convert single instance struct member to flexible array](http://lore.kernel.org/linux-hardening/20230620194234.never.023-kees@kernel.org/)** + +> Adjust the struct mux_adth definition and associated sizeof() math; no binary +> output differences are observed in the resulting object file. +> +> Closes: https://lore.kernel.org/lkml/dbfa25f5-64c8-5574-4f5d-0151ba95d232@gmail.com/ +> + +**[v1: igc: Ignore AER reset when device is suspended](http://lore.kernel.org/linux-hardening/20230620123636.1854690-1-kai.heng.feng@canonical.com/)** + +> The issue is that the PTM requests are sending before driver resumes the +> device. Since the issue can also be observed on Windows, it's quite +> likely a firmware/hardwar limitation. +> +> So avoid resetting the device if it's not resumed. Once the device is +> fully resumed, the device can work normally. +> + +#### 异步 IO + +**[v1: liburing: Introduce '--use-libc' option](http://lore.kernel.org/io-uring/20230622172029.726710-1-ammarfaizi2@gnuweeb.org/)** + +> This is an RFC patch series to introduce the '--use-libc' option to the +> configure script. +> +> Currently, when compiling liburing on x86, x86-64, and aarch64 +> architectures, the resulting binary lacks the linkage with the standard +> C library (libc). +> + +**[v2: io_uring/net: disable partial retries for recvmsg with cmsg](http://lore.kernel.org/io-uring/7e16d521-7c8a-3ac7-497a-04e69fee1afe@kernel.dk/)** + +> We cannot sanely handle partial retries for recvmsg if we have cmsg +> attached. If we don't, then we'd just be overwriting the initial cmsg +> header on retries. Alternatively we could increment and handle this +> appropriately, but it doesn't seem worth the complication. +> + +**[v2: io_uring/net: clear msg_controllen on partial sendmsg retry](http://lore.kernel.org/io-uring/312cc2b7-8229-c167-e230-bc1d7d0ed61b@kernel.dk/)** + +> If we have cmsg attached AND we transferred partial data at least, clear +> msg_controllen on retry so we don't attempt to send that again. +> + +#### Rust For Linux + +**[v1: Rust device mapper abstractions](http://lore.kernel.org/rust-for-linux/20230625121657.3631109-1-changxian.cqs@antgroup.com/)** + +> Additionally, there are some dummy codes used to wrap the block +> layer structs, i.e., `bio` and `request`, which seems being +> in the review process, so I just place it in the same file. +> + +**[v1: rust: alloc: Add realloc and alloc_zeroed to the GlobalAlloc impl](http://lore.kernel.org/rust-for-linux/20230622-global_alloc_methods-v1-1-3d3561593e23@protonmail.com/)** + +> While there are default impls for these methods, using the respective C +> api's is faster. Currently neither the existing nor these new +> GlobalAlloc method implementations are actually called. Instead the +> __rust_* function defined below the GlobalAlloc impl are used. With +> rustc 1.71 these functions will be gone and all allocation calls will go +> through the GlobalAlloc implementation. +> + +#### BPF + +**[v2: libbpf: kprobe.multi: Filter with available_filter_functions_addrs](http://lore.kernel.org/bpf/20230625011326.1729020-1-liu.yun@linux.dev/)** + +> When using regular expression matching with "kprobe multi", it scans all +> the functions under "/proc/kallsyms" that can be matched. However, not all +> of them can be traced by kprobe.multi. If any one of the functions fails +> to be traced, it will result in the failure of all functions. The best +> approach is to filter out the functions that cannot be traced to ensure +> proper tracking of the functions. +> + +**[v1: perf: Replace deprecated -target with --target= for Clang](http://lore.kernel.org/bpf/20230624002708.1907962-1-maskray@google.com/)** + +> -target has been deprecated since Clang 3.4 in 2013. Use the preferred +> --target=bpf form instead. This matches how we use --target= in +> scripts/Makefile.clang. +> + +**[v2: bpf: Replace deprecated -target with --target= for Clang](http://lore.kernel.org/bpf/20230624001856.1903733-1-maskray@google.com/)** + +> -target has been deprecated since Clang 3.4 in 2013. Use the preferred +> --target=bpf form instead. This matches how we use --target= in +> scripts/Makefile.clang. +> + +**[v4: lib/test_bpf: Call page_address() on page acquired with GFP_KERNEL flag](http://lore.kernel.org/bpf/20230623151644.GA434468@sumitra.com/)** + +> generate_test_data() acquires a page with alloc_page(GFP_KERNEL). +> The GFP_KERNEL is typical for kernel-internal allocations. +> The caller requires ZONE_NORMAL or a lower zone for direct access. +> +> Therefore the page cannot come from ZONE_HIGHMEM. Thus there's +> no need to map it with kmap(). +> + +**[v5: bpf-next: bpf: Support ->fill_link_info for kprobe_multi and perf_event links](http://lore.kernel.org/bpf/20230623141546.3751-1-laoar.shao@gmail.com/)** + +> This patchset enhances the usability of kprobe_multi program by introducing +> support for ->fill_link_info. This allows users to easily determine the +> probed functions associated with a kprobe_multi program. While +> `bpftool perf show` already provides information about functions probed by +> perf_event programs, supporting ->fill_link_info ensures consistent access +> to this information across all bpf links. +> + +**[v4: Bring back vmlinux.h generation](http://lore.kernel.org/bpf/20230623041405.4039475-1-irogers@google.com/)** + +> Commit 760ebc45746b ("perf lock contention: Add empty 'struct rq' to +> satisfy libbpf 'runqueue' type verification") inadvertently created a +> declaration of 'struct rq' that conflicted with a generated +> vmlinux.h's: +> + +**[[RFC v2 PATCH bpf-next 0/4] bpf: add percpu stats for bpf_map](http://lore.kernel.org/bpf/20230622095330.1023453-1-aspsk@isovalent.com/)** + +> This series adds a mechanism for maps to populate per-cpu counters of elements +> on insertions/deletions. The sum of these counters can be accessed by a new +> kfunc from a map iterator program. +> + +**[v7: bpf-next: bpf, x86: allow function arguments up to 12 for TRACING](http://lore.kernel.org/bpf/20230622075715.1818144-1-imagedong@tencent.com/)** + +> Therefore, let's enhance it by increasing the function arguments count +> allowed in arch_prepare_bpf_trampoline(), for now, only x86_64. +> +> In the 1st patch, we save/restore regs with BPF_DW size to make the code +> in save_regs()/restore_regs() simpler. +> +> In the 2nd patch, we make arch_prepare_bpf_trampoline() support to copy +> function arguments in stack for x86 arch. Therefore, the maximum +> arguments can be up to MAX_BPF_FUNC_ARGS for FENTRY, FEXIT and +> MODIFY_RETURN. Meanwhile, we clean the potential garbage value when we +> copy the arguments on-stack. +> + +**[v1: net-next: TSN auto negotiation between 1G and 2.5G](http://lore.kernel.org/bpf/20230622041905.629430-1-yong.liang.choong@linux.intel.com/)** + +> Intel platforms’ integrated Gigabit Ethernet controllers support +> 2.5Gbps mode statically using BIOS programming. In the current +> implementation, the BIOS menu provides an option to select between +> programs the Phase Lock Loop (PLL) registers. The BIOS also read the +> TSN lane registers from Flexible I/O Adapter (FIA) block and provided +> auto-negotiation between 10/100/1000Mbps and 2.5Gbps is not allowed. +> + +**[v3: bpf-next: BPF token](http://lore.kernel.org/bpf/20230621233809.1941811-1-andrii@kernel.org/)** + +> This patch set introduces new BPF object, BPF token, which allows to delegate +> a subset of BPF functionality from privileged system-wide daemon (e.g., +> systemd or any other container manager) to a *trusted* unprivileged +> application. Trust is the key here. This functionality is not about allowing +> unconditional unprivileged BPF usage. Establishing trust, though, is +> completely up to the discretion of respective privileged application that +> would create a BPF token, as different production setups can and do achieve it +> through a combination of different means (signing, LSM, code reviews, etc), +> and it's undesirable and infeasible for kernel to enforce any particular way +> of validating trustworthiness of particular process. +> + +**[v2: bpf-next: bpf: Netdev TX metadata](http://lore.kernel.org/bpf/20230621170244.1283336-1-sdf@google.com/)** + +> - Support passing metadata via XSK +> - Showcase how to consume this metadata at TX in the selftests +> - Sample untested mlx5 implementation +> - Simplify attach/detach story with simple global fentry (Alexei) +> - Add 'return 0' in xdp_metadata selftest (Willem) +> - Add missing 'sizeof(*ip6h)' in xdp_hw_metadata selftest (Willem) +> - Document 'timestamp' argument of kfunc (Simon) +> - Not relevant due to attach/detach rework: +> - s/devtx_sb/devtx_submit/ in netdev (Willem) +> - s/devtx_cp/devtx_complete/ in netdev (Willem) +> - Document 'devtx_complete' and 'devtx_submit' in netdev (Simon) +> - Add devtx_sb/devtx_cp forward declaration (Simon) +> - Add missing __rcu/rcu_dereference annotations (Simon) +> + +**[v1: fs: new accessors for inode->i_ctime](http://lore.kernel.org/bpf/20230621144507.55591-1-jlayton@kernel.org/)** + +> I've been working on a patchset to change how the inode->i_ctime is +> accessed in order to give us conditional, high-res timestamps for the +> ctime and mtime. struct timespec64 has unused bits in it that we can use +> to implement this. In order to do that however, we need to wrap all +> accesses of inode->i_ctime to ensure that bits used as flags are +> appropriately handled. +> +> This patchset first adds some new inode_ctime_* accessor functions. It +> then converts all in-tree accesses of inode->i_ctime to use those new +> functions and then renames the i_ctime field to __i_ctime to help ensure +> that use of the accessors. +> + +**[v1: bpf-next: bpf: Add two new bpf helpers bpf_perf_type_[uk]probe()](http://lore.kernel.org/bpf/20230621120042.3903-3-laoar.shao@gmail.com/)** + +> We are utilizing BPF LSM to monitor BPF operations within our container +> environment. Our goal is to examine the program type and perform the +> respective audits in our LSM program. +> +> When it comes to the perf_event BPF program, there are no specific +> definitions for the perf types of kprobe or uprobe. In other words, there +> is no PERF_TYPE_[UK]PROBE. It appears that defining them as UAPI at this +> stage would be impractical. +> + +**[v5: bpf-next: Handle immediate reuse in bpf memory allocator](http://lore.kernel.org/bpf/20230619143231.222536-1-houtao@huaweicloud.com/)** + +> V5 incorporates suggestions from Alexei and Paul (Big thanks for that). +> The main changes includes: +> *) Use per-cpu list for reusable list and freeing list to reduce lock +> contention and retain numa-ware attribute +> *) Use multiple RCU callback for reuse as v3 did +> *) Use rcu_momentary_dyntick_idle() to reduce the peak memory footprint +> + +**[v1: net-next: virtio-net: avoid XDP and _F_GUEST_CSUM](http://lore.kernel.org/bpf/20230619105738.117733-1-hengqi@linux.alibaba.com/)** + +> virtio-net needs to clear the VIRTIO_NET_F_GUEST_CSUM feature when +> loading XDP. The main reason for doing this is because +> VIRTIO_NET_F_GUEST_CSUM allows to receive packets marked as +> VIRTIO_NET_HDR_F_NEEDS_CSUM. Such packets are not compatible with +> XDP programs, because we cannot guarantee that the csum_{start, offset} +> fields are correct after XDP modifies the packets. +> + +**[v3: bpf-next: bpf, arm64: use BPF prog pack allocator in BPF JIT](http://lore.kernel.org/bpf/20230619100121.27534-1-puranjay12@gmail.com/)** + +> BPF programs currently consume a page each on ARM64. For systems with many BPF +> programs, this adds significant pressure to instruction TLB. High iTLB pressure +> usually causes slow down for the whole system. +> +> Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue. +> It packs multiple BPF programs into a single huge page. It is currently only +> enabled for the x86_64 BPF JIT. +> + +### 周边技术动态 + +#### Qemu + +**[QEMU RISC-V](http://lore.kernel.org/qemu-devel/CAK-FQ7uOUhAhmgqBOv5fYukFmz-hSp=XEaeyrmiAi2_UBncU0A@mail.gmail.com/)** + +> hello, +> I built RISC-V toolchain and QEMU as follows: +> # Install prerequisites: +> https://github.com/riscv-collab/riscv-gnu-toolchain#prerequisites +> # Install additional prerequisites: +> https://github.com/riscv-collab/riscv-gnu-toolchain/issues/1251 +> git clone https://github.com/riscv-collab/riscv-gnu-toolchain +> cd riscv-gnu-toolchain +> ./configure --prefix=/home/RISCV-installed-Tools --with-arch=rv32i_zicsr +> --with-abi=ilp32 +> make +> make build-qemu +> + +**[v2: target/riscv: Restrict KVM-specific fields from ArchCPU](http://lore.kernel.org/qemu-devel/20230624192957.14067-1-philmd@linaro.org/)** + +> These fields shouldn't be accessed when KVM is not available. +> +> Restrict the KVM timer migration state. Rename the KVM timer +> post_load() handler accordingly, because cpu_post_load() is +> too generic. +> + +**[v4: Add RISC-V KVM AIA Support](http://lore.kernel.org/qemu-devel/20230621145500.25624-1-yongxuan.wang@sifive.com/)** + +> This series adds support for KVM AIA in RISC-V architecture. +> +> In order to test these patches, we require Linux with KVM AIA support which can +> be found in the riscv_kvm_aia_hwaccel_v1 branch at +> https://github.com/avpatel/linux.git +> + +**[v1: linux-user/riscv: Add syscall riscv_hwprobe](http://lore.kernel.org/qemu-devel/06a4543df2aa6101ca9a48f21a3198064b4f1f87.camel@rivosinc.com/)** + +> This patch adds the new syscall for the +> "RISC-V Hardware Probing Interface" +> (https://docs.kernel.org/riscv/hwprobe.html). +> + +#### U-Boot + +**[v2: riscv: Add ACLINT mtimer and mswi devices support](http://lore.kernel.org/u-boot/20230621151147.1523273-1-bmeng@tinylab.org/)** + +> This RISC-V ACLINT specification [1] defines a set of memory mapped +> devices which provide inter-processor interrupts (IPI) and timer +> functionalities for each HART on a multi-HART RISC-V platform. +> +> This seriesl updates U-Boot existing SiFive CLINT driver to handle +> the ACLINT changes, and is now able to support both CLINT and ACLINT. +> + +**[CFP open for RISC-V MC at Linux Plumbers Conference 2023](http://lore.kernel.org/u-boot/CAOnJCU+uMF-brnyjA1HLhupYPqL0ebOVu+ivrSn2AewuSrhtBw@mail.gmail.com/)** + +> The CFP for topic proposals for the RISC-V micro conference[1] 2023 is open now. +> Please submit your proposal before it's too late! +> +> The Linux plumbers event will be both in person and remote +> (hybrid)virtual this year. More details can be found here [2]. +> + +## 20230618:第 50 期 + +### 内核动态 + +#### 文件系统 + +**[v1: d_path: include internal.h](http://lore.kernel.org/linux-fsdevel/20230616164627.66340-1-ben.dooks@codethink.co.uk/)** + +> Include internal.h to get the definition of simple_dname, to fix the +> following sparse warning: +> +> fs/d_path.c:317:6: warning: symbol 'simple_dname' was not declared. Should it be static? +> + +**[v1: fs: Provide helpers for manipulating sb->s_readonly_remount](http://lore.kernel.org/linux-fsdevel/20230616163827.19377-1-jack@suse.cz/)** + +> Provide helpers to set and clear sb->s_readonly_remount including +> appropriate memory barriers. Also use this opportunity to document what +> the barriers pair with and why they are needed. +> + +**[v5: dax: enable dax fault handler to report VM_FAULT_HWPOISON](http://lore.kernel.org/linux-fsdevel/20230615181325.1327259-1-jane.chu@oracle.com/)** + +> Change from v4: +> Add comments describing when and why dax_mem2blk_err() is used. +> Suggested by Dan. +> + +**[v19: Implement IOCTL to get and optionally clear info about PTEs](http://lore.kernel.org/linux-fsdevel/20230615141144.665148-1-usama.anjum@collabora.com/)** + +> At this point, we left soft-dirty considering it is too much delicate and +> userfaultfd [9] seemed like the only way forward. From there onward, we +> have been basing soft-dirty emulation on userfaultfd wp feature where +> kernel resolves the faults itself when WP_ASYNC feature is used. It was +> straight forward to add WP_ASYNC feature in userfautlfd. Now we get only +> those pages dirty or written-to which are really written in reality. (PS +> There is another WP_UNPOPULATED userfautfd feature is required which is +> needed to avoid pre-faulting memory before write-protecting [9].) +> + +**[v1: fs: Protect reconfiguration of sb read-write from racing writes](http://lore.kernel.org/linux-fsdevel/20230615113848.8439-1-jack@suse.cz/)** + +> The reconfigure / remount code takes a lot of effort to protect +> filesystem's reconfiguration code from racing writes on remounting +> read-only. However during remounting read-only filesystem to read-write +> mode userspace writes can start immediately once we clear SB_RDONLY +> flag. This is inconvenient for example for ext4 because we need to do +> some writes to the filesystem (such as preparation of quota files) +> before we can take userspace writes so we are clearing SB_RDONLY flag +> before we are fully ready to accept userpace writes and syzbot has found +> a way to exploit this [1]. Also as far as I'm reading the code +> the filesystem remount code was protected from racing writes in the +> legacy mount path by the mount's MNT_READONLY flag so this is relatively +> new problem. It is actually fairly easy to protect remount read-write +> from racing writes using sb->s_readonly_remount flag so let's just do +> that instead of having to workaround these races in the filesystem code. +> + +**[v5: Handle notifications on overlayfs fake path files](http://lore.kernel.org/linux-fsdevel/20230615112229.2143178-1-amir73il@gmail.com/)** + +> A little while ago, Jan and I realized that an unprivileged overlayfs +> mount could be used to avert fanotify permission events that were +> requested for an inode or sb on the underlying fs. +> + +**[v1: exfat: get file size from DataLength](http://lore.kernel.org/linux-fsdevel/PUZPR04MB6316DB8A8CB6107D56716EBC815BA@PUZPR04MB6316.apcprd04.prod.outlook.com/)** + +> From the exFAT specification, the file size should get from 'DataLength' +> of Stream Extension Directory Entry, not 'ValidDataLength'. +> + +**[v3: eventfd: add a uapi header for eventfd userspace APIs](http://lore.kernel.org/linux-fsdevel/tencent_2B6A999A23E86E522D5D9859D54FFCF9AA05@qq.com/)** + +> Create a uapi header include/uapi/linux/eventfd.h, move the associated +> flags to the uapi header, and include it from linux/eventfd.h. +> + +**[v1: fs: use helpers for opening kernel internal files](http://lore.kernel.org/linux-fsdevel/20230614120917.2037482-1-amir73il@gmail.com/)** + +> Overlayfs and cachefiles use vfs_open_tmpfile() to open a tmpfile +> without accounting for nr_files. +> +> Rename this helper to kernel_tmpfile_open() to better reflect this +> helper is used for kernel internal users. +> + +**[v1: RFC: high-order folio support for I/O](http://lore.kernel.org/linux-fsdevel/20230614114637.89759-1-hare@suse.de/)** + +> now, that was easy. +> Thanks to willy and his recent patchset to support large folios in +> gfs2 turns out that most of the work to support high-order folios +> for I/O is actually done. +> It only need twe rather obvious patches to allocate folios with +> the order derived from the mapping blocksize, and to adjust readahead +> to avoid reading off the end of the device. + +**[v3: Add support for Vendor Defined Error Types in Einj Module](http://lore.kernel.org/linux-fsdevel/f9e8243d-78e2-4aa1-e6f2-5ac2a8c1745d@amd.com/)** + +> On 6/13/2023 03:01, Greg KH wrote: +> > On Mon, Jun 12, 2023 at 09:51:36PM +0000, Avadhut Naik wrote: +> >> This patchset adds support for Vendor Defined Error types in the einj +> >> module by exporting a binary blob file in module's debugfs directory. +> >> Userspace tools can write OEM Defined Structures into the blob file as +> >> part of injecting Vendor defined errors. +> >> + +**[v1: Report on physically contiguous memory in smaps](http://lore.kernel.org/linux-fsdevel/20230613160950.3554675-1-ryan.roberts@arm.com/)** + +> This series adds new entries to /proc/pid/smaps[_rollup] to report on physically +> contiguous runs of memory. The first patch reports on the sizes of the runs by +> binning into power-of-2 blocks and reporting how much memory is in which bin. +> The second patch reports on how much of the memory is contpte-mapped in the page +> table (this is a hint that arm64 supports to tell the HW that a range of ptes +> map physically contiguous memory). +> + +**[v1: errseq_t: split the ERRSEQ_SEEN flag into two](http://lore.kernel.org/linux-fsdevel/20230613121521.146865-1-jlayton@kernel.org/)** + +> NFS wants to use the errseq_t mechanism to detect errors that occur +> during a write, but for that use-case we want to ignore anything that +> happened before the sample point. +> + +**[v3: gfs2/buffer folio changes for 6.5](http://lore.kernel.org/linux-fsdevel/20230612210141.730128-1-willy@infradead.org/)** + +> This kind of started off as a gfs2 patch series, then became entwined +> with buffer heads once I realised that gfs2 was the only remaining +> caller of __block_write_full_page(). For those not in the gfs2 world, +> the big point of this series is that block_write_full_page() should now +> handle large folios correctly. +> + +**[v3: Create large folios in iomap buffered write path](http://lore.kernel.org/linux-fsdevel/20230612203910.724378-1-willy@infradead.org/)** + +> The problem ends up being lock contention on the i_pages spinlock as we +> clear the writeback bit on each folio (and propagate that up through +> the tree). By using larger folios, we decrease the number of folios +> to be processed by a factor of 256 for this benchmark, eliminating the +> lock contention. +> + +**[v2: Landlock support for UML](http://lore.kernel.org/linux-fsdevel/20230612191430.339153-1-mic@digikod.net/)** + +> Commit cb2c7d1a1776 ("landlock: Support filesystem access-control") +> introduced a new ARCH_EPHEMERAL_INODES configuration, only enabled for +> User-Mode Linux. The reason was that UML's hostfs managed inodes in an +> ephemeral way: from the kernel point of view, the same inode struct +> could be created several times while being used by user space because +> the kernel didn't hold references to inodes. Because Landlock (and +> probably other subsystems) ties properties (i.e. access rights) to inode +> objects, it wasn't possible to create rules that match inodes and then +> allow specific accesses. +> + +**[v1: block: Add config option to not allow writing to mounted devices](http://lore.kernel.org/linux-fsdevel/20230612161614.10302-1-jack@suse.cz/)** + +> Writing to mounted devices is dangerous and can lead to filesystem +> corruption as well as crashes. Furthermore syzbot comes with more and +> more involved examples how to corrupt block device under a mounted +> filesystem leading to kernel crashes and reports we can do nothing +> about. Add config option to disallow writing to mounted (exclusively +> open) block devices. Syzbot can use this option to avoid uninteresting +> crashes. Also users whose userspace setup does not need writing to +> mounted block devices can set this config option for hardening. +> + +**[v5: blksnap - block devices snapshots module](http://lore.kernel.org/linux-fsdevel/20230612135228.10702-1-sergei.shtepa@veeam.com/)** + +> I am happy to offer a improved version of the Block Devices Snapshots +> Module. It allows to create non-persistent snapshots of any block devices. +> The main purpose of such snapshots is to provide backups of block devices. +> See more in Documentation/block/blksnap.rst. +> + +**[v1: zonefs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method](http://lore.kernel.org/linux-fsdevel/20230612053515.585428-1-hch@lst.de/)** + +> Since commit a2ad63daa88b ("VFS: add FMODE_CAN_ODIRECT file flag") file +> systems can just set the FMODE_CAN_ODIRECT flag at open time instead of +> wiring up a dummy direct_IO method to indicate support for direct I/O. +> Do that for zonefs so that noop_direct_IO can eventually be removed. +> + +**[v1: fs: kernel and userspace filesystem freeze](http://lore.kernel.org/linux-fsdevel/168653971691.755178.4003354804404850534.stgit@frogsfrogsfrogs/)** + +> Sometimes, kernel filesystem drivers need the ability to quiesce writes +> to the filesystem so that the driver can perform some kind of +> maintenance activity. This capability mostly already exists in the form +> of filesystem freezing but with the huge caveat that userspace can thaw +> any frozen fs at any time. If the correctness of the fs maintenance +> program requires stillness of the filesystem, then this caveat is BAD. +> + +**[v1: nilfs2: prevent general protection fault in nilfs_clear_dirty_page()](http://lore.kernel.org/linux-fsdevel/20230612021456.3682-1-konishi.ryusuke@gmail.com/)** + +> In a syzbot stress test that deliberately causes file system errors on +> nilfs2 with a corrupted disk image, it has been reported that +> nilfs_clear_dirty_page() called from nilfs_clear_dirty_pages() can cause +> a general protection fault. +> + +**[v1: eventfd: show flags in fdinfo](http://lore.kernel.org/linux-fsdevel/tencent_59C3AA88A8F1829226E5D3619837FC4A9E09@qq.com/)** + +> The flags should be displayed in fdinfo, as different flags +> could affect the behavior of eventfd. +> + +**[v1: fsnotify: move fsnotify_open() hook into do_dentry_open()](http://lore.kernel.org/linux-fsdevel/20230611122429.1499617-1-amir73il@gmail.com/)** + +> fsnotify_open() hook is called only from high level system calls +> context and not called for the very many helpers to open files. +> + +**[v1: sysctl: set variable sysctl_mount_point storage-class-specifier to static](http://lore.kernel.org/linux-fsdevel/20230611120725.183182-1-trix@redhat.com/)** + +> smatch reports +> fs/proc/proc_sysctl.c:32:18: warning: symbol +> 'sysctl_mount_point' was not declared. Should it be static? +> +> This variable is only used in its defining file, so it should be static. +> + +**[v1: blk: optimization for classic polling](http://lore.kernel.org/linux-fsdevel/3578876466-3733-1-git-send-email-nj.shetty@samsung.com/)** + +> This removes the dependency on interrupts to wake up task. Set task +> state as TASK_RUNNING, if need_resched() returns true, +> while polling for IO completion. +> Earlier, polling task used to sleep, relying on interrupt to wake it up. +> This made some IO take very long when interrupt-coalescing is enabled in +> NVMe. +> + +#### 网络设备 + +**[v2: net: revert "net: align SO_RCVMARK required privileges with SO_MARK"](http://lore.kernel.org/netdev/20230618103130.51628-1-maze@google.com/)** + +> This reverts commit 1f86123b9749 ("net: align SO_RCVMARK required +> privileges with SO_MARK") because the reasoning in the commit message +> is not really correct: +> SO_RCVMARK is used for 'reading' incoming skb mark (via cmsg), as such +> it is more equivalent to 'getsockopt(SO_MARK)' which has no priv check +> and retrieves the socket mark, rather than 'setsockopt(SO_MARK) which +> sets the socket mark and does require privs. +> + +**[v1: net-next: netlabel: Reorder fields in 'struct netlbl_domaddr6_map'](http://lore.kernel.org/netdev/aa109847260e51e174c823b6d1441f75be370f01.1687083361.git.christophe.jaillet@wanadoo.fr/)** + +> Group some variables based on their sizes to reduce hole and avoid padding. +> On x86_64, this shrinks the size of 'struct netlbl_domaddr6_map' +> from 72 to 64 bytes. +> +> It saves a few bytes of memory and is more cache-line friendly. +> + +**[v1: net-next: mptcp: Reorder fields in 'struct mptcp_pm_add_entry'](http://lore.kernel.org/netdev/e47b71de54fd3e580544be56fc1bb2985c77b0f4.1687081558.git.christophe.jaillet@wanadoo.fr/)** + +> Group some variables based on their sizes to reduce hole and avoid padding. +> On x86_64, this shrinks the size of 'struct mptcp_pm_add_entry' +> from 136 to 128 bytes. +> + +**[v1: net-next: mctp: Reorder fields in 'struct mctp_route'](http://lore.kernel.org/netdev/393ad1a5aef0aa28d839eeb3d7477da0e0eeb0b0.1687080803.git.christophe.jaillet@wanadoo.fr/)** + +> Group some variables based on their sizes to reduce hole and avoid padding. +> On x86_64, this shrinks the size of 'struct mctp_route' +> from 72 to 64 bytes. +> + +**[v1: net-next: dt-bindings: net: bluetooth: qualcomm: document VDD_CH1](http://lore.kernel.org/netdev/20230617165716.279857-1-krzysztof.kozlowski@linaro.org/)** + +> WCN3990 comes with two chains - CH0 and CH1 - where each takes VDD +> regulator. It seems VDD_CH1 is optional (Linux driver does not care +> about it), so document it to fix dtbs_check warnings like: +> +> sdm850-lenovo-yoga-c630.dtb: bluetooth: 'vddch1-supply' does not match any of the regexes: 'pinctrl-[0-9]+' +> + +**[v1: net-next: net: phy: at803x: Use devm_regulator_get_enable_optional()](http://lore.kernel.org/netdev/f5fdf1a50bb164b4f59409d3a70a2689515d59c9.1687011839.git.christophe.jaillet@wanadoo.fr/)** + +> Use devm_regulator_get_enable_optional() instead of hand writing it. It +> saves some line of code. +> + +**[v1: selftests: tc-testing: add one test for flushing explicitly created chain](http://lore.kernel.org/netdev/20230617032033.892064-1-renmingshuai@huawei.com/)** + +> Add the test for additional reference to chains that are explicitly created +> by RTM_NEWCHAIN message +> +> commit c9a82bec02c3 ("net/sched: cls_api: Fix lockup on flushing explicitly +> created chain") +> + +**[v3: net-next:pull request: Introduce Intel IDPF driver](http://lore.kernel.org/netdev/20230616231341.2885622-1-anthony.l.nguyen@intel.com/)** + +> This patch series introduces the Intel Infrastructure Data Path Function +> (IDPF) driver. It is used for both physical and virtual functions. Except +> for some of the device operations the rest of the functionality is the +> same for both PF and VF. IDPF uses virtchnl version2 opcodes and +> structures defined in the virtchnl2 header file which helps the driver +> to learn the capabilities and register offsets from the device +> Control Plane (CP) instead of assuming the default values. +> + +**[v1: net: selftests/ptp: Add support for new timestamp IOCTLs](http://lore.kernel.org/netdev/cover.1686955631.git.alex.maftei@amd.com/)** + +> PTP_SYS_OFFSET_EXTENDED was added in November 2018 in +> and PTP_SYS_OFFSET_PRECISE was added in February 2016 in +> 719f1aa4a671 ("ptp: Add PTP_SYS_OFFSET_PRECISE for driver crosstimestamping") +> + +**[v8: net-next: Brcm ASP 2.0 Ethernet Controller](http://lore.kernel.org/netdev/1686953664-17498-1-git-send-email-justin.chen@broadcom.com/)** + +> Add support for the Broadcom ASP 2.0 Ethernet controller which is first +> introduced with 72165. +> +> 2.7.4 +> +> [-- Attachment #2: S/MIME Cryptographic Signature --] +> [-- Type: application/pkcs7-signature, Size: 4206 bytes --] +> + +**[v1: net-next: net: dqs: add NIC stall detector based on BQL](http://lore.kernel.org/netdev/20230616213236.2379935-1-kuba@kernel.org/)** + +> softnet_data->time_squeeze is sometimes used as a proxy for +> host overload or indication of scheduling problems. In practice +> this statistic is very noisy and has hard to grasp units - +> e.g. is 10 squeezes a second to be expected, or high? +> + +**[v2: net-next: gro: move the tc_ext comparison to a helper](http://lore.kernel.org/netdev/20230616204939.2373785-1-kuba@kernel.org/)** + +> The double ifdefs (one for the variable declaration and +> one around the code) are quite aesthetically displeasing. +> Factor this code out into a helper for easier wrapping. +> + +**[v5: net: phy: Add sysfs attribute for PHY c45 identifiers.](http://lore.kernel.org/netdev/20230616144017.12483-1-zhaojh329@gmail.com/)** + +> If a phydevice use c45, its phy_id property is always 0, so +> this adds a c45_ids sysfs attribute group contains mmd id +> attributes from mmd0 to mmd31 to MDIO devices. Note that only +> mmd with valid value will exist. This attribute group can be +> useful when debugging problems related to phy drivers. +> + +**[v1: net-next: Add TJA1120 support](http://lore.kernel.org/netdev/20230616135323.98215-1-radu-nicolae.pirea@oss.nxp.com/)** + +> This patch series got bigger than I expected. It cleans up the +> next-c45-tja11xx driver and adds support for the TJA1120(1000BaseT1 +> automotive phy). +> +> Master/slave custom implementation was replaced with the generic +> implementation (genphy_c45_config_aneg/genphy_c45_read_status). +> + +**[v1: nfc: fdp: Add MODULE_FIRMWARE macros](http://lore.kernel.org/netdev/20230616122218.1036256-1-juerg.haefliger@canonical.com/)** + +> The module loads firmware so add MODULE_FIRMWARE macros to provide that +> information via modinfo. +> + +**[v1: ieee802154/adf7242: Add MODULE_FIRMWARE macro](http://lore.kernel.org/netdev/20230616121807.1034050-1-juerg.haefliger@canonical.com/)** + +> The module loads firmware so add a MODULE_FIRMWARE macro to provide that +> information via modinfo. +> + +**[v1: net-next: Add and use helper for PCS negotiation modes](http://lore.kernel.org/netdev/ZIxQIBfO9dH5xFlg@shell.armlinux.org.uk/)** + +> Earlier this month, I proposed a helper for deciding whether a PCS +> should use inband negotiation modes or not. There was some discussion +> around this topic, and I believe there was no disagreement about +> providing the helper. +> + +**[v1: net: dpaa2-mac: add 25gbase-r support](http://lore.kernel.org/netdev/20230616111414.1578-1-josua@solid-run.com/)** + +> Layerscape MACs support 25Gbps network speed with dpmac "CAUI" mode. +> Add the mappings between DPMAC_ETH_IF_* and HY_INTERFACE_MODE_*, as well +> as the 25000 mac capability. +> + +**[v1: net-next: dt-bindings: net: phy: gpy2xx: more precise description](http://lore.kernel.org/netdev/20230616-feature-maxlinear-dt-better-irq-desc-v1-1-57a8936543bf@kernel.org/)** + +> Mention that the interrupt line is just asserted for a random period of +> time, not the entire time. +> + +**[v2: drivers:net:ethernet:Add missing fwnode_handle_put()](http://lore.kernel.org/netdev/20230616092820.1756-1-machel@vivo.com/)** + +> In device_for_each_child_node(), we should have fwnode_handle_put() +> when break out of the iteration device_for_each_child_node() +> as it will automatically increase and decrease the refcounter. +> + +**[v2: net: macsec SCI assignment for ES = 0](http://lore.kernel.org/netdev/20230616092404.12644-1-carlos.fernandez@technica-engineering.de/)** + +> According to 802.1AE standard, when ES and SC flags in TCI are zero, used +> SCI should be the current active SC_RX. Current kernel does not implement +> it and uses the header MAC address. +> + +**[v1: net-next: ipv6: also use netdev_hold() in ip6_route_check_nh()](http://lore.kernel.org/netdev/20230616085752.3348131-1-edumazet@google.com/)** + +> In blamed commit, we missed the fact that ip6_validate_gw() +> could change dev under us from ip6_route_check_nh() +> +> In this fix, I use GFP_ATOMIC in order to not pass too many additional +> arguments to ip6_validate_gw() and ip6_route_check_nh() only +> for a rarely used debug feature. +> + +#### 安全增强 + +**[v3: Randomized slab caches for kmalloc()](http://lore.kernel.org/linux-hardening/20230616111843.3677378-1-gongruiqi@huaweicloud.com/)** + +> I adapted the v2 patch to the latest linux-next tree and made the v3 +> patch without "RFC", since this idea seems to be acceptable in general +> based on previous dicussion with mm and hardening folks. Please check +> the link specified below for more details of the discussion, and further +> suggestions are welcome. +> + +**[v3: usbip: usbip_host: Replace strlcpy with strscpy](http://lore.kernel.org/linux-hardening/20230615180504.401169-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +**[v2: tracing/boot: Replace strlcpy with strscpy](http://lore.kernel.org/linux-hardening/20230615180420.400769-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +**[v2: usb: gadget: function: printer: Replace strlcpy with strscpy](http://lore.kernel.org/linux-hardening/20230615180318.400639-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +**[v2: pstore/platform: Add check for kstrdup](http://lore.kernel.org/linux-hardening/20230615025312.48712-1-jiasheng@iscas.ac.cn/)** + +> Add check for the return value of kstrdup() and return the error +> if it fails in order to avoid NULL pointer dereference. +> + +**[v2: usb: ch9: Replace 1-element array with flexible array](http://lore.kernel.org/linux-hardening/20230614181307.gonna.256-kees@kernel.org/)** + +> Since commit df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3"), +> UBSAN_BOUNDS no longer pretends 1-element arrays are unbounded. Walking +> wData will trigger a warning, so make it a proper flexible array. Add a +> union to keep the struct size identical for userspace in case anything +> was depending on the old size. +> + +**[v3: wifi: cfg80211: replace strlcpy() with strscpy()](http://lore.kernel.org/linux-hardening/20230614134956.2109252-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +**[v2: wifi: cfg80211: replace strlcpy() with strlscpy()](http://lore.kernel.org/linux-hardening/20230614134552.2108471-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +**[v3: SUNRPC: Use sysfs_emit in place of strlcpy/sprintf](http://lore.kernel.org/linux-hardening/20230614133757.2106902-1-azeemshaikh38@gmail.com/)** + +> Part of an effort to remove strlcpy() tree-wide [1]. +> +> Direct replacement is safe here since the getter in kernel_params_ops +> handles -errno return [2]. +> + +**[v1: pstore/ram: Add check for kstrdup](http://lore.kernel.org/linux-hardening/20230614093733.36048-1-jiasheng@iscas.ac.cn/)** + +> Add check for the return value of kstrdup() and return the error +> if it fails in order to avoid NULL pointer dereference. +> + +**[v3: uml: Replace strlcpy with strscpy](http://lore.kernel.org/linux-hardening/20230614003604.1021205-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> No return values were used, so direct replacement is safe. +> + +**[v1: SUNRPC: Replace strlcpy with strscpy](http://lore.kernel.org/linux-hardening/20230613004054.3539554-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +**[v1: net/mediatek: strlcpy withreturn](http://lore.kernel.org/linux-hardening/20230613003458.3538812-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +**[v1: netfilter: ipset: Replace strlcpy with strscpy](http://lore.kernel.org/linux-hardening/20230613003437.3538694-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +**[v1: mac80211: Replace strlcpy with strscpy](http://lore.kernel.org/linux-hardening/20230613003404.3538524-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +**[v1: ieee802154: Replace strlcpy with strscpy](http://lore.kernel.org/linux-hardening/20230613003326.3538391-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +**[v1: cfg80211: cfg80211: strlcpy withreturn](http://lore.kernel.org/linux-hardening/20230612232301.2572316-1-azeemshaikh38@gmail.com/)** + +> strlcpy() reads the entire source buffer first. +> This read may exceed the destination size limit. +> This is both inefficient and can lead to linear read +> overflows if a source string is not NUL-terminated [1]. +> In an effort to remove strlcpy() completely [2], replace +> strlcpy() here with strscpy(). +> + +#### 异步 IO + +**[v2: add initial io_uring_cmd support for sockets](http://lore.kernel.org/io-uring/20230614110757.3689731-1-leitao@debian.org/)** + +> This patchset creates the initial plumbing for a io_uring command for +> sockets. +> +> For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ +> and SOCKET_URING_OP_SIOCINQ, which are available in TCP, UDP and RAW +> sockets. +> + +**[v1: io_uring/net: save msghdr->msg_control for retries](http://lore.kernel.org/io-uring/0b0d4411-c8fd-4272-770b-e030af6919a0@kernel.dk/)** + +> If the application sets ->msg_control and we have to later retry this +> command, or if it got queued with IOSQE_ASYNC to begin with, then we +> need to retain the original msg_control value. This is due to the net +> stack overwriting this field with an in-kernel pointer, to copy it +> in. Hitting that path for the second time will now fail the copy from +> user, as it's attempting to copy from a non-user address. +> + +#### Rust For Linux + +**[v2: `scripts/rust_is_available.sh` improvements](http://lore.kernel.org/rust-for-linux/20230616001631.463536-1-ojeda@kernel.org/)** + +> This is the patch series to improve `scripts/rust_is_available.sh`. +> +> The major addition in v2 is the test suite in the last commit. I added +> it because I wanted to have a proper way to test any further changes to +> it (such as the suggested `set --` idea to avoid forking by Masahiro), +> and so that adding new checks was easier to justify too (i.e. vs. the +> added complexity). +> + +**[v2: Rust abstractions for Crypto API](http://lore.kernel.org/rust-for-linux/20230615142311.4055228-1-fujita.tomonori@gmail.com/)** + +> Before sending v2 of my crypto patch [1] to linux-crypto ml and +> checking the chance of Rust bindings for crypto being accepted, I'd +> like to iron out Rust issues. I'd appreciate any feedback. +> + +**[v1: KUnit integration for Rust doctests](http://lore.kernel.org/rust-for-linux/20230614180837.630180-1-ojeda@kernel.org/)** + +> This is the initial KUnit integration for running Rust documentation +> tests within the kernel. +> +> Thank you to the KUnit team for all the input and feedback on this +> over the months, as well as the Intel LKP 0-Day team! +> + +**[v1: rust: make `UnsafeCell` the outer type in `Opaque`](http://lore.kernel.org/rust-for-linux/20230614115328.2825961-1-aliceryhl@google.com/)** + +> When combining `UnsafeCell` with `MaybeUninit`, it is idiomatic to use +> `UnsafeCell` as the outer type. Intuitively, this is because a +> `MaybeUninit` might not contain a `T`, but we always want the effect +> of the `UnsafeCell`, even if the inner value is uninitialized. +> +> Now, strictly speaking, this doesn't really make a difference. The +> compiler will always apply the `UnsafeCell` effect even if the inner +> value is uninitialized. But I think we should follow the convention +> here. +> + +**[v1: rust: allocator: Prevents mis-aligned allocation](http://lore.kernel.org/rust-for-linux/20230613164258.3831917-1-boqun.feng@gmail.com/)** + +> Currently the KernelAllocator simply passes the size of the type Layout +> to krealloc(), and in theory the alignment requirement from the type +> Layout may be larger than the guarantee provided by SLAB, which means +> the allocated object is mis-aligned. +> + +**[v1: Rust abstractions for network device drivers](http://lore.kernel.org/rust-for-linux/20230613045326.3938283-1-fujita.tomonori@gmail.com/)** + +> This patchset adds minimum Rust abstractions for network device +> drivers and an example of a Rust network device driver, a simpler +> version of drivers/net/dummy.c. +> + +**[v1: rust: bindgen: upgrade to 0.65.1](http://lore.kernel.org/rust-for-linux/20230612194311.24826-1-aakashsensharma@gmail.com/)** + +> Upgrades bindgen to code-generation for anonymous unions, structs, and enums [7] +> for LLVM-16 based toolchains. +> +> The following upgrade also incorporates `noreturn` support from bindgen +> allowing us to remove useless `loop` calls which was placed as a +> workaround. +> + +#### BPF + +**[v1: dwarves: dwarves: encode BTF kind layout, crcs](http://lore.kernel.org/bpf/20230616171728.530116-11-alan.maguire@oracle.com/)** + +> Encode kind layout at time of BTF encoding via --btf_gen_kind_layout +> and set CRC if --btf_gen_crc is set. +> + +**[v2: bpf-next: bpf: support BTF kind layout info, CRCs](http://lore.kernel.org/bpf/20230616171728.530116-1-alan.maguire@oracle.com/)** + +> By separating parsing BTF from using all the information +> it provides, we allow BTF to encode new features even if +> they cannot be used. This is helpful in particular for +> cases where newer tools for BTF generation run on an +> older kernel; BTF kinds may be present that the kernel +> cannot yet use, but at least it can parse the BTF +> provided. Meanwhile userspace tools with newer libbpf +> may be able to use the newer information. +> + +**[v2: Reduce overhead of LSMs with static calls](http://lore.kernel.org/bpf/20230616000441.3677441-1-kpsingh@kernel.org/)** + +> LSM hooks (callbacks) are currently invoked as indirect function calls. These +> callbacks are registered into a linked list at boot time as the order of the +> LSMs can be configured on the kernel command line with the "lsm=" command line +> parameter. +> + +**[v4: bpf-next: xsk: multi-buffer support](http://lore.kernel.org/bpf/20230615172606.349557-1-maciej.fijalkowski@intel.com/)** + +> This series of patches add multi-buffer support for AF_XDP. XDP and +> various NIC drivers already have support for multi-buffer packets. With +> this patch set, programs using AF_XDP sockets can now also receive and +> transmit multi-buffer packets both in copy as well as zero-copy mode. +> ZC multi-buffer implementation is based on ice driver. +> + +**[v1: nf: netfilter: conntrack: Avoid nf_ct_helper_hash uses after free](http://lore.kernel.org/bpf/20230615152918.3484699-1-revest@chromium.org/)** + +> If register_nf_conntrack_bpf() fails (for example, if the .BTF section +> contains an invalid entry), nf_conntrack_init_start() calls +> nf_conntrack_helper_fini() as part of its cleanup path and +> nf_ct_helper_hash gets freed. +> + +**[v1: bpf: bpf/btf: Accept function names that contain dots](http://lore.kernel.org/bpf/20230615145607.3469985-1-revest@chromium.org/)** + +> When building a kernel with LLVM=1, LLVM_IAS=0 and CONFIG_KASAN=y, LLVM +> leaves DWARF tags for the "asan.module_ctor" & co symbols. In turn, +> pahole creates BTF_KIND_FUNC entries for these and this makes the BTF +> metadata validation fail because they contain a dot. +> + +**[v1: bpf-next: bpf: generate 'nomerge' for map helpers in bpf_helper_defs.h](http://lore.kernel.org/bpf/20230615142520.10280-1-eddyz87@gmail.com/)** + +> Update code generation for bpf_helper_defs.h by adding +> __attribute__((nomerge)) for a set of helper functions to prevent some +> verifier unfriendly compiler optimizations. +> + +**[v1: fprobe: Release rethook after the ftrace_ops is unregistered](http://lore.kernel.org/bpf/20230615115236.3476617-1-jolsa@kernel.org/)** + +> While running bpf selftests it's possible to get following fault: +> + +**[v1: net: igc: Avoid dereference of ptr_err in igc_clean_rx_irq()](http://lore.kernel.org/bpf/20230615-igc-err-ptr-v1-1-a17145eb8d62@kernel.org/)** + +> In igc_clean_rx_irq() the result of a call to igc_xdp_run_prog() is assigned +> to the skb local variable. This may be an ERR_PTR. +> +> A little later the following is executed, which seems to be a +> possible dereference of an ERR_PTR. +> +> total_bytes += skb->len; +> + +**[v2: perf/core: Bail out early if the request AUX area is out of bound](http://lore.kernel.org/bpf/20230613123211.58393-1-xueshuai@linux.alibaba.com/)** + +> 'rb->aux_pages' allocated by kcalloc() is a pointer array which is used to +> maintains AUX trace pages. The allocated page for this array is physically +> contiguous (and virtually contiguous) with an order of 0..MAX_ORDER. If the +> size of pointer array crosses the limitation set by MAX_ORDER, it reveals a +> WARNING. +> + +**[v1: bpf: Force kprobe multi expected_attach_type for kprobe_multi link](http://lore.kernel.org/bpf/20230613113119.2348619-1-jolsa@kernel.org/)** + +> We currently allow to create perf link for program with +> expected_attach_type == BPF_TRACE_KPROBE_MULTI. +> +> This will cause crash when we call helpers like get_attach_cookie or +> get_func_ip in such program, because it will call the kprobe_multi's +> version (current->bpf_ctx context setup) of those helpers while it +> expects perf_link's current->bpf_ctx context setup. +> + +**[v2: bpf-next: Add SO_REUSEPORT support for TC bpf_sk_assign](http://lore.kernel.org/bpf/20230613-so-reuseport-v2-0-b7c69a342613@isovalent.com/)** + +> We want to replace iptables TPROXY with a BPF program at TC ingress. +> To make this work in all cases we need to assign a SO_REUSEPORT socket +> to an skb, which is currently prohibited. This series adds support for +> such sockets to bpf_sk_assing. See patch 5 for details. +> + +**[v6: bpf-next: Add benchmark for bpf memory allocator](http://lore.kernel.org/bpf/20230613080921.1623219-1-houtao@huaweicloud.com/)** + +> This patchset includes some trivial fixes for benchmark framework and +> a new benchmark for bpf memory allocator originated from handle-reuse +> patchset. Because htab-mem benchmark depends the fixes, so I post these +> patches together. +> + +**[v2: lib/test_bpf: Call page_address() on page acquired with GFP_KERNEL flag](http://lore.kernel.org/bpf/20230613071756.GA359746@sumitra.com/)** + +> generate_test_data() acquires a page with alloc_page(GFP_KERNEL). Pages +> allocated with GFP_KERNEL cannot come from Highmem. This is why +> there is no need to call kmap() on them. +> +> Therefore, use a plain page_address() on that page. +> + +**[v5: bpf-next: bpf, x86: allow function arguments up to 12 for TRACING](http://lore.kernel.org/bpf/20230613025226.3167956-1-imagedong@tencent.com/)** + +> Therefore, let's enhance it by increasing the function arguments count +> allowed in arch_prepare_bpf_trampoline(), for now, only x86_64. +> +> In the 1st patch, we clean garbage value in upper bytes of the trampoline +> when we store the arguments from regs into stack. +> +> In the 2nd patch, we make arch_prepare_bpf_trampoline() support to copy +> function arguments in stack for x86 arch. Therefore, the maximum +> arguments can be up to MAX_BPF_FUNC_ARGS for FENTRY and FEXIT. Meanwhile, +> we clean the potentian garbage value when we copy the arguments on-stack. +> + +**[v1: bpf-next: bpf: netdev TX metadata](http://lore.kernel.org/bpf/20230612172307.3923165-1-sdf@google.com/)** + +> The goal of this series is to add two new standard-ish places +> in the transmit path: +> +> 1. Right before the packet is transmitted (with access to TX +> descriptors) +> 2. Right after the packet is actually transmitted and we've received the +> completion (again, with access to TX completion descriptors) +> + +**[v5: bpf-next: verify scalar ids mapping in regsafe()](http://lore.kernel.org/bpf/20230612160801.2804666-1-eddyz87@gmail.com/)** + +> This example is unsafe because not all execution paths verify r7 range. +> Because of the jump at (4) the verifier would arrive at (6) in two states: +> I. r6{.id=b}, r7{.id=b} via path 1-6; +> II. r6{.id=a}, r7{.id=b} via path 1-4, 6. +> +> Currently regsafe() does not call check_ids() for scalar registers, +> thus from POV of regsafe() states (I) and (II) are identical. +> +> The change is split in two parts: +> - patches #1,2: update for mark_chain_precision() to propagate +> precision marks through scalar IDs. +> - patches #3,4: update for regsafe() to use a special version of +> check_ids() for precise scalar values. +> + +**[v3: bpf-next: bpf: Support ->fill_link_info for kprobe_multi and perf_event links](http://lore.kernel.org/bpf/20230612151608.99661-1-laoar.shao@gmail.com/)** + +> This patchset enhances the usability of kprobe_multi programs by introducing +> support for ->fill_link_info. This allows users to easily determine the +> probed functions associated with a kprobe_multi program. While +> `bpftool perf show` already provides information about functions probed by +> perf_event programs, supporting ->fill_link_info ensures consistent access to +> this information across all bpf links. +> + +**[v4: net-next: introduce page_pool_alloc() API](http://lore.kernel.org/bpf/20230612130256.4572-1-linyunsheng@huawei.com/)** + +> In [1] & [2], there are usecases for veth and virtio_net to +> use frag support in page pool to reduce memory usage, and it +> may request different frag size depending on the head/tail +> room space for xdp_frame/shinfo and mtu/packet size. When the +> requested frag size is large enough that a single page can not +> be split into more than one frag, using frag support only have +> performance penalty because of the extra frag count handling +> for frag support. +> + +**[v1: lib/test_bpf: Replace kmap() with kmap_local_page()](http://lore.kernel.org/bpf/20230612103341.GA354790@sumitra.com/)** + +> kmap() has been deprecated in favor of the kmap_local_page() +> due to high cost, restricted mapping space, the overhead of +> a global lock for synchronization, and making the process +> sleep in the absence of free slots. +> + +**[v1: Add a sysctl option to disable bpf offensive helpers.](http://lore.kernel.org/bpf/20230610152618.105518-1-clangllvm@126.com/)** + +> Some eBPF helper functions have been long regarded as problematic[1]. +> More than just used for powerful rootkit, these features can also be +> exploited to harm the containers by perform various attacks to the +> processes outside the container in the enrtire VM, such as process +> DoS, information theft, and container escape. +> + +### 周边技术动态 + +#### Qemu + +**[v1: hw/riscv/virt.c: check for 'ssaia' with KVM AIA](http://lore.kernel.org/qemu-devel/20230616172141.756386-1-dbarboza@ventanamicro.com/)** + +> This patch was inspired by my review and testing of the QEMU KVM AIA +> work. It's not dependent on it though, and can be reviewed and merged +> separately. +> + +**[v2: target/riscv: Add support for BF16 extensions](http://lore.kernel.org/qemu-devel/20230615063302.102409-1-liweiwei@iscas.ac.cn/)** + +> Specification for BF16 extensions can be found in: +> https://github.com/riscv/riscv-bfloat16 +> +> The port is available here: +> https://github.com/plctlab/plct-qemu/tree/plct-bf16-upstream-v2 +> + +**[v1: riscv-to-apply queue](http://lore.kernel.org/qemu-devel/20230614012017.3100663-1-alistair.francis@wdc.com/)** + +> The following changes since commit fdd0df5340a8ebc8de88078387ebc85c5af7b40f: +> +> Merge tag 'pull-ppc-20230610' of https://gitlab.com/danielhb/qemu into staging (2023-06-10 07:25:00 -0700) +> +> are available in the Git repository at: +> +> https://github.com/alistair23/qemu.git tags/pull-riscv-to-apply-20230614 +> +> for you to fetch changes up to 860029321d9ebdff47e89561de61e9441fead70a: +> + +**[v2: disas/riscv: Add vendor extension support](http://lore.kernel.org/qemu-devel/20230612111034.3955227-1-christoph.muellner@vrull.eu/)** + +> This series adds vendor extension support to the QEMU disassembler +> for RISC-V. The following vendor extensions are covered: +> * XThead{Ba,Bb,Bs,Cmo,CondMov,FMemIdx,Fmv,Mac,MemIdx,MemPair,Sync} +> * XVentanaCondOps +> + ## 20230611:第 49 期 ### 内核动态 diff --git a/ppt/qemu-boot-hang-debug-upstream-practice.pdf b/ppt/qemu-boot-hang-debug-upstream-practice.pdf new file mode 100644 index 0000000000000000000000000000000000000000..9fd4c8eabef2c628b72c5e3df715a53faf59e2f3 Binary files /dev/null and b/ppt/qemu-boot-hang-debug-upstream-practice.pdf differ diff --git a/ppt/riscv-linux-template-v5.pptx b/ppt/riscv-linux-template-v5.pptx deleted file mode 100755 index 808da91233a24ee9fafeca82909e50f5fc4e3ac1..0000000000000000000000000000000000000000 Binary files a/ppt/riscv-linux-template-v5.pptx and /dev/null differ diff --git a/ppt/riscv-linux-template-v6.pptx b/ppt/riscv-linux-template-v6.pptx new file mode 100755 index 0000000000000000000000000000000000000000..eb3d92a54fb510128b99cbb6044af61e29e89540 Binary files /dev/null and b/ppt/riscv-linux-template-v6.pptx differ diff --git a/ppt/riscv-semihosting.pdf b/ppt/riscv-semihosting.pdf new file mode 100644 index 0000000000000000000000000000000000000000..3bfece47dd1b8c593e16bae4d6be02294f559d6e Binary files /dev/null and b/ppt/riscv-semihosting.pdf differ diff --git a/ppt/tinyget-package_manager_history.pdf b/ppt/tinyget-package_manager_history.pdf new file mode 100755 index 0000000000000000000000000000000000000000..664ef2c07a2c0ef6465e2ba73dbe718c3af5c391 Binary files /dev/null and b/ppt/tinyget-package_manager_history.pdf differ diff --git a/ppt/tsoc2023-launch.pdf b/ppt/tsoc2023-launch.pdf new file mode 100644 index 0000000000000000000000000000000000000000..23c432a692b7fd9897b5a2a9803e1fe6ab387d76 Binary files /dev/null and b/ppt/tsoc2023-launch.pdf differ