6.828-Lab4
debug四五天的結果:
這一個Lab是關於多工的, 實現的功能有fork, fork的COW, 非搶佔式排程, 搶佔式排程, 程序間通訊, 算是難度挺大的一個Lab了, 先總結一下知識點,
知識點
多處理器
現在的計算機大部分都是多處理器的了, 在之前的lab中都只用到一個處理器, 在這個lab中會用到多個CPU(這也給除錯增加了一些難度)
什麼是Per-CPU?
硬體方面: TSS和TSS descriptor, 所有的暫存器.
多處理器的使用目的當然是為了提升效能, 那麼就要求各個CPU能獨立執行, 既然是執行, 當然暫存器就得是CPU私有的. 同樣, 在一個CPU中需要能夠獨立地切換task, TSS也是Per-CPU的.
軟體方面: 核心棧, 當前執行task.
核心棧當然應該是私有的, 不然的話, 多個CPU同時訪問一個棧, 後果不堪設想.
什麼是CPUs共享的?
硬體方面: 記憶體, 磁碟等硬體裝置
軟體方面: 記憶體管理相關, 硬體相關的程式碼.
頁表, 所有的執行緒這些都是各個CPU之間共享的.
多CPU如何處理中斷?
根據IA32手冊的分類, 中斷分為Interrupts和exception, Interrupts又分為軟體中斷和硬體中斷, exception又分為fault, trap和abort.
中斷分類如下:
- Interrupts
- 軟體中斷
- 硬體中斷
- 可遮蔽中斷
- 不可遮蔽中斷
- exceptions
- fault
- trap
- abort
exceptions是同步的中斷, 在多CPU的情況下, 在哪個CPU發生就由哪個CPU處理.
interrupts是非同步的中斷通過I/O APIC分發給local APIC, 然後local APIC傳送給其對應的CPU處理.
多處理器的啟動
在計算機啟動時, 只有一個處理器(Bootstrap Processor, BSP)在工作, 當作業系統被拉起來之後, BSP將會依次啟動其它的處理器(Application Processor, AP).
啟動的流程如下:
- BSP通過LAPIC獲知當前CPU的id, 以及其它CPU的id
- BSP通過LAPIC傳送IPI( Inter-processor interrupt )依次喚醒各個AP
- AP接收到IPI之後就會從真實模式啟動, 然後初始化, 切換到保護模式, 具體見下面分析.
下面是jos中BSP啟動AP的流程:
在 kern/init.c
的 i386_init
中呼叫了 boot_aps()
.
在 boot_aps
中呼叫了 lapic_startap
, 並且傳入了 cpuid 和 AP啟動程式碼的地址
lapic_startap
是在 kern/lapic.c
中定義的, 它將會發送IPI通知AP該起來幹活了.
此時就該AP出場了, 它從真實模式開始, 執行BSP通過IPI傳送過來的啟動程式碼, 在這裡是 mpentry_start
, 定義在 kern/mpentry.S
#define RELOC(x) ((x) - KERNBASE) #define MPBOOTPHYS(s) ((s) - mpentry_start + MPENTRY_PADDR) .set PROT_MODE_CSEG, 0x8# kernel code segment selector .set PROT_MODE_DSEG, 0x10# kernel data segment selector .code16 .globl mpentry_start mpentry_start: cli xorw%ax, %ax movw%ax, %ds movw%ax, %es movw%ax, %ss // 啟用保護模式 lgdtMPBOOTPHYS(gdtdesc) movl%cr0, %eax orl$CR0_PE, %eax movl%eax, %cr0 ljmpl$(PROT_MODE_CSEG), $(MPBOOTPHYS(start32)) .code32 start32: movw$(PROT_MODE_DSEG), %ax movw%ax, %ds movw%ax, %es movw%ax, %ss movw$0, %ax movw%ax, %fs movw%ax, %gs # Set up initial page table. We cannot use kern_pgdir yet because # we are still running at a low EIP. movl$(RELOC(entry_pgdir)), %eax movl%eax, %cr3 # Turn on paging. movl%cr0, %eax orl$(CR0_PE|CR0_PG|CR0_WP), %eax movl%eax, %cr0 # Switch to the per-cpu stack allocated in boot_aps() movlmpentry_kstack, %esp movl$0x0, %ebp# nuke frame pointer # Call mp_main().(Exercise for the reader: why the indirect call?) # call mp_main是相對跳轉 # 下面是絕對跳轉, find answer in the opcode! movl$mp_main, %eax call*%eax # If mp_main returns (it shouldn't), loop. spin: jmpspin # Bootstrap GDT .p2align 2# force 4 byte alignment gdt: SEG_NULL# null seg SEG(STA_X|STA_R, 0x0, 0xffffffff)# code seg SEG(STA_W, 0x0, 0xffffffff)# data seg gdtdesc: .word0x17# sizeof(gdt) - 1 .longMPBOOTPHYS(gdt)# address gdt .globl mpentry_end mpentry_end: nop
在AP執行完上面的彙編程式碼之後, 跳轉到高地址的 mp_main
執行, 並且在其中設定頁表, 以及做一些初始化工作, 然後開始排程程序.
Lab
exercise 1
Implement mmio_map_region
in kern/pmap.c
. To see how this is used, look at the beginning of lapic_init
in kern/lapic.c
. You’ll have to do the next exercise, too, before the tests for mmio_map_region
will run.
MMIOBASE-MMIOLIM區域是用來對映IO裝置到記憶體的, 比較簡單, 注意返回值以及更新base.
exercise 2
modify your implementation of page_init()
in kern/pmap.c
to avoid adding the page at MPENTRY_PADDR
to the free list, so that we can safely copy and run AP bootstrap code at that physical address.
AP的啟動需要用到0x7000處的實體地址, 不把這一頁新增到free_page_list中.
size_t i; for (i = 1; i < npages_basemem; i++) { if (page2pa(&pages[i]) == MPENTRY_PADDR) continue; pages[i].pp_ref = 0; pages[i].pp_link = page_free_list; page_free_list = &pages[i]; }
Question 1
what is the purpose of macro MPBOOTPHYS
?
計算實際地址, 即程式碼在entry中的偏移加上0x7000.
Why is it necessary in kern/mpentry.S
but not in boot/boot.S
?
boot.S
的連結地址就是裝載地址, 而 mpentry.S
的連結地址在KERNBASE上面, 而裝載地址在MPENTRY_PADDR.
exercise 3
Modify mem_init_mp()
(in kern/pmap.c
) to map per-CPU stacks starting at KSTACKTOP
, as shown in inc/memlayout.h
. The size of each stack is KSTKSIZE
bytes plus KSTKGAP
bytes of unmapped guard pages. Your code should pass the new check in check_kern_pgdir()
.
為每個CPU分配一個exception stack.
static void mem_init_mp(void) { uint32_t stack_top = KSTACKTOP; for (int i = 0 ; i < NCPU ; i++) { uint32_t top = stack_top; boot_map_region(kern_pgdir, top-KSTKSIZE, KSTKSIZE, PADDR(&percpu_kstacks[i]), PTE_W); stack_top -= KSTKGAP + KSTKSIZE; } }
exercise 4
The code in trap_init_percpu()
( kern/trap.c
) initializes the TSS and TSS descriptor for the BSP. It worked in Lab 3, but is incorrect when running on other CPUs. Change the code so that it can work on all CPUs. (Note: your new code should not use the global ts
variable any more.)
修改初始化TSS和TSS描述符的程式碼, 上面的知識點中說了暫存器是CPU私有的, 所以每個CPU都需要初始化這些暫存器.
void trap_init_percpu(void) { struct Taskstate* thists = &(thiscpu->cpu_ts); thists->ts_esp0 = KSTACKTOP - thiscpu->cpu_id*(KSTKSIZE+KSTKGAP); thists->ts_ss0 = GD_KD; thists->ts_iomb = sizeof(struct Taskstate); // Initialize the TSS slot of the gdt. gdt[GD_TSS0 >> 3] = SEG16(STS_T32A, (uint32_t) thists, sizeof(struct Taskstate) - 1, 0); gdt[GD_TSS0 >> 3].sd_s = 0; // Load the TSS selector (like other segment selectors, the // bottom three bits are special; we leave them 0) ltr(GD_TSS0); // Load the IDT lidt(&idt_pd); }
exercise 5
Apply the big kernel lock as described above, by calling lock_kernel()
and unlock_kernel()
at the proper locations.
拿一把大鎖把核心給鎖了, 每一時刻只有一個env可以訪問核心.
核心的入口處是 trap
, 出口處是 env_run
, 還有兩個初始化的地方, 一個是BSP的 i386_init
, 另一個是AP的 mp_main
.
加鎖/解鎖位置都給出來了, 程式碼就不放了.
Question 2
It seems that using the big kernel lock guarantees that only one CPU can run the kernel code at a time. Why do we still need separate kernel stacks for each CPU? Describe a scenario in which using a shared kernel stack will go wrong, even with the protection of the big kernel lock.
我們用一把鎖把核心鎖了, 那麼每一時刻只有一個CPU可能在核心態執行, 那麼為什麼我們需要為每個CPU分配一個棧?
參考這位前輩的 github
從使用者態陷入到核心態依次是vectorX->alltraps->trap
來看trap的程式碼:
void trap(struct Trapframe *tf) { // 省略 if ((tf->tf_cs & 3) == 3) { // 加鎖 lock_kernel(); } // 省略 }
對核心加鎖是在trap中, 在一個CPU獲取鎖之後, 其它CPU想獲取這個鎖就會被阻塞, 但是在阻塞之前的是會修改棧中內容的, 比如陷入時硬體自動push的暫存器, 比如vectorX中push的暫存器, 還有alltraps中push的暫存器.
這樣的話, 佔用核心的CPU正在正常地處理時, 忽然被接收到中斷的其它CPU塞了一堆的暫存器到棧中, 亂套了.
exercise 6
Implement round-robin scheduling in sched_yield()
as described above. Don’t forget to modify syscall()
to dispatch sys_yield()
實現round-robin scheduling, 其實就是迴圈遍歷所有的env, 挑出一個Runnable的來執行.
void sched_yield(void) { // 遍歷env表, 查詢可執行的env struct Env* theOne = 0; int eidx = 0; if (curenv) { eidx = ENVX(curenv->env_id + 1); } for (int i = 0 ; i < NENV ; i++, eidx &= (NENV-1)) { if (envs[eidx].env_status == ENV_RUNNABLE){ theOne = &envs[eidx++]; break; } eidx++; } if (theOne) { //cprintf("cpu[%d] found a runnable[%x]\n", cpunum(), theOne->env_id); env_run(theOne); } // no runnable env found, see whether curenv is running if ((theOne = thiscpu->cpu_env) && theOne->env_status == ENV_RUNNING) env_run(theOne); // sched_halt never returns sched_halt(); }
Question 3
Why can the pointer e
be dereferenced both before and after the addressing switch?
為什麼lcr3換了頁表, 後面還是能照常使用e這個指標?
所有的env的env_pgdir都是以kern_pgdir為模板的, 而e指向的是在核心空間中的envs中的某個, 所有的env都有這個map.
Question 4
Whenever the kernel switches from one environment to another, it must ensure the old environment’s registers are saved so they can be restored properly later. Why? Where does this happen?
正在執行的程式可能用到了暫存器, 在暫停時save, 在繼續是restore.
儲存在陷入核心的時候就已經儲存了, 由硬體以及vectorX和alltraps中的程式碼儲存.
恢復是在env_run中進行的, 再具體一點, 是在env_pop_tf中.
exercise 7
Implement the system calls described above in kern/syscall.c
and make sure syscall()
calls them. You will need to use various functions in kern/pmap.c
and kern/env.c
, particularly envid2env()
. For now, whenever you call envid2env()
, pass 1 in the checkperm
parameter. Be sure you check for any invalid system call arguments, returning -E_INVAL
in that case. Test your JOS kernel with user/dumbfork
and make sure it works before proceeding.
實現上面說的這些系統呼叫, 由於是涉及到程序和頁表, 需要很多check, 程式碼有點多, 到倉庫中看吧.
記得當初學Linux系統呼叫的時候就想知道fork是怎麼實現呼叫依次返回兩次, 而且是不同值, 寫完 sys_exofork
就知道了, 複製程序之後直接改eax就能使他們返回不同的結果. 至於返回兩次嘛, 兩個程序, 都在envs中, 早晚都會執行的.
exercise 8
Implement the sys_env_set_pgfault_upcall
system call. Be sure to enable permission checking when looking up the environment ID of the target environment, since this is a “dangerous” system call.
static int sys_env_set_pgfault_upcall(envid_t envid, void *func) { // LAB 4: Your code here. struct Env* theEnv = 0; if (envid2env(envid, &theEnv, 1) < 0) return -E_BAD_ENV; // check permission if (curenv) { if ((curenv->env_tf.tf_cs & 0x3) > (theEnv->env_tf.tf_cs & 0x3)) return -E_BAD_ENV; } theEnv->env_pgfault_upcall = func; return 0; }
exercise 9
Implement the code in page_fault_handler
in kern/trap.c
required to dispatch page faults to the user-mode handler. Be sure to take appropriate precautions when writing into the exception stack. (What happens if the user environment runs out of space on the exception stack?)
註釋和程式碼加起來有點長, 放倉庫中了.
注意判斷fault的位置是不是在exception stack.
跳轉到handler執行只需要修改trapframe中的eip即可, 把棧換成exception stack則需要修改tf中esp.
跳轉到handler其實呼叫就是env_run.
如果exception stack棧空間用完了會怎麼樣?
會有page fault出現, 然後又轉到核心中處理page fault, 如果沒有適當處理, 又page fault, 迴圈.
exercise 10
Implement the _pgfault_upcall
routine in lib/pfentry.S
. The interesting part is returning to the original point in the user code that caused the page fault. You’ll return directly there, without going back through the kernel. The hard part is simultaneously switching stacks and re-loading the EIP.
.text .globl _pgfault_upcall _pgfault_upcall: // Call the C page fault handler. pushl %esp// function argument: pointer to UTF movl _pgfault_handler, %eax call *%eax addl $4, %esp// pop function argument // LAB 4: Your code here. subl $4, 48(%esp) movl 48(%esp), %edi // eax儲存faulted eip movl 40(%esp), %eax // faulted eip 壓入原來棧中 mov %eax, (%edi) // Restore the trap-time registers.After you do this, you // can no longer modify any general-purpose registers. // LAB 4: Your code here. addl $8, %esp// pop fault_va and err popal // Restore eflags from the stack.After you do this, you can // no longer use arithmetic operations or anything else that // modifies eflags. // LAB 4: Your code here. addl $4, %esp// pop eip popf // Switch back to the adjusted trap-time stack. // LAB 4: Your code here. pop %esp // Return to re-execute the instruction that faulted. // LAB 4: Your code here. ret
exercise 11
Finish set_pgfault_handler()
in lib/pgfault.c
.
void set_pgfault_handler(void (*handler)(struct UTrapframe *utf)) { int r; envid_t envid = thisenv->env_id; if (_pgfault_handler == 0) { // First time through! // LAB 4: Your code here. // 第一次呼叫, 先為異常棧分配一個頁 sys_page_alloc(envid, (void*)(UXSTACKTOP-PGSIZE), PTE_W|PTE_P|PTE_U); } // Save handler pointer for assembly to call. _pgfault_handler = handler; // 系統呼叫 sys_env_set_pgfault_upcall(envid, _pgfault_upcall); }
exercise 12
Implement fork
, duppage
and pgfault
in lib/fork.c
.
注意事項:
thisenv是在libmain中設定的, 這意味著, 如果子程序沒有修改thisenv, 那麼所有的子程序中的thisenv都是它們的”祖宗”程序, 所以注意修改thisenv.
使用者態通過UVPT獲取va對應的pte.
exercise 13
Modify kern/trapentry.S
and kern/trap.c
to initialize the appropriate entries in the IDT and provide handlers for IRQs 0 through 15. Then modify the code in env_alloc()
in kern/env.c
to ensure that user environments are always run with interrupts enabled.
新增IRQ比較簡單, 不過注意設定gate是trap還是interrupt, 注意區分.
把env的trapframe中的eflags的允許中斷位置位即可讓所有的使用者態程式啟用中斷.
exercise 14
Modify the kernel’s trap_dispatch()
function so that it calls sched_yield()
to find and run a different environment whenever a clock interrupt takes place.
if (tf->tf_trapno == IRQ_OFFSET + IRQ_TIMER) { lapic_eoi(); sched_yield(); }
exercise 15
Implement sys_ipc_recv
and sys_ipc_try_send
in kern/syscall.c
.
static int sys_ipc_recv(void *dstva) { // LAB 4: Your code here. // mark as not runnable to block if ((uintptr_t)dstva < UTOP && ((uintptr_t)dstva & (PGSIZE-1))) return -E_INVAL; curenv->env_status = ENV_NOT_RUNNABLE; curenv->env_ipc_recving = 1; curenv->env_ipc_dstva = dstva; sys_yield(); return 0; } static int sys_ipc_try_send(envid_t envid, uint32_t value, void *srcva, unsigned perm) { // LAB 4: Your code here. debug("[%x] send to [%x]\n", curenv->env_id, envid); int r; struct Env* recv_env = 0; if ((r = envid2env(envid, &recv_env, 0)) < 0) { return -E_BAD_ENV; } if (!recv_env->env_ipc_recving) return -E_IPC_NOT_RECV; uintptr_t src_va = (uintptr_t)srcva; if (src_va < UTOP && (src_va & (PGSIZE-1))) return -E_INVAL; if (!user_mem_check(recv_env, srcva, 1, perm)) return -E_INVAL; /* pte_t* a_pte = pgdir_walk(recv_env->env_pgdir, srcva, 0); if ((perm & PTE_W) && *a_pte != (PTE_W | *a_pte)) { return -E_INVAL; } */ // check done, start process recv_env->env_ipc_value = value; recv_env->env_ipc_perm = perm; if ((uintptr_t)(recv_env->env_ipc_dstva) < UTOP) { // map physical page of srcva to dstva in the receiving env r = sys_page_map(curenv->env_id, srcva, recv_env->env_id, recv_env->env_ipc_dstva, perm); if (r == -E_NO_MEM) return r; } recv_env->env_ipc_from = curenv->env_id; // wake up the receiving env // tweak the receiving to return 0 recv_env->env_tf.tf_regs.reg_eax = 0; // 這樣設定recv返回值 recv_env->env_status = ENV_RUNNABLE; recv_env->env_ipc_recving = 0; return 0; }