Soft-Lock and Hard-Lock Debug on Arm Server
Terminology
Soft-Lock: sometimes one specific thread takes the CPU core and never release core to other threads, typical example is dead-lock in software. The linus kernel can not run normally any more and the OS seems no response.
Hard-Lock: the CPU core does not react to interrupts, even the hard-timer, Or the CPU core has a hard-hang that due to hardware-issue. The linux kernel can not run normally any more and the OS seems no response.
Nyquist Sampling Theorem: In signal processing, there is requirement on the minimum sampling rate in order to perfectly reconstruct a continuous-time signal from its discrete samples. If the continous-time signal has no frequency components higher than f_max, 2f_max is call the the Nyquist rate, which is the minimum sampling rate required to sample the continuous signal. Another summary from Nyquist Sampling is that if the clock\'s cycle is ts, we do sample via this clock, if we find that the time between 2 samples is larger than 2ts, we can conclude that this clock is disabled during that time and not working normaly.
Reference Tool:
Kernel trace tools(一):中断和软中断关闭时间过长问题追踪T
Debug Methodology:
The trace-irqoff tool from Byte-Dance adds timestamp record in the hardware-timer\'s irq to accumulate the time diff between 2 hrtimer\'s hardirq, the similar way is used for softtimer\'s softirq.
hrtimer is high-resolution timer and has cycle time of ts_hr which is normally smaller than ts_soft of softtimer\'s cycle time.
we use the trace-irqoff tool to show the time diff distribution of 2 hrirqs, if the time diff distribution lands mostly in the region that > 2*ts_hr, we can know hrtimer is disabled, this issue is a hard-lock. --A
trace-irqoff also shows the time diff distribution of 2 softirqs, if we do not see hrtimer is disabled in A, and the time diff distribution of 2 softirqs lands mostly in region that > 2*ts_soft, this issue is a soft-lock. --B
Debug Process:
from the trace-irqoff tool\'s result for our issue, we can see it is the case A, our issue is a hard-lock, hrtimer is disabled in the lock issue.
from the stack information, we can also see in cpu_idle function, the hrtimer is disabled, no soft funtion disables the hrtimer with intent. this is more likely a hardware issue that the arm global-timer does not work during the issue which causes the hrtimer disabled.