Skip to content
On this page

记一次中断延迟分析


软件版本硬件版本更新内容
linux 5.4arm64

1. 问题描述

在我们的设备中存在一个camera, 帧率为25帧每秒,那么就是每40ms一帧,在 SOF(start of frame) 中断时会记录下来帧的时间戳,方便上层使用,但是在测试中发现时间戳存在抖动,如在每0ms来一个帧,下一帧在每40ms, 正常情况应该在第80ms再来一帧,但是实际的情况是在第88ms来帧,帧间隔是48ms, 再下一帧会在第120ms到来是正确的。

主要的问题是第80ms的这帧数据存在问题,导致与上帧的间隔为48ms,存在问题。

2. 问题分析

camera sof的时间戳生成是在中断里面,时间戳获取出现偏差,可能有如下两个原因

  • camera本身出图存在抖动问题
  • 系统侧存在中断延迟

3. 实验

3.1 camera本身是否存在问题

这个很简单,连接示波器查看camera的输出,发现不存在问题。那么只能中断延迟的问题了。

3.2 中断延迟问题

cameraSOF中断服务函数中上加上trace, 查看中断服务函数的调用是否是每40ms一次,不存在抖动,拿到的trace如下:

c
          <idle>-0       [001] d.h1    22.638848: ais_isp_vfe_irq_activated: vfe0: irq 0x08000200 0x00000000
        adsprpcd-402     [001] d.h.    22.678901: ais_isp_vfe_irq_activated: vfe0: irq 0x08000000 0x00000000
        adsprpcd-402     [001] d.h.    22.718894: ais_isp_vfe_irq_activated: vfe0: irq 0x08000000 0x00000000
        adsprpcd-402     [001] d.h1    22.758908: ais_isp_vfe_irq_activated: vfe0: irq 0x08000000 0x00000000
        adsprpcd-402     [001] d.h.    22.798925: ais_isp_vfe_irq_activated: vfe0: irq 0x08000000 0x00000000
           <...>-1       [001] d.H1    22.845671: ais_isp_vfe_irq_activated: vfe0: irq 0x08000200 0x00000000
          <idle>-0       [001] d.h1    22.878859: ais_isp_vfe_irq_activated: vfe0: irq 0x08000200 0x00000000
          <idle>-0       [001] dNh1    22.918906: ais_isp_vfe_irq_activated: vfe0: irq 0x08000000 0x00000000
          <idle>-0       [001] d.h1    22.958851: ais_isp_vfe_irq_activated: vfe0: irq 0x08000200 0x00000000
          <idle>-0       [001] dNh1    22.998891: ais_isp_vfe_irq_activated: vfe0: irq 0x08000000 0x00000000
          <idle>-0       [001] dNh1    23.038891: ais_isp_vfe_irq_activated: vfe0: irq 0x08000000 0x00000000

trace信息可以看到在22.845671处,与上帧的间隔为46ms左右存在中断延迟的问题。

4. 是谁造成的中断延迟

中断处理分为阶段

  1. 中断产生
  2. 中断控制分发
  3. CPU响应中断
  4. CPU执行中断处理函数

其中中断产生和中断分发完全是硬件逻辑,不太可能存在问题,即使存在问题也就是硬件问题,需要换硬件才能解决,第4个阶段是中断处理函数的问题,也不可能,因为上面中断tarce是加在中断服务函数的最开始,不可能所以是调用中断处理函数就晚了。那只是第3个阶段出了问题。

对于第3个阶段,基本就是在一些情况会关掉cpu的本地中断,导致无法及时响应中断,下面来看是谁关了中断。

include/linux/irqflags.h中可以看到如下代码:

c
#ifdef CONFIG_TRACE_IRQFLAGS
#define local_irq_enable() \
	do { trace_hardirqs_on(); raw_local_irq_enable(); } while (0)
#define local_irq_disable() \
	do { raw_local_irq_disable(); trace_hardirqs_off(); } while (0)
#define local_irq_save(flags)				\
	do {						\
		raw_local_irq_save(flags);		\
		trace_hardirqs_off();			\
	} while (0)


#define local_irq_restore(flags)			\
	do {						\
		if (raw_irqs_disabled_flags(flags)) {	\
			raw_local_irq_restore(flags);	\
			trace_hardirqs_off();		\
		} else {				\
			trace_hardirqs_on();		\
			raw_local_irq_restore(flags);	\
		}					\
	} while (0)

#define safe_halt()				\
	do {					\
		trace_hardirqs_on();		\
		raw_safe_halt();		\
	} while (0)


#else /* !CONFIG_TRACE_IRQFLAGS */

也就是说只要配置 CONFIG_TRACE_IRQFLAGS 就可以trace到是谁关了中断,为防止tarce不到,还配置了CONFIG_TRACE_IRQFLAGS_SUPPORTCONFIG_TRACE_IRQFLAGS.

c
+CONFIG_TRACE_IRQFLAGS_SUPPORT=y
+CONFIG_TRACE_IRQFLAGS=y
+CONFIG_IRQSOFF_TRACER=y

抓到的trace信息如下:

c
           <...>-1       [001] d...    22.827257: irq_disable: caller=vprintk_emit+0x80/0x394 parent=devkmsg_emit+0x80/0xa8
           <...>-1       [001] d...    22.827266: irq_enable: caller=vprintk_emit+0xfc/0x394 parent=devkmsg_emit+0x80/0xa8
           <...>-1       [001] d..1    22.827268: irq_disable: caller=__down_trylock_console_sem+0x28/0x7c parent=vprintk_emit+0x1f8/0x394
           <...>-1       [001] d..1    22.827270: irq_enable: caller=__down_trylock_console_sem+0x5c/0x7c parent=vprintk_emit+0x1f8/0x394
           <...>-1       [001] d..1    22.827272: irq_disable: caller=console_unlock+0xac/0x59c parent=vprintk_emit+0x2e4/0x394
           <...>-1       [001] d..1    22.845480: irq_enable: caller=console_unlock+0x468/0x59c parent=vprintk_emit+0x2e4/0x394
           <...>-1       [001] d..1    22.845484: irq_disable: caller=el1_irq+0xb4/0x200 parent=console_unlock+0x46c/0x59c
           <...>-1       [001] d.h1    22.845492: irq_handler_entry: irq=3 name=arch_timer
           <...>-1       [001] d.h1    22.845650: irq_handler_exit: irq=3 ret=handled
           <...>-1       [001] d.s1    22.845653: irq_enable: caller=__do_softirq+0x10c/0x4d4 parent=irq_exit+0xc8/0xcc
           <...>-1       [001] d.s1    22.845655: irq_disable: caller=el1_irq+0xb4/0x200 parent=__do_softirq+0x114/0x4d4
           <...>-1       [001] d.H1    22.845661: irq_handler_entry: irq=168 name=ife
           <...>-1       [001] d.H1    22.845671: ais_isp_vfe_irq_activated: vfe0: irq 0x08000200 0x00000000
           <...>-1       [001] d.H1    22.845724: irq_handler_exit: irq=168 ret=handled
           <...>-1       [001] d.s1    22.845727: irq_enable: caller=el1_irq+0x128/0x200 parent=__do_softirq+0x114/0x4d4
           <...>-1       [001] d.s1    22.845729: irq_disable: caller=el1_irq+0xb4/0x200 parent=__do_softirq+0x114/0x4d4


可以看到

<...>-1 [001] d.H1 22.845671: ais_isp_vfe_irq_activated: vfe0: irq 0x08000200 0x00000000 在这个SOF中断之前是因为 <...>-1 [001] d..1 22.827272: irq_disable: caller=console_unlock+0xac/0x59c parent=vprintk_emit+0x2e4/0x394 关掉了中断,从22.82727222.845480 长达18ms之久,我们camera的sof中断本来是在22.840000左右的时刻产生中断,因为关了中断无法得到cpu响应,直到22.845671才得到处理,所以记录了一个错误的时间戳。

通过gdbvmlinux 来看一下那段代码关了中断:

c
=> aarch64-linux-gnu-gdb ./vmlinux
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "--host=x86_64-linux-gnu --target=aarch64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./vmlinux...
(gdb) list *(console_unlock+0xac)
0xffffffc010348a18 is in console_unlock (android/kernel/msm-5.4/kernel/printk/printk.c:2502).
2497            for (;;) {
2498                    struct printk_log *msg;
2499                    size_t ext_len = 0;
2500                    size_t len;
2501
2502                    printk_safe_enter_irqsave(flags);
2503                    raw_spin_lock(&logbuf_lock);
2504                    if (console_seq < log_first_seq) {
2505                            len = sprintf(text,
2506                                          "** %llu printk messages dropped **\n",
(gdb) list *(console_unlock+0x468)
0xffffffc010348dd4 is in console_unlock (android/kernel/msm-5.4/arch/arm64/include/asm/irqflags.h:118).
113     /*
114      * restore saved IRQ state
115      */
116     static inline void arch_local_irq_restore(unsigned long flags)
117     {
118             asm volatile(ALTERNATIVE(
119                             "msr    daif, %0\n"
120                             "nop",
121                             __msr_s(SYS_ICC_PMR_EL1, "%0")
122                             "dsb    sy",
(gdb)

代码在kernel/printk/printk.c

c
		printk_safe_enter_irqsave(flags);
		raw_spin_lock(&logbuf_lock);
		if (console_seq < log_first_seq) {
			len = sprintf(text,
				      "** %llu printk messages dropped **\n",
				      log_first_seq - console_seq);

			/* messages are gone, move to first one */
			console_seq = log_first_seq;
			console_idx = log_first_idx;
		} else {
			len = 0;
		}
skip:
		if (console_seq == log_next_seq)
			break;

		msg = log_from_idx(console_idx);
		if (suppress_message_printing(msg->level)) {
			/*
			 * Skip record we have buffered and already printed
			 * directly to the console when we received it, and
			 * record that has level above the console loglevel.
			 */
			console_idx = log_next(console_idx);
			console_seq++;
			goto skip;
		}

		/* Output to all consoles once old messages replayed. */
		if (unlikely(exclusive_console &&
			     console_seq >= exclusive_console_stop_seq)) {
			exclusive_console = NULL;
		}

		len += msg_print_text(msg,
				console_msg_format & MSG_FORMAT_SYSLOG,
				printk_time, text + len, sizeof(text) - len);
		if (nr_ext_console_drivers) {
			ext_len = msg_print_ext_header(ext_text,
						sizeof(ext_text),
						msg, console_seq);
			ext_len += msg_print_ext_body(ext_text + ext_len,
						sizeof(ext_text) - ext_len,
						log_dict(msg), msg->dict_len,
						log_text(msg), msg->text_len);
		}
		console_idx = log_next(console_idx);
		console_seq++;
		raw_spin_unlock(&logbuf_lock);

		/*
		 * While actively printing out messages, if another printk()
		 * were to occur on another CPU, it may wait for this one to
		 * finish. This task can not be preempted if there is a
		 * waiter waiting to take over.
		 */
		console_lock_spinning_enable();

		stop_critical_timings();	/* don't trace print latency */
		call_console_drivers(ext_text, ext_len, text, len);
		start_critical_timings();

		if (console_lock_spinning_disable_and_check()) {
			printk_safe_exit_irqrestore(flags);
			return;
		}

		printk_safe_exit_irqrestore(flags);

是在 printk_safe_enter_irqsave 关了,在 printk_safe_exit_irqrestore, 那么为什么会关18ms的中断,原因这段代码是时行串口log输出,androdi层通过**/dev/kmesg** 来进行串口log输出,因为我们的串口使用的波特率是115200,上层输出一个log长度大概在200个字节左右,所以需要17ms左右。

5. 解决

需要禁用串口输出,但是这样无法输出log,好在android的编译user版本或者说performance 版本会禁用串口。分析的过程相对比较长,原因比较清楚,但最后实际也是解决,业务决定的。


提示

欢迎评论、探讨,如果发现错误请指正。转载请注明出处! 探索者


Released under the MIT License.