Clementine, a NASA satellite to test sensors and spacecraft components under extended exposure to the space environment, was launched on 25 January 1994. For the lack of a few lines of watchdog code, her mission was lost on 7 May 1994.
Clementine had performed lunar mapping for approximately two consecutive months when she left lunar orbit and headed for her next target, the near-Earth asteroid Geographos. Soon, however, a malfunction in one of Clementine’s on-board computers occurred, effectively cutting NASA from operating the spacecraft and causing one of its thrusters to fire uncontrolled.
NASA spent 20 minutes trying to bring the system back to life, but to no avail. A hardware reset command finally brought Clementine back online, but it was too late: she had already used up all of her fuel, and the mission’s continuation had to be canceled.
Subsequently, the development team responsible for Clementine’s software wished they had used the hardware’s watchdog timer, when it became evident that the software timeouts they had implemented had been insufficient.
How could a watchdog have helped?
A watchdog is a piece of hardware that’s either integrated directly into a microcontroller, or is attached to a microcontroller externally. Its main purpose is to perform an error handling (usually a hardware reset) when it can safely assume that the system has hung or is otherwise executing improperly.
A watchdog’s main component is a counter that initially gets configured for a certain value and subsequently counts down to zero. The software must frequently re-set this counter to its initial value to ensure that it never reaches zero. Otherwise, a malfunction is assumed and, usually, the CPU will be reset. This suggests watchdogs for a last resort, an option taken only when everything else has failed. As it could have been the case with Clementine.
How to feed the watchdog
Properly using a watchdog timer, however, is not as simple as restarting the counter (a process often referred to as “feeding” or “kicking” the watchdog). With a watchdog timer running in their system, developers must carefully choose the watchdog’s timeout period so the watchdog can intervene before a malfunctioning system can perform any irreversible malicious actions.
In simple applications, specifically without the use of an RTOS, developers would usually feed the watchdog from the main loop. This approach merely requires configuration of an appropriate initial counter value, which can be as simple as choosing any value that exceeds the worst-case execution time of the entire main loop by at least one timer cycle. This often is a fairly robust approach: While some systems will require immediate recovery, others merely need to ensure they are not hung indefinitely – and this will definitely get the job done.
In a multitask (RTOS) environment
In more complex systems, however, specifically with multi-tasking systems, various threads could potentially hang on various occasions and for various reasons. Some threads are OK to not run for long times, such as a thread waiting for potential network communication. A clean method to feed the watchdog periodically, while still ensuring that each distinct process is in good health, became a major challenge for developers of these systems, who for example need to focus on:
- Whether the OS is executing properly
- Whether high-priority tasks are exhausting the CPU, preventing low-priority tasks from running at all
- Whether a deadlock has occurred that inhibits the execution of one or several tasks
- Whether a task routine is executing properly and entirely
Developers also need to ensure that any modification performed to their source code, whether it be a dedicated watch dog tasks or specific modifications to the monitored tasks, must be small and optimized for efficiency in order to keep intrusiveness at a minimum.
Utilize the watchdog support of your RTOS
For this reason, state-of-the-art RTOS’s like SEGGER’s embOS offer comprehensive watchdog solutions to their customers in order to simplify the watchdog handling and thereby reduce the time spend on any development process.
The general principles applied with these solutions may vary between different RTOS’s. At SEGGER, however, versatility and ease-of-use are deemed of capital importance, while still keeping the required footprint to a minimum in both memory usage and execution time. To the embedded experts it therefore was evident that a comprehensive set of API functions was required that allows for both
- the individual registration of tasks, timers, and even ISRs with the underlying embOS watchdog module, as well as
- the possibility to test the intended watchdog conditions flexibly from any desired context.
The final implementation now consists of mere five API functions, yet is powerful enough to suffice any intended purpose.
Using these API functions, a task would simply register itself with the embOS watchdog module and would simultaneously configure its timeout period individually. The task could then signal its proper execution periodically by calling one simple embOS API function. Whether all monitored tasks have signaled their proper execution within their specified timeout period, subsequently gets checked by another single embOS API call, which may either be performed from within a dedicated watchdog task, from within OS_Idle(), or even from within the periodic OS timer interrupt service routine or any other ISR.
Users would merely need to provide and register two functions: The first performs the hardware-dependent feeding of the watchdog, while the other specifies further actions in case the watchdog counter reaches zero. E.g., this allows the storage of a log file to non-volatile memory, containing further information on the system status before performing a hardware reset or taking any other action.
When starting to design and develop an application with a watchdog, make sure you decide early on how you intend to use it – and consider the available tools that will aid you in achieving it more swiftly. At least, you wouldn’t want to get stranded in space, would you?