Here are couple of issues reported where inactive core reboots and there is no impact to the system.But inactive core reboots.
There is a recent PSN released realted to this.
The HWD is run by the MSP430 an off processer chip on the CPPM.
It is a two stage HWD. There is a task , the swd task that will kick the HWD constantly it runs at priority 0.
The purpose of the HWD is really to make sure vxWorks has not gone off the rails.
Since this is two stage – the first stage will send an interrupt to the CPPM – the interrupt handler will print an rpt report and call ini. I believe it also kicks the HWD .
And resets some timers to allow us to reboot without triggering the 2nd level.
If Vxworks is in too much trouble , and is not able to process the interrupt the MSP430 will power cycle the board – ie hard reboot the card. This is the second stage of the HWD.
The problems at the sites is the second stage HWD. With the second stage there is no indication what so ever of what the OS was doing or why it was in trouble.
So there are only 2 thingsthat can cause a second stage HWD
1. Actually bad hardware or something wrong with the MSP430 .
2. The double exception .
So about 1 – this is very unlikely since we don’t have it happen on the active sides ( we can double check all the cpu’s and make sure there are no ini’s, an active side sysload is really a UGSWO and ini on the active and the inactive will show the HWD reason 5)
We can check all the RAM versions but again since it never happens on the active it is unlikely to be hardware.
For number 2 – the double exception
The double exception is an architectural limitation or really more of a big design over sight by windRiver when they designed Vxworks. They have actually fixed it in 6.0 (cpl actually does not have this problem)
When there is an exception the exception handler runs in the same stack as the task that caused the exception. 99% of the time this is fine – we know we have seen 1000’s of berr ini’s over the years.
The problem occurs when the cause of the berr is some kind of stack corruption . If the stack pointer is not valid what happens is that when the exception handler tries to use the stack it will exception again – hence the double exception . When this happens the CPU will literally freeze and we wait to be restarted by the second stage HWD.
So to cause a stack corruption bad enough there are 3 main possibilities
1. Local variable array over writing it’s bounds causing stack corruption
2. Malloc 10 write 100 so we overwrite some ones stack – similar to what we seen with the rstTask stuff caused by the DFO feature for overlays. It would have to be a large block of memory since it would be intermingled with the tasks
3. A stack overflow.
There is a recent PSN released realted to this.
The HWD is run by the MSP430 an off processer chip on the CPPM.
It is a two stage HWD. There is a task , the swd task that will kick the HWD constantly it runs at priority 0.
The purpose of the HWD is really to make sure vxWorks has not gone off the rails.
Since this is two stage – the first stage will send an interrupt to the CPPM – the interrupt handler will print an rpt report and call ini. I believe it also kicks the HWD .
And resets some timers to allow us to reboot without triggering the 2nd level.
If Vxworks is in too much trouble , and is not able to process the interrupt the MSP430 will power cycle the board – ie hard reboot the card. This is the second stage of the HWD.
The problems at the sites is the second stage HWD. With the second stage there is no indication what so ever of what the OS was doing or why it was in trouble.
So there are only 2 thingsthat can cause a second stage HWD
1. Actually bad hardware or something wrong with the MSP430 .
2. The double exception .
So about 1 – this is very unlikely since we don’t have it happen on the active sides ( we can double check all the cpu’s and make sure there are no ini’s, an active side sysload is really a UGSWO and ini on the active and the inactive will show the HWD reason 5)
We can check all the RAM versions but again since it never happens on the active it is unlikely to be hardware.
For number 2 – the double exception
The double exception is an architectural limitation or really more of a big design over sight by windRiver when they designed Vxworks. They have actually fixed it in 6.0 (cpl actually does not have this problem)
When there is an exception the exception handler runs in the same stack as the task that caused the exception. 99% of the time this is fine – we know we have seen 1000’s of berr ini’s over the years.
The problem occurs when the cause of the berr is some kind of stack corruption . If the stack pointer is not valid what happens is that when the exception handler tries to use the stack it will exception again – hence the double exception . When this happens the CPU will literally freeze and we wait to be restarted by the second stage HWD.
So to cause a stack corruption bad enough there are 3 main possibilities
1. Local variable array over writing it’s bounds causing stack corruption
2. Malloc 10 write 100 so we overwrite some ones stack – similar to what we seen with the rstTask stuff caused by the DFO feature for overlays. It would have to be a large block of memory since it would be intermingled with the tasks
3. A stack overflow.