Contained Within
Find More Documentation
Featured Support Resources
| Download this book in PDF
- CHAPTER 3
CPU Over Temperature Safeguard
- The CPU over temperature safeguard (COS) is a new Ultra Enterprise 3000, 4000, 5000, and 6000 platform safeguard feature for the Solaris 2.6 software environment. COS is an automatic feature available on the Ultra Enterprise family of x000 servers. It ensures that the temperature on any CPU/memory board does not go above the safe operating range.
COS Requirements
- COS operation requires proper firmware support. COS is not available if an Ultra Enterprise x000 server lacks enabling firmware. In this case, the system displays these messages during the boot sequence:
-
WARNING: Firmware does not support CPU power off
WARNING: Automatic CPU shutdown on over-temperature disabled
WARNING: Firmware does not support CPU restart from power off
WARNING: The ability to restart individual CPUs is disabled
|
-
· To check the firmware revision level, use the prtdiag -v command.
- The correct firmware version for COS support is 3.2.8 or above.
- The system, when equipped with the required firmware, displays the following message during the boot sequence:
-
Board 0: OBP 3.2.8 1997/02/27 14:00 POST 3.5.1 1997/03/05 09:34
(or equivalent for later firmware)
|
Overheating Factors
- Many external forces can affect the temperature and compound the CPU high temperature problem, including:
-
- Room air-conditioning is incorrectly set
- Lateral cooling is obstructed
- There are also some Solaris software environment issues, such as bound threads or having only one CPU/memory board in the system. These Solaris software environment issues can cause a fallback to the existing shutdown behavior.
- The CPU over temperature safeguard does not affect the Solaris software environment in any way. The technology operates only during over temperature conditions.
COS Operation
- COS functions by monitoring the temperatures of all system CPUs. Warning messages are displayed in the system console when the over temperature occurs. For example:
-
WARNING: CPU/Memory board 0 is warm (temperature: 73C). Please check system
cooling
NOTICE: Processor 0 powered off.
NOTICE: Processor 1 powered off.
|
- The following procedure describes the steps to follow when one or more CPUs reach an over temperature condition.
Resolving an Over Temperature Conditon
- When the COS feature detects a CPU over temperature condition, it takes the CPU offline and powers it off.
- The system continues to operate with the offending CPUs regarded as powered off. The CPUs are the chief source of heat on a CPU/Memory board; removing that heat source lowers the temperature into the normal operating range. This prevents the sudden down time to the production server.
· To Resolve an Over Temperature Condition
-
-
Verify the new state with the psrinfo command.
The psrinfo output reflects the new CPU state:
-
-
0 powered-off since 03/11/97 09:48:31
1 powered-off since 03/11/97 09:48:31
-
-
Without powering off the operating system, replace the defective power supply (containing cooling fans) with a working unit.
Note - If desired, you can cleanly halt the server using /etc/halt or init 0 at the root or superuser prompt before replacing the defective power supply.
-
Bring the CPUs back to normal operation using the psradm command:
-
# psradm -n processor_id#
|
- With the CPU over temperature safeguard feature, if the temperature sensor again reports an over temperature (the temperature is still out of range), then the attempt to bring the CPUs back into operation using the psradm command fails, and a -1 and an error messag is returned.
- If the CPUs in question return to normal operating temperature, the console messages display a message similar to the following.
-
NOTICE: CPU/Memory board 0 has cooled down (temperature: 72C), system OK.
|
Failure to Power Down CPUs
- In some instances, the CPU power control cannot disengage the affected CPU(s) from the Solaris software environment. For example, if the high temperature condition occurs with only one CPU/memory board with two processors in the system, processor 1 will not go to off-line due to its being the last processor in the system.
Failure to Power Up CPUs
- If the attempted de-coupling of the problem CPUs from the Solaris software environment fails, the temperature continues to increase. When the temperature reaches the hard upper operational temperature limit, the system shuts down. You will see a message similar to the following:
-
WARNING: CPU/Memory board 0 is very hot (temperature: 83C)
WARNING: System shutdown scheduled in 20 seconds due to over-temperature
condition on CPU/Memory board 0
WARNING: CPU/Memory board 0 still too hot (temperature: 83C). Overtemp shutdown
started
|
|
|