Posts

Showing posts from July, 2011

Preventing/Identifying hardware failures in Linux environment.

Hardware failures are always catastrophic . If it is a single point of failure then the impact can be severe  like data loss , delay in arrival of data , service unavailability etc. To prevent single point of failure we can setup high availability , load balancing etc but that comes at a cost and many cannot afford that may be because of technical reasons or because of  cost reasons. So , can  come up with some preventive measures  to at least  alert us in advance that a device is going to fail after few days . It might not be possible for every hardware device but yes we can do it for some devices like disk drives .  But this might not be possible for every hardware device to calculate in advance that a device is going to fail . For such cases like system getting rebooted by itself, or system was hung/unresponsive because of some hardware failure, we need to identify which device failed actually . Many a times  no trace can be found in the system log for such incidents and we have to…