It’s my third and final project during my internship at WSO2. It is a property of software that enables a system to perceive that it is not operating correctly and, with/without human intervention, make the necessary adjustments to restore itself to normalcy. I read a few research papers since this area is still new to the industry, I couldn’t find any actual implementations. Therefore, I had to spend a lot time reading research papers to get an idea about self-healing software.
Fault tolerance computer systems mirror all their operations, so if one component fails another redundant component will take part and continue the service without any loss.
Self-healing is more about recovery-oriented computing. The abstract flow of this will be fault diagnostics, recovery, and re-induction of the repaired element to the system. Most of the decision support systems passive form where the decision making is based on user initiation but in self-healing systems, it can support an active form of decision making, which will be involving: detecting faults and recovering from it without human intervention. Intelligent models can select a proper repair plan to deploy the broken component, and also if there is more than one component to be healed; prioritize them.
Before a health healing system can bring the system to normalcy from the fault, it has to know what is a healthy state and what is not a healthy state. Usually, a system does not break down recognizably but it deteriorates over time, i.e. there is a gradual transition between healthy and unhealthy states. So we can define a new state fuzzy which is between healthy and unhealthy.
First, we should be able to identify a symptom, then if a symptom is identified we should find a diagnostic for that symptom. If all the diagnostic plan exists in the database, then we can execute the solution plan for that.
As a first step, we decided to implement this for memory out error in Java.
Self-healing solution for Java memory out error
Java applications are only allowed to use a limited amount of memory. This limit is specified during the application startup. If the application is overtaking the maximum limit, then the memory out error will be thrown and JVM will shut down.
Before I start any implementations, I tried to produce a memory out error and tried to find any pattern in that by analyzing the remaining memory after each garbage collection.
Therefore, for memory out error I was able to come up with symptoms and solving plan as follows,
- The size of the non-collected garbage are kept increasing (Gradient of remaining garbage objects is positive and keeps increasing (Second derivative is also positive)).
- Heap size eventually equals the maximum size allocated by JVM
- The used heap size value is always closer to the heap size
Diagnostic: Out of memory exception
Solution plan: Gracefully restart the JVM
Implementation of Self-healing component for Memory Out Error
We decided to start the derivative analysis only after when the used heap size is greater than 75% of the max heap size. After reaching 75% of max memory, the Self-healing component will record the first derivative and second derivative of the remaining memory strike. This will happen at each time whenever the Garbage collection occurs. In order to know that, I have to register my component to notification emitter in Java, so whenever garbage collection occurs, it will send a notification to my component.
Then finally, If the total used memory is greater than 90% of max memory, it will check the following criteria and make a decision,
- Continuously the first derivative > 0 for n times AND Continuously second derivative >= 0 for n times, then gracefully restart the JVM. (Priority 1)
- Continuously the first derivative > 0 for n times AND second derivative >= 0 (Not continuously) for n times, then gracefully restart the JVM. (Priority 2)
- If the first derivative is not positive, then we might not want to restart the JVM.
The following sequence diagram will explain the sequence of self-healing components.
After implementing the above algorithm my component successfully passed in detecting a memory out error. The self-healing project is available here at GitHub.
Implement machine learning algorithms to detect symptoms and create a solution plan instead of writing static solutions, not only for memory out error; but for all mostly known exceptions.
We are actually working on Anomaly detection software for Java applications using ML algorithms for our final year project, which is possible to be extended by Self-healing software.
Fork self-healing software at GitHub — https://github.com/VIthulan/self-healing
See you in the next blog,
Happy coding!! 🙂
Originally published at http://vithulanmv.wordpress.com on May 29, 2016.