Debugging is a normal part of building software. To catch bugs quickly, developers write automatic tests to find the most anticipated problems and run these tests before releasing the software. Still, it’s often impossible to anticipate all the anomalies across various parts of the system, and one day you may find out that the code doesn’t work as expected. What comes next? And more, what should one do to minimize the risk of code anomalies?
You spotted an anomaly. What’s next?
A quick note: the action items you will read in this article won’t necessarily be the same for every case. I outlined an algorithm to follow when a nasty bug hides deeper than you could normally track. Here are four steps to spot even the trickiest problem.
Collect
The first thing you should do is to collect as much information about the anomaly as possible. While searching for details you may find other problem manifestations or anomalies which may contribute to spotting the root cause. Pay special attention to the software areas where you’ll be able to find code anomalies and unexpected behavior with the highest probability. Logging all the processes may help. Dig into the following areas of your application: queries, databases, third-party systems, staging, and production environments.
Techstack case
Once I had an anomaly when processing cURL queries in a PHP-based application. The problem was reoccurring from time to time when the system was highly loaded (more than 100000 rps) so I couldn’t find the cause provoking this behavior. The anomaly caused delays, so we needed to find a solution. That’s why we wrote a logging system that processed all the logs to find anomalies that could help us understand the nature of the problem.
The biggest challenge was to enable the system to process an enormous number of logs without affecting application performance. After some research and consideration, we used Nats to process all messages (log requests with all parameters that were available about request) from the system with low latency and a possibility to process all logs and Golang to transmit them firstly to files and then into Elastic. Using Kibana we could see problem queries and define some templates of their behaviors and examine them closer.
Guess
When you spot anomalies, come up with hypotheses about what failures of the system could be displayed with these anomalies. Visualizing the work of your system may help you generate more accurate versions. These hypotheses will enable you to narrow down your effort.
Validate
Investigate every version by writing a short test that can prove or disprove your version. While 100-percent test coverage isn’t always possible because of economic reasons, a pointwise test may save resources.
In the case with cURL, I narrowed down the search area by logging all the queries and finding those that were processed with a delay. The investigation of these queries showed that the problem occurred before connecting to the server and decided to look for the cause in DNS. To prove or disprove that it was DNS causing the problem, I configured a DNS cache locally. If my hypothesis was right, this might have solved the problem. The anomaly was gone after this fix, which demonstrates that my suggestion was true.
Resolve
When the problem is located, look for the typical errors for this system area and ways to fix them. Your problem may affect different areas of the system so you should dig in diligently to ensure that it’s not reproduced again after your fixes and doesn’t affect other parts of the system.
Returning to a DNS-related challenge, I decided to implement a specific library for caching a DNS and auto-rotating them to prevent such problems in the future since DNS records may change over time.
How to prevent anomalies in code behavior
Usually, problems stem from the early stages of development and projecting the future system. If the content of the problem you’re trying to solve isn’t obvious, it’s easy to make mistakes in choosing the proper architecture and technologies. Here’s how to prevent mistakes in the code that can cause unexpected behavior.
- Choose wisely. The rule of thumb here is not to rush with decisions that don’t fit all cases. Leave the decisions you won’t be able to change quickly for when you have a solid view of the problem and context in which it exists. Collecting information and questioning what you’ve found will provide you with a broader perspective for making these decisions at later stages of development.
- SOLID principles. Sticking to the best software design practices may eliminate most anomalies. Keep these principles in mind while designing, writing, and refactoring your code so that it is much more clean, extendable, and testable.
- Code review. The fresh eye sees many bugs and inconsistencies before they will come to the release. That’s why code review of the code by your peers will contribute to stable application work in the future. This wouldn’t take much time, which makes the method cost-effective.
- Testing automation. The quality automated test coverage eliminates the huge amount of bugs, but making it properly takes time. A good automated test should check every line of code on all problems we can predict; that’s why developing a testing framework may take much longer than the code it checks.
- Monitoring systems. Like in the case I described, logging systems and metrics allow for detection of every misbehavior in the production and development stage. Once they’re applied, you can configure alerts to notify you about changed metric indices in real-time. Such a system may enable you to instantly find out the current server response time or the number of successful routes, which servers are on, etc.
- Micro releases and a limited audience. Good releases don’t consist of features the whole team worked on for a month. Push small releases that affect a limited system area to identify the anomalies faster. To avoid serious pitfalls, try releasing for a small number of users first and rolling out to everyone after. The group of users you’re going to choose for this should consist of different people proportionally to the whole audience so that the results were representative. Such a technique allows for A/B testing features as well. Once applied, it may become an investment in finding a proper product-market fit and growing business metrics of the application.
Summary
Knowing the algorithm, you can find the root cause of the unexpected behavior of your application. Still, it’s cheaper to prevent bugs than search for non-trivial solutions to identify and fix them. Use the checklist to ensure that you leverage all methods of making your development cost-efficient.
At Techstack, we use advanced monitoring and deployment systems to set the technology bar high while saving costs for business. Contact the team to learn more about our approach.