Gremlin Introduces Automatic Detection Of Common Reliability Issues

Gremlin, renowned for its chaos engineering tools, has launched its new Detected Risks feature. This advanced tool can now automatically identify and categorize high-priority reliability issues commonly found in Kubernetes-based services. By analyzing misconfigurations and bad default values, Gremlin can determine the severity and suggest potential fixes for these issues.

“Reliability is becoming increasingly important,” stated Kolton Andrus, CTO and founder of Gremlin. He went on to explain the significance of digital infrastructure, highlighting how sectors such as government, healthcare, transportation, communication, and finance heavily rely on it. However, this digital foundation comes with risks. Fortunately, many of these risks can be mitigated if they are known. This is why Gremlin is excited to introduce Detected Risks, enabling customers to quickly identify and address serious issues within their systems, ultimately enhancing overall system resilience and reliability.

Key Takeaway

Gremlin’s new Detected Risks feature automates the identification of common reliability issues in Kubernetes-based services, helping companies improve the resilience of their systems.

In contrast to Gremlin’s chaos engineering tools, which seek to explore unusual scenarios that put a company’s infrastructure to the test, Detected Risks utilizes pre-configured tests. Currently, the system includes a set of tests, with 20 more to be added later this year. These tests analyze common issues that can impact the reliability and resilience of a company’s infrastructure. The advantage of Detected Risks is that it functions without requiring chaos engineering experiments or reliability tests to be conducted.

Most of these tests are straightforward and follow best practices. For example, they ensure that deployments are configured to run in multiple availability zones to guarantee redundancy. While this may seem like common sense, Gremlin discovered that 26% of its customers’ deployments lacked redundancy. Moreover, an astonishing 80% of deployments did not have dual redundancies. Additionally, Gremlin’s system identifies common Kubernetes misconfigurations that may affect autoscaling capabilities.

“Although our industry has talented Site Reliability Engineers (SREs) who work diligently to address these issues, this individual approach does not scale,” explained Andrus. “To solve this problem, we are developing an easy-to-use tool that offers valuable insights across thousands of real-world applications. By providing engineering leadership with visibility into existing risks, we enable them to prioritize and address these critical issues, safeguarding the customer experience and fostering the development of high-quality software.”

Gremlin’s ability to automatically detect and address common reliability issues is a significant step in enhancing the resilience and reliability of digital infrastructure. As companies rely more heavily on their digital foundation, tools like Detected Risks play a crucial role in preventing potential system failures and improving overall system performance.