TrinityRCL : Multi-granular and code-level root cause localization using multiple types of telemetry data in microservice systems
Journal article
Gu, Shenghui, Rong, Guoping, Ren, Tian, Zhang, He, Shen, Haifeng, Yu, Yongda, Li, Xian, Ouyang, Jian and Chen, Chunan. (2023). TrinityRCL : Multi-granular and code-level root cause localization using multiple types of telemetry data in microservice systems. IEEE Transactions on Software Engineering. 49(5), pp. 3071-3088. https://doi.org/10.1109/TSE.2023.3241299
Authors | Gu, Shenghui, Rong, Guoping, Ren, Tian, Zhang, He, Shen, Haifeng, Yu, Yongda, Li, Xian, Ouyang, Jian and Chen, Chunan |
---|---|
Abstract | The microservice architecture has been commonly adopted by large scale software systems exemplified by a wide range of online services. Service monitoring through anomaly detection and root cause analysis (RCA) is crucial for these microservice systems to provide stable and continued services. However, compared with monolithic systems, software systems based on the layered microservice architecture are inherently complex and commonly involve entities at different levels of granularity. Therefore, for effective service monitoring, these systems have a special requirement of multi-granular RCA. Furthermore, as a large proportion of anomalies in microservice systems pertain to problematic code, to timely troubleshoot these anomalies, these systems have another special requirement of RCA at the finest code-level. Microservice systems rely on telemetry data to perform service monitoring and RCA of service anomalies. The majority of existing RCA approaches are only based on a single type of telemetry data and as a result can only support uni-granular RCA at either application-level or service-level. Although there are attempts to combine metric and tracing data in RCA, their objective is to improve RCA's efficiency or accuracy rather than to support multi-granular RCA. In this article, we propose a new RCA solution TrinityRCL that is able to localize the root causes of anomalies at multiple levels of granularity including application-level, service-level, host-level, and metric-level, with the unique capability of code-level localization by harnessing all three types of telemetry data to construct a causal graph representing the intricate, dynamic, and nondeterministic relationships among the various entities related to the anomalies. By implementing and deploying TrinityRCL in a real production environment, we evaluate TrinityRCL against two baseline methods and the results show that TrinityRCL has a significant performance advantage in terms of accuracy at the same level of granularity with comparable efficiency and is particularly effective to support large-scale systems with massive telemetry data. |
Keywords | computer network security; graph theory; service oriented architecture; telemetry; application level; code level localization; code level root cause localization; continued services; effective service monitoring; finest code level; host level; large scale software systems; large scale systems; layered microservice architecture; massive telemetry data; metric level; microservice systems; monolithic systems; multigranular rca; online services; rca solution trinity rc lthat; service anomalies; service level; stable services; unigranular rca; microservice architectures; telemetry; measurement; codes; monitoring; location awareness; computer architecture; root cause; telemetry data; microservices |
Year | 2023 |
Journal | IEEE Transactions on Software Engineering |
Journal citation | 49 (5), pp. 3071-3088 |
Publisher | IEEE Computer Society |
ISSN | 0098-5589 |
Digital Object Identifier (DOI) | https://doi.org/10.1109/TSE.2023.3241299 |
Scopus EID | 2-s2.0-85148448378 |
Page range | 3071-3088 |
Funder | National Key Research and Development Program of China |
Research Council of Norway | |
Meituan | |
Key Research and Development Program of Jiangsu Province | |
National Natural Science Foundation of China (NSFC) | |
Publisher's version | License All rights reserved File Access Level Controlled |
Output status | Published |
Publication dates | |
Online | 01 Feb 2023 |
Publication process dates | |
Accepted | 18 Jan 2023 |
Deposited | 07 Nov 2023 |
Grant ID | 2019YFE0105500 |
309494 | |
BE2021002-2 | |
62072227 | |
62202219 |
https://acuresearchbank.acu.edu.au/item/8zy56/trinityrcl-multi-granular-and-code-level-root-cause-localization-using-multiple-types-of-telemetry-data-in-microservice-systems
Restricted files
Publisher's version
54
total views0
total downloads1
views this month0
downloads this month