Abstract: Microservice architecture has rapidly gained popularity for building large-scale latency-sensitive online applications. The architecture supports decomposing an application into a collection of fine-grained and loosely-coupled services called microservices. This modular architecture enables independent management of microservices for agility, scalability, and fault isolation. However, the modular design of microservices architecture leads to a large graph of interacting microservices whose influence on each other is non-trivial. As a result, performance management and debugging of microservices architecture is a challenging problem. This thesis develops techniques built on optimization theory and machine learning to address performance management problems in microservices architecture. Specifically, this thesis focuses on solving two critical issues faced in microservices architecture: configuration tuning and bottleneck detection.
Application configuration tuning is essential to improve performance and utilization, but the microservice architecture leads to a very large configuration search space with interdependent parameters. We jointly optimize the parameters to deal with interdependence and develop practical dimensionality reduction strategies based on available system characteristics to reduce the size of the search space. Our pre-deployment (offline) evaluation of different optimization algorithms and dimensionality reduction techniques across three popular benchmark applications highlights the importance of configuration tuning in reducing tail latency (by as much as 46%). Post-deployment tuning of real-world applications requires dynamic reconfiguration as the workloads are complex and time-varying. Moreover, the tuning process must reduce application interruptions to maintain the quality of service and application uptime. We design OPPerTune, a framework that uses various machine learning algorithms to handle the challenges involved with post-deployment tuning. We evaluate OPPerTune on a benchmarking application deployed on an enterprise cluster with synthetic and production traces to analyze its effectiveness in (a) determining which configurations to tune and (b) automatically managing the scope at which to tune the configurations. Our experimental results show that OPPerTune reduces the end-to-end P95 latency of microservices applications by more than 50% over expert configuration choices made ahead of deployment.
Beyond configuration tuning, it is critical to detect and mitigate sources of performance degradation (bottlenecks) to avoid revenue loss. As part of the proposed work, we will investigate techniques to detect and mitigate bottlenecks in microservices applications. We plan on exploring different mitigation strategies, including configuration tuning (e.g., autoscaling, application configuration tuning, etc.). Our preliminary results using graph neural networks show that we can improve bottleneck detection accuracy and precision by up to 15% and 14% compared to the techniques used in existing work.
It is our thesis that optimization and machine learning algorithms coupled with system characteristics can effectively address the complexities of configuration tuning and bottleneck detection and mitigation in large-scale microservices applications.