Suggested Term Project Topics – CSC 724 (001) Spring 2023 Advanced Distributed Systems

Note

You can pick one topic that sounds most interesting to you. All the following topics describe open research problems. You should aim at developing those ideas into conference papers or even MS thesis. You can also suggest any proper topic to the instructor.
For all projects, you are required to report your experience (e.g., any problems, failures, bugs) with any infrastructure (VCL, Amazon EC2, Google AppEngine) you choose to use. You will receive extra credits for each specific bug you report.

Research Projects Supervised by Dr.Gu

Automatic System Management using Unsupervised Machine Learning: [slides]

A Hybrid Approach to Cloud System Performance Bug Detection and Diagnosis: [slides]

Topic 1: Virtual Machine Management in Distributed Computing Environments

Project description: Virtualization is one of the basic technologies for modern data centers and cloud computing systems such as Amazon EC2. The goal of this project is to explore the virtualization techniques (i.e., Xen) to achieve various system management goals such as resourcement management for distributed computing environments such as VCL.

References:

“AGILE: elastic distributed resource scaling for Infrastructure-as-a-Service“,
Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Sethuraman Subbiah, John Wilkes,
Proc. of USENIX International Conference on Autonomic Computing (ICAC), San Jose, CA, June, 2013.
“CloudScale: Elastic Resource Scaling for Multi-Tenant Cloud Systems”
Zhiming Shen, Sethuraman Subbiah, Xiaohui Gu, John Wilkes,
Proc. of ACM Symposium on Cloud Computing (SOCC) in conjunction with SOSP, Cascais, Portugal, October, 2011.
“PRESS: PRedictive Elastic ReSource Scaling for Cloud Systems“,
Zhenhuan Gong, Xiaohui Gu, John Wilkes
IEEE International Conference on Network and Services Management (CNSM), Niagara Falls, Canada, October, 2010.
“PAC: Pattern-driven Application Consolidation for Efficient Cloud Computing“,
Zhenhuan Gong, Xiaohui Gu,
IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), Miami Beach, Florida, August, 2010.
Xen and the Art of Virtulization,
Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauery, Ian Pratt, Andrew Wareld,
Proc. of SOSP, 2003.

Experiment environment: VCL
Related software: KVM, Xen, Hadoop, RUBiS, IBM System S

Topic 2: System Monitoring & Behavior Learning & Anomaly Management

Project description: The goal of this project is to collect monitoring data for one system anomaly and develop anomaly prediction or diagnosis algorithm.
References:

“FChain: Toward Black-box Online Fault Localization for Cloud Systems”
Hiep Nguyen, Zhiming Shen, Yongmin Tan, Xiaohui Gu
Proc. of IEEE International Conference on Distributed Computing Systems (ICDCS), Philadelphia, PA, July, 2013.
“UBL: Unsupervised Behavior Learning for Predicting Performance Anomalies in Virtualized Cloud Systems”
Daniel Dean, Hiep Nguyen, Xiaohui Gu,
Proc. of ACM International Conference on Autonomic Computing (ICAC), San Jose, CA, September, 2012.
“PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems”
Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, Deepak Rajan,
Proc. of IEEE International Conference on Distributed Computing Systems (ICDCS), Macau, China, June, 2012
“Adaptive Runtime Anomaly Prediction for Dynamic Hosting Infrastructures“,
Yongmin Tan, Xiaohui Gu, Haixun Wang,
ACM Symposium on Principles of Distributed Computing (PODC), Zurich, Switzerland, July, 2010. (acceptance rate: 21%)

Experiment environment: VCL, Amazon EC2, Google AppEngine
Related Software: Rubis, Hadoop, IBM System S

Topic 3: System diagnosis using console logs or traces

Project description: The goal of this project is to detect and diagnose runtime system problems using logs or system traces.
References:

“ELT: Efficient Log-based Troubleshooting System for Cloud Computing Infrastructures”,
Kamal Kc, Xiaohui Gu,
Proc. of IEEE International Symposium on Reliable Distributed Systems (SRDS), Madrid, Spain, October, 2011.
Detecting Large-Scale System Problems by Mining Console Logs
Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael Jordan,
Proc. of SOSP 2009.
DScope: Detecting Real-World Data Corruption Hang Bugs in Cloud Server Systems
Ting Dai, Jingzhu He, Xiaohui Gu, Shan Lu, Peipei Wang
Proc. of SOCC 2018.
TScope: Automatic Timeout Bug Identification for Server Systems
Jingzhu He, Ting Dai, Xiaohui Gu
Proc. of ICAC 2018.
Hytrace: A Hybrid Approach to Performance Bug Diagnosis in Production Cloud Infrastructures
Ting Dai, Daniel Dean, Peipei Wang, Xiaohui Gu, Shan Lu
IEEE Transactions on Parallel and Distributed Systems (TPDS), 2018
TFix: Automatic Timeout Bug Fixing in Production Server Systems
Jingzhu He, Ting Dai, Xiaohui Gu
Proc. of ICDCS 2019.
HangFix: Automatically Fixing Software Hang Bugs for Production Cloud Systems
Jingzhu He, Ting Dai, Xiaohui Gu and Guoliang Jin
Proc. of SOCC 2020.
CDL: Classified Distributed Learning for Detecting Security Attacks in Containerized Applications(opens in new window)
Yuhang Lin, Olufogorehan Tunde-Onadele, and Xiaohui Gu
Proc. of ACSAC 2020
Self-Patch: Beyond Patch Tuesday for Containerized Applications(opens in new window)
Olufogorehan Tunde-Onadele, Yuhang Lin, Jingzhu He, and Xiaohui Gu
Proc. of ACSOS 2020
SHIL: Self-Supervised Hybrid Learning for Security Attack Detection in Containerized Applications
Yuhang Lin, Olufogorehan Tunde-Onadele, Xiaohui Gu, Jingzhu He, and Hugo Latapie
Proc. of ACSOS 2022

Experiment environment: VCL, Amazon EC2, Google AppEngine
Related software: Hadoop, VCL,