Spring 2023
Credits:
 3
Meeting Times: Tuesday/Thursday, 3:00pm – 4:15pm
Meeting Location: 1220 Engineering Building II
Assignment submission: Moodle
Message board: Piazza

 

Instructor Information

  • Xiaohui (Helen) Gu
  • Office Hours: Tues/Thurs 4:15pm – 5:00pm  at EBII 3274
  • Email : xgu AT ncsu.edu

Teaching Assistants/Graders

  • Fogo Tunde-Onadele
  • Office Hours: Mon/Wed 3:00pm – 4:00pm at EBII 1229B
  • Email : oatundeo AT ncsu.edu
  • Yuhang Lin
  • Office Hours: TBD at EBII 1229B
  • Email : ylin34 AT ncsu.edu

Course Objectives

This course explores design and implementation principles in modern distributed systems. In particular, the course will emphasize on recent techniques used by real-world distributed systems such as cloud systems, enterprise data center, and peer-to-peer file sharing (e.g., BitTorrent). Students will learn the state of the art in distributed system architectures, algorithms, and performance evaluation methodologies. Topics include canonical distributed concepts such as remote procedure call, distributed objects, replication, distributed system security, consensus protocol, and recent distributed system technologies such as peer-to-peer, grid, autonomic computing, distributed massive data processing/Google map-reduce, system machine learning,  distributed system debugging, multi-core systems, distributed virtualization. On completing this course, the student should be able to the following:

  • Identify research problems and challenges in distributed systems, (assessed by review and presentation);
  • List the state-of-art tools and techniques for addressing research problems and challenges in distributed systems (assessed by review and presentation);
  • Develop and implement new ideas to solve open problems in  distributed systems (assessed by project);
  • Conduct technical reviews, technical writing, and technical presentations (assessed by review, project, paper, presentation).

Text Books

There are no assigned textbooks for this course. Topics will be covered during in-class lectures, and through course notes made available on this web page.
Links to the supplementary material in the form of research papers related to each topic are included in this syllabus [Course Syllabus]. PDF for most papers is available through the NCSU library web site, which has full-text access to most recent ACM and IEEE journals and conferences. A number of supplemental distributed system textbooks are also available:
Distributed Systems: Concepts and Design, (4th Edition), G. Coulouris, J. Dollimore, and T. Kindberg
Distributed Systems (2nd Edition), Sape Mullender
Distributed Systems: Principles and Paradigms, Andrew S. Tanenbaum, Maarten van Steen

Course Description

Distributed systems have become the fundamental computing infrastructure for many important real-world applications such as Internet search engine, media streaming servers, online file sharing, information analytics, and scientific exploration. This course explores design and implementation principles in modern distributed systems. In particular, the course will emphasize on recent techniques used by real-world distributed systems such as peer-to-peer file sharing (e.g., BitTorrent), enterprise data center, and Internet search engine (Google). Students will learn the state of the art in distributed system architectures, algorithms, and performance evaluation methodologies. Topics include i) traditional distributed computing concepts (e.g., distributed objects, middleware, replication, distributed system security, and consensus protocol); and ii) recent emergent distributed system techniques such as peer-to-peer systems, massive data processing, Grid, and autonomic computing. Students will have opportunities to not only learn the common design methodology of many important distributed systems, but also gain hands-on experience through project implementations. The majority of course materials will be drawn from classic papers and current state-of-the-art work. The instructor will lecture for the first half of the semester and students will present papers and projects in the second half of the semester. Students will read and review papers ahead of time, participate in class discussions, present at least one research topic during the course, and do a term project individually or in a two-member team. Students will also write a paper (as well as review other students’ papers) describing their project and present their work at the end of the course, in a “conference” format designed to give students an experience similar to that of participating in a professional conference.

Prerequisites

CSC501 or equivalent. Programming in C++ or Java in Unix environment. If you are not sure whether you can attend this course, please consult the instructor.

Tentative Grading Policy

Written reviews 20%, class participation 30% (presentation: 20%, discussion: 10%), project 50% (proposal writeup 5%, proposal presentation 5%, Project MidReview Presentation  5%, demo 15%, final presentation 10%, Final write-up 10%)

Late policy

Calculated by the time recorded in the assignment emails received to the instructor. Students will lose 25% for each 24-hour period they are late on reviews, project, or paper.

Paper Review

Review guidelines: Provide a paragraph of summary about the paper, a paragraph of 2-3 strong points of the paper (i.e., Why the paper should be accepted), a paragraph of 2-3 weak points of the paper (i.e., why the paper should be rejected),  brainstorming ideas for developing new research ideas related to the work described in the paper(optional).

Project

Both project proposal and final report should follow typical paper requirements using ACM Double-Column Paper format. The project proposal should include abstract, introduction, proposed approaches, and related work. The final project report should include a full paper content including abstract, introduction, design and algorithms, experiment evaluation, related work, and conclusion. We will organize a mini-conference for the students to present their project work. Three best papers will be selected during the mini-conference.

Class Schedule (Tentative)

 W  Date Topic Assigned Readings Assignments
1 1/10

Introduction [slides]
  • Chapter 1, Distributed Systems: Concepts and Design
Investigate your term project idea and do preparation for it. A list of candidate project topics will also be provided to you on the class. Talk to the instructor about your project idea and talk to other students in forming a two-three members group. Email the instructor to setup the appointment.

 

1/16 midnight: review due for

1/12 Replication [slides]
2 1/17 Replication [slides]   Investigate your term project idea and do preparation for it. Talk to the instructor about your project idea and talk to other students in forming a group if you would like to work in a group.

 

1/23 midnight: review due for

1/23 midnight: Paper presentation signup due. Please send an email to the TA to bid four papers in the list below and list your choices in decreasing order. You will be allocated with two papers to present based on the FCFS policy and paper availability.

1/19 Research [slides]  
3 1/24 Project Testbed    1/30 midnight: review due for

 

1/26 Fundamentals [slides]
4 1/31 Overlay Networks
[slides]
2/6 midnight: project proposal due.
2/2 Methodology
[methodology slides, autonomic computing slides]
5 2/7 Peer-to-Peer Systems
[slides]
2/13 midnight: reviews due

 

2/9 Data-Intensive Computing
[slides]
6 2/14 System Research Methodology [slides] 2/20 midnight: reviews due

 

2/16 Wellness Day (No classes)  
7 2/21 Project Proposal Presentation

 

(Each group will have 12 minutes including QA. )

  1. Adaptive AI-based Container Management
    Framework [slides] Siddharth Sheth, Siddarth Royapally, Jyothi Sumer Goud Maduru 
  2. Virtual Machine Management in Distributed Computing
    Environments implemented in Kubernetes with Linux KVM [slides] Gaolin Zhang, Haoqu Ma, Huangxing Chen
  3. Proactive Horizontal Auto-Scaling for Kubernetes [slides] Rajesh Manedi, Rohit Mohan, Vinay Vasudev
  4. Automatic Detection of Runtime Performance
    Problems in Cloud-like Environments [slides] Zachary Parks, WeiRui Wang, Lalit Bangad
  5. Performance Diagnosis with Logs and Metrics [slides] Jae Jimmy Wong
2/27 midnight: reviews due

 

2/23 Student presentation
8 2/28 Student presentation No paper reading assigned. You should spend time on your term projects.
3/2 Student presentation
9 3/7  Student presentation No paper reading assigned. You should spend time on your term projects.
3/9 Student presentation
10 3/21 Student presentation No paper reading assigned. You should spend time on your term projects.
3/23 Student presentation
11 3/28  Project MidReview   No paper reading assigned. You should spend time on your term projects.
3/30 Student presentation
12 4/4 Student presentation No paper reading assigned. You should spend time on your term projects.
4/6 Student presentation
13 4/11 Student presentation
  • Guiseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels, Dynamo: Amazon’s Highly Available Key-Value Store, Proc. of SOSP 2007 [slides] Vinay Vasudevo
No paper reading assigned. You should spend time on your term projects.
4/13 Student presentation
14 4/18 Student presentation No paper reading assigned. You should spend time on your term projects.
4/20 27 Project Demo  
15 4/25 Reading Day
(No class)
  May 3th midnight: final project report due, project source code and document due. Your project source code and document submission should be a single zip file. The zip file should include your system source code including all other dependent packages, the experimental subjects used in the project report, instructions on how to set up and use the system to reproduce the experimental results, and other documents that help others understand your tool and source code.
4/27 Final Presentation
(Project Demo at 10am-12pm at EB2 3300,
Final presentation 1:30pm-4:30pm at EB2 3211.
)
 
16      

Suggested Topics for Student Presentations

(You can suggest to the instructor the papers that are not in this list but you would like to present):

Please check below for your assigned paper.

Automatic Distributed System Management

  1. Chang Lou, Cong Chen, Peng Huang, Yingnong Dang, Si Qin, Xinsheng Yang, Xukun Li, Qingwei Lin, and Murali Chintalapati, RESIN: A Holistic Service for Dealing with Memory Leaks in Production Cloud Infrastructure, Proc. of OSDI 2022 Siddarth Royapally
  2. Yongle Zhang, Junwen Yang, Zhuqi Jin, Utsav Sethi, Kirk Rodrigues, Shan Lu, and Ding Yuan, Understanding and Detecting Software Upgrade Failures in Distributed Systems, Proc. of SOSP 2021 Huang-Xing Chen
  3. Yigong Hu, Gongqi Huang, and Peng Huang, Automated Reasoning and Detection of Specious Configuration in Large Systems with Symbolic Execution, Proc. of OSDI 2020 Jyothi Sumer Goud Maduru
  4. Sebastien Levy, Randolph Yao, Youjiang Wu, Yingnong Dang, Peng Huang, Zheng Mu, Pu Zhao et al., Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions, Proc. of OSDI 2020 Gaolin Zhang
  5. Jingzhu He, Yuhang Lin, Xiaohui Gu, Chin-Chia Michael Yeh, and Zhongfang Zhuang, PerfSig: Extracting Performance Bug Signatures via Multi-modality Causal Analysis, Proc. of ICSE 2022 Lalit Bangad
  6. Xudong Sun, Runxiang Cheng, Jianyan Chen, and Elaine Ang, Owolabi Legunsen, Tianyin Xu, Testing Configuration Changes in Context to Prevent Production Failures, Proc. Of OSDI 2020 Vinay Vasudevo 
  7. Jingzhu He, Ting Dai, Xiaohui Gu, and Guoliang Jin, HangFix: Automatically Fixing Software Hang Bugs for Production Cloud Systems, Proc. of SOCC 2020 Huang-Xing Chen
  8. Ting Dai, Jingzhu He, Xiaohui Gu, Shan Lu, and Peipei Wang, DScope: Detecting Real-World Data Corruption Hang Bugs in Cloud Server Systems, Proc. of SOCC 2018 Haoqu Ma
  9. Jingzhu He, Ting Dai, and Xiaohui Gu, TFix: Automatic Timeout Bug Fixing in Production Server Systems, Proc. of ICDCS 2019 Siddarth Royapally
  10. Hiep Nguyen, Zhiming Shen, Yongmin Tan, and Xiaohui Gu, FChain: Toward Black-box Online Fault Localization for Cloud Systems, Proc. of ICDCS 2013 Haoqu Ma
  11. Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, and Kaushik Veeraraghavan, Gorilla: A Fast, Scalable, In-Memory Time Series Database, Proc. of VLDB 2015 Zach Parks
  12. Benjamin H. Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag, Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Google Technical Report 2010 Siddharth Sheth

Cloud Computing & Machine Learning

  1. Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol et al., Ray: A Distributed Framework for Emerging AI Applications, Proc. of OSDI 2018 Rohit Mohan
  2. L. Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang et al., Ansor: Generating High-Performance Tensor Programs for Deep Learning, Proc of OSDI 2020 Jae Jimmy Wong
  3. Shoumik Palkar and Matei Zaharia, Optimizing Data-Intensive Computations in Existing Libraries with Split Annotations, Proc. of SOSP 2019 
  4. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin et al., TensorFlow: A System for Large-Scale Machine Learning, Proc. of OSDI 2016 Siddharth Sheth
  5. Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Sethuraman Subbiah, and John Wilkes, “AGILE: elastic distributed resource scaling for Infrastructure-as-a-Service“, Proc. of ICAC 2013 Gaolin Zhang
  6. Zhiming Shen, Sethuraman Subbiah, Xiaohui Gu, and John Wilkes, CloudScale: Elastic Resource Scaling for Multi-Tenant Cloud Systems, Proc. of SOCC 2011 WeiRui Wang
  7. Guiseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels, Dynamo: Amazon’s Highly Available Key-Value Store, Proc. of SOSP 2007 Vinay Vasudevo
  8. James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat et al., Spanner: Google’s Globally-Distributed Database, Proc. of OSDI 2012 Jae Jimmy Wong

Distributed Systems Security

  1. Alexander Van’t Hof and Jason Nieh, BlackBox: A Container Security Monitor for Protecting Containers on Untrusted Operating Systems, Proc. of OSDI 2022 Jyothi Sumer Goud Maduru 
  2. Yuhang Lin, Olufogorehan Tunde-Onadele, Xiaohui Gu, Jingzhu He, and Hugo Latapie, SHIL: Self-Supervised Hybrid Learning for Security Attack Detection in Containerized Applications, Proc. of ACSOS 2022 WeiRui Wang
  3. Rui Shu, Xiaohui Gu, and William Enck., A Study of Security Vulnerabilities on Docker Hub, Proc. of CODASPY 2017 Lalit Bangad
  4. Yuhang Lin, Olufogorehan Tunde-Onadele, and Xiaohui Gu, CDL: Classified Distributed Learning for Detecting Security Attacks in Containerized Applications, Proc. of ACSAC 2020 
  5. Olufogorehan Tunde-Onadele, Yuhang Lin, Jingzhu He, and Xiaohui Gu, Self-Patch: Beyond Patch Tuesday for Containerized Applications, Proc. of ACSOS 2020

Student Suggested Papers

  1. Taft, Rebecca, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi et al. Cockroachdb: The resilient geo-distributed sql database, Proc. of ACM SIGMOD 2020 Rohit Mohan
  2. Choi, Inho, Ellis Michael, Yunfan Li, Dan RK Ports, and Jialin Li. Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications, NSDI 2023 Rajesh Manedi
  3. Ha, Kiryong, Yoshihisa Abe, Thomas Eiszler, Zhuo Chen, Wenlu Hu, Brandon Amos, Rohit Upadhyaya, Padmanabhan Pillai, and Mahadev Satyanarayanan. You can teach elephants to dance: Agile VM handoff for edge computing, SEC 2017 Rajesh Manedi
  4. Zaharia, Matei, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. Discretized streams: Fault-tolerant streaming computation at scale, SOSP 2013 Zach Parks

Academic Integrity

The university provides a detailed policy on academic integrity. This policy can be found in the Code of Student Conduct. It is understood that when you submit your homework, you are implicitly agreeing to the university honor pledge: “I have neither given nor received unauthorized aid on this test or assignment.”

Academic dishonesty (e.g., cheating or plagiarism) will not be tolerated under any circumstances. If you are having difficulty with any part of the course material, please see me as soon as possible. I will do everything I can to help you with any course-related problems you may be having. If you are found to be guilty of academic dishonesty, however, I will then do everything I can to see that you are punished as forcefully as possible. This may include asking to have you suspended or expelled from the course, the program, and/or the university. At a minimum, you will receive -50% for the assignment in question, and your name will be placed on record with the university as having committed an academic offence. Multiple offences during your academic career will result in suspension or expulsion from the university. I take absolutely no pleasure in pursuing cases of academic misconduct, and would ask that you please do not put me in this position.

Students With Disabilities

All effort will be made to ensure that no students with disabilities are denied any opportunity to successfully complete this course. If you have specific requirements that need to be addressed, please contact me immediately. Possible changes can include (but are not necessarily limited to) rescheduling classes from inaccessible to accessible buildings, or providing access to auxiliary aids such as tape recorders, special lab equipment, or other services such as readers, note takers, or interpreters. This may also include oral or taped tests, readers, scribes, separate testing rooms, or extension of time limits.

Lab Safety Issues

None.

Pass-Through Costs

None.