Dates
Friday, April 22, 2022 - 02:40pm to Friday, April 22, 2022 - 03:40pm
Location
NCS 120
Event Description


Abstract: 

Modern database and storage systems provide live fault tolerance through replication supported by consensus protocols, e.g., Raft. This talk will review the common designs of these systems and discuss the challenges and insufficiency in their designs and implementation. There will be three main projects we cover in this talk: (i) We designed and implemented the strongly consistent replication in MongoDB through a novel consensus protocol that derives from Raft. A major difference between our protocol and vanilla Raft is that MongoDB deploys a unique pull-based data synchronization model: a replica pulls new data from another replica. (ii) The need for fail-slow fault tolerance in modern distributed systems is highlighted by the increasingly reported fail-slow hardware/software components that lead to poor performance system-wide. We argue that fail-slow fault tolerance not only needs new distributed protocol designs, but also desires programming support for implementing and verifying fail-slow fault-tolerant code. (iii) A new speedy and fault-tolerant replicated multi-core transactional database system, Rolis. Rolis's aim is to mask the high cost of replication by ensuring that cores are always doing useful work and not waiting for each other or for other replicas, with a novel execute-replicate-replay approach. 

Bio: 

Shuai Mu is an assistant professor of Computer Science at Stony Brook University. His main research interests are in distributed systems and multi-core systems. His works have been widely adopted by industry, e.g., he helped MongoDB design their replication schemes. Prior to joining Stony Brook, he was post-doc lecturer in the Systems Group at NYU and he obtained his Ph.D. from Tsinghua University.

Event Title
PhD Seminar, Shuai Mu: 'Fault tolerance in practice'