Dates
Thursday, April 28, 2022 - 11:00am to Thursday, April 28, 2022 - 12:00pm
Event Description


Abstract: Resource disaggregation (RD) is an emerging paradigm for data center computing whereby resource-optimized servers are employed to minimize resource fragmentation and improve resource utilization. Apache Spark deployed under the RD paradigm employs a cluster of compute-optimized servers to run executors and a cluster of storage-optimized servers to host the data on HDFS. However, the network transfer from storage to compute cluster becomes a severe bottleneck for big data processing. Near-data processing (NDP) is a concept that aims to alleviate network load in such cases by offloading (or pushing down) some of the compute tasks to the storage cluster. Employing NDP for Spark under the RD paradigm is challenging because storage-optimized servers have limited computational resources and cannot host the entire Spark processing stack. Further, it is not entirely obvious which Spark queries would benefit from pushdown, and which tasks of a given query should be pushed down to storage.

In our work, we developed an analytical model to help determine which Spark tasks should be pushed down to storage based on the current network and system state. We tested this model on an open-source project made by our collaborators that implements NDP for Spark and Hadoop workloads. Our experiments showed that the existing NDP strategy of pushing down all available operations can reduce the execution time up to 71% and the strategy based on our model can provide up to 38% reduction compared to the existing strategy.

 For Zoom information contact events [at] cs.stonybrook.edu

Event Title
Ph.D. Research Proficient Presentation: Sri Pramodh Rachuri, 'Optimizing Near-Data Processing for Spark'