The Stanford Systems Seminar, sponsored by Platform Lab, is an opportunity for Stanford faculty and students working in systems to present their research and invite guest speakers. Seminars are held in-person/hybrid on Tuesdays in the Gates Fujitsu Room (Gates 403) at 4 PM PST, with an online option on Zoom. Food is provided. The seminar is organized by Priya Mishra and Anjiang Wei. To receive regular updates on upcoming talks, as well as Zoom links to join them virtually, please subscribe via mailman. Please also reach out to us if you are interested in presenting.
AbstractThe landscape of computing has undergone a significant transformation with the death of Dennard scaling and the slowing of Moore’s law: applications now drive innovations in computer systems architecture. At the same time, the advent of high throughput, low cost sequencing technology has revolutionized genomics. The massive amount of data generated in genomics has uncovered acute compute challenges in biological inference due to limitations of software on traditional multi-core systems. To address this challenge, my research employs a hardware-software-algorithm co-design approach to significantly improve computational performance in key areas of comparative and clinical genomics. In this talk, I will present systems that accelerate pipelines in both these domains of genomics. First, I will present Darwin-WGA (FPGA/ASIC) and SegAlign (GPU) for cross-species whole genome alignment where co-design has yielded orders of magnitude increase in speed. Additionally, there are gains in accuracy while modifying the algorithm to improve the underlying hardware implementation. Next, I will talk about the ultra-rapid nanopore whole genome sequencing pipeline that can deliver a genetic diagnosis in under 8 hours, making it the fastest pipeline to date. The scalable, cloud-based distributed infrastructure overcomes system bottlenecks to enable near real-time computation and improved variant identification. This pipeline has been deployed in critical care units in Stanford hospitals and applied to 13 patients. Finally, I will share my vision for continued advancements in computer systems applied to emerging fields such as pangenomics, single cell genomics, and long read clinical sequencing.
BioSneha Goenka is a Ph.D. candidate in the Electrical Engineering Department at Stanford University where she is advised by Prof. Mark Horowitz. Her research centers on designing efficient computer systems for advancing genomic pipelines for clinical and research applications, with a focus on improving speed and cost. She is a 2023 Forbes 30 Under 30 Honoree in the Science category, 2022 NVIDIA Graduate Fellow, and 2021 Cadence Women in Technology Scholar. She has a B.Tech. and M.Tech. (Microelectronics) in Electrical Engineering from the Indian Institute of Technology, Bombay where she received the Akshay Dhoke Memorial Award for the most outstanding student in the program.
AbstractWe study the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused. Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8--110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75--54.23x self-relative speedup. Compared to state-of-the-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which previous algorithms are not able to handle.
BioShangdi is a PhD student at MIT Department of Electrical Engineering and Computer Science, advised by professor Julian Shun. Her research focuses on parallel algorithms for graph and metric data clustering. Shangdi received her BSc in Computer Science and Operations Research from Cornell University and MSc in Computer Science from MIT.
AbstractThe massive growth of machine learning-based applications and the end of Moore’s law have created a pressing need to redesign computing platforms. We propose Lightning, the first reconfigurable photonic-electronic smartNIC to serve real-time deep neural network inference requests. Lightning uses a fast datapath to feed traffic from the NIC into the photonic domain without creating digital packet processing and data movement bottlenecks. To do so, Lightning leverages a novel reconfigurable count-action abstraction that keeps track of the required computation operations of each inference packet. Our count-action abstraction decouples the compute control plane from the data plane by counting the number of operations in each task and triggers the execution of the next task(s) without interrupting the dataflow. We evaluate Lightning’s performance using four platforms: a prototype, chip synthesis, emulations, and simulations. Our prototype demonstrates the feasibility of performing 8-bit photonic multiply-accumulate operations with 99.25% accuracy. To the best of our knowledge, our prototype is the highest-frequency photonic computing system, capable of serving real-time inference queries at 4.055 GHz end-to-end. Our simulations with large DNN models show that compared to Nvidia A100 GPU, A100X DPU, and Brainwave smartNIC, Lightning accelerates the average inference serve time by 337×, 329×, and 42×, while consuming 352×, 419×, and 54× less energy, respectively.
BioZhizhen Zhong is a postdoctoral researcher at MIT CSAIL. His research work focuses on the intersection of networked systems and photonics/optoelectronics to build the next-generation reconfigurable computer networks and high-performance networked computing infrastructures. Before joining MIT, he was a visiting researcher at Meta (Facebook) with Dr. Ying Zhang. He received his Ph.D. in Electronic Engineering from Tsinghua University in July 2019. His Ph.D. thesis is on 'Traffic-Driven Self-Adaptive Optical Networking' that designs reconfigurable optical networks to adapt to spatiotemporal traffic demand dynamics for high throughput and low latency. He was a visiting Ph.D student at the University of California Davis, advised by Prof. Biswanath Mukherjee and Prof. Massimo Tornatore. He received his Bachelor's Degree of Engineering and secondary Bachelor's Degree of Economics from Tsinghua University in 2014 and 2016, respectively. His awards include the Best Paper Award from OECC conference in 2022, Zijing-Scholar Fellowship from Tsinghua University in 2019, the 1st Place Best Poster Award (1 out of 600+ posters) on IEEE CLEO-PR/OECC/PGC conference from IEEE Photonics Society in 2017, the Tang Lixin Scholarship from Tsinghua University in 2016.
AbstractThe rapid expansion of wide-area networks (WANs) in the past decades makes it challenging for commercial optimization engines to solve the network traffic engineering (TE) problem quickly at scale. Recent approaches decompose the problem into parallel subproblems for acceleration, but they achieve limited parallelism and speedup in order to balance runtime and TE performance. We propose a learning-based TE algorithm--Teal, which harnesses the parallel processing power of GPUs to accelerate TE control. Teal is composed of (i) a flow-centric GNN, called FlowGNN, to capture WAN connectivity and model network flows; (ii) a multi-agent RL that allocates each traffic demand independently for problem scalability and learning tractability; (iii) ADMM fine-tuning solution to minimize constraint violations. Our evaluation on large WAN topologies with real traffic shows that Teal generates near-optimal flow allocations while being several orders of magnitude faster than the production optimization engine.
BioZhiying is a 5th-year Ph.D. student in the Dept. of Computer Science at Harvard University, advised by Prof. Minlan Yu. Before that, she earned her Bachelor's degree at Shanghai Jiao Tong University. Her research focuses on solving network problems with learning-based approaches.
AbstractThe global database research community has greatly impacted the functionality and performance of data storage and processing systems along the dimensions that define “big data”, i.e., volume, velocity, variety, and veracity. Locally, over the past five years, we have also been working on varying fronts. Among our contributions are: (1) establishing a vision for a database-inspired big data analytics system, which unifies the best of database and distributed systems technologies, and augments it with concepts drawn from compilers (e.g., iterations) and data stream processing, as well as (2) forming a community of researchers and institutions to create the Stratosphere platform to realize our vision. One major result from these activities was Apache Flink, an open-source big data analytics platform and its thriving global community of developers and production users. Although much progress has been made, when looking at the overall big data stack, a major challenge for database research community still remains. That is, how to maintain the ease-of-use despite the increasing heterogeneity and complexity of data analytics, involving specialized engines for various aspects of an end-to-end data analytics pipeline, including, among others, graph-based, linear algebra-based, and relational-based algorithms, and the underlying, increasingly heterogeneous hardware and computing infrastructure. At TU Berlin, DFKI, and the Berlin Institute for Foundations of Learning and Data (BIFOLD) we currently aim to advance research in this field via the NebulaStream and Agora projects. Our goal is to remedy some of the heterogeneity challenges that hamper developer productivity and limit the use of data science technologies to just the privileged few, who are coveted experts. In this talk, we will outline how state-of-the-art SPEs have to change to exploit the new capabilities of the IoT and showcase how we tackle IoT challenges in our own system, NebulaStream. We will also present our vision for Agora, an asset ecosystem that provides the technical infrastructure for offering and using data and algorithms, as well as physical infrastructure components.
BioVolker Markl is a German Professor of Computer Science. He leads the Chair of Database Systems and Information Management at TU Berlin and the Intelligent Analytics for Massive Data Research Department at DFKI. In addition, he is Director of the Berlin Institute for the Foundations of Learning and Data (BIFOLD). He is a database systems researcher, conducting research at the intersection of distributed systems, scalable data processing, and machine learning. Volker led the Stratosphere project, which resulted in the creation of Apache Flink. Volker has received numerous honors and prestigious awards, including two ACM SIGMOD Research Highlight Awards and best paper awards at ACM SIGMOD, VLDB, ICDE, and EDBT. He was recognized as ACM Fellow for his contributions to query optimization, scalable data processing, and data programmability. He is a member of the Berlin-Brandenburg Academy of Sciences. In 2014, he was elected one of Germany's leading “Digital Minds“ (Digitale Köpfe) by the German Informatics Society. He also is a member of the Berlin-Brandenburg Academy of Sciences and serves as advisor to academic institutions, governmental organizations, and technology companies. Volker holds eighteen patents and has been co-founder and mentor to several startups.
AbstractIn this talk, I will present Junction, a new operating system for the next generation of datacenters. Today’s datacenter applications are outgrowing the capabilities of existing operating systems. For example, they struggle to take advantage of recent improvements in datacenter networks, and they suffer from poor performance when other applications share a machine. At the same time, hardware is evolving to support more direct execution, or the ability to access once-privileged hardware features without going through the kernel. I will discuss our effort to combine the latest hardware support for direct execution with a new kernel design that focuses on fine-grained, microsecond-scale sharing of resources to deliver superior performance, security, and utilization. I will conclude with an example of how these improvements can enable a new programming model in the cloud.
BioAdam Belay is an Associate Professor of Computer Science at the Massachusetts Institute of Technology, where he works on operating systems, runtime systems, and distributed systems. During his Ph.D. at Stanford, he developed Dune, a system that safely exposes privileged CPU instructions to userspace; and IX, a dataplane operating system that significantly accelerates I/O performance. Dr. Belay’s current research interests lie in developing systems that cut across hardware and software layers to increase datacenter efficiency and performance. He is a member of the Parallel and Distributed Operating Systems Group, and a recipient of a Google Faculty Award, a Facebook Research Award, and the Stanford Graduate Fellowship.
AbstractBuilding a high-performance VM for a dynamic language has traditionally required a huge amount of time, money, and expertise. To reduce the high engineering cost, we present Deegen, a meta-compiler that generates a high-performance VM automatically from a semantical description of the bytecodes. Currently, Deegen is capable of automatically generating an optimized interpreter and baseline JIT compiler. This allows the user get a high-performance VM for their own language at an engineering cost similar to writing an interpreter. To demonstrate Deegen's capability in the real world, we implemented LuaJIT Remake (LJR), a standard-compliant VM for Lua 5.1. Across a variety of benchmarks, we demonstrated that LJR's interpreter significantly outperforms LuaJIT's interpreter, and LJR's baseline JIT generates high-quality code with a negligible compilation cost.
BioHaoran Xu is currently a Ph.D. student in Computer Science at Stanford University, advised by Fredrik Kjolstad.
AbstractDistributed Systems and Packet-Switched Networks were developed in the 1970s under a "clockless design" paradigm. This was mainly due to the difficulty of accurately synchronizing clocks over jittery packet-switched networks and it caused a bifurcation whose effects are felt to this day: widely-used "commodity" networks (such as those in public clouds) offer "best effort" service, while networks using specialized hardware and protocols offer "high-performance" or "time-sensitive" services. Imagine clocks can be accurately synchronized at scale and at distance without the need for specialized hardware. What implications would this have for Distributed Systems and Networking? This talk builds on Huygens---a high-accuracy, software-based network clock synchronization system. We will provide examples of how Huygens can be used to rectify some of the defects of clockless designs and transform the unpredictable public cloud infrastructure into a high-performance, time-sensitive system. Specifically, we will describe how to (1) convert Ethernet into a "zero-drop network," (2) provide programmable bandwidth slices without hardware support, (3) enable deterministic propagation times, (4) discover "noisy relative" VMs, and (5) provide a TrueTime-like API to enable external consistency in databases. We conclude by mentioning a few other examples where time-sensitivity and "timeliness" play a critical role.
BioBalaji Prabhakar is a faculty member in the Electrical Engineering and Computer Science departments and, by courtesy, in the Graduate School of Business at Stanford University. His research interests are in computer networks; notably, in Data Center Networks and Cloud Computing Platforms. His work spans network algorithms, congestion control protocols, and stochastic network theory. He has also worked on Societal Networks, where he has developed "nudge engines" to incentivize commuters to travel in off-peak times so that congestion, fuel and pollution costs are reduced.
AbstractData center networks are inclined towards increasing line rates to 200Gbps and beyond to satisfy the performance requirements of applications such as NVMe and distributed ML. With larger Bandwidth Delay Products (BDPs), an increasing number of transfers fit within a few BDPs. These transfers are not only more performance-sensitive to congestion, but also bring more challenges to congestion control (CC) as they leave little time for CC to make the right decisions. Therefore, CC is under more pressure than ever before to achieve minimal queuing and high link utilization, leaving no room for imperfect control decisions. We identify that for CC to make quick and accurate decisions, the use of precise congestion signals and minimization of the control loop delay are vital. We address these issues by designing Bolt, an attempt to push congestion control to its theoretical limits by harnessing the power of programmable data planes. Bolt is founded on three core ideas, (i) Sub-RTT Control (SRC) reacts to congestion faster than RTT control loop delay, (ii) Proactive Ramp-up (PRU) foresees flow completions in the future to promptly occupy released bandwidth, and (iii) Supply matching (SM) explicitly matches bandwidth demand with supply to maximize utilization. Our experiments in testbed and simulations demonstrate that Bolt reduces 99th- p latency by 80% and improves 99th-p flow completion time by up to 3× compared to Swift and HPCC while maintaining near line-rate utilization even at 400Gbps.
BioSerhat is a 5th year PhD student from Stanford University. He works with Nick McKeown and Sachin Katti. He mainly studies how to extract the most effective telemetry from different type of networks to make them more controllable. Today, he is going to present his work on extremely precise congestion control for high speed data centers. Nowadays, he is working on bringing controllability to cellular networks with a systems approach. He will be in the job market next year.
AbstractIn this talk, I present Wi-Peep -- a new location-revealing privacy attack on non-cooperative Wi-Fi devices. Wi-Peep exploits loopholes in the 802.11 protocol to elicit responses from Wi-Fi devices on a network that we do not have access to. It then uses a novel time-of-flight measurement scheme to locate these devices. Wi-Peep works without any hardware or software modifications on target devices and without requiring access to the physical space that they are deployed in. Therefore, a pedestrian or a drone that carries a Wi-Peep device can estimate the location of every Wi-Fi device in a building. Our Wi-Peep design costs $20 and weighs less than 10 g. We deploy it on a lightweight drone and show that a drone flying over a house can estimate the location of Wi-Fi devices across multiple floors to meter-level accuracy. Finally, I present mitigation techniques to secure future Wi-Fi devices against such attacks.
BioAli Abedi is currently a postdoctoral scholar at Stanford University. He is also an adjunct professor of computer science at the University of Waterloo. His research interests are in the areas of wireless networks and mobile systems with a special focus on the Internet of Things (IoT) and smart environments. He received his Ph.D. in computer science from the University of Waterloo. His work has been published in top systems and networking venues such as SIGCOMM, MobiCom, and HotNets. His research projects have been featured in ACM GetMobile, ACM Tech News, and many technology websites. He has received multiple grants from the Natural Sciences and Engineering Research Council of Canada (NSERC).
AbstractThe ledgers that record the title to your house, your stocks, or the money in your bank account are each maintained by a single central operator. This results in asymmetric access privileges between operators and users, which create single points of failure and abuse, and stifle interoperability and innovation. Blockchains provide a way to maintain a ledger without a central operator, thereby alleviating these problems. The foundation of every blockchain is an Internet-scale consensus protocol. Although consensus has been studied in distributed systems for decades, Internet-scale consensus requires new models, new security properties, and new protocols due to the unprecedented scale, open participation, fragile Internet communication, and self-interested participants. I will present two examples from my work: (1) Ethereum, the second largest cryptocurrency, aims to strengthen both the traditional consensus liveness property (to support unforeseeable coming-and-going of parties), and the traditional consensus safety property (to hold malicious parties accountable). We show that no single-ledger protocol can satisfy both strengthened properties. To resolve this dilemma, we develop the multi-ledger consensus paradigm that is now the security design-specification for Ethereum and similar systems. (2) The bounded-delay network model, pervasive in distributed systems, does not capture rate constraints on communication and processing. This leaves many 'provably secure' protocols vulnerable. We show via a new queuing-based model that good communication/processing scheduling is key to protocol security. We break a popular scheduling policy. Our new secure policies are simple enough to envision them forward-deployed at Internet providers via an application-agnostic system that can improve the reliability of security-critical traffic even beyond blockchain.
BioJoachim Neu is a PhD student at Stanford advised by David Tse working on Internet-scale consensus (in more hype terms: the technical foundations of blockchains). His current research focus is provable consensus security for next-generation Ethereum, and provable security and performance of proof-of-stake consensus under bandwidth constraints and network-level attacks. In an earlier life, he published in information and coding theory.
AbstractA cryptographic computer is a cryptosystem that takes a computation as an input. That is, a user-programmable cryptosystem. Examples: zero-knowledge proofs (ZKP), succinct proofs, multi-party computation (MPC), homomorphic encryption, functional encryption, functional commitments, etc. Today, cryptographic computer (privately) secure billions of dollars. Cryptographic computers are (philosophically/politically) exciting because they give the programmability and power of a regular computer, but can guarantee privacy/integrity properties. They're also (computationally) exciting because their semantics are rather different from traditional computers, which re-raises variations of classic questions: * How can we program (or compile to) cryptographic computers? * How can we verify programs written for cryptographic computers? * What are the right programming interfaces to cryptographic computers? * (in terms of computational semantics *and* security properties) In this talk, I'll survey how Zero-Knowledge Proofs (one type of cryptographic computer) are used, and I'll summarise my research on compilation, verification, and generalization for them.
BioAlex Ozdemir is a Stanford PhD student working with the applied cryptography group and the CENTer for AUtomated Reasoning (CENTAUR). His main focus is on 'cryptographic computers': cryptosystems that are configured by user-defined programs. His work includes new cryptographic computers, compilers for cryptographic computers, and tools for verifying programs that run on cryptographic computers.
AbstractVideo conferencing systems suffer from poor user experience when network conditions deteriorate because current video codecs simply cannot operate at extremely low bitrates. Recently, several neural alternatives have been proposed that reconstruct talking head videos at very low bitrates using sparse representations of each frame such as facial landmark information. However, these approaches produce poor reconstructions in scenarios with major movement or occlusions over the course of a call, and do not scale to higher resolutions. We design Gemino, a new neural compression system for video conferencing based on a novel high-frequency-conditional super-resolution pipeline. Gemino upsamples a very low-resolution version of each target frame while enhancing high-frequency details (e.g., skin texture, hair, etc.) based on information extracted from a single high-resolution reference image. We use a multi-scale architecture that runs different components of the model at different resolutions, allowing it to scale to resolutions comparable to 720p, and we personalize the model to learn specific details of each person, achieving much better fidelity at low bitrates. We implement Gemino atop aiortc, an open-source Python implementation of WebRTC, and show that it operates on 1024x1024 videos in real-time on a A100 GPU, and achieves 2.9x lower bitrate than traditional video codecs for the same perceptual quality.
BioVibhaa is a sixth year Ph.D. student in the Networking and Mobile Systems Group at MIT CSAIL where she is advised by Prof. Mohammad Alizadeh. Her research interests lie broadly in computer networks, with a particular interest in algorithmic techniques. In the past, she has worked on using networking ideas to improve blockchain scalability, as well as network monitoring and heavy-hitter detection. Recently, she has been interested in improving video streaming and conferencing applications using advances in computer vision and video compression techniques. Prior to MIT, Vibhaa received a B.S.E. in Computer Science from Princeton University.
AbstractWe are in a new computing era of domain-specific accelerators, where Google's TPU is a visible example. Building such accelerators calls for broader codesign, not just traditional codesign at the hardware/software interface, but vertically integrated codesign that reaches up to applications and down to materials science and device physics. I'll talk about the balance between science and engineering, about how codesign works in TPUs, and I'll pose some materials challenges looking forward.
BioCliff Young is a software engineer in Google Research, where he works on codesign for deep learning accelerators. He is one of the designers of Google’s Tensor Processing Unit (TPU) and one of the founders of the MLPerf benchmark. Previously, Cliff built special-purpose supercomputers for molecular dynamics at D. E. Shaw Research and was a Member of Technical Staff at Bell Labs. Cliff holds AB, MS, and PhD degrees in computer science from Harvard University. Cliff is a member of ACM and IEEE.
AbstractIn this talk, I will explore three elements of designing next-generation machine learning systems: congestion control, network topology, and computation frequency. I will show that fair sharing, the holy grail of congestion control algorithms, is not necessarily desirable for deep neural network training clusters. Then I will introduce a new optical fabric that optimally combines network topology and parallelization strategies for machine learning training clusters. Finally, I will demonstrate the benefits of leveraging photonic computing systems for real-time, energy-efficient inference via analog computing. Pushing the frontiers of optical networks for machine learning workloads will enable us to fully harness the potential of deep neural networks and achieve improved performance and scalability.
BioManya Ghobadi is faculty in the EECS department at MIT. Her research spans different areas in computer networks, focusing on optical reconfigurable networks, networks for machine learning, and high-performance cloud infrastructure. Her work has been recognized by the Sloan Fellowship in Computer Science, ACM SIGCOMM Rising Star award, NSF CAREER award, Optica Simmons Memorial Speakership award, best paper award at the Machine Learning Systems (MLSys) conference, as well as the best dataset and best paper awards at the ACM Internet Measurement Conference (IMC). Manya received her Ph.D. from the University of Toronto and spent a few years at Microsoft Research and Google prior to joining MIT.
AbstractModern databases are moving to the cloud for lower cost, elastic resource allocation, and high availability. Cloud-native databases adopt a unique storage-disaggregation architecture, where the computation and storage are decoupled as two separate services connected through the data center network. The new storage-disaggregation architecture demands a revisit of database system design. In this talk, I first discuss our recent research on optimizing the two-phase commit protocol, a fundamental building block in distributed transactional databases. We leverage the disaggregated storage service to reduce the protocol latency and eliminate blocking. Then, I discuss our recent work on leveraging storage-layer computation (i.e., a pushdown layer) to accelerate data analytics processing. Finally, I will discuss our on-going research and future plans on cloud-native database systems.
BioXiangyao Yu is an Assistant Professor at the University of Wisconsin-Madison. His research interests include (1) cloud-native databases, (2) new hardware for databases, and (3) transactions and HTAP. Before joining UW-Madison, he finished postdoc and PhD at MIT and bachelor at Tsinghua University.
AbstractNowadays, computing platforms use a mix of different hardware technologies, to scale application performance, resource capacities and achieve cost effectiveness. However, this heterogeneity, along with the greater irregularity in the behavior of emerging workloads, render existing resource management approaches ineffective. In the first part of this talk, I will describe how we can use machine learning methods at the operating system-level, in order to make smarter resource management decisions and speed up application performance. In the second part of the talk, I will present how we can accelerate certain components of such systems using visualization and computer vision methods. Finally, I will conclude with my vision of coupling machine learning and computer vision at the system-level and present open questions that make this research area exciting to work on!
BioThaleia Dimitra Doudali is an Assistant Research Professor at the IMDEA Software Institute in Madrid, Spain. She received her PhD from the Georgia Institute of Technology (Georgia Tech) in the United States. Prior to that she earned an undergraduate diploma in Electrical and Computer Engineering at the National Technical University of Athens in Greece. Thaleia’s research lies at the intersection of Systems and Machine Learning, where she explores novel methodologies, such as machine learning and computer vision, to improve system-level resource management of emerging hardware technologies. In 2020, Thaleia was selected to attend the prestigious Rising Stars in EECS academic workshop. Aside from research, Thaleia actively strives to improve the mental health awareness in academia and foster diversity and inclusion.
AbstractProcessing input data plays a vital role in ML training, impacting accuracy, throughput, and cost. The input data pipeline is responsible for extracting data from storage, transforming data on-the-fly, and loading data to a training node (typically a GPU or TPU). As ML hardware accelerators continue to provide more FLOPS, feeding data at a sufficient rate to saturate accelerators is increasingly challenging. The high cost of accelerators compared to their CPU hosts makes it particularly important to ensure that they operate at high utilization. In this talk, I will discuss the characteristics of ML input data pipelines and motivate the design of a new system architecture, in which we disaggregate input data processing from model training. I will present Cachew, a fully-managed service for ML data processing, built on top of Tensorflow's data loading framework, tf.data. Cachew disaggregates and dynamically scales distributed resources for data processing to avoid input data stalls. The service also maintains a global view of data processing across jobs, which enables selectively caching preprocessed datasets to maximize training throughput and improve energy efficiency across jobs. I will conclude by discussing open research questions in the area of data storage and data processing systems for ML.
BioAna Klimovic is an Assistant Professor in the Systems Group of the Computer Science Department at ETH Zurich. Her research interests span operating systems, computer architecture, and their intersection with machine learning. Ana's work focuses on computer system design for large-scale applications such as cloud computing services, data analytics, and machine learning. Before joining ETH in August 2020, Ana was a Research Scientist at Google Brain and completed her Ph.D. in Electrical Engineering at Stanford University.
AbstractWe present a decade of evolution and production experience with Jupiter datacenter network fabrics. In this period Jupiter has delivered 5x higher speed and capacity, 30% reduction in capex, 41% reduction in power, incremental deployment and technology refresh all while serving live production traffic. A key enabler for these improvements is evolving Jupiter from a Clos to a direct-connect topology among the machine aggregation blocks. Critical architectural changes for this include: A datacenter interconnection layer employing Micro-Electro-Mechanical Systems (MEMS) based Optical Circuit Switches (OCSes) to enable dynamic topology reconfiguration, centralized Software-Defined Networking (SDN) control for traffic engineering, and automated network operations for incremental capacity delivery and topology engineering. We show that the combination of traffic and topology engineering on direct-connect fabrics achieves similar throughput as Clos fabrics for our production traffic patterns. We also optimize for path lengths: 60% of the traffic takes direct path from source to destination aggregation blocks, while the remaining transits one additional block, achieving an average block-level path length of 1.4 in our fleet today. OCS also achieves 3x faster fabric reconfiguration compared to pre-evolution Clos fabrics that used a patch panel based interconnect.
BioOmid Mashayekhi is a senior software engineer at Google, developing distributed software systems for data-center and WAN networks, as a member of NetInfra team. Before that, he was a graduate student at Stanford University, where he received a Ph.D. degree in Electrical Engineering, and a Ph.D. minor degree in Computer Science, in 2017. At Stanford, Omid worked with Professor Philip Levis as a member of Stanford Information Networks Group (SING). His interests include cloud computing, distributed systems, and networking systems.
AbstractAlthough TCP is a tremendously successful transport protocol that has survived 40 years of dramatic technology changes, virtually every aspect of its design is wrong for the datacenter. This talk will discuss the problems with TCP, ranging from its use of connections and streams to its mechanisms for reliable delivery and congestion control. If we are to make significant headway against the datacenter tax we must move most datacenter traffic to a new and fundamentally different protocol. I will then discuss how the Homa transport protocol solves all of the problems of TCP and suggest a migration strategy for bringing a TCP alternative into widespread usage. Finally, I will discuss my experiences implementing Homa in the Linux kernel. I will argue that it no longer makes sense to implement network transport protocols in software. Instead, transport protocols must move to NIC hardware. This will likely require the development of a new architecture for NICs.
BioJohn Ousterhout is the Bosack Lerner Professor of Computer Science at Stanford University. His current research focuses on new software stack layers to allow datacenter applications to take advantage of communication and storage technologies with microsecond-scale latencies. Ousterhout's prior positions include 14 years in industry, where he founded two companies (Scriptics and Electric Cloud), preceded by 14 years as Professor of Computer Science at U.C. Berkeley. He is author of the book A Philosophy of Software Design, co-creator of the Raft consensus protocol, and creator of the Tcl scripting language and the Tk toolkit. Ousterhout received a BS degree in Physics from Yale University and a PhD in Computer Science from Carnegie Mellon University. He is a member of the National Academy of Engineering and has received numerous awards, including the ACM Software System Award, the ACM Grace Murray Hopper Award, the National Science Foundation Presidential Young Investigator Award, and the U.C. Berkeley Distinguished Teaching Award.
AbstractModern computing systems must meet multiple---often conflicting---goals; e.g., high-performance and low energy consumption. The current state-of-practice involves ad hoc, heuristic solutions to such system management problems that offer no formally verifiable behavior and must be rewritten or redesigned wholesale as new computing platforms and constraints evolve. In this talk, I will discuss my research on building self-aware computing systems that combine machine learning and control theory to handle system goals and constraints in a fundamental way, starting with rigorous mathematical models and ending with real software and hardware implementations that have formally analyzable behavior and can be re-purposed to address new problems as they emerge. These self-aware systems are distinguished by awareness of user goals and operating environment; they continuously monitor themselves and adapt their behavior and foundational models to ensure the goals are met despite the challenges of complexity (diverse hardware resources to be managed) and dynamics (unpredictable changes in input workload or resource availability). In this talk, I will describe how to build self-aware systems through a combination of control theoretic and machine learning techniques. I will then show how to apply these techniques to systems based on machine learning, including both scientific and low-power sensing systems.
BioHenry Hoffmann is an Associate Professor in the Department of Computer Science at the University of Chicago. He received the President's Aware for Early Career Scientists and Engineers (PECASE) in 2019. He was granted early tenure in 2018. He is a member of the ASPLOS Hall of Fame. He has a Test of Time Honorable Mention from FSE 2021 for his work on Loop Perforation (an early project on approximate computing). He received a DOE Early Career Award in 2015. At Chicago he leads the Self-aware Computing group (or SEEC project) and conducts research on adaptive techniques for power, energy, accuracy, and performance management in computing systems. He also founded the UChicago CS department's EDI (equity, diversity, and inclusion) committee in 2020. He completed a PhD in Electrical Engineering and Computer Science at MIT where his research on self-aware computing was named one of ten 'World Changing Ideas' by Scientific American in December 2011. As a Masters student he worked on MIT's Raw processor, one of the first manycore processors. Along with other members of the Raw team, he spent several years at Tilera Corporation, a startup which commercialized the Raw architecture and created one of the first manycores (Tilera was sold for $130M in 2014). His implementation of the BDTI Communications Benchmark (OFDM) on Tilera's 64-core TILE64 processor still has the highest certified performance of any programmable processor. In 1999, he received his BS in Mathematical Sciences with highest honors and highest distinction from UNC Chapel Hill.
AbstractModern complex datacenter applications exhibit unique characteristics such as extensive data and instruction footprints, complex control flow, and hard-to-predict branches that are not adequately served by existing microprocessor architectures. In particular, these workloads exceed the capabilities of microprocessor structures such as the instruction cache, BTB, branch predictor, and data caches, causing significant degradation of performance and energy efficiency. In my talk, I will provide a detailed characterization of datacenter applications, highlighting the importance of addressing frontend and backend performance issues. I will then introduce three new techniques to address these challenges, improving the branch predictor, data cache, and instruction scheduler. I will make the case for profile-guided optimizations that amortize overheads across the fleet and which have been successfully deployed at Google and Intel, serving millions of users daily.
BioHeiner Litz is an Assistant Professor at the University of California, Santa Cruz working in the field of Computer Architecture and Systems. His research focuses on improving the performance, cost, and efficiency of data center systems. Heiner is the recipient of the NSF CAREER award, Intel's Outstanding Researcher award, Google's Faculty Award, and his work received the 2020 IEEE MICRO Top Pick award. Before joining UCSC, Heiner Litz was a researcher at Google and a postdoctoral research fellow at Stanford University with Prof. Christos Kozyrakis and David Cheriton. Dr. Litz received his Diplom and Ph.D. from the University of Mannheim, Germany, advised by Prof. Bruening.
AbstractWith the slowing of technology scaling, computer systems must look higher in the stack for performance and energy efficiency. Data movement is usually to blame for poor scaling, but current systems give software too few options to optimize data movement. As a result, many systems have turned to specialized solutions, often requiring custom hardware, to deal with the data movement problem. This talk will argue that the hardware-software interface is the problem, and data-centric design is the solution. We will consider three data-centric systems that optimize data movement while remaining general-purpose: (i) Polymorphic cache hierarchies allow software to modify or expand the memory interface on general-purpose hardware. (ii) Energy-minimal dataflow architectures massively reduce instruction and data movement, letting software compete with ASIC efficiency. And (iii) General-purpose flash-based caches reduce cost by an order of magnitude in the datacenter behind a simple interface.
AbstractThe success of deep neural networks (DNNs) from Machine Learning (ML) has inspired domain specific architectures (DSAs) for them. ML has two phases: training, which constructs accurate models, and inference, which serves those models. Google’s first generation DSA offered 50x improvement over conventional architectures for inference in 2015. Google next built the first production DSA supercomputer for the much harder problem of training. Subsequent generations greatly improved performance of both phases. We start with ten lessons learned, such as DNNs grow rapidly; workloads quickly evolve with DNN advances; the bottleneck is memory, not floating-point units; and semiconductor technology advances unequally. The rapid growth of DNNs rightfully raised concerns about their carbon footprint. The second part of the talk identifies the “4Ms” (Model, Machine, Mechanization, Map) that, if optimized, can reduce ML training energy by up to 100x and carbon emissions up to 1000x. By improving the 4Ms, ML held steady at <15% of Google’s total energy use despite it consuming ~75% of its floating point operations. Climate change is one of our most important problems, so ML papers should include emissions explicitly to foster competition on more than just model quality. External estimates have been off 100x–100,000x, so publishing emissions also ensures accurate accounting, which helps pinpoint the biggest challenges. With continuing focus on the 4Ms, we can realize the amazing potential of ML to positively impact many fields in a sustainable way.
AbstractMemory hierarchies are changing, both moving persistence closer to computation and pushing the semantics of memory to a more disaggregated setting. We present Twizzler, an operating system redesign for this near-future. Twizzler removes the kernel from the I/O path, provides programs with memory-style access to global address space of data using small (64 bit), object-relative cross-object pointers, and enables simple and efficient long-term sharing of data in both space and time. Twizzler provides a clean-slate programming model for persistent and shared data, realizing the vision of Unix in a world of evolving memory hierarchies.
BioNathan Beckmann is an assistant professor at Carnegie Mellon University in the Computer Science and (by courtesy) Electrical and Computer Engineering Departments. His research focuses on general-purpose, data-centric computer architectures and systems. His awards include a Google Research Scholar Award in 2021 and the NSF CAREER Award in 2019. He graduated from MIT advised by Daniel Sanchez, receiving the George M. Sprowls Award for an outstanding PhD dissertation in computer science in 2015.
AbstractModern cluster managers routinely grapple with hard combinatorial optimization problems, such as policy-based load balancing, placement, scheduling, and configuration. Implementing ad-hoc heuristics to solve these problems is notoriously hard to do, making it challenging to evolve the system over time and add new features. In this talk, I will present Declarative Cluster Managers (DCM), a general approach for building cluster managers that makes them performant and easily extensible. With DCM, developers specify the cluster manager's behavior using a high-level declarative language like SQL and let a compiler take care of generating an efficient implementation. I will show how DCM significantly lowers the barrier to building scalable and extensible cluster manager components, in the context of some real-world systems like Kubernetes.
AbstractIncremental view maintenance has been for a long time a central problem in database theory. Many solutions have been proposed for restricted classes of database languages, such as the relational algebra, or Datalog. These techniques do not naturally generalize to richer languages. In this work we give a general solution to this problem in 3 steps: (1) we describe a simple but expressive language called DBSP for describing computations over data streams; (2) we give a general algorithm for solving the incremental view maintenance problem for arbitrary DBSP programs, and (3) we show how to model many rich database query languages (including the full relational queries, grouping and aggregation, monotonic and non-monotonic recursion, and streaming aggregation) using DBSP. As a consequence, we obtain efficient incremental view maintenance techniques for all these rich languages. Joint work with Leonid Ryzhyk (Vmware Research), Frank McSherry (Materialize Inc.) and Val Tannen (University of Pennsylvania)
BioMihai Budiu is a senior researcher in the VMware Research Group (VRG). He has a Ph.D. in CS from Carnegie Mellon University. He was previously employed at Microsoft Research and Barefoot Networks. Mihai's has worked on reconfigurable hardware, computer architecture, compilers, security, distributed systems, big data platforms, large-scale machine learning, programmable networks and P4, data visualization, and databases; four of his papers have received “test of time” awards.
AbstractAnalytics on video recorded by cameras in public areas have the potential to fuel many exciting applications, but also pose the risk of intruding on individuals' privacy. Unfortunately, existing solutions fail to practically resolve this tension between utility and privacy, relying on perfect detection of all private information in each video frame--an elusive requirement. This paper presents: (1) a new notion of differential privacy (DP) for video analytics, (p,K,e)-event-duration privacy, which protects all private information visible for less than a particular duration, rather than relying on perfect detections of that information, and (2) a practical system called Privid that enforces duration-based privacy even with the (untrusted) analyst-provided deep neural networks that are commonplace for video analytics today. Across a variety of videos and queries, we show that Privid achieves accuracies within 79-99% of a non-private system.
AbstractWe present data-parallel actors (DPA), a programming model for building distributed query serving systems. Query serving systems are an important class of applications characterized by low-latency data-parallel queries and frequent bulk data updates; they include data analytics systems like Apache Druid, full-text search engines like ElasticSearch, and time series databases like InfluxDB. They are challenging to build because they run at scale and need complex distributed functionality like data replication, fault tolerance, and update consistency. DPA makes building these systems easier by allowing developers to construct them from purely single-node components while automatically providing these critical properties. In DPA, we view a query serving system as a collection of stateful actors, each encapsulating a partition of data. DPA provides parallel operators that enable consistent, atomic, and fault-tolerant parallel updates and queries over data stored in actors. We have used DPA to build a new query serving system, a simplified data warehouse based on the single-node database MonetDB, and enhance existing ones, such as Druid, Solr, and MongoDB, adding missing user-requested features such as load balancing and elasticity. We show that DPA can distribute a system in <1K lines of code (>10× less than typical implementations in current systems) while achieving state-of-the-art performance and adding rich functionality.
AbstractMy current research interest is in systems support and resource management for distributed machine learning frameworks and applications. Specifically, I am currently working on distributed systems and resource management algorithms for soft-real time Machine Learning inference. This builds on award-winning (Best Student Paper, EuroSys'16) body of R&D at Carnegie Mellon modeling, designing, and developing abstractions, primitives, algorithms and systems for a general resource management framework for static and dynamic heterogeneity, hard and soft placement constraints, time-varying resource capacity guarantees, and combinatorial constraints in heterogeneous resource contexts. Earlier motivating work published in ACM SoCC'12 recently led to a Test of Time award at SoCC'21. I focus on developing algorithms and building systems that support an array of practical aspects of Machine Learning.
BioI am a tenure-track Assistant Professor in the School of Computer Science at Georgia Tech since August 2019. I completed my postdoc at the University of California, Berkeley, working with Ion Stoica and collaborating closely with Joseph Gonzalez. I completed my Ph.D. at Carnegie Mellon University, advised by Gregory Ganger. At Carnegie Mellon, I was honored by the prestigious NSERC Alexander Graham Bell Canada Graduate Scholarship (NSERC CGS-D3) and partially funded by the Intel Science and Technology Centre for Cloud Computing (ISTC-CC) and the Parallel Data Lab (PDL) industry consortium.
AbstractScanning and filtering over multi-dimensional tables are key operations in modern analytical database engines. To optimize the performance of these operations, databases often use multi-dimensional indexes or specialized data layouts (e.g., Z-order). However, these schemes are hard to tune and their performance is inconsistent. In this talk, I will present Flood and Tsunami, two learned multi-dimensional read-optimized indexes for in-memory analytic workloads. By automatically co-optimizing the index structure and data layout for a particular dataset and workload, Flood and Tsunami achieve up to 10X faster query performance and 100X smaller index size than optimally-tuned traditional indexes. I will conclude by giving a brief overview of ongoing work on incorporating instance-optimized data layouts into SageDB, a learned database system.
AbstractAbstract: Real-world multi-agent systems have to safely operate in environments where unsafe decisions have real consequences. One such environment of interest is network congestion control, in which selfish agents with conflicting goals can cause a network to become unusable by causing some agents to receive zero throughput. Motivated by this problem, we consider safety conditions for a broad class of aggregative games that includes congestion control, the classic Cournot oligopoly model, and resource sharing in the commons. We use comparative statics to derive simple, practically useful conditions that guarantee that all users achieve a fixed minimum utility at equilibrium. Compared to prior work on studying properties of equilibria in network congestion control, we show that our approach allows for a more realistic class of utilities and addresses real concerns regarding safe deployment of modern network congestion control protocols. Joint work with Matei Zaharia and Tatsu Hashimoto.
AbstractSerialization frameworks are a fundamental component of scale-out systems, but introduce significant compute overheads. However, they are amenable to acceleration with specialized hardware. To understand the trade-offs involved in architecting such an accelerator, we present the first in-depth study of serialization framework usage at scale by profiling Protocol Buffers (“protobuf”) usage across Google’s datacenter fleet. We use this data to build HyperProtoBench, an open-source benchmark representative of key serialization-framework user services at scale. In doing so, we identify key insights that challenge prevailing assumptions about serialization framework usage. We use these insights to develop a novel hardware accelerator for protobufs, implemented in RTL and integrated into a RISC-V SoC. Applications can easily harness the accelerator, as it integrates with a modified version of the open-source protobuf library and is wire-compatible with standard protobufs. We have fully open-sourced our RTL, which, to the best of our knowledge, is the only such implementation currently available to the community. We also present a first-of-its-kind, end-to-end evaluation of our entire RTL-based system running hyperscale-derived benchmarks and microbenchmarks. We boot Linux on the system using FireSim to run these benchmarks and implement the design in a commercial 22nm FinFET process to obtain area and frequency metrics. We demonstrate an average 6.2x to 11.2x performance improvement vs. our baseline RISC-V SoC with BOOM OoO cores and despite the RISC-V SoC’s weaker uncore/supporting components, an average 3.8x improvement vs. a Xeon-based server
BioSagar Karandikar is a Ph.D. student at UC Berkeley and a Student Researcher at Google. His research focuses on hardware/software co-design in warehouse-scale machines. His work on FireSim, which enables high-performance scale-out FPGA-accelerated full-system simulation, was selected as an IEEE Micro Top Pick, nominated for CACM Research Highlights, and has been used in published work from over 20 companies and academic institutions and in the development of commercially-available silicon. His work on hardware acceleration for protocol buffers received an IEEE Micro Top Picks Honorable Mention and the Distinguished Artifact Award at MICRO-54.
AbstractTransport protocols can be implemented in NIC (Network Interface Card) hardware to increase throughput, reduce latency and free up CPU cycles. If the ideal transport protocol were known, the optimal implementation would be simple: bake it into fixed-function hardware. But transport layer protocols are still evolving, with innovative new algorithms proposed every year. A recent study proposed Tonic, a Verilog-programmable transport layer in hardware. We build on this work to propose a new programmable hardware transport layer architecture, called nanoTransport, optimized for the extremely low-latency message-based RPCs (Remote Procedure Calls) that dominate large, modern distributed data center applications. NanoTransport is programmed using the P4 language, making it easy to modify existing (or create entirely new) transport protocols in hardware. We identify common events and primitive operations, allowing for a streamlined, modular, programmable pipeline, including packetization, reassembly, timeouts and packet generation, all to be expressed by the programmer. We evaluate our nanoTransport prototype by programming it to run the reliable message-based transport protocols NDP and Homa, as well as a hybrid variant. Our FPGA prototype - implemented in Chisel and running on the Firesim simulator - exposes P4-programmable pipelines and is designed to run in an ASIC at 200Gb/s with each packet processed end-to-end in less than 10ns (including message reassembly).
AbstractThe goal of my research is to build efficient large-scale parallel systems that can accelerate big data applications by 100--1000x without incurring a hefty cost. My work is motivated by two growing trends: big data and cloud computing. Today, many businesses and web services store staggering quantities of data in the cloud and lease relatively small clusters of instances to run analytics queries, train machine learning models, and more. However, the exponential data growth, combined with the slowdown of Moore's law, makes it challenging (if not impossible) to run such big data processing tasks in real-time. Most applications run big data workloads on timescales of several minutes or hours and resort to complex, application-specific optimizations to reduce the amount of data processing required for interactive queries. This design pattern hurts developer productivity and restricts the scope of applications that can use big data. My research aims to enable interactive, cost-effective big data processing through flash bursts. Flash bursts enable an application to use a large portion of a shared cluster for short periods of time. This could allow big data applications to complete significantly faster, with a cost comparable to leasing a few instances for a longer period of time. The main challenge to flash bursts is Amdahl's law. Flash bursts launch small tasks per node that run for milliseconds at a time. With such small task granularity, previously negligible overheads (e.g., synchronization, I/O processing, cache misses, etc.) can turn into significant bottlenecks, and increasing communication overhead for coordination across many nodes can limit scaling. My research takes a two-pronged approach to tackle this challenge: building new distributed system infrastructure for flash bursts, and restructuring important applications (e.g., data analytics, DNN training) to use flash bursts efficiently. In this talk, I will focus on how I restructured distributed sorting (MilliSort) and how I removed the overheads of replication (CURP).
BioSeo Jin Park is a PhD alumnus, previously advised by John Ousterhout. He is currently a postdoc at MIT CSAIL with Mohammad Alizadeh. His research focuses on making cluster-scale parallel systems efficient so that big data applications (e.g., DNN training, analytics) can run 100--1000x faster on the public cloud. Other areas of research are improving blockchain throughput under network bandwidth variability, removing overheads of consistent replication, and building tools for debugging tail latencies.
AbstractThe DMA part of RDMA stands for Direct Memory Access. It refers to the ability of a network card (among other devices) to read and write data from a host’s memory without CPU assistance. RDMA’s performance depends on efficient DMAs in the initiating and target hosts. In turn, a DMA’s cost is almost always proportional to the length of the data transfer. The exception is small DMAs, which suffer from high overheads. In this talk, we show that database systems often generate small DMA operations when using RDMA canonically. The reason is that the data they transmit is seldom contiguous by the time transmissions occur. Modern databases avoid this problem by copying data into large transmission buffers and issuing RDMAs over these buffers instead. However, doing this requires a substantial amount of CPU cycles and memory bandwidth, forfeiting RDMA’s benefits: its zero-copy feature. To solve this issue, we introduce D- RDMA, a declarative extension to RDMA. The approach leverages a smart NIC to group data fragments into larger DMAs and produce the same packet stream as regular RDMA. Our experiments show that the network throughput can increase from 18 Gbps per CPU core to up to 98 Gbps (on a 100 Gbps card) with virtually zero CPU usage in a typical data shuffle scenario. We believe that D-RDMA can enable a new generation of high-performance systems to take full advantage of fast networking without incurring the usual CPU penalties.
BioAndré Ryser is a Research Assistant at the eXascale Infolab at the University of Fribourg in Switzerland, where he investigates heterogeneous platforms for data-intensive applications. He has participated in creating software-hardware co-designed infrastructure to push application logic to programmable switches, NVMe SSDs, and, more recently, to NICs. André is a long-time Swiss Olympiad in Informatics participant, having won a gold medal and, later, building and preparing teams.
AbstractThis work aims to develop a lightweight data center execution environment exploiting application semantics to optimize tail performance for cloud services. This system, dubbed Service Boosters, is a library operating system exposing the application structure and semantics to the underlying resource management stack. Using Service Boosters, programmers can declare and annotate the structure of their request processing pipeline, while performance engineers can program advanced management strategies for the application. I present three components of Service Boosters, FineLame, Perséphone, and DeDoS, that exploit application awareness to provide real time anomaly detection; tail-tolerant RPC scheduling; and resource harvesting. FineLame leverages awareness of the request processing pipeline to deploy monitoring and anomaly detection probes on critical resource consumption functions. Using these probes, FineLame can detect abnormal requests in-flight as soon as they depart from the expected workload behavior, allowing operators to prevent a surge in tail latency. Perséphone exploits an understanding of request types to dynamically allocate resources to each type of request and forbid pathological head-of-line blocking from heavy-tailed workloads, without the need for interrupts. Perséphone is well suited for kernel-bypass, microsecond scale applications requiring low overhead scheduling. Finally, DeDoS can identify overloaded components and dynamically scale them on the cluster, harvesting only the resources needed to quench the overload.
AbstractResource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers is often intractable for large problem sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. We observe, however, that many allocation problems are granular: they consist of a large number of clients and resources, each client requests a small fraction of the total number of resources, and clients can interchangeably use different resources. For these problems, we propose an alternative approach that reuses the original optimization problem formulation and leads to better allocations than domain-specific heuristics. Our technique, Partitioned Optimization Problems (POP), randomly splits the problem into smaller problems (with a subset of the clients and resources in the system) and coalesces the resulting sub-allocations into a global allocation for all clients. We provide theoretical and empirical evidence as to why random partitioning works well. In our experiments, POP achieves allocations within 1.5% of the optimal with orders-of-magnitude improvements in runtime compared to existing systems for cluster scheduling, traffic engineering, and load balancing.
AbstractDatacenter systems and I/O devices now run at single-digit microsecond latencies, requiring nanosecond-scale operating systems. Traditional kernel-based operating systems impose an unaffordable overhead, so recent kernel-bypass OSes (e.g., Arrakis, Ix) and libraries (e.g., Caladan, eRPC) eliminate the OS kernel from the I/O datapath. However, none of these systems offer a general-purpose datapath OS replacement that meet the needs of microsecond-scale systems. This talk proposes Demikernel, a flexible datapath OS and architecture designed for heterogenous kernel-bypass devices and microsecond-scale datacenter systems. We have built two prototype Demikernel datapath OSes and show that minimal effort is needed to port existing microsecond-scale systems. Once ported, Demikernel lets applications run across heterogenous kernel-bypass devices with nanosecond-scale overheads and no code changes.
BioIrene Zhang is a Principal Researcher at Microsoft Research. Her work focuses on datacenter operating systems and distributed systems, especially making new datacenter hardware technologies more widely usable by highly-demanding datacenter applications. Irene completed her PhD in 2017 at the University of Washington, where her PhD thesis focused on distributed systems that span mobile devices and cloud servers. Her thesis work received the ACM SIGOPS Dennis Ritchie doctoral dissertation award and the UW Allen School William Chan Memorial dissertation award. Before her PhD, Irene was a member of the virtual machine monitor group at VMware, where she worked on memory resource management and virtual machine checkpointing.
AbstractOrganizations are shifting to a new paradigm called the Lakehouse, which promises the benefits of structured data warehouses on top of unstructured data lakes. This presents new challenges for query execution engines. The execution engine needs to provide good performance over raw uncurated datasets that are ubiquitous in data lakes, and state-of-the-art performance over structured tables stored in open file formats to provide the main benefits of SQL warehouses. Toward these goals, we present Photon, a new native vectorized query engine for the Lakehouse that underlies Databricks Runtime, the execution framework underlying all Databricks workloads. Photon can outperform other cloud data warehouses' specialized engines, but is implemented on top of a more general execution framework that operates over open file formats such as Apache Parquet using modern database techniques such as interpreted vectorization. We discuss the design choices that underlie Photon (e.g., vectorization vs. code generation) and describe its integration with our existing runtime, its task model, and its memory manager. Photon exhibits a 4x improvement end-to-end on the TPC-DS benchmark compared to the previous code-generated engine and has improved customer queries in the field by up to 11x.
AbstractWhen developing a new networking algorithm, it is established practice to run a randomized experiment, or A/B test, to evaluate its performance. In an A/B test, traffic is randomly allocated between a treatment group, which uses the new algorithm, and a control group, which uses the existing algorithm. However, because networks are congested, both treatment and control traffic compete against each other for resources in a way that biases the outcome of these tests. This bias can have a surprisingly large effect; for example, in lab A/B tests with two widely used congestion control algorithms, the treatment appeared to deliver 150% higher throughput when used by a few flows, and 75% lower throughput when used by most flows–despite the fact that the two algorithms have identical throughput when used by all traffic. Beyond the lab, we show that A/B tests can also be biased at scale. In an experiment run in cooperation with Netflix, estimates from A/B tests mistake the direction of change of some metrics, miss changes in other metrics, and overestimate the size of effects. We propose alternative experiment designs, previously used in online platforms, to more accurately evaluate new algorithms and allow experimenters to better understand the impact of congestion on their tests.
AbstractWe propose Clamor, a functional cluster computing framework that adds support for fine-grained, transparent access to global variables for distributed, data-parallel tasks. Clamor targets workloads that perform sparse accesses and updates within the bulk synchronous parallel execution model, a setting where the standard technique of broadcasting global variables is highly inefficient. Clamor implements a novel dynamic replication mechanism in order to enable efficient access to popular data regions on the fly, and tracks fine-grained dependencies in order to retain the lineage-based fault tolerance model of systems like Spark. Clamor can integrate with existing Rust and C ++ libraries to transparently distribute programs on the cluster. We show that Clamor is competitive with Spark in simple functional workloads and can improve performance significantly compared to custom systems on workloads that sparsely access large global variables: from 5× for sparse logistic regression to over 100× on distributed geospatial queries.
AbstractWe introduce Transaction Triaging, a set of techniques that manipulate streams of transaction requests and responses while they travel to and from a database server. Compared to normal transaction streams, the triaged ones execute faster once they reach the database. The triaging algorithms do not interfere with the transaction execution nor require adherence to any particular concurrency control method, making them easy to port across database systems. Transaction Triaging leverages recent programmable networking hardware that can perform computations on in-flight data. We evaluate our techniques on an in-memory database system using an actual programmable hardware network switch. Our experimental results show that triaging brings enough performance gains to compensate for almost all networking overheads. In high-overhead network stacks such as UDP/IP, we see throughput improvements from 2.05× to 7.95×. In an RDMA stack, the gains range from 1.08× to 1.90× without introducing significant latency.
AbstractSuboptimal scheduling decisions in operating systems, networking stacks, and application runtimes are often responsible for poor application performance, including higher latency and lower throughput. These poor decisions stem from a lack of insight into the applications and requests the scheduler is handling and a lack of coherence and coordination between the various layers of the stack, including NICs, kernels, and applications. We propose Syrup, a framework for user-defined scheduling. Syrup enables untrusted application developers to express application-specific scheduling policies across these system layers without being burdened with the low-level system mechanisms that implement them. Application developers write a scheduling policy with Syrup as a set of matching functions between inputs (threads, network packets, network connections) and executors (cores, network sockets, NIC queues) and then deploy it across system layers without modifying their code. Syrup supports multi-tenancy as multiple co-located applications can each safely and securely specify a custom policy. We present several examples of uses of Syrup to define application and workload-specific scheduling policies in a few lines of code, deploy them across the stack, and improve performance up to 8x compared with default policies.
AbstractReconfigurable analog devices are a powerful new computing substrate especially appropriate for executing dynamical systems in an energy efficient manner. These devices leverage the physical behavior of transistors to directly implement computation. Under this paradigm, voltages and currents within the device implement continuously evolving variables in the computation. In this talk, I discuss compilation techniques for automatically configuring such devices to execute dynamical systems. I present Legno, the first compilation system that automatically targets a real reconfigurable analog device of this class. Legno synthesizes analog circuits from parametric and specialized analog blocks and accounts for analog noise, quantization error, operating range limitations, and manufacturing variations within the device. I evaluate Legno on applications from the biology, physics, and controls domains. The results demonstrate that these applications execute with acceptable error while consuming microjoules of energy.
BioSara Achour is an Assistant Professor jointly appointed to both the Computer Science Department and the Electrical Engineering Department at Stanford University. Her research focuses on new techniques and tools, specifically new programming languages, compilers, and runtime systems, that enable end-users to easily develop computations that exploit the potential of emergent nontraditional computing platforms.
AbstractInternet-connected IoT devices pose a significant threat to user privacy. Compromised or malicious vendors have access to microphones, cameras, door locks, and other highly sensitive data sources in people’s homes. Though a few IoT apps inherently depend on the cloud for consuming content and sharing analytics, much of the rationale for cloud-based control is to utilize cloud hardware resources. However, many users already own the computation, storage, and connectivity necessary to support today’s IoT applications. The missing piece is a framework to make these resources available to IoT devices. This talk presents Karl, an architecture for a home cloud that executes as much functionality as possible on user-owned hardware. Karl allows IoT devices to offload computation and storage to the user’s own computers using a module programming model inspired by serverless computing. Karl also mediates access to the Internet through pipeline policies, easily visualized and mapped to English-language privacy guarantees. Pipeline policies are compiled down to a simplified form of mandatory access control based on tags, which justify the transfer of data between and out of sandboxed modules. We prototype Karl and implement 3 sensors, 9 modules, and 9 pipeline policies using it. We show that Karl can easily express sophisticated privacy policies and is practical, with latencies below 2ms for interactive applications.
AbstractIn this talk I will discuss the back-story behind the Raft consensus algorithm: why we decided to undertake this project, how the algorithm developed, and the challenges of publishing an idea that 'gores a sacred cow'. I will also make several observations about how to perform research, how program committees work, and the relationship between Paxos and Raft.
AbstractAs an increasingly important workload, machine learning (ML) applications require different performance optimization techniques from traditional runtimes and compilers. In particular, to accelerate ML applications, it is generally necessary to perform ML computations on heterogeneous hardware and parallelize computations using multiple data dimensions, neither of which is even expressible in traditional compilers and runtimes. In this talk, I will describe our work on automated approaches to building performant and scalable ML systems. Instead of relying on human effort to manually design and implement systems optimizations for ML workload, our work automatically discovers ML optimizations by leveraging the statistical and mathematical properties of ML computations, such as the multi-linearity of tensor algebra. Compared to today's manually-design systems optimizations, our work significantly improves the efficiency and scalability of ML computations and provides stronger correctness guarantees, while requiring much less human effort. I will also outline future research directions for further automating ML systems, such as codesigning ML models, software systems, and hardware backends for end-to-end ML deployment.
BioZhihao Jia is a research scientist at Facebook and will join CMU as an assistant professor of computer science in Fall 2021. He obtained his Ph.D. from Stanford working with Alex Aiken and Matei Zaharia. His research interests lie in the intersection of computer systems and machine learning, with a focus on building efficient, scalable, and high-performance systems for ML computations.
AbstractWhile deep neural networks (DNNs) are an increasingly popular way to query large corpora of data, their significant runtime remains an active area of research. As a result, researchers have proposed systems and optimizations to reduce these costs by allowing users to trade off accuracy and speed. In this work, we examine end-to-end DNN execution in visual analytics systems on modern accelerators. Through a novel measurement study, we show that the preprocessing of data (eg, decoding, resizing) can be the bottleneck in many visual analytics systems on modern hardware. To address the bottleneck of preprocessing, we introduce two optimizations for end-to-end visual analytics systems. First, we introduce novel methods of achieving accuracy and throughput trade-offs by using natively present, low-resolution visual data. Second, we develop a runtime engine for efficient visual DNN inference. This runtime engine a) efficiently pipelines preprocessing and DNN execution for inference, b) places preprocessing operations on the CPU or GPU in a hardware-and input-aware manner, and c) efficiently manages memory and threading for high throughput execution. We implement these optimizations in a novel system, Smol, and evaluate Smol on eight visual datasets. We show that its optimizations can achieve up to 5.9x end-to-end throughput improvements at a fixed accuracy over recent work in visual analytics.
AbstractModern computer systems can introduce correctness and/or security issues into seemingly secure programs. This is because hardware does not execute program instructions atomically. Rather, individual instructions get “cracked” into a collection of hardware events that take place on behalf of the user-facing instruction. Events corresponding to one instruction can interleave and interact with the events corresponding to another instruction in a variety of different ways during a program’s execution. Some of these interactions can translate to program-level security vulnerabilities. Evaluating the correctness/security of a program requires searching the space of all possible ways in which the program could run a particular hardware implementation for execution scenarios that feature security violations. Fortunately, the field of automated reasoning has developed tools for conducting such an analysis, subject to the user’s ability to provide specifications of the target 1) hardware system and 2) correctness/security property. Both specifications impact the soundness, completeness, and efficiency of the final verification effort. In this talk, I will give an overview of some of our work on applying formal methods techniques to the problem of correctness/security verification of modern processor designs. In particular, I will focus on the CheckMate approach for modeling hardware systems in a way that makes them amenable to efficient formal security analysis, and I will discuss how we are addressing the specification challenge above to, for example, automatically lift formal hardware specifications for security analysis directly from RTL.
AbstractWe present the first compiler for the general class of sparse array programming languages (i.e., sparse NumPy). A sparse array programming language supports element-wise operations, reduction, and broadcasting of arbitrary functions over both dense and sparse arrays. Such languages have great expressive power and can express sparse/dense tensor algebra, functions over images, exclusion and inclusion filters, and even graph algorithms. Our compiler generalizes prior work on sparse tensor algebra compilation, which assumes additions and multiplications only, to any function over sparse arrays. We thus generalize the notion of sparse iteration spaces beyond intersections and unions and automatically derive them from how the algebraic properties of the functions interact with the compressed out values of the arrays. We then show for the first time how to compile these iteration spaces to efficient code. The resulting bespoke code performs 1.5–70x (geometric mean of 13.7x) better than the Pydata/Sparse Python library, which implements the alternative approach that reorganizes sparse data and calls pre-written dense functions.
AbstractState machine replication (SMR) increases the availability of an application; by replicating each request to multiple servers, the system can hide server failures from a client. However, replicating requests introduces overhead. Indeed, traditional SMR may add hundreds of microseconds in normal execution, and need hundreds of milliseconds to recover from a server failure. With the rise of microsecond applications, such replication overheads become unacceptable. Newer SMR systems reduce this overhead to several microseconds in normal execution, and recover in just tens of milliseconds. However, for the fastest applications, this may still be unsatisfactory. In this talk, I’ll present our work on a new state machine replication system called Mu, which carefully leverages remote direct memory access (RDMA) to drastically improve replication latency. Mu requires only 1.3 microseconds to replicate a request, and takes less than a millisecond to recover from failures. Thus, Mu demonstrates that replication algorithms can be fast -- both in normal execution and in recovery.
AbstractThe proliferation of camera-enabled devices and large video repositories has given rise to a diverse set of video analytics applications. The video pipelines for these applications are DAGs of operations that transform videos, process extracted metadata, and answer questions such as, "Is this intersection congested?" The latency and resource efficiency of pipelines can be optimized using configurable knobs for each operation such as the sampling rate, batch size, or type of hardware used. However, determining efficient configurations is challenging because (a) the configuration search space is exponentially large, and (b) the optimal configuration depends on the desired latency target and the input video contents that may exercise different paths in the DAG and produce different volumes of intermediate results. Existing video analytics and processing systems leave it to the users to manually configure operations and select hardware resources. Hence, we observe that they often execute inefficiently and fail to meet latency and cost targets. In this talk, we present Llama, a heterogeneous and serverless framework for auto-tuning video pipelines. Llama optimizes the overall video pipeline latency by (a) dynamically calculating latency targets per-operation invocation, and (b) dynamically running a cost-based optimizer to determine efficient configurations that meet the target latency for each invocation. We show that Llama achieves reduced latency and cost compared to state-of-the-art cluster and serverless video analytics and processing systems.
AbstractWe present a reinforcement learning system for visually complex 3D environments built around a custom simulator design that processes large batches of simulated environments simultaneously. This batch simulation strategy allows GPU resources to be efficiently leveraged by amortizing memory and compute costs across multiple simulated agents, dramatically improving the number of simulated environments per GPU and overall simulation throughput. Our implementation of navigation trains agents on the Gibson dataset at 19,000 frames of experience per second on a single GPU (and up to 72,000 frames per second on a single eight-GPU machine) – more than 100x faster than prior work in the same environments. In terms of end-to-end training, policies can be trained to convergence in 1.5 days on a single GPU to 97% of the accuracy of agents trained on a prior state-of-the-art system using a 64-GPU cluster over three days. This talk will describe the architecture of our batch simulator and our strategy of end-to-end optimization throughout the entire reinforcement learning system.
AbstractI present a machine-learning-based approach for creating hardware component knowledge bases directly from the PDF datasheets that manufacturers publish for those components. This approach reduces the amount of costly human input required to create new hardware component knowledge bases. First, I show Fonduer, a novel knowledge base construction system for richly formatted data like PDF datasheets. Fonduer provides a data model that serves as a necessary building block that enables automated information extraction from datasheets. Second, I explain how Fonduer can be used to build hardware component knowledge bases in practice. The multimodal information captured by Fonduer provides signals for training data generation as well as for augmenting deep learning models for multi-task learning. Finally, I demonstrate the utility of this approach with end-to-end applications and empirical results from real-world use cases. I implement this approach with a dataset of over 15,000 datasheets of three types of components. When extracting multiple electrical characteristics, this implementation achieves an average quality of 77 F1 points—quality that improves on existing human-curated knowledge bases by 12 F1 points. In one case where existing knowledge bases are scarce (product thumbnails of circular connectors) this implementation improves on the F1 score by 12x.
AbstractRecent work has established that better scheduling can drastically improve the throughput, tail latency, scalability, and security of important workloads. However, kernel schedulers are difficult to implement and cannot be updated without a full reboot. Researchers are bypassing the kernel complexity, including the challenge of communicating constraints around optimization targets that are often in conflict (e.g. latency versus power), by evaluating new scheduling policies within bespoke data plane operating systems. However, it is difficult to maintain and deploy a custom OS image for every impor- tant application, particularly in a shared environment. Hence, the practical benefits of new scheduling research have been limited. We present ghOSt, a general-purpose delegation of schedul- ing policy implemented on top of the Linux kernel. ghOSt provides a rich API that receives scheduling decisions for kernel threads from user code and actuates them as transac- tions. Programmers can use any language or tools to develop policies, which can be upgraded without a reboot. We develop policies for μs-scale workloads and a production database to demonstrate that ghOSt supports a wide range of scheduling models (per-CPU to centralized, run-to-completion to preemp- tive) and incurs low overheads for scheduling actions. Many policies are just a few hundred lines of code. Overall, ghOSt provides a performant framework for delegation of thread scheduling policy to userspace processes that enables policy optimization, non-disruptive upgrades, and fault isolation.
BioJack Humphries is a software engineer in the Google Cloud Systems Infrastructure group where he works on kernel scheduling for Google's data centers. He completed his Bachelor's degree in Computer Science at Stanford in 2019. He is currently a Master's student in Computer Science at Stanford, advised by Professors Christos Kozyrakis and David Mazières, and will start his Ph.D. in autumn 2021.
AbstractThe modern storage landscape is changing at an exciting rate. New technologies, such as Intel DC Persistent Memory, are being introduced. At the same time, new applications such as blockchain are emerging with new requirements from the storage subsystem. New regulations, such as the General Data Protection Regulation (GDPR), place new constraints on how data may be read and written. As a result, designing storage systems that satisfy these constraints is interesting and challenging. In this talk, I will describe the lessons we learnt from tackling this challenge in various forms: my group has built file systems and concurrent data structures for persistent memory, storage solutions for blockchains and machine learning, and analyzed how the GDPR regulation affects storage systems.
BioVijay Chidambaram is an Assistant Professor in the Computer Science department at the University of Texas at Austin. He did his post-doc at the VMware Research Group, and got his PhD with Prof. Remzi and Andrea Arpaci-Dusseau at the University of Wisconsin-Madison. His papers have won Best Paper Awards in ATC 2018, FAST 2018, and FAST 2017. He was awarded the NSF CAREER Award in 2018, SIGOPS Dennis M. Ritchie Dissertation Award in 2016, and the Microsoft Research Fellowship in 2014. Techniques from his work have been incorporated into commercial products, and his work has helped make the Linux kernel more reliable.
AbstractToday's datacenter applications couple scale and time: applications that harness large numbers of servers also execute for long periods of time (seconds or more). This paper explores the possibility of flash bursts: applications that use a large number of servers but for very short time intervals (as little as one millisecond). In order to learn more about the feasibility of flash bursts, we developed MilliSort and MilliQuery. MilliSort is a sorting application and MilliQuery implements three SQL queries. The goal for both applications was to process as many records as possible in one millisecond, given unlimited resources in a datacenter. The short time scale required a new distributed sorting algorithm for MilliSort that uses a hierarchical form of partitioning. Both applications depended on fast group communication primitives such as shuffle and all-gather. Our implementation of MilliSort can sort 0.84 million items in one millisecond using 120 servers on an HPC cluster; MilliQuery can process .03--48 million items in one millisecond using 60-280 servers, depending on the query. The number of items that each application can process grows quadratically with the time budget. The primary obstacle to scalability is per-message costs, which appear in the form of inefficient shuffles and coordination overhead.
AbstractDistributed protocols make it possible to build scalable and reliable systems, but come at a performance cost. Recent advances in accelerators have yielded major improvements in single-node performance, increasingly leaving distributed communication as a bottleneck. In this talk, I’ll argue that in-network computation can serve as the missing accelerator for distributed systems. Enabled by new programmable switches and NICs that can place small amounts of computation directly in the network fabric, we can speed up common communication patterns for distributed systems, and reach new levels of performance. I’ll describe three systems that use in-network acceleration to speed up classic communication and coordination challenges. First, I’ll show how to speed up state machine replication using a network sequencing primitive. The ordering guarantees it provides allow us to design a new consensus protocol, Network-Ordered Paxos, with extremely low performance overhead. Second, I’ll show that even a traditionally compute-bound workload -- ML training -- can now be network-bound. Our new system, SwitchML, alleviates this bottleneck by accelerating a common communication pattern using a programmable switch. Finally, I’ll show that using in-network computation to manage the migration and replication of data, in a system called Pegasus, allows us to load-balance a key-value store to achieve high utilization and predictable performance in the face of skewed workloads.
BioDan Ports is a Principal Researcher at Microsoft Research and Affiliate Assistant Professor in Computer Science and Engineering at the University of Washington. Dan’s background is in distributed systems research, and more recently he has been focused on how to use new datacenter technologies like programmable networks to build better distributed systems. He leads the Prometheus project at MSR, which uses this co-design approach to build practical high-performance distributed systems. Dan received a Ph.D. from MIT (2012). His research has been recognized with best paper awards at NSDI and OSDI.
AbstractUsing modern cloud computing platforms, a consumer can rapidly acquire thousands of CPUs, featuring terabytes of aggregate memory and hundreds of gigabits of bandwidth to shared storage, on demand. This abundance of resources offers the promise of low-latency execution of expensive graphics jobs, such as path tracing. In this talk, I present R2E2, the first system architected to perform low latency path tracing of terabyte-scale scenes using serverless computing nodes in the cloud. R2E2 is a parallel renderer that leverages elastic cloud platforms (availability of many CPUs/memory in aggregate and massively parallel access to shared storage) and mitigates the cloud's limitations (low per-node memory capacity and high latency inter-node communication). R2E2 rapidly acquires thousands of cloud CPU cores, loads scene geometry from a pre-built scene BVH into the aggregate memory of these nodes in parallel, and performs full path traced global illumination using an inter-node messaging service designed for communicating ray data. Scenes with up to a terabyte of geometry can be path traced at high resolution, in a few minutes, using thousands of tiny serverless nodes on the AWS Lambda platform.
AbstractIn this work, we explore how programmable data planes can naturally provide a higher-level of service to user applications via a new abstraction called packet subscriptions. Packet subscriptions generalize forwarding rules, and can be used to express both traditional routing and more esoteric, content-based approaches. We present strategies for routing with packet subscriptions in which a centralized controller has a global view of the network, and the network topology is organized as a hierarchical structure. We also describe a compiler for packet subscriptions that uses a novel BDD-based algorithm to efficiently translate predicates into P4 tables that can support O(100K) expressions. Using our system, we have built three diverse applications. We show that these applications can be deployed in brownfield networks while performing line-rate message processing, using the full switch bandwidth of 6.5Tbps.
AbstractData volumes are growing exponentially, fueled by an increased number of automated processes such as sensors and devices. Meanwhile, the computational power available for processing this data – as well as analysts’ ability to interpret it – remain limited. As a result, database systems must evolve to address these new bottlenecks in analytics. In my work, I ask: how can we adapt classic ideas from database query processing to modern compute- and attention-limited data analytics? In this talk, I will discuss the potential for this kind of systems development through the lens of several practical systems I have developed. By drawing insights from database query optimization, such as pushing workload- and domain-specific filtering, aggregation, and sampling into core analytics workflows, we can dramatically improve the efficiency of analytics at scale. I will illustrate these ideas by focusing on two systems — one designed for high-volume seismic waveform analysis and one designed to optimize visualizations for streaming infrastructure and application telemetry — both of which have been field-tested at scale. I will also discuss lessons from production deployments at companies including Datadog, Microsoft, Google and Facebook.
AbstractDeep Learning models have enabled state-of-the-art results across a broad range of applications; however, training these models is extremely time- and resource-intensive, taking weeks on clusters with thousands of expensive accelerators in the extreme case. In this talk, I will describe two systems that improve the resource efficiency of model training. The first system, PipeDream, proposes a new primitive called pipeline parallelism to accelerate distributed training. Pipeline parallelism facilitates model training with lower communication overhead than previous methods while still ensuring high compute resource utilization. Pipeline parallelism also enables the efficient training of large models that do not fit on a single worker. Pipeline parallelism is being used at Facebook, Microsoft, OpenAI, and Nvidia for efficient large-scale model training. The second system, Gavel, determines how resources in a shared cluster with heterogeneous compute resources (e.g., different types of hardware accelerators) should be partitioned among different users to optimize objectives specified over multiple training jobs. Gavel can improve various scheduling objectives, such as average completion time, makespan, or cloud computing resource cost, by up to 3.5x. I will conclude the talk with discussion on future directions for optimizing Machine Learning systems.
AbstractMachine learning (ML) can now be used to answer a range of queries over unstructured data (e.g., videos, text) by extracting structured information over this data (e.g., object types and positions in videos). Unfortunately, these ML methods can be prohibitively expensive to deploy for many organizations. In this talk, I'll first describe algorithms to accelerate ML-based queries using approximations to expensive ML methods. I'll describe algorithms to accelerate selection, aggregation, and limit queries so that query results have statistical guarantees on accuracy, despite using approximations. These algorithms can accelerate queries by orders of magnitude compared to recent work. I'll then describe how to use new programming abstractions to find errors in ML models and in human-generated labels. These abstractions, model assertions and learned observation assertions, can find errors in mission-critical datasets and can be used to improved ML model performance by up to 2x. Time permitting, I'll discuss ongoing collaborations with the Toyota Research Institute and Stanford ecologists on deploying my research.
AbstractMachine learning inference is becoming a core building block for interactive web applications. As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets. Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency, but cannot effectively curtail tail latency caused by unpredictable execution times. Yet the underlying execution times are not fundamentally unpredictable --- on the contrary we observe that inference using Deep Neural Network (DNN) models has deterministic performance. In this talk, starting with the predictable execution times of individual DNN inferences, I will show how we adopt a principled design methodology to successively build a fully distributed model serving system that achieves predictable end-to-end performance. I will discuss the evaluation of our implementation, Clockwork, using production trace workloads, and show that Clockwork can support thousands of models while simultaneously meeting 100 ms latency targets for 99.997% of requests. Finally, I will demonstrate that Clockwork exploits predictable execution times to achieve tight request-level service-level objectives (SLOs) as well as a high degree of request-level performance isolation. Clockwork is a collaboration between researchers at Emory University and MPI-SWS, and is available online at https://gitlab.mpi-sws.org/cld/ml/clockwork.
BioYmir Vigfusson is Assistant Professor of Computer Science at Emory University, co-PI of the Emory SimBioSys lab, a CDC Guest Researcher, and co-founder and Chief Science Officer of the cybersecurity companies Syndis and Adversary (acquired by Secure Code Warrior in 2020). His research focuses on the scalability of distributed systems, where caching systems hold a special place in his heart, as well as network science, computational epidemiology (influenza and malaria), and cybersecurity education. He holds a PhD in Computer Science from Cornell University and was on faculty at Reykjavik University in Iceland before joining Emory in 2014. Ymir is a former hacker, an NSF CAREER awardee, a father of four (please send help), a pianist and a private pilot, (I beg you), and that blasted Reviewer #3.
AbstractSecurity researchers depend on visibility into large volumes of network traffic in order to answer questions about security recommendations, new protocols, malware, and more. However, as traffic speeds increase to 100GbE and beyond, network data from large ISPs or enterprise networks is becoming more difficult to obtain and analyze. Historically, traffic analysis has been restricted to small networks due to the performance limitations of current tools. However, recent advances in x86 servers, NICs, and fast packet I/O libraries like DPDK and PF_RING show potential for performing passive analysis on high-speed traffic. I will present our work in progress on building a framework for high-speed passive analysis on a single server. Our goal is to allow users to subscribe to subsets of reassembled application-layer data (e.g., all TLS handshakes to Netflix), and automate many of the mechanical aspects of collecting network data, including implementing filters, reconstructing TCP streams at high speeds, and load balancing packet processing across cores.
AbstractWe need affordable and innovative sensing to assist in tackling global-scale problems. Radars have become a common tool in sensing applications. However, target detection and identification remain challenges. This talk presents our system that combines RF backscatter tags with commodity UWB radar to cooperatively detect and identify objects in realtime. Our approach makes existing sensing applications, like localization, significantly easier, as well as enabling novel applications. I describe how even in cluttered static environments without line-of-sight, the return amplitude of our backscatter tags can be 100-10,000x larger than the strongest noise signal. I will also present two example applications we implemented to demonstrate the potential of our system: soil moisture sensing and respiration monitoring.
AbstractSpecialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been increasingly deployed to train deep learning models. These accelerators exhibit heterogeneous performance behavior across model architectures. Existing schedulers for clusters of accelerators, which are used to arbitrate these expensive training resources across many users, have shown how to optimize for various multi-job, multi-user objectives, like fairness and makespan. Unfortunately, existing schedulers largely do not consider performance heterogeneity. In this paper, we propose Gavel, a heterogeneity-aware scheduler that systematically generalizes a wide range of existing scheduling policies. Gavel expresses these policies as optimization problems and then systematically transforms these problems into heterogeneity-aware versions using an abstraction we call effective throughput. Gavel then uses a round-based scheduling mechanism to ensure jobs receive their ideal allocation given the target scheduling policy. Gavel's heterogeneity-aware policies allow a heterogeneous cluster to sustain higher input load, and improve end objectives such as makespan and average job completion time by 1.4x and 3.5x compared to heterogeneity-agnostic policies.
Credit to the Stanford MLSys Seminar Series for the site theme!