The Stanford Systems Seminar, sponsored by Platform Lab, is an opportunity for Stanford faculty and students working in systems to present their research and invite guest speakers. Seminars are held in-person/hybrid on Tuesdays in the Gates Fujitsu Room (Gates 403) at 4 PM PST, with an online option on Zoom. Food is provided. The seminar is organized by Deepti Raghavan and Peter Kraft. To receive regular updates on upcoming talks, as well as Zoom links to join them virtually, please reach out to one of us so we can add you to our mailing list (Stanford or Platform Lab affiliates only). Please also reach out to us if you are interested in presenting.
AbstractModern complex datacenter applications exhibit unique characteristics such as extensive data and instruction footprints, complex control flow, and hard-to-predict branches that are not adequately served by existing microprocessor architectures. In particular, these workloads exceed the capabilities of microprocessor structures such as the instruction cache, BTB, branch predictor, and data caches, causing significant degradation of performance and energy efficiency. In my talk, I will provide a detailed characterization of datacenter applications, highlighting the importance of addressing frontend and backend performance issues. I will then introduce three new techniques to address these challenges, improving the branch predictor, data cache, and instruction scheduler. I will make the case for profile-guided optimizations that amortize overheads across the fleet and which have been successfully deployed at Google and Intel, serving millions of users daily.
BioHeiner Litz is an Assistant Professor at the University of California, Santa Cruz working in the field of Computer Architecture and Systems. His research focuses on improving the performance, cost, and efficiency of data center systems. Heiner is the recipient of the NSF CAREER award, Intel's Outstanding Researcher award, Google's Faculty Award, and his work received the 2020 IEEE MICRO Top Pick award. Before joining UCSC, Heiner Litz was a researcher at Google and a postdoctoral research fellow at Stanford University with Prof. Christos Kozyrakis and David Cheriton. Dr. Litz received his Diplom and Ph.D. from the University of Mannheim, Germany, advised by Prof. Bruening.
AbstractWith the slowing of technology scaling, computer systems must look higher in the stack for performance and energy efficiency. Data movement is usually to blame for poor scaling, but current systems give software too few options to optimize data movement. As a result, many systems have turned to specialized solutions, often requiring custom hardware, to deal with the data movement problem. This talk will argue that the hardware-software interface is the problem, and data-centric design is the solution. We will consider three data-centric systems that optimize data movement while remaining general-purpose: (i) Polymorphic cache hierarchies allow software to modify or expand the memory interface on general-purpose hardware. (ii) Energy-minimal dataflow architectures massively reduce instruction and data movement, letting software compete with ASIC efficiency. And (iii) General-purpose flash-based caches reduce cost by an order of magnitude in the datacenter behind a simple interface.
AbstractThe success of deep neural networks (DNNs) from Machine Learning (ML) has inspired domain specific architectures (DSAs) for them. ML has two phases: training, which constructs accurate models, and inference, which serves those models. Google’s first generation DSA offered 50x improvement over conventional architectures for inference in 2015. Google next built the first production DSA supercomputer for the much harder problem of training. Subsequent generations greatly improved performance of both phases. We start with ten lessons learned, such as DNNs grow rapidly; workloads quickly evolve with DNN advances; the bottleneck is memory, not floating-point units; and semiconductor technology advances unequally. The rapid growth of DNNs rightfully raised concerns about their carbon footprint. The second part of the talk identifies the “4Ms” (Model, Machine, Mechanization, Map) that, if optimized, can reduce ML training energy by up to 100x and carbon emissions up to 1000x. By improving the 4Ms, ML held steady at <15% of Google’s total energy use despite it consuming ~75% of its floating point operations. Climate change is one of our most important problems, so ML papers should include emissions explicitly to foster competition on more than just model quality. External estimates have been off 100x–100,000x, so publishing emissions also ensures accurate accounting, which helps pinpoint the biggest challenges. With continuing focus on the 4Ms, we can realize the amazing potential of ML to positively impact many fields in a sustainable way.
AbstractMemory hierarchies are changing, both moving persistence closer to computation and pushing the semantics of memory to a more disaggregated setting. We present Twizzler, an operating system redesign for this near-future. Twizzler removes the kernel from the I/O path, provides programs with memory-style access to global address space of data using small (64 bit), object-relative cross-object pointers, and enables simple and efficient long-term sharing of data in both space and time. Twizzler provides a clean-slate programming model for persistent and shared data, realizing the vision of Unix in a world of evolving memory hierarchies.
BioNathan Beckmann is an assistant professor at Carnegie Mellon University in the Computer Science and (by courtesy) Electrical and Computer Engineering Departments. His research focuses on general-purpose, data-centric computer architectures and systems. His awards include a Google Research Scholar Award in 2021 and the NSF CAREER Award in 2019. He graduated from MIT advised by Daniel Sanchez, receiving the George M. Sprowls Award for an outstanding PhD dissertation in computer science in 2015.
AbstractModern cluster managers routinely grapple with hard combinatorial optimization problems, such as policy-based load balancing, placement, scheduling, and configuration. Implementing ad-hoc heuristics to solve these problems is notoriously hard to do, making it challenging to evolve the system over time and add new features. In this talk, I will present Declarative Cluster Managers (DCM), a general approach for building cluster managers that makes them performant and easily extensible. With DCM, developers specify the cluster manager's behavior using a high-level declarative language like SQL and let a compiler take care of generating an efficient implementation. I will show how DCM significantly lowers the barrier to building scalable and extensible cluster manager components, in the context of some real-world systems like Kubernetes.
AbstractIncremental view maintenance has been for a long time a central problem in database theory. Many solutions have been proposed for restricted classes of database languages, such as the relational algebra, or Datalog. These techniques do not naturally generalize to richer languages. In this work we give a general solution to this problem in 3 steps: (1) we describe a simple but expressive language called DBSP for describing computations over data streams; (2) we give a general algorithm for solving the incremental view maintenance problem for arbitrary DBSP programs, and (3) we show how to model many rich database query languages (including the full relational queries, grouping and aggregation, monotonic and non-monotonic recursion, and streaming aggregation) using DBSP. As a consequence, we obtain efficient incremental view maintenance techniques for all these rich languages. Joint work with Leonid Ryzhyk (Vmware Research), Frank McSherry (Materialize Inc.) and Val Tannen (University of Pennsylvania)
BioMihai Budiu is a senior researcher in the VMware Research Group (VRG). He has a Ph.D. in CS from Carnegie Mellon University. He was previously employed at Microsoft Research and Barefoot Networks. Mihai's has worked on reconfigurable hardware, computer architecture, compilers, security, distributed systems, big data platforms, large-scale machine learning, programmable networks and P4, data visualization, and databases; four of his papers have received “test of time” awards.
AbstractAnalytics on video recorded by cameras in public areas have the potential to fuel many exciting applications, but also pose the risk of intruding on individuals' privacy. Unfortunately, existing solutions fail to practically resolve this tension between utility and privacy, relying on perfect detection of all private information in each video frame--an elusive requirement. This paper presents: (1) a new notion of differential privacy (DP) for video analytics, (p,K,e)-event-duration privacy, which protects all private information visible for less than a particular duration, rather than relying on perfect detections of that information, and (2) a practical system called Privid that enforces duration-based privacy even with the (untrusted) analyst-provided deep neural networks that are commonplace for video analytics today. Across a variety of videos and queries, we show that Privid achieves accuracies within 79-99% of a non-private system.
AbstractWe present data-parallel actors (DPA), a programming model for building distributed query serving systems. Query serving systems are an important class of applications characterized by low-latency data-parallel queries and frequent bulk data updates; they include data analytics systems like Apache Druid, full-text search engines like ElasticSearch, and time series databases like InfluxDB. They are challenging to build because they run at scale and need complex distributed functionality like data replication, fault tolerance, and update consistency. DPA makes building these systems easier by allowing developers to construct them from purely single-node components while automatically providing these critical properties. In DPA, we view a query serving system as a collection of stateful actors, each encapsulating a partition of data. DPA provides parallel operators that enable consistent, atomic, and fault-tolerant parallel updates and queries over data stored in actors. We have used DPA to build a new query serving system, a simplified data warehouse based on the single-node database MonetDB, and enhance existing ones, such as Druid, Solr, and MongoDB, adding missing user-requested features such as load balancing and elasticity. We show that DPA can distribute a system in <1K lines of code (>10× less than typical implementations in current systems) while achieving state-of-the-art performance and adding rich functionality.
AbstractMy current research interest is in systems support and resource management for distributed machine learning frameworks and applications. Specifically, I am currently working on distributed systems and resource management algorithms for soft-real time Machine Learning inference. This builds on award-winning (Best Student Paper, EuroSys'16) body of R&D at Carnegie Mellon modeling, designing, and developing abstractions, primitives, algorithms and systems for a general resource management framework for static and dynamic heterogeneity, hard and soft placement constraints, time-varying resource capacity guarantees, and combinatorial constraints in heterogeneous resource contexts. Earlier motivating work published in ACM SoCC'12 recently led to a Test of Time award at SoCC'21. I focus on developing algorithms and building systems that support an array of practical aspects of Machine Learning.
BioI am a tenure-track Assistant Professor in the School of Computer Science at Georgia Tech since August 2019. I completed my postdoc at the University of California, Berkeley, working with Ion Stoica and collaborating closely with Joseph Gonzalez. I completed my Ph.D. at Carnegie Mellon University, advised by Gregory Ganger. At Carnegie Mellon, I was honored by the prestigious NSERC Alexander Graham Bell Canada Graduate Scholarship (NSERC CGS-D3) and partially funded by the Intel Science and Technology Centre for Cloud Computing (ISTC-CC) and the Parallel Data Lab (PDL) industry consortium.
AbstractScanning and filtering over multi-dimensional tables are key operations in modern analytical database engines. To optimize the performance of these operations, databases often use multi-dimensional indexes or specialized data layouts (e.g., Z-order). However, these schemes are hard to tune and their performance is inconsistent. In this talk, I will present Flood and Tsunami, two learned multi-dimensional read-optimized indexes for in-memory analytic workloads. By automatically co-optimizing the index structure and data layout for a particular dataset and workload, Flood and Tsunami achieve up to 10X faster query performance and 100X smaller index size than optimally-tuned traditional indexes. I will conclude by giving a brief overview of ongoing work on incorporating instance-optimized data layouts into SageDB, a learned database system.
AbstractAbstract: Real-world multi-agent systems have to safely operate in environments where unsafe decisions have real consequences. One such environment of interest is network congestion control, in which selfish agents with conflicting goals can cause a network to become unusable by causing some agents to receive zero throughput. Motivated by this problem, we consider safety conditions for a broad class of aggregative games that includes congestion control, the classic Cournot oligopoly model, and resource sharing in the commons. We use comparative statics to derive simple, practically useful conditions that guarantee that all users achieve a fixed minimum utility at equilibrium. Compared to prior work on studying properties of equilibria in network congestion control, we show that our approach allows for a more realistic class of utilities and addresses real concerns regarding safe deployment of modern network congestion control protocols. Joint work with Matei Zaharia and Tatsu Hashimoto.
AbstractSerialization frameworks are a fundamental component of scale-out systems, but introduce significant compute overheads. However, they are amenable to acceleration with specialized hardware. To understand the trade-offs involved in architecting such an accelerator, we present the first in-depth study of serialization framework usage at scale by profiling Protocol Buffers (“protobuf”) usage across Google’s datacenter fleet. We use this data to build HyperProtoBench, an open-source benchmark representative of key serialization-framework user services at scale. In doing so, we identify key insights that challenge prevailing assumptions about serialization framework usage. We use these insights to develop a novel hardware accelerator for protobufs, implemented in RTL and integrated into a RISC-V SoC. Applications can easily harness the accelerator, as it integrates with a modified version of the open-source protobuf library and is wire-compatible with standard protobufs. We have fully open-sourced our RTL, which, to the best of our knowledge, is the only such implementation currently available to the community. We also present a first-of-its-kind, end-to-end evaluation of our entire RTL-based system running hyperscale-derived benchmarks and microbenchmarks. We boot Linux on the system using FireSim to run these benchmarks and implement the design in a commercial 22nm FinFET process to obtain area and frequency metrics. We demonstrate an average 6.2x to 11.2x performance improvement vs. our baseline RISC-V SoC with BOOM OoO cores and despite the RISC-V SoC’s weaker uncore/supporting components, an average 3.8x improvement vs. a Xeon-based server
BioSagar Karandikar is a Ph.D. student at UC Berkeley and a Student Researcher at Google. His research focuses on hardware/software co-design in warehouse-scale machines. His work on FireSim, which enables high-performance scale-out FPGA-accelerated full-system simulation, was selected as an IEEE Micro Top Pick, nominated for CACM Research Highlights, and has been used in published work from over 20 companies and academic institutions and in the development of commercially-available silicon. His work on hardware acceleration for protocol buffers received an IEEE Micro Top Picks Honorable Mention and the Distinguished Artifact Award at MICRO-54.
AbstractTransport protocols can be implemented in NIC (Network Interface Card) hardware to increase throughput, reduce latency and free up CPU cycles. If the ideal transport protocol were known, the optimal implementation would be simple: bake it into fixed-function hardware. But transport layer protocols are still evolving, with innovative new algorithms proposed every year. A recent study proposed Tonic, a Verilog-programmable transport layer in hardware. We build on this work to propose a new programmable hardware transport layer architecture, called nanoTransport, optimized for the extremely low-latency message-based RPCs (Remote Procedure Calls) that dominate large, modern distributed data center applications. NanoTransport is programmed using the P4 language, making it easy to modify existing (or create entirely new) transport protocols in hardware. We identify common events and primitive operations, allowing for a streamlined, modular, programmable pipeline, including packetization, reassembly, timeouts and packet generation, all to be expressed by the programmer. We evaluate our nanoTransport prototype by programming it to run the reliable message-based transport protocols NDP and Homa, as well as a hybrid variant. Our FPGA prototype - implemented in Chisel and running on the Firesim simulator - exposes P4-programmable pipelines and is designed to run in an ASIC at 200Gb/s with each packet processed end-to-end in less than 10ns (including message reassembly).
AbstractThe goal of my research is to build efficient large-scale parallel systems that can accelerate big data applications by 100--1000x without incurring a hefty cost. My work is motivated by two growing trends: big data and cloud computing. Today, many businesses and web services store staggering quantities of data in the cloud and lease relatively small clusters of instances to run analytics queries, train machine learning models, and more. However, the exponential data growth, combined with the slowdown of Moore's law, makes it challenging (if not impossible) to run such big data processing tasks in real-time. Most applications run big data workloads on timescales of several minutes or hours and resort to complex, application-specific optimizations to reduce the amount of data processing required for interactive queries. This design pattern hurts developer productivity and restricts the scope of applications that can use big data. My research aims to enable interactive, cost-effective big data processing through flash bursts. Flash bursts enable an application to use a large portion of a shared cluster for short periods of time. This could allow big data applications to complete significantly faster, with a cost comparable to leasing a few instances for a longer period of time. The main challenge to flash bursts is Amdahl's law. Flash bursts launch small tasks per node that run for milliseconds at a time. With such small task granularity, previously negligible overheads (e.g., synchronization, I/O processing, cache misses, etc.) can turn into significant bottlenecks, and increasing communication overhead for coordination across many nodes can limit scaling. My research takes a two-pronged approach to tackle this challenge: building new distributed system infrastructure for flash bursts, and restructuring important applications (e.g., data analytics, DNN training) to use flash bursts efficiently. In this talk, I will focus on how I restructured distributed sorting (MilliSort) and how I removed the overheads of replication (CURP).
BioSeo Jin Park is a PhD alumnus, previously advised by John Ousterhout. He is currently a postdoc at MIT CSAIL with Mohammad Alizadeh. His research focuses on making cluster-scale parallel systems efficient so that big data applications (e.g., DNN training, analytics) can run 100--1000x faster on the public cloud. Other areas of research are improving blockchain throughput under network bandwidth variability, removing overheads of consistent replication, and building tools for debugging tail latencies.
AbstractThe DMA part of RDMA stands for Direct Memory Access. It refers to the ability of a network card (among other devices) to read and write data from a host’s memory without CPU assistance. RDMA’s performance depends on efficient DMAs in the initiating and target hosts. In turn, a DMA’s cost is almost always proportional to the length of the data transfer. The exception is small DMAs, which suffer from high overheads. In this talk, we show that database systems often generate small DMA operations when using RDMA canonically. The reason is that the data they transmit is seldom contiguous by the time transmissions occur. Modern databases avoid this problem by copying data into large transmission buffers and issuing RDMAs over these buffers instead. However, doing this requires a substantial amount of CPU cycles and memory bandwidth, forfeiting RDMA’s benefits: its zero-copy feature. To solve this issue, we introduce D- RDMA, a declarative extension to RDMA. The approach leverages a smart NIC to group data fragments into larger DMAs and produce the same packet stream as regular RDMA. Our experiments show that the network throughput can increase from 18 Gbps per CPU core to up to 98 Gbps (on a 100 Gbps card) with virtually zero CPU usage in a typical data shuffle scenario. We believe that D-RDMA can enable a new generation of high-performance systems to take full advantage of fast networking without incurring the usual CPU penalties.
BioAndré Ryser is a Research Assistant at the eXascale Infolab at the University of Fribourg in Switzerland, where he investigates heterogeneous platforms for data-intensive applications. He has participated in creating software-hardware co-designed infrastructure to push application logic to programmable switches, NVMe SSDs, and, more recently, to NICs. André is a long-time Swiss Olympiad in Informatics participant, having won a gold medal and, later, building and preparing teams.
AbstractThis work aims to develop a lightweight data center execution environment exploiting application semantics to optimize tail performance for cloud services. This system, dubbed Service Boosters, is a library operating system exposing the application structure and semantics to the underlying resource management stack. Using Service Boosters, programmers can declare and annotate the structure of their request processing pipeline, while performance engineers can program advanced management strategies for the application. I present three components of Service Boosters, FineLame, Perséphone, and DeDoS, that exploit application awareness to provide real time anomaly detection; tail-tolerant RPC scheduling; and resource harvesting. FineLame leverages awareness of the request processing pipeline to deploy monitoring and anomaly detection probes on critical resource consumption functions. Using these probes, FineLame can detect abnormal requests in-flight as soon as they depart from the expected workload behavior, allowing operators to prevent a surge in tail latency. Perséphone exploits an understanding of request types to dynamically allocate resources to each type of request and forbid pathological head-of-line blocking from heavy-tailed workloads, without the need for interrupts. Perséphone is well suited for kernel-bypass, microsecond scale applications requiring low overhead scheduling. Finally, DeDoS can identify overloaded components and dynamically scale them on the cluster, harvesting only the resources needed to quench the overload.
AbstractResource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers is often intractable for large problem sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. We observe, however, that many allocation problems are granular: they consist of a large number of clients and resources, each client requests a small fraction of the total number of resources, and clients can interchangeably use different resources. For these problems, we propose an alternative approach that reuses the original optimization problem formulation and leads to better allocations than domain-specific heuristics. Our technique, Partitioned Optimization Problems (POP), randomly splits the problem into smaller problems (with a subset of the clients and resources in the system) and coalesces the resulting sub-allocations into a global allocation for all clients. We provide theoretical and empirical evidence as to why random partitioning works well. In our experiments, POP achieves allocations within 1.5% of the optimal with orders-of-magnitude improvements in runtime compared to existing systems for cluster scheduling, traffic engineering, and load balancing.
AbstractDatacenter systems and I/O devices now run at single-digit microsecond latencies, requiring nanosecond-scale operating systems. Traditional kernel-based operating systems impose an unaffordable overhead, so recent kernel-bypass OSes (e.g., Arrakis, Ix) and libraries (e.g., Caladan, eRPC) eliminate the OS kernel from the I/O datapath. However, none of these systems offer a general-purpose datapath OS replacement that meet the needs of microsecond-scale systems. This talk proposes Demikernel, a flexible datapath OS and architecture designed for heterogenous kernel-bypass devices and microsecond-scale datacenter systems. We have built two prototype Demikernel datapath OSes and show that minimal effort is needed to port existing microsecond-scale systems. Once ported, Demikernel lets applications run across heterogenous kernel-bypass devices with nanosecond-scale overheads and no code changes.
BioIrene Zhang is a Principal Researcher at Microsoft Research. Her work focuses on datacenter operating systems and distributed systems, especially making new datacenter hardware technologies more widely usable by highly-demanding datacenter applications. Irene completed her PhD in 2017 at the University of Washington, where her PhD thesis focused on distributed systems that span mobile devices and cloud servers. Her thesis work received the ACM SIGOPS Dennis Ritchie doctoral dissertation award and the UW Allen School William Chan Memorial dissertation award. Before her PhD, Irene was a member of the virtual machine monitor group at VMware, where she worked on memory resource management and virtual machine checkpointing.
AbstractOrganizations are shifting to a new paradigm called the Lakehouse, which promises the benefits of structured data warehouses on top of unstructured data lakes. This presents new challenges for query execution engines. The execution engine needs to provide good performance over raw uncurated datasets that are ubiquitous in data lakes, and state-of-the-art performance over structured tables stored in open file formats to provide the main benefits of SQL warehouses. Toward these goals, we present Photon, a new native vectorized query engine for the Lakehouse that underlies Databricks Runtime, the execution framework underlying all Databricks workloads. Photon can outperform other cloud data warehouses' specialized engines, but is implemented on top of a more general execution framework that operates over open file formats such as Apache Parquet using modern database techniques such as interpreted vectorization. We discuss the design choices that underlie Photon (e.g., vectorization vs. code generation) and describe its integration with our existing runtime, its task model, and its memory manager. Photon exhibits a 4x improvement end-to-end on the TPC-DS benchmark compared to the previous code-generated engine and has improved customer queries in the field by up to 11x.
AbstractWhen developing a new networking algorithm, it is established practice to run a randomized experiment, or A/B test, to evaluate its performance. In an A/B test, traffic is randomly allocated between a treatment group, which uses the new algorithm, and a control group, which uses the existing algorithm. However, because networks are congested, both treatment and control traffic compete against each other for resources in a way that biases the outcome of these tests. This bias can have a surprisingly large effect; for example, in lab A/B tests with two widely used congestion control algorithms, the treatment appeared to deliver 150% higher throughput when used by a few flows, and 75% lower throughput when used by most flows–despite the fact that the two algorithms have identical throughput when used by all traffic. Beyond the lab, we show that A/B tests can also be biased at scale. In an experiment run in cooperation with Netflix, estimates from A/B tests mistake the direction of change of some metrics, miss changes in other metrics, and overestimate the size of effects. We propose alternative experiment designs, previously used in online platforms, to more accurately evaluate new algorithms and allow experimenters to better understand the impact of congestion on their tests.
AbstractWe propose Clamor, a functional cluster computing framework that adds support for fine-grained, transparent access to global variables for distributed, data-parallel tasks. Clamor targets workloads that perform sparse accesses and updates within the bulk synchronous parallel execution model, a setting where the standard technique of broadcasting global variables is highly inefficient. Clamor implements a novel dynamic replication mechanism in order to enable efficient access to popular data regions on the fly, and tracks fine-grained dependencies in order to retain the lineage-based fault tolerance model of systems like Spark. Clamor can integrate with existing Rust and C ++ libraries to transparently distribute programs on the cluster. We show that Clamor is competitive with Spark in simple functional workloads and can improve performance significantly compared to custom systems on workloads that sparsely access large global variables: from 5× for sparse logistic regression to over 100× on distributed geospatial queries.
AbstractWe introduce Transaction Triaging, a set of techniques that manipulate streams of transaction requests and responses while they travel to and from a database server. Compared to normal transaction streams, the triaged ones execute faster once they reach the database. The triaging algorithms do not interfere with the transaction execution nor require adherence to any particular concurrency control method, making them easy to port across database systems. Transaction Triaging leverages recent programmable networking hardware that can perform computations on in-flight data. We evaluate our techniques on an in-memory database system using an actual programmable hardware network switch. Our experimental results show that triaging brings enough performance gains to compensate for almost all networking overheads. In high-overhead network stacks such as UDP/IP, we see throughput improvements from 2.05× to 7.95×. In an RDMA stack, the gains range from 1.08× to 1.90× without introducing significant latency.
AbstractSuboptimal scheduling decisions in operating systems, networking stacks, and application runtimes are often responsible for poor application performance, including higher latency and lower throughput. These poor decisions stem from a lack of insight into the applications and requests the scheduler is handling and a lack of coherence and coordination between the various layers of the stack, including NICs, kernels, and applications. We propose Syrup, a framework for user-defined scheduling. Syrup enables untrusted application developers to express application-specific scheduling policies across these system layers without being burdened with the low-level system mechanisms that implement them. Application developers write a scheduling policy with Syrup as a set of matching functions between inputs (threads, network packets, network connections) and executors (cores, network sockets, NIC queues) and then deploy it across system layers without modifying their code. Syrup supports multi-tenancy as multiple co-located applications can each safely and securely specify a custom policy. We present several examples of uses of Syrup to define application and workload-specific scheduling policies in a few lines of code, deploy them across the stack, and improve performance up to 8x compared with default policies.
AbstractReconfigurable analog devices are a powerful new computing substrate especially appropriate for executing dynamical systems in an energy efficient manner. These devices leverage the physical behavior of transistors to directly implement computation. Under this paradigm, voltages and currents within the device implement continuously evolving variables in the computation. In this talk, I discuss compilation techniques for automatically configuring such devices to execute dynamical systems. I present Legno, the first compilation system that automatically targets a real reconfigurable analog device of this class. Legno synthesizes analog circuits from parametric and specialized analog blocks and accounts for analog noise, quantization error, operating range limitations, and manufacturing variations within the device. I evaluate Legno on applications from the biology, physics, and controls domains. The results demonstrate that these applications execute with acceptable error while consuming microjoules of energy.
BioSara Achour is an Assistant Professor jointly appointed to both the Computer Science Department and the Electrical Engineering Department at Stanford University. Her research focuses on new techniques and tools, specifically new programming languages, compilers, and runtime systems, that enable end-users to easily develop computations that exploit the potential of emergent nontraditional computing platforms.
AbstractInternet-connected IoT devices pose a significant threat to user privacy. Compromised or malicious vendors have access to microphones, cameras, door locks, and other highly sensitive data sources in people’s homes. Though a few IoT apps inherently depend on the cloud for consuming content and sharing analytics, much of the rationale for cloud-based control is to utilize cloud hardware resources. However, many users already own the computation, storage, and connectivity necessary to support today’s IoT applications. The missing piece is a framework to make these resources available to IoT devices. This talk presents Karl, an architecture for a home cloud that executes as much functionality as possible on user-owned hardware. Karl allows IoT devices to offload computation and storage to the user’s own computers using a module programming model inspired by serverless computing. Karl also mediates access to the Internet through pipeline policies, easily visualized and mapped to English-language privacy guarantees. Pipeline policies are compiled down to a simplified form of mandatory access control based on tags, which justify the transfer of data between and out of sandboxed modules. We prototype Karl and implement 3 sensors, 9 modules, and 9 pipeline policies using it. We show that Karl can easily express sophisticated privacy policies and is practical, with latencies below 2ms for interactive applications.
AbstractIn this talk I will discuss the back-story behind the Raft consensus algorithm: why we decided to undertake this project, how the algorithm developed, and the challenges of publishing an idea that 'gores a sacred cow'. I will also make several observations about how to perform research, how program committees work, and the relationship between Paxos and Raft.
AbstractAs an increasingly important workload, machine learning (ML) applications require different performance optimization techniques from traditional runtimes and compilers. In particular, to accelerate ML applications, it is generally necessary to perform ML computations on heterogeneous hardware and parallelize computations using multiple data dimensions, neither of which is even expressible in traditional compilers and runtimes. In this talk, I will describe our work on automated approaches to building performant and scalable ML systems. Instead of relying on human effort to manually design and implement systems optimizations for ML workload, our work automatically discovers ML optimizations by leveraging the statistical and mathematical properties of ML computations, such as the multi-linearity of tensor algebra. Compared to today's manually-design systems optimizations, our work significantly improves the efficiency and scalability of ML computations and provides stronger correctness guarantees, while requiring much less human effort. I will also outline future research directions for further automating ML systems, such as codesigning ML models, software systems, and hardware backends for end-to-end ML deployment.
BioZhihao Jia is a research scientist at Facebook and will join CMU as an assistant professor of computer science in Fall 2021. He obtained his Ph.D. from Stanford working with Alex Aiken and Matei Zaharia. His research interests lie in the intersection of computer systems and machine learning, with a focus on building efficient, scalable, and high-performance systems for ML computations.
AbstractWhile deep neural networks (DNNs) are an increasingly popular way to query large corpora of data, their significant runtime remains an active area of research. As a result, researchers have proposed systems and optimizations to reduce these costs by allowing users to trade off accuracy and speed. In this work, we examine end-to-end DNN execution in visual analytics systems on modern accelerators. Through a novel measurement study, we show that the preprocessing of data (eg, decoding, resizing) can be the bottleneck in many visual analytics systems on modern hardware. To address the bottleneck of preprocessing, we introduce two optimizations for end-to-end visual analytics systems. First, we introduce novel methods of achieving accuracy and throughput trade-offs by using natively present, low-resolution visual data. Second, we develop a runtime engine for efficient visual DNN inference. This runtime engine a) efficiently pipelines preprocessing and DNN execution for inference, b) places preprocessing operations on the CPU or GPU in a hardware-and input-aware manner, and c) efficiently manages memory and threading for high throughput execution. We implement these optimizations in a novel system, Smol, and evaluate Smol on eight visual datasets. We show that its optimizations can achieve up to 5.9x end-to-end throughput improvements at a fixed accuracy over recent work in visual analytics.
AbstractModern computer systems can introduce correctness and/or security issues into seemingly secure programs. This is because hardware does not execute program instructions atomically. Rather, individual instructions get “cracked” into a collection of hardware events that take place on behalf of the user-facing instruction. Events corresponding to one instruction can interleave and interact with the events corresponding to another instruction in a variety of different ways during a program’s execution. Some of these interactions can translate to program-level security vulnerabilities. Evaluating the correctness/security of a program requires searching the space of all possible ways in which the program could run a particular hardware implementation for execution scenarios that feature security violations. Fortunately, the field of automated reasoning has developed tools for conducting such an analysis, subject to the user’s ability to provide specifications of the target 1) hardware system and 2) correctness/security property. Both specifications impact the soundness, completeness, and efficiency of the final verification effort. In this talk, I will give an overview of some of our work on applying formal methods techniques to the problem of correctness/security verification of modern processor designs. In particular, I will focus on the CheckMate approach for modeling hardware systems in a way that makes them amenable to efficient formal security analysis, and I will discuss how we are addressing the specification challenge above to, for example, automatically lift formal hardware specifications for security analysis directly from RTL.
AbstractWe present the first compiler for the general class of sparse array programming languages (i.e., sparse NumPy). A sparse array programming language supports element-wise operations, reduction, and broadcasting of arbitrary functions over both dense and sparse arrays. Such languages have great expressive power and can express sparse/dense tensor algebra, functions over images, exclusion and inclusion filters, and even graph algorithms. Our compiler generalizes prior work on sparse tensor algebra compilation, which assumes additions and multiplications only, to any function over sparse arrays. We thus generalize the notion of sparse iteration spaces beyond intersections and unions and automatically derive them from how the algebraic properties of the functions interact with the compressed out values of the arrays. We then show for the first time how to compile these iteration spaces to efficient code. The resulting bespoke code performs 1.5–70x (geometric mean of 13.7x) better than the Pydata/Sparse Python library, which implements the alternative approach that reorganizes sparse data and calls pre-written dense functions.
AbstractState machine replication (SMR) increases the availability of an application; by replicating each request to multiple servers, the system can hide server failures from a client. However, replicating requests introduces overhead. Indeed, traditional SMR may add hundreds of microseconds in normal execution, and need hundreds of milliseconds to recover from a server failure. With the rise of microsecond applications, such replication overheads become unacceptable. Newer SMR systems reduce this overhead to several microseconds in normal execution, and recover in just tens of milliseconds. However, for the fastest applications, this may still be unsatisfactory. In this talk, I’ll present our work on a new state machine replication system called Mu, which carefully leverages remote direct memory access (RDMA) to drastically improve replication latency. Mu requires only 1.3 microseconds to replicate a request, and takes less than a millisecond to recover from failures. Thus, Mu demonstrates that replication algorithms can be fast -- both in normal execution and in recovery.
AbstractThe proliferation of camera-enabled devices and large video repositories has given rise to a diverse set of video analytics applications. The video pipelines for these applications are DAGs of operations that transform videos, process extracted metadata, and answer questions such as, "Is this intersection congested?" The latency and resource efficiency of pipelines can be optimized using configurable knobs for each operation such as the sampling rate, batch size, or type of hardware used. However, determining efficient configurations is challenging because (a) the configuration search space is exponentially large, and (b) the optimal configuration depends on the desired latency target and the input video contents that may exercise different paths in the DAG and produce different volumes of intermediate results. Existing video analytics and processing systems leave it to the users to manually configure operations and select hardware resources. Hence, we observe that they often execute inefficiently and fail to meet latency and cost targets. In this talk, we present Llama, a heterogeneous and serverless framework for auto-tuning video pipelines. Llama optimizes the overall video pipeline latency by (a) dynamically calculating latency targets per-operation invocation, and (b) dynamically running a cost-based optimizer to determine efficient configurations that meet the target latency for each invocation. We show that Llama achieves reduced latency and cost compared to state-of-the-art cluster and serverless video analytics and processing systems.
AbstractWe present a reinforcement learning system for visually complex 3D environments built around a custom simulator design that processes large batches of simulated environments simultaneously. This batch simulation strategy allows GPU resources to be efficiently leveraged by amortizing memory and compute costs across multiple simulated agents, dramatically improving the number of simulated environments per GPU and overall simulation throughput. Our implementation of navigation trains agents on the Gibson dataset at 19,000 frames of experience per second on a single GPU (and up to 72,000 frames per second on a single eight-GPU machine) – more than 100x faster than prior work in the same environments. In terms of end-to-end training, policies can be trained to convergence in 1.5 days on a single GPU to 97% of the accuracy of agents trained on a prior state-of-the-art system using a 64-GPU cluster over three days. This talk will describe the architecture of our batch simulator and our strategy of end-to-end optimization throughout the entire reinforcement learning system.
AbstractI present a machine-learning-based approach for creating hardware component knowledge bases directly from the PDF datasheets that manufacturers publish for those components. This approach reduces the amount of costly human input required to create new hardware component knowledge bases. First, I show Fonduer, a novel knowledge base construction system for richly formatted data like PDF datasheets. Fonduer provides a data model that serves as a necessary building block that enables automated information extraction from datasheets. Second, I explain how Fonduer can be used to build hardware component knowledge bases in practice. The multimodal information captured by Fonduer provides signals for training data generation as well as for augmenting deep learning models for multi-task learning. Finally, I demonstrate the utility of this approach with end-to-end applications and empirical results from real-world use cases. I implement this approach with a dataset of over 15,000 datasheets of three types of components. When extracting multiple electrical characteristics, this implementation achieves an average quality of 77 F1 points—quality that improves on existing human-curated knowledge bases by 12 F1 points. In one case where existing knowledge bases are scarce (product thumbnails of circular connectors) this implementation improves on the F1 score by 12x.
AbstractRecent work has established that better scheduling can drastically improve the throughput, tail latency, scalability, and security of important workloads. However, kernel schedulers are difficult to implement and cannot be updated without a full reboot. Researchers are bypassing the kernel complexity, including the challenge of communicating constraints around optimization targets that are often in conflict (e.g. latency versus power), by evaluating new scheduling policies within bespoke data plane operating systems. However, it is difficult to maintain and deploy a custom OS image for every impor- tant application, particularly in a shared environment. Hence, the practical benefits of new scheduling research have been limited. We present ghOSt, a general-purpose delegation of schedul- ing policy implemented on top of the Linux kernel. ghOSt provides a rich API that receives scheduling decisions for kernel threads from user code and actuates them as transac- tions. Programmers can use any language or tools to develop policies, which can be upgraded without a reboot. We develop policies for μs-scale workloads and a production database to demonstrate that ghOSt supports a wide range of scheduling models (per-CPU to centralized, run-to-completion to preemp- tive) and incurs low overheads for scheduling actions. Many policies are just a few hundred lines of code. Overall, ghOSt provides a performant framework for delegation of thread scheduling policy to userspace processes that enables policy optimization, non-disruptive upgrades, and fault isolation.
BioJack Humphries is a software engineer in the Google Cloud Systems Infrastructure group where he works on kernel scheduling for Google's data centers. He completed his Bachelor's degree in Computer Science at Stanford in 2019. He is currently a Master's student in Computer Science at Stanford, advised by Professors Christos Kozyrakis and David Mazières, and will start his Ph.D. in autumn 2021.
AbstractThe modern storage landscape is changing at an exciting rate. New technologies, such as Intel DC Persistent Memory, are being introduced. At the same time, new applications such as blockchain are emerging with new requirements from the storage subsystem. New regulations, such as the General Data Protection Regulation (GDPR), place new constraints on how data may be read and written. As a result, designing storage systems that satisfy these constraints is interesting and challenging. In this talk, I will describe the lessons we learnt from tackling this challenge in various forms: my group has built file systems and concurrent data structures for persistent memory, storage solutions for blockchains and machine learning, and analyzed how the GDPR regulation affects storage systems.
BioVijay Chidambaram is an Assistant Professor in the Computer Science department at the University of Texas at Austin. He did his post-doc at the VMware Research Group, and got his PhD with Prof. Remzi and Andrea Arpaci-Dusseau at the University of Wisconsin-Madison. His papers have won Best Paper Awards in ATC 2018, FAST 2018, and FAST 2017. He was awarded the NSF CAREER Award in 2018, SIGOPS Dennis M. Ritchie Dissertation Award in 2016, and the Microsoft Research Fellowship in 2014. Techniques from his work have been incorporated into commercial products, and his work has helped make the Linux kernel more reliable.
AbstractToday's datacenter applications couple scale and time: applications that harness large numbers of servers also execute for long periods of time (seconds or more). This paper explores the possibility of flash bursts: applications that use a large number of servers but for very short time intervals (as little as one millisecond). In order to learn more about the feasibility of flash bursts, we developed MilliSort and MilliQuery. MilliSort is a sorting application and MilliQuery implements three SQL queries. The goal for both applications was to process as many records as possible in one millisecond, given unlimited resources in a datacenter. The short time scale required a new distributed sorting algorithm for MilliSort that uses a hierarchical form of partitioning. Both applications depended on fast group communication primitives such as shuffle and all-gather. Our implementation of MilliSort can sort 0.84 million items in one millisecond using 120 servers on an HPC cluster; MilliQuery can process .03--48 million items in one millisecond using 60-280 servers, depending on the query. The number of items that each application can process grows quadratically with the time budget. The primary obstacle to scalability is per-message costs, which appear in the form of inefficient shuffles and coordination overhead.
AbstractDistributed protocols make it possible to build scalable and reliable systems, but come at a performance cost. Recent advances in accelerators have yielded major improvements in single-node performance, increasingly leaving distributed communication as a bottleneck. In this talk, I’ll argue that in-network computation can serve as the missing accelerator for distributed systems. Enabled by new programmable switches and NICs that can place small amounts of computation directly in the network fabric, we can speed up common communication patterns for distributed systems, and reach new levels of performance. I’ll describe three systems that use in-network acceleration to speed up classic communication and coordination challenges. First, I’ll show how to speed up state machine replication using a network sequencing primitive. The ordering guarantees it provides allow us to design a new consensus protocol, Network-Ordered Paxos, with extremely low performance overhead. Second, I’ll show that even a traditionally compute-bound workload -- ML training -- can now be network-bound. Our new system, SwitchML, alleviates this bottleneck by accelerating a common communication pattern using a programmable switch. Finally, I’ll show that using in-network computation to manage the migration and replication of data, in a system called Pegasus, allows us to load-balance a key-value store to achieve high utilization and predictable performance in the face of skewed workloads.
BioDan Ports is a Principal Researcher at Microsoft Research and Affiliate Assistant Professor in Computer Science and Engineering at the University of Washington. Dan’s background is in distributed systems research, and more recently he has been focused on how to use new datacenter technologies like programmable networks to build better distributed systems. He leads the Prometheus project at MSR, which uses this co-design approach to build practical high-performance distributed systems. Dan received a Ph.D. from MIT (2012). His research has been recognized with best paper awards at NSDI and OSDI.
AbstractUsing modern cloud computing platforms, a consumer can rapidly acquire thousands of CPUs, featuring terabytes of aggregate memory and hundreds of gigabits of bandwidth to shared storage, on demand. This abundance of resources offers the promise of low-latency execution of expensive graphics jobs, such as path tracing. In this talk, I present R2E2, the first system architected to perform low latency path tracing of terabyte-scale scenes using serverless computing nodes in the cloud. R2E2 is a parallel renderer that leverages elastic cloud platforms (availability of many CPUs/memory in aggregate and massively parallel access to shared storage) and mitigates the cloud's limitations (low per-node memory capacity and high latency inter-node communication). R2E2 rapidly acquires thousands of cloud CPU cores, loads scene geometry from a pre-built scene BVH into the aggregate memory of these nodes in parallel, and performs full path traced global illumination using an inter-node messaging service designed for communicating ray data. Scenes with up to a terabyte of geometry can be path traced at high resolution, in a few minutes, using thousands of tiny serverless nodes on the AWS Lambda platform.
AbstractIn this work, we explore how programmable data planes can naturally provide a higher-level of service to user applications via a new abstraction called packet subscriptions. Packet subscriptions generalize forwarding rules, and can be used to express both traditional routing and more esoteric, content-based approaches. We present strategies for routing with packet subscriptions in which a centralized controller has a global view of the network, and the network topology is organized as a hierarchical structure. We also describe a compiler for packet subscriptions that uses a novel BDD-based algorithm to efficiently translate predicates into P4 tables that can support O(100K) expressions. Using our system, we have built three diverse applications. We show that these applications can be deployed in brownfield networks while performing line-rate message processing, using the full switch bandwidth of 6.5Tbps.
AbstractData volumes are growing exponentially, fueled by an increased number of automated processes such as sensors and devices. Meanwhile, the computational power available for processing this data – as well as analysts’ ability to interpret it – remain limited. As a result, database systems must evolve to address these new bottlenecks in analytics. In my work, I ask: how can we adapt classic ideas from database query processing to modern compute- and attention-limited data analytics? In this talk, I will discuss the potential for this kind of systems development through the lens of several practical systems I have developed. By drawing insights from database query optimization, such as pushing workload- and domain-specific filtering, aggregation, and sampling into core analytics workflows, we can dramatically improve the efficiency of analytics at scale. I will illustrate these ideas by focusing on two systems — one designed for high-volume seismic waveform analysis and one designed to optimize visualizations for streaming infrastructure and application telemetry — both of which have been field-tested at scale. I will also discuss lessons from production deployments at companies including Datadog, Microsoft, Google and Facebook.
AbstractDeep Learning models have enabled state-of-the-art results across a broad range of applications; however, training these models is extremely time- and resource-intensive, taking weeks on clusters with thousands of expensive accelerators in the extreme case. In this talk, I will describe two systems that improve the resource efficiency of model training. The first system, PipeDream, proposes a new primitive called pipeline parallelism to accelerate distributed training. Pipeline parallelism facilitates model training with lower communication overhead than previous methods while still ensuring high compute resource utilization. Pipeline parallelism also enables the efficient training of large models that do not fit on a single worker. Pipeline parallelism is being used at Facebook, Microsoft, OpenAI, and Nvidia for efficient large-scale model training. The second system, Gavel, determines how resources in a shared cluster with heterogeneous compute resources (e.g., different types of hardware accelerators) should be partitioned among different users to optimize objectives specified over multiple training jobs. Gavel can improve various scheduling objectives, such as average completion time, makespan, or cloud computing resource cost, by up to 3.5x. I will conclude the talk with discussion on future directions for optimizing Machine Learning systems.
AbstractMachine learning (ML) can now be used to answer a range of queries over unstructured data (e.g., videos, text) by extracting structured information over this data (e.g., object types and positions in videos). Unfortunately, these ML methods can be prohibitively expensive to deploy for many organizations. In this talk, I'll first describe algorithms to accelerate ML-based queries using approximations to expensive ML methods. I'll describe algorithms to accelerate selection, aggregation, and limit queries so that query results have statistical guarantees on accuracy, despite using approximations. These algorithms can accelerate queries by orders of magnitude compared to recent work. I'll then describe how to use new programming abstractions to find errors in ML models and in human-generated labels. These abstractions, model assertions and learned observation assertions, can find errors in mission-critical datasets and can be used to improved ML model performance by up to 2x. Time permitting, I'll discuss ongoing collaborations with the Toyota Research Institute and Stanford ecologists on deploying my research.
AbstractMachine learning inference is becoming a core building block for interactive web applications. As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets. Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency, but cannot effectively curtail tail latency caused by unpredictable execution times. Yet the underlying execution times are not fundamentally unpredictable --- on the contrary we observe that inference using Deep Neural Network (DNN) models has deterministic performance. In this talk, starting with the predictable execution times of individual DNN inferences, I will show how we adopt a principled design methodology to successively build a fully distributed model serving system that achieves predictable end-to-end performance. I will discuss the evaluation of our implementation, Clockwork, using production trace workloads, and show that Clockwork can support thousands of models while simultaneously meeting 100 ms latency targets for 99.997% of requests. Finally, I will demonstrate that Clockwork exploits predictable execution times to achieve tight request-level service-level objectives (SLOs) as well as a high degree of request-level performance isolation. Clockwork is a collaboration between researchers at Emory University and MPI-SWS, and is available online at https://gitlab.mpi-sws.org/cld/ml/clockwork.
BioYmir Vigfusson is Assistant Professor of Computer Science at Emory University, co-PI of the Emory SimBioSys lab, a CDC Guest Researcher, and co-founder and Chief Science Officer of the cybersecurity companies Syndis and Adversary (acquired by Secure Code Warrior in 2020). His research focuses on the scalability of distributed systems, where caching systems hold a special place in his heart, as well as network science, computational epidemiology (influenza and malaria), and cybersecurity education. He holds a PhD in Computer Science from Cornell University and was on faculty at Reykjavik University in Iceland before joining Emory in 2014. Ymir is a former hacker, an NSF CAREER awardee, a father of four (please send help), a pianist and a private pilot, (I beg you), and that blasted Reviewer #3.
AbstractSecurity researchers depend on visibility into large volumes of network traffic in order to answer questions about security recommendations, new protocols, malware, and more. However, as traffic speeds increase to 100GbE and beyond, network data from large ISPs or enterprise networks is becoming more difficult to obtain and analyze. Historically, traffic analysis has been restricted to small networks due to the performance limitations of current tools. However, recent advances in x86 servers, NICs, and fast packet I/O libraries like DPDK and PF_RING show potential for performing passive analysis on high-speed traffic. I will present our work in progress on building a framework for high-speed passive analysis on a single server. Our goal is to allow users to subscribe to subsets of reassembled application-layer data (e.g., all TLS handshakes to Netflix), and automate many of the mechanical aspects of collecting network data, including implementing filters, reconstructing TCP streams at high speeds, and load balancing packet processing across cores.
AbstractWe need affordable and innovative sensing to assist in tackling global-scale problems. Radars have become a common tool in sensing applications. However, target detection and identification remain challenges. This talk presents our system that combines RF backscatter tags with commodity UWB radar to cooperatively detect and identify objects in realtime. Our approach makes existing sensing applications, like localization, significantly easier, as well as enabling novel applications. I describe how even in cluttered static environments without line-of-sight, the return amplitude of our backscatter tags can be 100-10,000x larger than the strongest noise signal. I will also present two example applications we implemented to demonstrate the potential of our system: soil moisture sensing and respiration monitoring.
AbstractSpecialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been increasingly deployed to train deep learning models. These accelerators exhibit heterogeneous performance behavior across model architectures. Existing schedulers for clusters of accelerators, which are used to arbitrate these expensive training resources across many users, have shown how to optimize for various multi-job, multi-user objectives, like fairness and makespan. Unfortunately, existing schedulers largely do not consider performance heterogeneity. In this paper, we propose Gavel, a heterogeneity-aware scheduler that systematically generalizes a wide range of existing scheduling policies. Gavel expresses these policies as optimization problems and then systematically transforms these problems into heterogeneity-aware versions using an abstraction we call effective throughput. Gavel then uses a round-based scheduling mechanism to ensure jobs receive their ideal allocation given the target scheduling policy. Gavel's heterogeneity-aware policies allow a heterogeneous cluster to sustain higher input load, and improve end objectives such as makespan and average job completion time by 1.4x and 3.5x compared to heterogeneity-agnostic policies.
Credit to the Stanford MLSys Seminar Series for the site theme!