Streaming systems continuously ingest and process data from a variety of data sources. They build on append-only data structures to enable efficient write and read access, targeting low-latency end-to-end. As more of the data sources in applications are machines, the expected volume of continuously generated data has been growing and is expected to grow further . Such growth puts pressure on streaming systems to handle machine-generated workloads not only with low latency, but also with high throughput to accommodate high volumes of data.
Pravega (“good speed” in Sanskrit) is an open-source storage system for streams that we have built from the ground up to ingest data from continuous data sources and meet the stringent requirements of such streaming workloads. It provides the ability to store an unbounded amount of data per stream using tiered storage while being elastic, durable and consistent. Both the write and read paths of Pravega have been designed to provide low latency along with high throughput for event streams in addition to features such as long-term retention and stream scaling. This post is a performance evaluation of Pravega focusing on the ability of reading and writing.
To contrast with different design choices, we additionally show results from other systems: Apache Kafka and Apache Pulsar. Initially qualified as messaging systems, both Pulsar and Kafka make a conscious effort to become more like a storage system; they have recently added features like tiered storage. These systems have made fundamentally different design choices, however, leading to different behavior and performance characteristics that we explore in this post.