Browse Category: Storage

Yet Another Cache but for the Streaming World

Traditional cache solutions treat each entry as an immutable blob of data, which poses problems for the append-heavy ingestion workloads that are common in Pravega. Each Event appended to a Stream would either require its own cache entry or need an expensive read-modify-write operation to be included in the Cache. To enable high-performance ingestion of events, big or small, while also providing near-real-time tail reads and high-throughput historical reads, Pravega needs a specialized cache that can natively support the types of workloads that are prevalent in Streaming Storage Systems.

The Streaming Cache, introduced in Pravega with release 0.7, has been designed from the ground up with streaming data in mind and optimizes for appends while organizing the data in a layout that makes eviction and disk spilling easy.

Not all caches are created equal. It is essential to choose a cache that fits the requirements of the system where it will be used, and streaming solutions are no exception to that rule. In this blog post, we describe an innovative way to look at caching that works well with streaming use cases. Continue Reading

Segment Attributes

The ability to pipeline Events to the Segment Store is a key technique that the Pravega Client uses to achieve high throughput, even when dealing with small writes. A Writer appends an Event to its corresponding Segment as soon as it is received, without waiting for previous ones to be acknowledged. To guarantee ordering and exactly once semantics, the Segment Store requires all such appends to be conditional on some known state, which is unique per Writer. This state is stored in each Segment’s Attributes and can be atomically queried and updated with every Segment operation.

Over time, Attributes have evolved to support a variety of use cases, from keeping track of the number of Events in a Segment (enabling auto-scaling) to storing a hash table index. The introduction of Table Segments (key-value stores which contain all of Pravega’s Stream, Transaction and Segment metadata) required the ability to seamlessly manage tens of millions of such Attributes per Segment.

This blog post explains how Segment Attributes work under the hood to provide an efficient key-value store that represents the foundation for several higher-level features. It begins with an overview of how Pravega Writers use them to prevent data duplication or loss and follows up by describing how Segment Attributes are organized as B+Trees in Tier 2 using innovative compaction techniques that reduce write amplification. Continue Reading

Events Big or Small – Bring Them On

Streaming applications typically need to process the events as soon as they arrive. For example, being able to quickly react to events in applications such as fraud detection, manufacturing error detection can result in massive savings. However, due to the limitation of storage systems not being able to handle large numbers of small writes, producers are forced to buffer events before they write. This not only leads to increased latencies from the event generation time to event process time but increases the chance of losing events when writers are not able to store them reliably in failure scenarios.  Pravega has been built from the ground up to be an ideal store for stream processing because not only can it handle frequent small writes well, but it also does that in a consistent and durable way.

This blog explains how Pravega is able to provide excellent throughput using minimal latency for all writes, big or small. It explores the append path of the Segment Store, detailing how we pipeline external requests and use the append-only Tier 1 log to achieve excellent ingestion performance without sacrificing latency. Continue Reading

Segment Store Internals

The Pravega Segment Store Service is a subsystem that lies at the heart of the entire Pravega deployment. It is the main access point for managing Stream Segments, providing the ability to modify and read their contents. The Pravega Client communicates with the  Pravega Stream Controller to identify which Segments need to be used (for a Stream), and both the Stream Controller and the Client deal with the Segment Store Service to operate on them.

We’ll be exploring the functionality involved in the internal workings of the Segment Store, covering its components, how they interact and, in future posts, we will be doing deeper dives into each of them, explaining how they work. Continue Reading

Pravega Internals

Several of the difficulties with tailing a data stream boil down to the dynamics of the source and of the stream processor. For example, if the source increases its production rate in an unplanned manner, then the ingestion system must be able to accommodate such a change. The same happens in the case a processor downstream experiences issues and struggles to keep up with the rate. To be able to accommodate all such variations, it is critical that a system for storing stream data, like Pravega, is sufficiently flexible.

The flexibility of Pravega comes from breaking stream data into segments: append-only sequences of bytes that are organized both sequentially and in parallel to form streams. Segments enable important features, like parallel reads and writes, auto-scaling, and transactions; they are designed to be inexpensive to create and maintain. We can create new segments for a given stream when it needs more parallelism, when it needs to scale, or when it needs to begin a transaction.

The control plane in Pravega is responsible for all the operations that affect the lifecycle of a stream, e.g., create, delete, and scale. The data plane stores and serves the data of segments. The following figure depicts the high-level Pravega architecture with its core components. Continue Reading

Streams in and out of Pravega

Introduction

Reading and writing is the most basic functionality that Pravega offers. Applications ingest data by writing to one or more Pravega streams and consume data by reading data from one or more streams. To implement applications correctly with Pravega, however, it is crucial that the developer is aware of some additional functionality that complements the core write and read calls. For example, writes can be transactional, and readers are organized into groups.

In this post, we cover some basic concepts and functionality that a developer must be aware when developing an application with Pravega, focusing on reads and writes. We encourage the reader to additionally check the Pravega documentation site under the section “Developing Pravega Applications” for some code snippets and more detail. Continue Reading

Storage Reimagined for a Streaming World

Driven by the desire to shrink to zero the time it takes to turn massive volumes of raw data into useful information and action, streaming is deceptively simple: just process and act on data as it arrives, quickly, and in a continuous and infinite fashion.

For use cases from Industrial IoT to Connected Cars to Real-Time Fraud Detection and more, we’re increasingly looking to build new applications and customer experiences that react quickly to customer interests and actions, learn and adapt to changing behavior patterns, and the like. But the reality is most of us don’t yet have the tools to do this with production level data volumes, ingestion rates, and fault resiliency. So we do the best we can with bespoke systems piling complexity on top of complexity.

Complexity is symptomatic of fundamental systems design mismatches: we’re using a component for something it wasn’t designed to do, and the mechanisms at our disposal won’t scale from small to large. Continue Reading