Browse Category: Best Practices

Pravega Client API 101

Introduction

The fundamentals of stream semantics in Pravega are learned through familiarity with its client APIs. In this article, we will overview Pravega’s client APIs with a handful of simple examples. As we reach the end, you should see Pravega in action, understand the guarantees afforded by Pravega streams, and have some familiarity with several of the facilities provided by the client API.

Pravega client APIs provide read and write access to data streams. Streams store sequences of bytes. Writers commit new sequences of bytes at the tail position(s) of a stream. Writes to a single stream can be split across shards or segments; and, when writes are accompanied by routing keys, these writes can be ordered within their determined segments.

Streams have scaling policies that allow them to split into several parallel segments. In UNIX, a stream is akin to writing a file in append mode, where several writers are guaranteed to append after each other and not overwrite each other’s contents. An open append-only file typically has one data stream, whereas a stream in Pravega can have many parallel data streams, called segments, allowing an influx of writes to scale horizontally across a cluster, as according to scaling policies and routing keys. Unlike most other distributed message passing or data storage systems, the parallelism of a stream can change over time according to write throughput factors. Writes can be distributed across these parallel stream segments, and there can become more of them or fewer of them over time.

Continue Reading

Deploying Pravega on Kubernetes

Pravega is a storage system for data streams that has an innovative design and an attractive set of features to cope with today’s Stream processing requirements (e.g., event ordering, scalability, performance, etc.). The project has plenty of documentation and great blog posts that explain in detail every technical aspect of Pravega. But, if you are now more interested in having your first Pravega cluster up and running in the cloud and leave the technical readings for later, then you are at the right place.

We show you how to deploy your “first Pravega cluster in Kubernetes”. We provide a step-by-step guide to deploy Pravega in both Google Kubernetes Engine (GKE) and Amazon Elastic Kubernetes Service (EKS). Our goal is to keep things as simple as possible, and, at the same time, provide you with valuable insights on the services that form a Pravega cluster and the operators we developed to deploy them. For this reason, while we have a “one-click” Pravega installation ready to use, we commend you to complete this blog post. If you do so, you will have a Pravega cluster ready for serving sample applications and, why not, even your first application working against Pravega. Let’s get started!

Continue Reading

Exactly-Once Processing Using Apache Flink and Pravega Connector

This blog post provides an overview of how Apache Flink and Pravega Connector works under the hood to provide end-to-end exactly-once semantics for streaming data pipelines.

Overview

Pravega [4] is a storage system that exposes Stream as storage primitive for continuous and unbounded data. A Pravega stream is a durable, elastic, append-only, unbounded sequence of bytes providing strong consistency model guaranteeing data durability (once the writes are acknowledged to the client), message ordering (events within the same Routing Key will be delivered to the readers in the same order as it was written) and exactly-once support (duplicate event writes are not allowed).

Pravega was designed to support a new generation of streaming applications which process large amounts of data arriving continuously to derive deep insights. Pravega relies on the stream processing frameworks to process and transform the data, and it provides enough storage primitives that are necessary for a stream processing framework to operate on the data and reason about it.

Apache Flink is a distributed stream processor with intuitive and expressive APIs to implement stateful stream processing applications. By combining the features of Apache Flink and Pravega, it is possible to build a pipeline comprising of multiple Flink applications, that can be chained together to give end-to-end exactly-once guarantees across the chain of applications. Continue Reading

Exploring State Synchronizer

Pravega allows the state to be shared in a consistent fashion across multiple cooperating processes distributed in a cluster using a State Synchronizer. This blog details how to use State Synchronizer [1] to build and maintain consistency in a distributed application.

State Synchronizer

In distributed systems, frequently state needs to be shared across multiple instances of an application. If this information is on the data path, it typically goes through whatever datastore is appropriate for the application. Usually, we choose our datastore carefully based on the requirements of our application.

When we have the state that needs to be used by multiple processes, like a schema registry or cluster membership that is not related to the application’s data, it’s worth considering alternative storage options because the requirements might be totally different. Often metadata doesn’t fit neatly in the data path’s schema or consistency model. So, having different storage solutions often makes sense. Sometimes the importance of this is underappreciated and implemented as an afterthought. Continue Reading