Browse Author: Tom Kaitchuck

Pravega Watermarking Support 

Pravega Watermarking Support 

Tom Kaitchuck and Flavio Junqueira

Motivation 

Stream processing broadly refers to the ability to ingest data from unbounded sources and processing such data as it is ingested. The data can be user-generated, like in social networks or other online application, or machine-generated, like in server telemetry or sensor samples from IoT and Edge applications [1]. 

Stream processing applications typically process data following the order in which the data is produced. Following a total order strictly is often not practically possible for a couple of important reasons: 

  1. The source is not a single element as it might comprise multiple users, servers, or gateways; 
  2. Inherent choices of the application design might cause items to be ingested and processed out of order. 

Consequently, the order in Pravega and similar systems refers to the order in which the data is ingested and determined by some concept like keys connecting elements of the data stream. 

The ability to process data following the order of generation, even if only loosely, is one of the most interesting aspects of stream processing as it enables an application to establish temporal correlations about the different events. For example, an application is capable of asking questions such as how many distinct users signed in during the last hour or how many distinct sensors have reported an anomaly in the past 10 minutes. To implement and answer such queries, the application must be able to produce results for every reporting period, every hour in the first example and every 10 minutes in the second. These reporting periods are often referred to as time windows [2].  Continue Reading

Exploring State Synchronizer

Pravega allows the state to be shared in a consistent fashion across multiple cooperating processes distributed in a cluster using a State Synchronizer. This blog details how to use State Synchronizer [1] to build and maintain consistency in a distributed application.

State Synchronizer

In distributed systems, frequently state needs to be shared across multiple instances of an application. If this information is on the data path, it typically goes through whatever datastore is appropriate for the application. Usually, we choose our datastore carefully based on the requirements of our application.

When we have the state that needs to be used by multiple processes, like a schema registry or cluster membership that is not related to the application’s data, it’s worth considering alternative storage options because the requirements might be totally different. Often metadata doesn’t fit neatly in the data path’s schema or consistency model. So, having different storage solutions often makes sense. Sometimes the importance of this is underappreciated and implemented as an afterthought. Continue Reading