Apache Beam

Apache Beam is a library of code that makes it easy to write "Beam code" that:

  1. Can easily run in parallel across many servers
  2. Is tolerant to sudden unexpected hardware failures
  3. Can output partial results based on partial inputs, and refine those partial results into a complete result as the input becomes complete, even if the data arrives out of order, or some data arrives late
  4. Is designed to be agnostic toward the stream processing engine it runs on
  5. Is designed to be agnostic toward programming language

Beam calls itself a "programming model" instead of a library.  This is because a Beam program (called a Pipeline) is a directed acyclic graph, created by constructing and calling methods on Java classes.  A significant fraction of top-level Pipeline code is setting attributes on nodes in the graph and defining the connectivity between them. There are plans to let this directed graph structure be specified in other ways in the future, for example with SQL

The main data types are:

  1. A loosely-ordered, potentially infinite collection object called PCollection which neither needs to have an upper bound on size nor needs to fit on a single computer.
  2. A composable subroutine construct called PTransform which defines a function from one PCollection into another PCollection

In a Beam Pipeline graph, the PTransform objects are often visualized as nodes, and the PCollection objects as the edges. It is possible for a single PCollection to be the input for multiple PTransform, but this can be visualized as multiple arrows representing the same PCollection.

PCollections "contain" any number of plain java objects. Each individual object is called an element. Even without subclassing a library class, every element is associated with an implicit "event time" representing the time that the element originally "happened". Every element is also associated with one or more ranges of event times called Windows, which serve as the groupings for aggregation calculations.

There is one special type of element, a key-value pair type called KV that, through a loose agreement with several of the PTransform classes, lets you quickly partition data across cluster machines by assigning non-equal K values or group them together onto the same machine by giving them equal K values.

Beam lets you implement your own PTransform functions and attach them into your graph.  The code in one PTransform is isolated from the code in any other PTransform, and the only way for them to communicate is if they are connected through a PCollection.  There are plans to eventually let different nodes in a single Pipeline run code in different languages.

Beam also contains:

  1. A library of Runner objects that translates and executes a Pipeline graph on various stream processing systems such as Flink, Apex, Google Cloud Dataflow, and Spark.  These Runner objects are implemented on top of a separate class library in the Beam codebase from the one that you use to define Pipelines.
  2. A library of connectors which let you wrap a variety of data sources (Kafka, JDBC database, etc) as PTransform objects that either take a PCollection as input or emit one as output.
Subscribe to All Posts - Wesley Tanaka