Apache Beam is a library of code that makes it easy to write "Beam code" that:
- Can easily run in parallel across many servers
- Is tolerant to sudden unexpected hardware failures
- Can output partial results based on partial inputs, and refine those partial results into a complete result as the input becomes complete, even if the data arrives out of order, or some data arrives late
- Is designed to be agnostic toward the stream processing engine it runs on
- Is designed to be agnostic toward programming language
Beam calls itself a "programming model" instead of a library. This is because a Beam program (called a
Pipeline) is a directed acyclic graph, created by constructing and calling methods on Java classes. A significant fraction of top-level
Pipeline code is setting attributes on nodes in the graph and defining the connectivity between them. There are plans to let this directed graph structure be specified in other ways in the future, for example with SQL.
The main data types are:
- A loosely-ordered, potentially infinite collection object called
PCollectionwhich neither needs to have an upper bound on size nor needs to fit on a single computer.
- A composable subroutine construct called
PTransformwhich defines a function from one
In a Beam
Pipeline graph, the
PTransform objects are often visualized as nodes, and the
PCollection objects as the edges. It is possible for a single
PCollection to be the input for multiple
PTransform, but this can be visualized as multiple arrows representing the same
PCollections "contain" any number of plain java objects. Each individual object is called an element. Even without subclassing a library class, every element is associated with an implicit "event time" representing the time that the element originally "happened". Every element is also associated with one or more ranges of event times called Windows, which serve as the groupings for aggregation calculations.
There is one special type of element, a key-value pair type called KV that, through a loose agreement with several of the
PTransform classes, lets you quickly partition data across cluster machines by assigning non-equal
K values or group them together onto the same machine by giving them equal
Beam lets you implement your own
PTransform functions and attach them into your graph. The code in one
PTransform is isolated from the code in any other
PTransform, and the only way for them to communicate is if they are connected through a
PCollection. There are plans to eventually let different nodes in a single
Pipeline run code in different languages.
Beam also contains:
- A library of Runner objects that translates and executes a
Pipelinegraph on various stream processing systems such as Flink, Apex, Google Cloud Dataflow, and Spark. These Runner objects are implemented on top of a separate class library in the Beam codebase from the one that you use to define
- A library of connectors which let you wrap a variety of data sources (Kafka, JDBC database, etc) as
PTransformobjects that either take a
PCollectionas input or emit one as output.