(Update 2017 May 25) A more detailed and refined description is available at https://wtanaka.com/beam.
Apache Beam is a formal definition of a few core data structures:
PCollection-- an immutable but possibly infinitely long list (originally defined by FlumeJava paper)
PTransform-- a function that converts a
PCollection(which might be longer, the same length, shorter like a histogram, or even length 1 like a sum or a count
Pipeline-- a directed acyclic graph of
PTransformthat defines a calculation you want to do on one or more
Pipelinecan be defined with any number of input
PCollections and any number of output
Using those core data structures, you can specify a program by creating a pipeline definition. Beam also then has utility code which lets you execute the program defined by that
Pipeline on multiple computers in parallel using your choice of a few different stream processing frameworks like Flink and Spark.