What is Apache Beam?

‹ Notes about using HTTP/2 | Using Packer and Ansible (instead of Dockerfile) to create Docker Images ›

(Update 2017 May 25) A more detailed and refined description is available at https://wtanaka.com/beam.

Apache Beam is a formal definition of a few core data structures:

  • PCollection -- an immutable but possibly infinitely long list (originally defined by FlumeJava paper)
  • PTransform -- a function that converts a PCollection into another PCollection (which might be longer, the same length, shorter like a histogram, or even length 1 like a sum or a count
  • Pipeline -- a directed acyclic graph of PTransform that defines a calculation you want to do on one or more PCollection. A Pipeline can be defined with any number of input PCollections and any number of output PCollections.

Using those core data structures, you can specify a program by creating a pipeline definition.  Beam also then has utility code which lets you execute the program defined by that Pipeline on multiple computers in parallel using your choice of a few different stream processing frameworks like Flink and Spark.

Subscribe to All Posts - Wesley Tanaka