Introduction to Apache Beam Using Java – InfoQ.com

Key Takeaways

In this article, we are going to introduce Apache Beam, a powerful batch and streaming processing open source project, used by big companies like eBay to integrate its streaming pipelines and by Mozilla to move data safely between its systems.

Apache Beam is a programming model for processing data, supporting batch and streaming.

Using the provided SDKs for Java, Python and Go, you can develop pipelines and then choose a backend that will run the pipeline.

Beam Model (Frances Perry & Tyler Akidau)

The key concepts in the Beam programming model are:

A basic pipeline operation consists of 3 steps: reading, processing and writing the transformation result. Each one of those steps is defined programmatically using one of the Apache Beam SDKs.

In this section, we will create pipelines using the Java SDK. You can choose between creating a local application (using Gradle or Maven) or you can use the Online Playground. The examples will use the local runner as it will be easier to verify the result using JUnit Assertions.

In this first example, the pipeline will receive an array of numbers and will map each element multiplied by 2.

The first step is creating the pipeline instance that will receive the input array and run the transform function. As we're using JUnit to run Apache Beam, we can easily create a TestPipeline as a test class attribute. If you prefer running on your main application instead, you'll need to set the pipeline configuration options,

Now we can create the PCollection that will be used as input to the pipeline. It'll be an array instantiated directly from memory but it could be read from anywhere supported by Apache Beam:

Then we apply our transform function that will multiply each dataset element by two:

To verify the results we can write an assertion:

Note the results are not supposed to be sorted as the input, because Apache Beam processes each item independently and in parallel.

The test at this point is done, and we run the pipeline by calling:

The reduce operation is the combination of multiple input elements that results in a smaller collection, usually containing a single element.

MapReduce (Frances Perry & Tyler Akidau)

Now let's extend the example above to sum up all the items multiplied by two, resulting in a MapReduce transform.

Each PCollection transform results in a new PCollection instance, which means we can chain transformations using the apply method. In this case, the Sum operation will be used after multiplying each input by 2:

FlatMap is an operation that first applies a map on each input element that usually returns a new collection, resulting in a collection of collections. A flat operation is then applied to merge all the nested collections, resulting in a single one.

The next example will be transforming arrays of strings into a unique array containing each word.

First, we declare our list of words that will be used as the pipeline input:

Then we create the input PCollection using the list above:

Now we apply the flatmap transformation, which will split the words in each nested array and merge the results in a single list:

A common job in data processing is aggregating or counting by a specific key. We'll demonstrate it by counting the number of occurrences of each word from the previous example.

After having the flat array of string, we can chain another PTransform:

One of the principles of Apache Beam is reading data from anywhere, so let's see in practice how to use a text file as a datasource.

The following example will read the content of a "words.txt" with the content "An advanced unified programming model". Then the transform function will return a PCollection containing each word from the text.

As seen in the previous example for the input, Apache Beam has multiple built-in output connectors. In the following example, we will count the number of each word present in the text file "words.txt" that contains only a single sentence ("An advanced unified programming model") and the output will be persisted in a text file format.

Even the file writing is optimized for parallelism by default, which means Beam will determine the best number of shards (files) to persist the result. The files will be located on folder src/main/resources and will have the prefix "wordcount", the shard number and the total number of shards as defined in the last output transformation.

When running it on my laptop, four shards were generated:

First shard (file name: wordscount-00001-of-00003):

Second shard (file name: wordscount-00002-of-00003):

Third shard (file name: wordscount-00003-of-00003):

The last shard was created but in the end was empty, because all words were already processed.

We can take advantage of Beam extensibility by writing a custom transform function. A custom transformer will improve code maintainability as will remove duplication.

Basically we'd need to create a subclass of PTransform, stating the type of the input and output as Java Generics. Then we override the expand method and inside its content we place the duplicated logic, that receives a single string and returns a PCollection containing each word.

The test scenario refactored to use WordsFileParser now become:

The result is a clearer and more modular pipeline.

Windowing in Apache Beam (Frances Perry & Tyler Akidau)

A common problem in streaming processing is grouping the incoming data by a certain time interval, specially when handling large amounts of data. In this case, the analysis of the aggregated data per hour or per day is more relevant than analyzing each element of the dataset.

In the following example, let's suppose we're working in a fintech and we are receiving transactions events containing the amount and the instant the transaction happened and we want to retrieve the total amount transacted per day.

Beam provides a way to decorate each PCollection element with a timestamp. We can use this to create a PCollection representing 5 money transactions:

Next, we'll apply two transform functions:

In the first window (2022-02-01) it's expected the total amount of 30 (10+20), while in the second window (2022-02-05) we should see 120 (30+40+50) in the total amount.

Each IntervalWindow instance needs to match the exact beginning and end timestamps of the chosen duration, so the chosen time has to be "00:00:00".

Apache Beam is a powerful battle-tested data framework, allowing both batching and streaming processing. We have used the Java SDK to build map, reduce, group, windowing and other operations.

Apache Beam can be well suited for developers who works with embarrassingly parallel tasks to simplify the mechanics of large-scale data processing.

Its connectors, SDKs and support for various runners bring flexibility and by choosing a cloud native runner like Google Cloud Dataflow, you get automated management of computational resources.

Go here to see the original:
Introduction to Apache Beam Using Java - InfoQ.com

Related Posts
This entry was posted in $1$s. Bookmark the permalink.