Skip to main content

Parallelism

Flink provides scalibility of a job through parallelism, and the AMPS Flink connectors utilize parallelism as a simple way to improve performance.

AMPS Source Parallelism

The AMPS Source can use splits to create multiple clients to read from AMPS in parallel. The following example shards a topic along the id field for incoming messages and uses two clients to read from AMPS by setting parallelism to 2:

AMPSSource<MyClass> source = AMPSSource.<MyClass>builder()
.setUri(uri)
.setTopic(topic)
.setDeserializationSchema(new JsonDeserializationSchema<>(MyClass.class))
.setSplits(List.of("/id MOD 2 = 0", "/id MOD 2 = 1")) // Add the splits to tell the source to split along /id
.build();

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

// Use .setParallelism(2) to tell Flink the parallelism to use
DataStream<MyClass> ds = env.fromSource(source, WatermarkStrategy.noWatermarks(), "AMPS Source Example").setParallelism(2);
info

If the amount of splits is greater than the set parallelism, then some clients may receive multiple splits. This may reduce performance as a single client is being used to read messages from multiple subscriptions to AMPS as opposed to multiple clients being used to read messages from their own individual subscription to AMPS.

Order

Messages from AMPS will always arrive in order from each split, but not necessarily in order when combined. For example, if the ids 1, 2, 3, and 4 are stored in AMPS in that specific order, the above AMPS Source may emit any of the following orders:

  • 1, 2, 3, 4
  • 1, 3, 2, 4
  • 2, 1, 3, 4

Basically, 1 will always occur before 3, and 2 will always occur before 4 since that is how the topics are sharded along the id field. This is because the split "/id MOD 2 = 0" processes the messages 1 and 3 (in order), and the split "/id MOD 2 = 1" processes the messages 2 and 4 (in order). The messages will be in order within the split, but there are no order guarantees across splits.