Spark Streaming 入门介绍

1. Objective

了解什么是 Apache Spark Streaming, 需求背景是什么,Streaming  在Spark architecture中的位置,怎么工作的, 以及 Apache Spark Streaming 在 Big Data Hadoop 和 Storm 上的优势。

Apache Spark Streaming Tutorial

2. What is Apache Spark Streaming?

Data stream 是系列不断连续的进入的数据,随后把数据切分为离散的数据单元用于随后进一步的处理。流处理 (Stream processing)就是低延迟处理和流数据的动态分析。

Spark Streaming 在2013年加入 Apache Spark, 是Spark 核心API的一个扩展,可以实现高吞吐量的、具备容错机制的 实时流数据的处理。支持从多种数据源获取数据,包括Kafk、Flume、Twitter、ZeroMQ、Amazon Kinesis以及TCP sockets,从数据源获取数据之后,可以使用诸如 map、reduce、join和window等高级函数进行复杂算法的处理。最后还可以将处理结果存储到文件系统,数据库和现场仪表盘。

内部的处理机制是:接收实时流的数据,然后Spark Streaming拆分成一批批的数据,通过Spark Engine处理这些批数据,最终得到处理后的一批批结果数据

DStream (Apache Spark Discretized Stream) 作为Spark Streaming的基础抽象,它呈现为切分为多个小批数据的流数据。在内部实现上,DStream由一组时间序列上连续的RDD来表示。每个RDD都包含了自己特定时间间隔内的数据流。从而,Streaming in Spark 可以无缝集成其他任何 组件,比如 Spark MLlib 和 Spark SQL。

3. Need for Streaming in Apache Spark

为了处理数据,大部分的传统 stream processing 设计为 continuous operators,它工作方法是:

  • Streaming data 从 实时 logs,遥感数据, IoT 设备数据中发给消费系统,比如 Apache Kafka, Amazon Kinesis 等
  • 数据并行在cluster上并行处理
  • 结果发给下游系统,像 HBase, Cassandra, Kafka

每个工作node 运行着一个或多个 连续的operators, 每个连续的 operator 处理一行流数据,然后把这条记录交给pipeline中的其他 operators。Source operators接受数据,Sink operators把数据交给下游系统。

随着今天的大规模的硬件扩展和更加负责的实时分析处理,传统的架构碰到很多挑战:

a) 动态 load balancing

切分数据成为小 micro-batches 达成计算资源的精细化分配。

b) 快速故障和分段恢复

c) 流处理,批处理和交互式查询的统一

d) 高级分析像 machine learning 和 interactive SQL

 

e) Performance

4. Why Streaming in Spark?

Batch processing systems like Apache Hadoop have high latency that is not suitable for near real time processing requirements. Processing of a record is guaranteed by Storm if it hasn’t been processed, but this can lead to inconsistency as repetition of record processing might be there. The state is lost if a node running Storm goes down. In most environments, Hadoop is used for batch processing while Storm is used for stream processing that causes an increase in code size, number of bugs to fix, development effort, introduces a learning curve, and causes other issues. This creates the difference between Big data Hadoop and Apache Spark.

Spark Streaming helps in fixing these issues and provides a scalable, efficient, resilient, and integrated (with batch processing) system. Spark has provided a unified engine that natively supports both batch and streaming workloads. Spark’s single execution engine and unified Spark programming model for batch and streaming lead to some unique benefits over other traditional streaming systems.

Key reason behind Spark Streaming’s rapid adoption is the unification of disparate data processing capabilities. This makes it very easy for developers to use a single framework to satisfy all the processing needs. Furthermore, data from streaming sources can be combined with a very large range of static data sources available through Apache Spark SQL.

To address the problems of traditional stream processing engine, Spark Streaming uses a new architecture called Discretized Streams that directly leverages the rich libraries and fault tolerance of the Spark engine.

5. Spark Streaming 架构和优势

改变了每次处理一条记录的流数据处理方法,Spark Streaming 采用离散化数据成为微小、micro-batches方法。换句话说,Spark Streaming 并行接受数据和缓冲在Spark的 workers 节点的内存中,然后 latency-optimized Spark engine 运行短任务去处理缓冲的数据,最后输出结果到其他系统。

不像把计算能力分配到一个节点的连续 operator model,Spark tasks 被动态的分配到本地的和可用的workers。 这保证了更好的 load balancing 和 faster fault recovery。

每一个batch数据就是 Resilient Distributed Dataset (RDD) ,可容错性(fault-tolerant) 数据集的抽象。

6. Spark Streaming 是怎么工作的?

Spark Streaming data stream 会切分为叫 DStreams的小batchs, 是一系列的 RDDs。接着Spark APIs 处理 RDDs ,最后结果按照批次返回。

Spark Streaming 有 Scala, Java, and Python提供API。

Spark Streaming 维护着stream中数据的状态,称为 stateful computations。

7. Spark Streaming 的源

每个 DStream (除了 file stream)的输入是连接着采集源,然后存储在 Spark 的内存处理。

有两类内嵌的 streaming 的源:

  • 基本 sources – StreamingContext API 提供了现成的源直接使用,比如:file systems, and socket connections
  • 高级 sources – Sources 像通过外部类的 Kafka, Flume, Kinesis,需要引入相映类。

There are two types of receivers base on their reliability:

  • Reliable Receiver – A reliable receiver is the one that correctly sends acknowledgment to a source when the data has been received and stored in Spark with replication.
  • Unreliable Receiver – An unreliable receiver does not send acknowledgment to a source. This can be used for sources when one does not want or need to go into the complexity of acknowledgment.

8. Spark Streaming Operations

Spark streaming support two types of operations:

8.1. Transformation Operations in Spark

Similar to Spark RDDs, Spark transformations allow modification of the data from the input DStream. DStreams support many transformations that are available on normal Spark RDD’s. Some of the common ones are as follows.

map(), flatMap(), filter(), repartition(numPartitions), union(otherStream), count(), reduce(), countByValue(), reduceByKey(func, [numTasks]), join(otherStream, [numTasks]), cogroup(otherStream, [numTasks]), transform(), updateStateByKey(), Window()

8.2. Output Operations in Apache Spark

DStream’s data is pushed out to external systems like a database or file systems using Output Oprations. Since external systems consume the transformed data as allowed by the output operations, they trigger the actual execution of all the DStream transformations. Currently, the following output operations are defined:

print(), saveAsTextFiles(prefix, [suffix])”prefix-TIME_IN_MS[.suffix]”, saveAsObjectFiles(prefix, [suffix]), saveAsHadoopFiles(prefix, [suffix]), foreachRDD(func)

Hence, DStreams like RDDs are executed lazily by the output operations. Specifically, received data is processed forcefully by RDD actions inside the DStream output operations. By default, output operations are executed one-at-a-time. And they are executed in the order they are defined in the Spark applications.

Resource:

http://spark.apache.org/