在社区发现很多同事对人工智能的现状有些许误解和沮丧。第一，最大的误解认为人工智能算法都是开源的，业界算法能力无差异而且华为在这方面也无所作为。第二，只是看到了语音识别和图像识别的突破和实践，而对于人工智能在其他领域的认知不足。为此，这里推荐给大家两本简单有趣的书来全景的了解人工智能的历史、进展和未来，尼克著的《人工智能简史》和特伦斯著的《深度学习 智能时代的核心驱动力量》。这两本书以关键人物和关键技术突破为主线，讲述了生物学家、物理学家、数学家和信息科学家，如何一步步的研究人工智能的，故事性和概念性引人入胜。 这里先带大家认识一些人工智能的关键技术，其中最有趣最吸引人的是强化学习。什么是强化学习呢？形象的比喻就是用肉包反复训练小狗，以奖励或惩罚的机制，提升小狗对特定行为作出最佳的决策反应，也称为奖励学习。下图是一个最简单的强化学习场景图，智能体通过采取行动和进行观察来积极探索环境。如果行动成功，执行器得到奖励。该过程目标是通过学习怎么采取行动来最大化可能获得奖励。曾在2016年11月击败韩国围棋高手李世乭，2017年5月打败围棋世界冠军柯洁的人工智能 AlphaGo，就是通过强化学习进行自我学习而成为世界高手的。
- 1. Objective
- 2. Apache Storm vs Spark Streaming Comparison
- 2.1. Processing Model
- 2.2. Primitives
- 2.3. State Management
- 2.4. Message Delivery Guarantees (Handling message level failures)
- 2.5. Fault Tolerance (Handling process/node level failures)
- 2.6. Debuggability and Monitoring
- 2.7. Auto Scaling
- 2.8. Yarn Integration
- 2.9. Isolation
- 2.11. Open Source Apache Community
- 2.12. Ease of development
- 2.13. Ease of Operability
- 2.14. Language Options
- 3. Conclusion
Apache Storm 是一个实时数据处理的流数据处理引擎, Apache Spark 是一个通用的计算引擎，它提供 Spark Streaming 去处理流数据。这里是比较他们的优缺点。
2. Apache Storm vs Spark Streaming
2.1. Processing Model
- Storm: 通过核心的storm用来支持流处理
- Spark Streaming: 是Spark batch processing的变形
- Storm: It provides a very rich set of primitives to perform tuple level process at intervals of a stream (filters, functions). Aggregations over messages in a stream are possible through group by semantics. It supports left join, right join, inner join (default) across the stream.
- Spark Streaming: 提供2种变种的operators，第一是 Stream transformation operators 一个 DStream 到另一个 DStream。第二是 output operators 是写信息到外部系统。
2.3. State Management
- Storm: Core Storm by default doesn’t offer any framework level support to store any intermediate bolt output (the result of user operation) as a state. Hence, any application has to create/update its own state as and once required.
- Spark Streaming: The underlying Spark by default treats the output of every RDD operation(Transformations and Actions) as an intermediate state. It stores it as RDD. Spark Streaming permits maintaining and changing state via updateStateByKey API. A pluggable method couldn’t be found to implement state within the external system.
2.4. Message Delivery Guarantees (Handling message level failures)
- Storm: It supports 3 message processing guarantees: at least once, at-most-once and exactly once. Storm’s reliability mechanisms are distributed, scalable, and fault-tolerant.
- Spark Streaming: Apache Spark Streaming defines its fault tolerance semantics, the guarantees provided by the recipient and output operators. As per the Apache Spark architecture, the incoming data is read and replicated in different Spark executor’s nodes. This generates failure scenarios data received but may not be reflected. It handles fault tolerance differently in the case of worker failure and driver failure.
2.5. Fault Tolerance (Handling process/node level failures)
- Storm: Storm is intended with fault-tolerance at its core. Storm daemons (Nimbus and Supervisor) are made to be fail-fast (that means that method self-destructs whenever any sudden scenario is encountered) and stateless (all state is unbroken in Zookeeper or on disk).
- Spark Streaming: The Driver Node (an equivalent of JT) is SPOF. If driver node fails, then all executors will be lost with their received and replicated in-memory information. Hence, Spark Streaming uses data checkpointing to get over from driver failure.
2.6. Debuggability and Monitoring
- Storm: Apache Storm UI support image of every topology; with the entire break-up of internal spouts and bolts. UI additionally contributes information having any errors coming in tasks and fine-grained stats on the throughput and latency of every part of the running topology. It helps in debugging problems at a high level. Metric based monitoring: Storm’s inbuilt metrics feature supports framework level for applications to emit any metrics, which can then be simply integrated with external metrics/monitoring systems.
- Spark Streaming: Spark web UI displays an extra Streaming tab that shows statistics of running receivers (whether receivers are active, the variety of records received, receiver error, and so on.) and completed batches (batch process times, queuing delays, and so on). It is useful to observe the execution of the application. The following 2 info in Spark web UI are significantly necessary for standardization of batch size:
- Processing Time – The time to process every batch of data.
- Scheduling Delay – The time a batch stays in a queue for the process previous batches to complete.
2.7. Auto Scaling
- Storm: It provides configuring initial parallelism at various levels per topology – variety of worker processes, executors, tasks. Additionally, it supports dynamic rebalancing, that permits to increase or reduces the number of worker processes and executors w/o being needed to restart the cluster or the topology. But, many initial tasks designed stay constant throughout the life of topology.
Once all supervisor nodes are fully saturated with worker processes, and there’s a need to scale out, one merely has to begin a replacement supervisor node and inform it to cluster wide Zookeeper.
It is possible to transform the logic of monitor the present resource consumption on every node in a very Storm cluster, and dynamically add a lot of resources. STORM-594 describes such auto-scaling mechanism employing a feedback system.
- Spark Streaming: The community is currently developing on dynamic scaling to streaming applications. At the instant, elastic scaling of Spark streaming applications isn’t supported.
Essentially, dynamic allocation isn’t meant to be used in Spark streaming at the instant (1.4 or earlier). the reason is that presently the receiving topology is static. the number of receivers is fixed. One receiver is allotted with every DStream instantiated and it’ll use one core within the cluster. Once the StreamingContext is started, this topology cannot be modified. Killing receivers leads to stopping the topology.
2.8. Yarn Integration
- Storm: 通过Apache Slider集成 YARN，slider 是一个部署在YARN cluster上的分布应用
- Spark Streaming: Spark framework provides native integration along with YARN. Spark streaming as a layer above Spark merely leverages the integration. Every Spark streaming application gets reproduced as an individual Yarn application. The ApplicationMaster container runs the Spark driver and initializes the SparkContext. Every executor and receiver run in containers managed by ApplicationMaster. The ApplicationMaster then periodically submits one job per micro-batch on the YARN containers.
- Storm: Each employee process runs executors for a particular topology. That’s mixing of various topology tasks isn’t allowed at worker process level which supports topology level runtime isolation. Further, every executor thread runs one or more tasks of an identical element (spout or bolt), that’s no admixture of tasks across elements.
- Spark Streaming: Spark application is a different application run on YARN cluster, wherever every executor runs in a different YARN container. Thus, JVM level isolation is provided by Yarn since 2 totally different topologies can’t execute in same JVM. Besides, YARN provides resource level isolation so that container level resource constraints (CPU, memory limits) can be organized.
2.11. Open Source Apache Community
- Storm: Apache Storm powered-by page healthy list of corporations that are running Storm in production for many use-cases. Many of them are large-scale web deployments that are pushing the boundaries for performance and scale. For instance, Yahoo reading consists of two, 300 nodes running Storm for near-real-time event process, with the largest topology spanning across four hundred nodes.
- Spark Streaming: Apache Spark streaming remains rising and has restricted expertise in production clusters. But, the general umbrella Apache Spark community is well one in all the biggest and thus the most active open supply communities out there nowadays. The general charter is space evolving given the massive developer base. this could cause maturity of Spark Streaming within the close to future.
2.12. Ease of development
- Storm: It provides extremely easy, rich and intuitive APIs that simply describe the DAG nature of process flow (topology). The Storm tuples, which give the abstraction of data flowing between nodes within the DAG, are dynamically written. The motivation there’s to change the APIs for simple use. Any new custom tuple can be plugged in once registering its Kryo serializer. Developers will begin with writing topologies and run them in native cluster mode. In local mode, threads are used to simulate worker nodes, permitting the developer to set breakpoints, halt the execution, examine variables, and profile before deploying it to a distributed cluster wherever all this is often way tougher.
- Spark Streaming: It offers Scala and Java APIs that have a lot of a practical programming (transformation of data). As a result, the topology code is way a lot of elliptic. There’s an upscale set of API documentation and illustrative samples on the market for the developer.
- Storm: It is little tricky to deploy/install Storm through many tools (puppets, and then on ) and deploys the cluster. Apache Storm contains a dependency on Zookeeper cluster. So that it can meet coordination over clusters, store state and statistics. It implements CLI support to install actions like submit, activate, deactivate, list, kill topology. a powerful fault tolerance suggests that any daemon period of time doesn’t impact executing topology.
In standalone mode, Storm daemons are compel to run in supervised mode. In YARN cluster mode, Storm daemons emerged as containers and driven by Application Master (Slider).
- Spark Streaming: It uses Spark as the fundamental execution framework. It should be easy to feed up Spark cluster on YARN. There are many deployment requirements. Usually we enable checkpointing for fault tolerance of application driver. This could bring a dependency on fault-tolerant storage (HDFS).
2.14. Language Options
- Storm: We can create Storm applications in Java, Clojure, and Scala.
- Spark Streaming: We can create Spark applications in Java, Scala, Python, and R.
Common use cases
Kudu is designed to excel at use cases that require the combination of random reads/writes and the ability to do fast analytic scans—which previously required the creation of complex Lambda architectures. When combined with the broader Hadoop ecosystem, Kudu enables a variety of use cases, including:
IoT and time series data
Machine data analytics (network security, health, etc.)
Blog source from: https://github.com/litaotao/litaotao.github.io/blob/master/_posts/new-spark/2016-03-10-spark-resouces-blogs-paper.md
3. Databricks Blog
- Apache Spark 1.5 DataFrame API Highlights: Date/Time/String Handling, Time Intervals, and UDAFs
- Achieving End-to-end Security for Apache Spark with Databricks
上面两篇是 databricks 出的关于 databricks 专业版的描述，虽然没有从根本上解决问题，但是读起来还是挺有说服力的，哈哈，因为采用了很多很细节的方案。不错不错，各位有在做云产品的，在宣传自己的安全方案时可用参考参考哦。
- Deep Dive into Spark SQL’s Catalyst Optimizer
- A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets
- Understanding your Apache Spark application through visualization
- New Visualizations for Understanding Apache Spark Streaming Applications
- An Introduction to Writing Apache Spark Applications on Databricks
- A Gentle Introduction to Apache Spark on Databricks
- Apache Spark on Databricks for Data Scientists
- Import Notebook Apache Spark on Databricks for Data Engineers
- Structured Streaming In Apache Spark: A new high-level API for streaming
- Spark SQL Supported Syntax
- Combining Machine Learning Frameworks with Apache Spark
- Deep Dive: Memory Management in Apache Spark
- An Architecture for Fast and General Data Processing on Large Clusters
- How-to: Tune Your Apache Spark Jobs (Part 1)
- How-to: Tune Your Apache Spark Jobs (Part 2)
By 0x0fffSpark Misconceptions
By 0x0fffSpark Architecture
By 0x0fffSpark DataFrames are faster, aren’t they?
By 0x0fffSpark Architecture: Shuffle
By 0x0fffModern Data Architecture
By 0x0fffSpark Architecture Talk
By 0x0fffApache Spark Future
By 0x0fffData Industry Trends
By 0x0fffSpark Memory Management
By 0x0fffSpark Architecture Video
- 借助 Redis ，让 Spark 提速 45 倍！
- Hadoop vs Spark
- [2016 上海第二次 spark meetup: 1. spark_meetup.pdf](http://litaotao.github.io/files/1. spark_meetup.pdf)
- [2016 上海第二次 spark meetup: 2. Flink_ An unified stream engine.pdf](http://litaotao.github.io/files/2. Flink_ An unified stream engine.pdf)
- [2016 上海第二次 spark meetup: 3. Spark在计算广告领域的应用实践.pdf](http://litaotao.github.io/files/3. Spark在计算广告领域的应用实践.pdf)
- [2016 上海第二次 spark meetup: 4. splunk_spark.pdf](http://litaotao.github.io/files/4. splunk_spark.pdf)
- Monitoring Spark with Graphite and Grafana
- Databricks Empowers Enterprises to Secure Their Apache Spark Workloads
- Running Spark Python Applications
- How-to: Prepare Your Apache Hadoop Cluster for PySpark Jobs
- Apache Spark’s Hidden REST API
- [SQL, sqlContext, hiveContext]
- [Spark Memory Issues]
- A Beginner’s Guide on Troubleshooting Spark Applications
- YouTube: what is apache spark
- Introduction to Spark Architecture
- Top 5 Mistakes When Writing Spark Applications
slideTop 5 mistakes when writing Spark applications
- Tuning and Debugging Apache Spark
slideTuning and Debugging Apache Spark
- A Deeper Understanding of Spark Internals – Aaron Davidson (Databricks)
slideA Deeper Understanding of Spark Internals – Aaron Davidson (Databricks)
- Building, Debugging, and Tuning Spark Machine Learning Pipelines – Joseph Bradley (Databricks)
slideBuilding, Debugging, and Tuning Spark Machine Learning Pipelines
- Spark DataFrames Simple and Fast Analysis of Structured Data – Michael Armbrust (Databricks)
slideSpark DataFrames Simple and Fast Analysis of Structured Data – Michael Armbrust (Databricks)
- Spark Tuning for Enterprise System Administrators
slideSpark Tuning for Enterprise System Administrators
- Structuring Spark: DataFrames, Datasets, and Streaming
slideStructuring Spark: DataFrames, Datasets, and Streaming
- Spark in Production: Lessons from 100+ Production Users
slideSpark in Production: Lessons from 100+ Production Users
- Production Spark and Tachyon use Cases
slideProduction Spark and Tachyon use Cases
- SparkUI Visualization
- Everyday I’m Shuffling – Tips for Writing Better Spark Programs, Strata San Jose 2015
slideEveryday I’m Shuffling – Tips for Writing Better Spark Programs, Strata San Jose 2015
- Large Scale Distributed Machine Learning on Apache Spark
- Securing your Spark Applications
slideSecuring your Spark Applications
- Building a REST Job Server for Interactive Spark as a Service
slideBuilding a REST Job Server for Interactive Spark as a Service
- Exploiting GPUs for Columnar DataFrame Operations
slideExploiting GPUs for Columnar DataFrame Operations
- Easy JSON Data Manipulation in Spark – Yin Huai (Databricks)
slideEasy JSON Data Manipulation in Spark – Yin Huai (Databricks)
- Sparkling: Speculative Partition of Data for Spark Applications – Peilong Li
slideSparkling: Speculative Partition of Data for Spark Applications – Peilong Li
- Advanced Spark Internals and Tuning – Reynold Xin
slideAdvanced Spark Internals and Tuning – Reynold Xin
- The Future of Real Time in Spark
slideThe Future of Real Time in Spark
- Spark 2 0
slideSpark 2 0
- Democratizing Access to Data
slideDemocratizing Access to Data
- Not Your Father’s Database: How to Use Apache Spark Properly in Your Big Data Architecture
slideNot Your Father’s Database: How to Use Apache Spark Properly in Your Big Data Architecture
- Disrupting Big Data with Apache Spark in the Cloud
slideDisrupting Big Data with Apache Spark in the Cloud
我一直很欣赏 databricks 出的 video 和 slide，结构非常清晰，这个是其中一个非常好的演讲，里面有很多值得借鉴的地方，特别是当你像别人介绍你的工作，产品的时候。[我有一个感受，很少有人能清晰，有条理的介绍自己正在做的产品，对于一些小众的产品，甚至一些职业的销售也难以做到清晰，简明的叙述。这个 video 和 slide 有很大的参考价值。我自己感觉仔细研究这些 video 和 slide 有时候比看上一两本专业讲销售的书还要管用。]
- Getting The Best Performance With PySpark
slideGetting The Best Performance With PySpark
- Disrupting Big Data with Apache Spark in the Cloud
slideDisrupting Big Data with Apache Spark in the Cloud
- Large Scale Deep Learning with TensorFlow
slideLarge Scale Deep Learning with TensorFlow
- 700 Queries Per Second with Updates: Spark As A Real-Time Web Service
slide700 Queries Per Second with Updates: Spark As A Real-Time Web Service
blog700 SQL Queries per Second in Apache Spark with FiloDB
- Operational Tips for Deploying Spark
slideOperational Tips for Deploying Spark
- Connecting Python To The Spark Ecosystem
slideConnecting Python To The Spark Ecosystem
- Livy: A REST Web Service For Apache Spark
slideLivy: A REST Web Service For Apache Spark
- Understanding Memory Management In Spark For Fun And Profit
slideUnderstanding Memory Management In Spark For Fun And Profit
- Deep Dive: Memory Management in Apache Spark
slideDeep Dive: Memory Management in Apache Spark
- High Performance Python on Apache Spark
slideHigh Performance Python on Apache Spark
- A Deep Dive into Structured Streaming
slideA Deep Dive into Structured Streaming
slideSpark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
- Apache spark job server example with installation
- Productionizing Spark and the Spark REST Job Server
slideProductionizing Spark and the Spark REST Job Server
- Data Profiling and Pipeline Processing with Spark
slideData Profiling and Pipeline Processing with Spark
- Spark at Bloomberg: Dynamically Composable Analytics
slideSpark at Bloomberg: Dynamically Composable Analytics
- Data Storage Tips for Optimal Spark Performance – Vida Ha (Databricks)
slideData Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
- Sparkling Pandas – using Apache Spark to scale Pandas – Holden Karau and Juliet Hougland
The full replication of information operates as follows:
- Data is extracted from the source database using the standard extractor, for example by reading the row change data from the binlog in MySQL.
colnamesfilter is used to extract column name information from the database. This enables the row-change information to be tagged with the corresponding column information. The data changes, and corresponding row names, are stored in the THL.
pkeyfilter is used to extract primary key data from the source tables.
- On the slave replicator, the THL data is read and written into batch-files in the character-separated value format.
The information in these files is change data, and contains not only the original data, but also metadata about the operation performed (i.e.
UPDATE, and the primary key of for each table. All
UPDATEstatements are recorded as a
DELETEof the existing data, and an
INSERTof the new data.
- A second process uses the CSV stage data and any existing data, to build a materialized view that mirrors the source table data structure.
The staging files created by the replicator are in a specific format that incorporates change and operation information in addition to the original row data.
- The format of the files is a character separated values file, with each row separated by a newline, and individual fields separated by the character
0x01. This is supported by Hive as a native value separator.
- The content of the file consists of the full row data extracted from the master, plus metadata describing the operation for each row, the sequence number, and then the full row information.
|Operation||Sequence No||Unique Row||Commit TimeStamp||Table-specific primary key||Table-column|
|I (Insert) or D (Delete)||
||Unique row ID within the batch||The commit timestamp of the original transaction, which can be used for partitioning|
For example, the MySQL row:
| 3 | #1 Single | 2006 | Cats and Dogs (#1.4) |
Is represented within the staging files generated as:
I^A1318^A1^A2017-06-07 09:22:28.000^A3^A3^A#1 Single^A2006^ACats and Dogs (#1.4)
The character separator, and whether to use quoting, are configurable within the replicator when it is deployed. The default is to use a newline character for records, and the
0x01 character for fields. For more information on these fields and how they can be configured, see Section 7.2.7, “Supported CSV Formats”.
On the Hadoop host, information is stored into a number of locations within the HDFS during the data transfer:
Table 5.2. Hadoop Replication Directory Locations
||Top-level directory for Tungsten Replicator information|
||Location for metadata related to the replication operation|
||The directory (named after the servicename of the replicator service) that holds service-specific metadata|
||Directory of the data transferred|
||Directory of the data transferred from a specific servicename.|
||Directory of the data transferred specific to a database.|
||Directory of the data transferred specific to a table.|
||Filename of a single file of the data transferred for a specific table and database.|
Files are automatically created, named according to the parent table name, and the starting Tungsten Replicator sequence number for each file that is transferred. The size of the files is determined by the batch and commit parameters. For example, in the truncated list of files below displayed using the hadoop fs command,
hadoop fs -ls /user/tungsten/staging/hadoop/chicagoFound 66 items -rw-r--r-- 3 cloudera cloudera 1270236 2014-01-13 06:58 /user/tungsten/staging/alpha/hadoop/chicago/chicago-10.csv -rw-r--r-- 3 cloudera cloudera 10274189 2014-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-103.csv -rw-r--r-- 3 cloudera cloudera 1275832 2014-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-104.csv -rw-r--r-- 3 cloudera cloudera 1275411 2014-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-105.csv -rw-r--r-- 3 cloudera cloudera 10370471 2014-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-113.csv -rw-r--r-- 3 cloudera cloudera 1279435 2014-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-114.csv -rw-r--r-- 3 cloudera cloudera 2544062 2014-01-13 06:58 /user/tungsten/staging/alpha/hadoop/chicago/chicago-12.csv -rw-r--r-- 3 cloudera cloudera 11694202 2014-01-13 08:33 /user/tungsten/staging/alpha/hadoop/chicago/chicago-123.csv -rw-r--r-- 3 cloudera cloudera 1279072 2014-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-124.csv -rw-r--r-- 3 cloudera cloudera 2570481 2014-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-126.csv -rw-r--r-- 3 cloudera cloudera 9073627 2014-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-133.csv -rw-r--r-- 3 cloudera cloudera 1279708 2014-01-13 08:34 /user/tungsten/staging/alpha/hadoop/chicago/chicago-134.csv ...
The individual file numbers will not be sequential, as they will depend on the sequence number, batch size and range of tables transferred.
hadoop fs -ls /user/tungsten/staging/hadoop/chicago