Kafka & Distributed Streaming

Apache Kafka is a multi-purpose distributed streaming platform.  Kafka can be used to not only build streaming data pipelines that reliably get data between systems or applications, but Kafka can also be used to build streaming applications that transform, analyze or react to streams of data.

Kafka supports the use of both publish/subscribe, point-to-point or custom models on streams of events.  This capability can be used for simple, fast data transport from one data system to another, used as an extremely robust and sophisticated enterprise messaging system and used for streaming data analysis without the micro-batching limitation found in the Spark or Storm projects.  In addition Kafka provides a fully-functional SQL interface.

 

With the Kafka’s library, Kafka Streams, the following use cases are implemented with minimal effort (compared to imperative programming):  Web site activity (track page views, searches, etc. in near-real-time); Events & log aggregation (particularly in distributed systems where messages come from multiple sources); Monitoring and metrics (aggregate statistics from distributed applications for dashboard applications); Stream processing(process raw data or treat it as SQL data types, clean it, verify it, analyze it, and forward it on to another topic or messaging system or a storage system); Event-time data ingestion(fast processing of a very large volume of messages).

In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol.  A Java client is provided in the Kafka library, but clients are available in many languages, such as C/C++, Python, Go (AKA golang), Erlang, .NET, Clojure, Ruby, Node.js, Proxy (HTTP REST, etc), Perl, stdin/stdout, PHP, Rust, Alternative Java, Storm, Scala DSL, Swift and SQL (KSQL).

Kafka provides the functionality of a messaging system, but with a unique design.  It is a distributed, partitioned, replicated commit “log” service. Kafka is horizontally scalable, partitioned (the data is split-up and distributed across the brokers),replicated(allowing for automatic failover) and unique among messaging systems.  Kafka does not track the consumption of messages (the consumers do). In short, Kafka has been optimized for performance. It has been designed from the ground up with a focus on performance and throughput.

This 4 day class will instruct the student in all aspects of the Kafka project.  While Kafka provides the complete functionality of a messaging system, it was designed to do more and perform better than an enterprise messaging system.  The student will be provided a comprehensive understanding of Kafka and all of its capabilities through lecture and hands-on labs.

 

PREREQUISITES

Development experience with Java, either academic or experiential knowledge of distributed systems and a distributed message service or data flow topologies.  Knowledge or experience with Hadoop is strongly recommended as is user-level proficiency with a Unix-based operating system and command line scripts.

 

TARGET AUDIENCE

Data flow architects and developers who wish to understand the features of Kafka and individuals who wish to migrate from MQ Series, Rabbit MQ and Active MQ and replace them with a highly-extensible, multi-faceted, superior alternative.

 

FORMAT

50% Lecture 50% Hands-on Labs

 

DURATION

This is a 4 day class when taught on-site with ILT or via web-ex with VILT.  It is also offered on a per-module basis for on-line self-enablement via our LMS, Brane.

 

AGENDA SUMMARY

Day 1: Introduction to Kafka, installation, core APIs and Architecture

Day 2: From data motion to guaranteed message delivery with Kafka

Day 3: Kafka Streams

Day 4: Kafka SQL (KSQL)

 

Request more information on this course

* indicates required