Machine learning using Zeppelin and Scikit-Learn

Machine learning applications have been written since the late 1950s (see Perceptron, 1958).  The term “machine learning” was accredited to Arthur Samuel in 1959. With the recent interest in Deep Learning (see Geoffrey Hinton et al., 2006) machine learning techniques have become valuable tools for the data scientist.  

Machine learning is a field of computer science that gives computer systems the ability to progressively improve performance on a specific task with data, without being explicitly programmed.  Many of the tasks performed by data scientists such as computational statistics, which focuses on prediction-making through the use of computers, usually engage machine learning algorithms.


Machine learning is often associated with data mining

Data mining is often accomplished by data scientists through the application of exploratory data analysis (EDA) and unsupervised learning.  Machine learning is used in the credit card industry, for example, to evince and establish baseline behavioral profiles for various entities and then find meaningful anomalies like the fraudulent use of credit cards or identities.

Machine learning algorithms are used to devise complex models and algorithms that lend themselves to prediction.  In commercial use, this is known as “predictive analytics.” Analytical models allow data scientists to “produce reliable, repeatable decisions and results” and uncover “hidden insights” through learning from historical relationships and trends in the data.

This course instructs the student in key concepts and fundamental practices of machine learning (through lecture and labs using the Scikit-Learn libraries) that are relevant to the activities of a data scientist.  


Experience with the Python programming language, the Zeppelin IDE and exposure to EDA statistics is a prerequisite.  It is suggested that a student new to programming and new to Zeppelin take the DFHz course “Introduction to Python using Zeppelin.” Either experience with programming EDA statistics using Zeppelin or the completion of the DFHz course “Statistics for Data Science using Zeppelin” is a prerequisite for this course.  


Individuals who are new to the application of Machine Learning.  The goal of this course ware is to provide the concepts and the tools a data scientist needs to implement programs that are capable of “learning” from data.  Applications will be written in the Python programming language using the Apache Zeppelin environment.


50% Lecture 50% Hands-on Labs


Hands-On Machine Learning with SciKit-Learn & TensorFlow by Aurelien Geron


This is a 4 day class when taught on-site with ILT or via web-ex with VILT.  It is also offered on a per-module basis for on-line self-enablement via our LMS, Brane.


Day 1: Fundamental Machine Learning and Classification

Day 2: Training Models and Support Vector Machines

Day 3: Decision Trees and Ensemble learning and Random Forests

Day 4: Dimensionality Reduction

Request more course information

* indicates required