Data Science for the Hortonworks Data Platform
This 3-day hands-on training course teaches the fundamentals of Data Science and how to apply those concepts in Hadoop using machine learning, Mahout, Pig, Python and various machine learning libraries like SciPy and Scikit-Learn.
Data Science for the Hortonworks Data Platform covers data science principles and techniques through lecture and hands-on experience. During this three-day class, students will learn the processes and practice of data science, including machine learning and natural language processing. Students will also learn the tools and programming languages used by data scientists, including Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikit-Learn and Spark MLlib.
Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the Hadoop Essentials course.
Architects, software developers, analysts and data scientists who need to understand how to apply data science and machine learning on Hadoop.
Instructor-led lecture/discussion, with 50% of the course being devoted to hands-on labs.
At the completion of the course students will be able to:
- Recognize use cases for data science
- Describe the architecture of Hadoop and YARN
- Explain the differences between supervised and unsupervised learning
- List the six machine learning tasks
- Recognize use cases for clustering, outlier detection, affinity analysis, classification, regression, and recommendation
- Use Mahout to run a machine learning algorithm on Hadoop
- Write Pig scripts to transform data on Hadoop
- Use Pig to prepare data for a machine learning algorithm
- Write a Python script
- Use NumPy to analyze big data
- Use the data structure classes in the pandas library
- Write a Python script that invokes a SciPy machine learning algorithm
- Explain the options for running Python code on a Hadoop cluster
- Write a Pig User Defined Function in Python
- Use Pig streaming on Hadoop with a Python script
- Write a Python script that invokes a scikit-learn machine learning algorithm
- Use the k-nearest neighbor algorithm to predict values based on a training data set
- Run the k-means clustering algorithm on a distributed data set on Hadoop
- Describe use cases for Natural Language Processing (NLP)
- Run an NLP algorithm on a Hadoop cluster
- Run machine learning algorithms on Hadoop using Spark MLlib
- Unit 1: Using Hadoop for Data Science
- Unit 2: Hadoop Architecture
- Unit 3: Machine Learning
- Unit 4: Introduction to Pig
- Unit 5: Python Programming
- Unit 6: Analyzing Data with Python
- Unit 7: Running Python on Hadoop
- Unit 8: Implementing Machine Learning
- Unit 9: Natural Language Processing
- Unit 10: Using Spark MLlib
Students will complete the following hands-on labs using their own 7-node Hadoop cluster (HDP 2.1) and IPython Notebook:
- Setting Up a Development Environment
- Using HDFS Commands
- Using Mahout for Machine Learning
- Getting Started with Pig
- Exploring Data with Pig
- Using the IPython Notebook
- Data Analysis with Python
- Interpolating Data Points
- Define a Pig UDF in Python
- Streaming Python with Pig
- K-Nearest Neighbor
- K-Means Clustering
- Natural Language Processing
- Running Data Science Algorithms using Spark MLlib
All necessary equipment and infrastructure required to perform lab exercises are provided.
Unlimited teas, coffees & soft drinks provided.
Cancellation & Reschedule Policy
You must provide a written notice to Big Data Partnership at least 2 weeks' prior to the start of the class if you cannot attend this class. Big Data Partnership will transfer your registration to a future class of equal or lesser value.
Students who fail to cancel within 2 weeks' and/or do not attend the class, will not receive a refund and will be charged the full amount.
Big Data Partnership can cancel or reschedule at any time at our discretion. In the event that the class is cancelled or rescheduled, we will work with you to apply your registration to another date or refund your fee in full. Big Data Partnership is not responsible for non-refundable travel or other expenses incurrred by the student.
If you have any questions concerning this class, please do not hesitate to contact firstname.lastname@example.org.
Do you have questions about HDP : Data Science?
Contact Big Data Partnership