Syllabus Version: 2015/10/07
This overview course is a guided hands-on tour of Spark, a popular tool for big data analytics with a unified API for batch analytics, SQL queries, stream processing, machine learning, and graph analysis. The course walks students through the lifecycle of a Spark application from Extract-Transform-Load (ETL) operations, through ad-hoc data analysis and SQL queries, to machine learning and beyond. Students will learn to use a variety of tools to understand how the entire Spark stack functions, including the underlying Spark execution engine, the fundamental programming abstractions (e.g., Resilient Distributed Datasets as well as DataFrames), and more.
Target Audience: Engineers, Data Scientists, and Analysts
Class Duration:1 Day
Lecture vs. Labs: Class will consist of 50% lecture and 50% hands-on labs
Course Learning Objectives
After taking this class you will be able to:
- ● Experiment with use cases for Spark and Databricks, including extract-transform-load operations, data analytics, data visualization, batch analysis, machine learning, graph processing, and stream processing
- ● Identify Spark and Databricks capabilities appropriate to your business needs.
- ● Communicate with team members and engineers using appropriate terminology.
- ● Build data pipelines and query large data sets using Spark SQL and DataFrames.
- ● Execute and modify extract-transform-load (ETL) jobs to process big data using the Spark API, DataFrames, and Resilient Distributed Datasets (RDD).
- ● Analyze Spark jobs using the administration UIs and logs inside Databricks.
- ● Find answers to common Spark and Databricks questions using the documentation and other resources.
Students should arrive to class with:
- ● A basic understanding of software development
- ● Some experience coding in Python, Java, SQL, Scala, or R
- ● Modern operating system (Windows, OS X, Linux)
- ● An up-to-date version of Chrome or Firefox (Internet Explorer not supported) and Internet access
- ● Overview of Apache Spark and Databricks
Outline of Topics Covered in Class
○ A brief history of Spark and Databricks
○ Where Spark fits in the big data landscape
○ Apache Spark vs. Apache MapReduce: An architecture comparison
- ● Hands-On: Spark Guided Tour
○ Connecting to the Databricks notebook lab environment
○ Spark SQL
○ Machine Learning
○ Exercise: What can spark do for your team?
- ● Intro to DataFrames and SparkSQL
○ What are DataFrames?
○ DataFrames and Spark SQL
○ Using SQLContext
○ Creating your first DataFrame
○ Inspecting your DataFrame (e.g. printSchema(), describe(), show(), take() )
○ Running DataFrame operations
○ Reading from multiple data source formats
○ Using the table catalog
- ● Hands-On: Using DataFrames
○ Examples of the DataFrames API to query and transform data.
- ● Resilient Distributed Datasets: Fundamentals
○ RDDs vs. DataFrames
○ Two ways to create an RDD: Parallelize & Read from external data source
○ How an RDD is distributed via partitions in a cluster
○ Introduction to Transformations and Actions
○ Different types of RDDs
○ How transformations lazily build up a Directed Acyclic Graph (DAG)
○ Introduction to Caching an RDD
- ● Hands-On: A Developer's Introduction to Spark
○ Learn what a SparkContext is and how to use it
○ Using the Spark shell to parallelize data from a local collection and perform transformations and actions
○ Visually seeing the execution of Spark jobs in the Spark UI
○ Caching an RDD
○ How to repartition a RDD and count the # of items in each partition
○ Understanding the Spark lineage graph with .toDebugString()
- ● Spark Documentation and Resources
○ Spark Guide
○ Spark API Documentation
○ Spark Source Code on Github
○ Discussion Forums
○ Videos, Courses, and Other Resources
- ● Hands-On: Searching the Docs
○ Given a task, locate the appropriate API documentation.
- ● Spark Runtime Architecture
○ How these JVMs interact with each other in Spark: Driver, Executor, Worker, Spark Master
○ RDDs, DAGs, and Narrow vs. Wide Operations
○ How jobs are broken into stages and tasks and scheduled for execution.
- ● Hands-On: Spark UI
○ Visualizing Jobs and DAGs
○ Monitoring Tasks and Stages
○ Reading Logs
- ● More on Spark SQL and DataFrames
○ Creating a temporary table from a data source (using a DataFrame)
○ Overview of supported SQL dialect
○ Querying the temporary table with SQL
○ Using the table catalog
○ Table and DataFrame caching
○ Understanding query plans (.explain(true))
○ Working with nested data
○ Statistical functions in DataFrames
○ Working with null data
- ● Hands-On: Spark SQL
○ Examples of SQL queries, plus some assignments.
- ● Spark Streaming
○ Understanding the Streaming Architecture: How DStreams break down into RDD batches
○ How receivers run inside Executor task slots to capture data coming in from a network socket, Kafka or Flume
○ Common transformations and actions on DStreams
- ● Machine Learning
○ Supervised Learning
○ Unsupervised Learning
- ● Q&A and Conclusion
Unlimited tea/coffee/soft drinks are provided.
Each participant will require a laptop.
Modern operating system (Windows, OS X, Linux) with 2 GB of memory and network card
An up-to-date version of Chrome or Firefox (Internet Explorer not supported) and Internet access
Cancellation & Reschedule Policy
You must provide a written notice to Big Data Partnership at least 2 weeks' prior to the start of the class if you cannot attend this class. Big Data Partnership will transfer your registration to a future class of equal or lesser value.
Students who fail to cancel within 2 weeks' and/or do not attend the class, will not receive a refund and will be charged the full amount.
Big Data Partnership can cancel or reschedule at any time at our discretion. In the event that the class is cancelled or rescheduled, we will work with you to apply your registration to another date or refund your fee in full. Big Data Partnership is not responsible for non-refundable travel or other expenses incurrred by the student.
If you have any questions concerning this class, please do not hesitate to contact email@example.com.
When & Where
Big Data Partnership
Big Data Partnership is the leading European-based big data service provider.
Our team has deep expertise across a wide range of big data technologies and data science techniques.
Our recent projects have included:
- the Apache Hadoop ecosystem
- Apache Spark
- Apache Cassandra
And a range of other NoSQL databases & search technologies.
Big Data Partnership helps organisations across all industries become more data-driven by reducing costs and grasping new big-data opportunities, rapidly and at low risk.
We help you Discover why and how to become data driven; we work with you to Develop and prove the value of this approach; we Deliver cost effective solutions which exploit faster and more scalable technology. We reduce risk by Training your staff in the necessary new skills and by providing Support.
For more information, visit http://www.bigdatapartnership.com.