Summary
This two half-day workshop offers a hands-on introduction to common big data tools, such as Hadoop, Spark, and Kafka, applied to stored (batch) and real-time (streaming) processing of healthcare data. The first session will introduce Google Cloud Platform (GCP) and distributed (cloud) storage systems, along with Spark libraries for machine learning prediction and classification problems. The second session will concentrate on real-time processing in Kafka, based on synthetic data generation and interactive analytical dashboards. A drop-in session will be held between the two workshop sessions to address technical questions. The focus of the workshop is how to orchestrate big data pipelines by combining different tools and libraries, rather than on interpreting results.
Prerequisite knowledge
This workshop makes use of Linux commands and Python programming. Some experience in any of these tools may be helpful but not necessarily required. All example code and commands will be provided.
Intended audience
This workshop is aimed at anyone interested in the intersection of data science and healthcare and who wants to acquire hands-on experience with distributed computing tools applied to health data. The emphasis is on orchestrating big data pipelines rather than interpreting results.
Capacity
30 people