Apache Cassandra 2.x – Core Internals
This three day Apache Cassandra course is focused on the key aspects of the technology for developers and system operations staff covering core internal and distributed architecture fundamentals. The class is 60% lectures and 40% labs.
Summary: This is a fast-paced, vendor agnostic, technical overview of the Cassandra database. No prior knowledge of databases or programming is assumed, although having some basic experience with relational/SQL databases and Java will help. This course is targeted at both technical and non-technical people who want to understand the emerging world of Big Data, with a specific focus on Cassandra. In each sub-topic, the instructor will provide links and resource recommendations for students who want to explore that area further, for example, YouTube videos, books, blog posts. Students will be given a PDF slide deck, which can be used as reference material after the course. PDFs will also be given out for the labs in the course.
Labs: Labs are a significant portion of this class. Every student will get 3 VMs running in Rackspace for their individual lab environment
Audience: Developers, System Operations staff, Software Engineers, Data Scientists, Network Engineers, or Technology Managers.
- Identifying the correct use cases for Cassandra
- Introduce students to the core concepts of the distributed architecture of the Cassandra database
- Deep dive into the internal architecture of the read/write paths of Cassandra: bloom filters, block indexes, commit-log, memtables, sstables, compaction, etc.
- Give each student access to a 3-node Cassandra cluster in Rackspace to run through the hands-on labs
- Teach the fundamentals of how to write Java code to interact with Cassandra
- Cover the data modeling using CQL using the newest features of Cassandra 2.x and how it apply these concepts to build real applications on top of Cassandra
- Provide insight into Spark/Cassandra integration
- Provide links to the best books, blog posts and videos for students to learn more about Cassandra on their own
Day 1: Intro to Cassandra and its Architecture
- NoSQL ecosystem overview and review of distributed systems fundamental concepts
- Database families and data models review: structured vs. unstructured data, key/value, key/document, column family and graph databases
- Consistency models in distributed systems
- Cassandra origins: Amazon Dynamo, Google BigTable and Cassandra at Facebook
- Analysis of Cassandra use cases
- Cassandra ecosystem and distributions
- Cassandra distributed architecture fundamentals: peer to peer design, gossiping, seed nodes, coordinator nodes, hinted handoff, read/write consistency levels, snitches, multi-data center deployments, client request routing, manual ring management vs. vnodes, key partitioners, node discovery
- Rapid read protection, cold data storage optimization and other new features of Cassandra 2.x
- Cassandra configuration: node, cluster, logging and firewall setup
- Lab #1: Install Cassandra 2.0 on a single node in the cloud
- Lab #2: Run Cassandra commands and explore operations management concepts -Create a new keyspace and table, write data to the table, flush the table to SSTable on disk, learn how to run compaction, run nodetool commands, benchmark the one node by inserting and reading 100,000 rows.
Day 2: Cassandra Storage internals and Data Model
- Hardware recommendations (spinning disks vs. SSD, CPU/RAM/network requirements)
- Review of classical data structures used for indexing on disk
- Introduction to LSM-tree and it performance characteristics
- Study of the components of a LSM-tree: commit-log, memtable, sstable
- Implementation details: bloom filters, in-memory caching, compression and off-heap data structures
- Physical data model: column families, partitions and cells
- Data files layout on the filesystem
- Detailed study of the read/write path: how data is flushed from memtables to disk as sstables
- Compaction concepts and strategies to reduce sstable data files
- Deletes and tombstone techniques
- Repair and snapshotting operations
- System keyspaces
- JVM tuning and troubleshooting
- Operations monitoring with nodetool and performance tuning
- Lab #3: Grow the cluster size to 3 nodes - Install Cassandra on 2 additional nodes in Rackspace and edit the YAML files to configure the 3-node cluster.
- Lab #4: Advanced Cassandra commands - query the system table, take a snapshot, decommission a node, rejoin the same node back into the cluster.
Day 3: Data Modelling using CQL in Cassandra
- CQL language fundamentals
- CQL uses cases
- Data model exposed through CQL
- Mapping between logical CQL data model and internal low level storage engine
- Useful patterns for data model design
- Secondary indexes
- CQL collections
- Examples of efficient schema designs with CQL
- Advanced CQL concepts
- Lightweight transactions
- Atomic batches
- Distributed counters
- Spark Integration
- Lab #5: Java API lab -learn how to programmatically insert and read data from a Cassandra cluster using the Java API.
- Lab #6: Advanced Java API lab - explore more advanced features of the Java driver such as setting up the consistency levels, automatic failover, prepared failover, asynchronous reads, tracing explore more advanced features of the Java driver such as setting up the consistency levels, automatic failover, prepared failover, asynchronous reads and tracing queries
- Lab #7: Spark integration lab - explore the functionality of Spark for
fast distributed data processing using Cassandra as its data store.
Learn how to load tables from Cassandra as RDDs and how to write a
simple Spark application
Tea, coffee, water and biscuits provided. Lunch is not included however there is a wide range of cafes and restaurants in the local area.
Please bring your laptop to the training in order to do the hands-on labs
Cancellation & Reschedule Policy
You must provide a written notice to Big Data Partnership at least 2 weeks' prior to the start of the class if you cannot attend this class. Big Data Partnership will transfer your registration to a future class of equal or lesser value.
Students who fail to cancel within 2 weeks' and/or do not attend the class, will not receive a refund and will be charged the full amount.
Big Data Partnership can cancel or reschedule at any time at our discretion. In the event that the class is cancelled or rescheduled, we will work with you to apply your registration to another date or refund your fee in full. Big Data Partnership is not responsible for non-refundable travel or other expenses incurrred by the student.
If you have any questions concerning this class, please do not hesitate to contact firstname.lastname@example.org.
When & Where
Big Data Partnership
Big Data Partnership is the leading European-based big data service provider.
Our team has deep expertise across a wide range of big data technologies and data science techniques.
Our recent projects have included:
- the Apache Hadoop ecosystem
- Apache Spark
- Apache Cassandra
And a range of other NoSQL databases & search technologies.
Big Data Partnership helps organisations across all industries become more data-driven by reducing costs and grasping new big-data opportunities, rapidly and at low risk.
We help you Discover why and how to become data driven; we work with you to Develop and prove the value of this approach; we Deliver cost effective solutions which exploit faster and more scalable technology. We reduce risk by Training your staff in the necessary new skills and by providing Support.
For more information, visit http://www.bigdatapartnership.com.