May 9, 2020

Database Learning Resources

Over the last few years working in analytics and data science, I’ve developed a strong interest in the internal workings of database engines, particularly distributed database for analytical workloads. Here I’m going to build a list of computer science concepts and learning resources that I am finding helpful in building understanding of how these systems work and how to implement them.

File I/O Concepts

On Disk I/O a series of blog posts by Alex Petrov, explaining the different methods of file I/O in Linux (buffered, direct, memory-mapped, asynchronous, vectored) and their use cases in the context of database applications, as well as on-disk data structures (SSTables, B+ Trees, Log-Structured Merge Trees)

Probabilistic Data Structures

Bloom Filter A hash bitmap for probabilistic answers to set membership queries (true/false with no false negatives)
HyperLogLog A hash based algorithm for fast approximate cardinality (distinct count) estimation
Count Min Sketch an approximate frequency table similar to a counting Bloom filter

On-Disk Data Structures

LSM Tree Paper Log Structured Merge Trees–trees of immutable sorted runs of on-disk data that are periodically merged using external merge sort
Google BigTable Paper Application of LSM Trees

Distributed Consensus Algorithms

Comprehensive Database Concept Books

Open Courses

CMU 15-445 Database Systems covers basics of on-disk database system implementation
CMU 15-721 Advanced Database Systems covers advanced topics in in-memory database implementation, with emphasis on concurrency control and optimizations.