Database Learning Resources
Over the last few years working in analytics and data science, I’ve developed a strong interest in the internal workings of database engines, particularly distributed database for analytical workloads. Here I’m going to build a list of computer science concepts and learning resources that I am finding helpful in building understanding of how these systems work and how to implement them.
File I/O Concepts
- On Disk I/O a series of blog posts by Alex Petrov, explaining the different methods of file I/O in Linux (buffered, direct, memory-mapped, asynchronous, vectored) and their use cases in the context of database applications, as well as on-disk data structures (SSTables, B+ Trees, Log-Structured Merge Trees)
Probabilistic Data Structures
- Bloom Filter A hash bitmap for probabilistic answers to set membership queries (true/false with no false negatives)
- HyperLogLog A hash based algorithm for fast approximate cardinality (distinct count) estimation
- Count Min Sketch an approximate frequency table similar to a counting Bloom filter
On-Disk Data Structures
- LSM Tree Paper Log Structured Merge Trees–trees of immutable sorted runs of on-disk data that are periodically merged using external merge sort
- Google BigTable Paper Application of LSM Trees
Distributed Consensus Algorithms
Comprehensive Database Concept Books
Open Courses
- CMU 15-445 Database Systems covers basics of on-disk database system implementation
- CMU 15-721 Advanced Database Systems covers advanced topics in in-memory database implementation, with emphasis on concurrency control and optimizations.