Database Learning Resources

Over the last few years working in analytics and data science, I’ve developed a strong interest in the internal workings of database engines, particularly distributed database for analytical workloads. Here I’m going to build a list of computer science concepts and learning resources that I am finding helpful in building understanding of how these systems work and how to implement them.

File I/O Concepts

  • On Disk I/O a series of blog posts by Alex Petrov, explaining the different methods of file I/O in Linux (buffered, direct, memory-mapped, asynchronous, vectored) and their use cases in the context of database applications, as well as on-disk data structures (SSTables, B+ Trees, Log-Structured Merge Trees)

Probabilistic Data Structures

  • Bloom Filter A hash bitmap for probabilistic answers to set membership queries (true/false with no false negatives)
  • HyperLogLog A hash based algorithm for fast approximate cardinality (distinct count) estimation
  • Count Min Sketch an approximate frequency table similar to a counting Bloom filter

On-Disk Data Structures

  • LSM Tree Paper Log Structured Merge Trees–trees of immutable sorted runs of on-disk data that are periodically merged using external merge sort
  • Google BigTable Paper Application of LSM Trees

Distributed Consensus Algorithms

Comprehensive Database Concept Books

Open Courses

245 Words

2020-05-09 11:04 -0700