Files
2023-03-06 03:00:50 -05:00

3.8 KiB

Distributed databases and alternative database models

The gains from generic structures in a relational database work against the losses from having generic structures in an unreliable database.

Distributing data

The user should not be aware the database is distributed

Why distribute?

  • Parallelize operations
  • No single point of failure
  • Divide large dataset
  • Data already partitioned (different departments in the company)

Approaches to distributing RDBMS

Local autonomy

  • Sites operate independently
  • One site cannot interfere with another
  • No site should relay on another for its operation

No centralization

  • No single site conrols transactions
  • No single site failure breaks the system

Continuous operation

  • The system is available most of the time
  • The system is reliable - it fails rearely and recovers quickly (and gracefully)

Location independence

  • User need not know where any data might be located
  • Data may be moved without changing functionality

Particion independence

  • User need not know how the data is partitioned
  • Data may be repartitioned without changing functionality

Note: Optimization can still benefit from Partition structure

Distributed queries

  • Execute queries close to the data

X independence

  • Distribution making no difference to the user means:
    • Hardware independence
    • OS independence
    • Network independence

DBMS independence

  • Distribution over different DBMS implementations

Concepts

Partitioning

  • How will you divide your data?
  • We divide the data as well as the processing
  • Also called fragmentation
  • Vertical partitioning
    • Divide table by columns
    • Normalization is a type of partitioning
  • Horizontal partitioning
    • Divide the table by rows

Catalog Management

Relations are not just the column but the definition of the table.

  • How is information about the data distributed?
  • It should be distributed in a manner that maintains integrity and consistency of the DB

Recovery control

  • Transaction usually use 'two-phase protocol'
  • This required one site to act as a co-ordinator in any give transaction

Brewer's Conjecture

Three goals (CAP) in tension:

  • Consistency: All parts of the database should converge on a consistent state
  • Availability: Every request should result in a response (eventually)
  • Partition tolerance: If a network flaw breaks the network into separate subnets, the database should run and recover

Key/Value databases and MapReduce

If distributing databases is complex, simplify your data structures.

Key-value

  • Associative array/dictionary
  • No explicit connection between entries
  • Processing either
    • Retrieval by key
    • Exhaustive exploration
  • Easy parallel processing
  • Easy partition
  • Partition is always horizontal
  • Processing should happen near the data where possible

MapReduce

'MapReduce' is an example, of key-value processing.

This is based on the observation that the query has two stages:

  • Gathering and processing information form base tables
  • Processing and summarizing the result
SELECT Actor1, COUNT(Title) --- Map phase
FROM    MovieLocations -- Reduce phase
        INNER JOIN Actors
        ON Actor1=Name
WHERE   DataOfBirth > "1950-01-01"
GROUP BY Actor1; --- Map phase

Map phase

  • Carried out with direct access to the data
  • Peration usually loops over all data
  • Output is a new key-value set

Reduce phase

It doesn't need to happen close to the data

  • Carried out by reducer workers
  • Summarize data based on key
  • Because we know that the data is based on the key, we can assign the data to workers based on key
  • Output is collated and setn to the users

Why use MapReduce?

  • Simple
  • Easy for distributed data
  • Easy for distributed processing
  • Move the data you need, not the data you have
  • Can recover form failure of all but co-ordinating node easily