Week 11 notes completed

This commit is contained in:
levdoescode
2023-03-06 03:00:50 -05:00
parent ec71a797cc
commit 7f758c391d

View File

@ -0,0 +1,124 @@
# Distributed databases and alternative database models
The gains from generic structures in a relational database work against the losses from having generic structures in an unreliable database.
## Distributing data
The user should not be aware the database is distributed
## Why distribute?
* Parallelize operations
* No single point of failure
* Divide large dataset
* Data already partitioned (different departments in the company)
## Approaches to distributing RDBMS
### Local autonomy
* Sites operate independently
* One site cannot interfere with another
* No site should relay on another for its operation
### No centralization
* No single site conrols transactions
* No single site failure breaks the system
### Continuous operation
* The system is available most of the time
* The system is reliable - it fails rearely and recovers quickly (and gracefully)
### Location independence
* User need not know where any data might be located
* Data may be moved without changing functionality
### Particion independence
* User need not know how the data is partitioned
* Data may be repartitioned without changing functionality
Note: Optimization can still benefit from Partition structure
### Distributed queries
* Execute queries close to the data
### X independence
* Distribution making no difference to the user means:
* Hardware independence
* OS independence
* Network independence
### DBMS independence
* Distribution over different DBMS implementations
## Concepts
### Partitioning
* How will you divide your data?
* We divide the data as well as the processing
* Also called fragmentation
* Vertical partitioning
* Divide table by columns
* Normalization is a type of partitioning
* Horizontal partitioning
* Divide the table by rows
### Catalog Management
Relations are not just the column but the definition of the table.
* How is information about the data distributed?
* It should be distributed in a manner that maintains integrity and consistency of the DB
### Recovery control
* Transaction usually use 'two-phase protocol'
* This required one site to act as a co-ordinator in any give transaction
### Brewer's Conjecture
Three goals (CAP) in tension:
* Consistency: All parts of the database should converge on a consistent state
* Availability: Every request should result in a response (eventually)
* Partition tolerance: If a network flaw breaks the network into separate subnets, the database should run and recover
## Key/Value databases and MapReduce
> If distributing databases is complex, simplify your data structures.
### Key-value
* Associative array/dictionary
* No explicit connection between entries
* Processing either
* Retrieval by key
* Exhaustive exploration
* Easy parallel processing
* Easy partition
* Partition is always horizontal
* Processing should happen near the data where possible
### MapReduce
'MapReduce' is an example, of key-value processing.
This is based on the observation that the query has two stages:
* Gathering and processing information form base tables
* Processing and summarizing the result
```SQL
SELECT Actor1, COUNT(Title) --- Map phase
FROM MovieLocations -- Reduce phase
INNER JOIN Actors
ON Actor1=Name
WHERE DataOfBirth > "1950-01-01"
GROUP BY Actor1; --- Map phase
```
### Map phase
* Carried out with direct access to the data
* Peration usually loops over all data
* Output is a new key-value set
### Reduce phase
It doesn't need to happen close to the data
* Carried out by reducer workers
* Summarize data based on key
* Because we know that the data is based on the key, we can assign the data to workers based on key
* Output is collated and setn to the users
### Why use MapReduce?
* Simple
* Easy for distributed data
* Easy for distributed processing
* Move the data you need, not the data you have
* Can recover form failure of all but co-ordinating node easily
*