Week 11 notes completed
This commit is contained in:
@ -0,0 +1,124 @@
|
||||
# Distributed databases and alternative database models
|
||||
|
||||
The gains from generic structures in a relational database work against the losses from having generic structures in an unreliable database.
|
||||
|
||||
## Distributing data
|
||||
The user should not be aware the database is distributed
|
||||
|
||||
## Why distribute?
|
||||
* Parallelize operations
|
||||
* No single point of failure
|
||||
* Divide large dataset
|
||||
* Data already partitioned (different departments in the company)
|
||||
|
||||
## Approaches to distributing RDBMS
|
||||
### Local autonomy
|
||||
* Sites operate independently
|
||||
* One site cannot interfere with another
|
||||
* No site should relay on another for its operation
|
||||
|
||||
### No centralization
|
||||
* No single site conrols transactions
|
||||
* No single site failure breaks the system
|
||||
|
||||
### Continuous operation
|
||||
* The system is available most of the time
|
||||
* The system is reliable - it fails rearely and recovers quickly (and gracefully)
|
||||
|
||||
### Location independence
|
||||
* User need not know where any data might be located
|
||||
* Data may be moved without changing functionality
|
||||
|
||||
### Particion independence
|
||||
* User need not know how the data is partitioned
|
||||
* Data may be repartitioned without changing functionality
|
||||
|
||||
Note: Optimization can still benefit from Partition structure
|
||||
|
||||
### Distributed queries
|
||||
* Execute queries close to the data
|
||||
|
||||
### X independence
|
||||
* Distribution making no difference to the user means:
|
||||
* Hardware independence
|
||||
* OS independence
|
||||
* Network independence
|
||||
|
||||
### DBMS independence
|
||||
* Distribution over different DBMS implementations
|
||||
|
||||
## Concepts
|
||||
### Partitioning
|
||||
* How will you divide your data?
|
||||
* We divide the data as well as the processing
|
||||
* Also called fragmentation
|
||||
* Vertical partitioning
|
||||
* Divide table by columns
|
||||
* Normalization is a type of partitioning
|
||||
* Horizontal partitioning
|
||||
* Divide the table by rows
|
||||
|
||||
### Catalog Management
|
||||
Relations are not just the column but the definition of the table.
|
||||
* How is information about the data distributed?
|
||||
* It should be distributed in a manner that maintains integrity and consistency of the DB
|
||||
|
||||
### Recovery control
|
||||
* Transaction usually use 'two-phase protocol'
|
||||
* This required one site to act as a co-ordinator in any give transaction
|
||||
|
||||
### Brewer's Conjecture
|
||||
Three goals (CAP) in tension:
|
||||
* Consistency: All parts of the database should converge on a consistent state
|
||||
* Availability: Every request should result in a response (eventually)
|
||||
* Partition tolerance: If a network flaw breaks the network into separate subnets, the database should run and recover
|
||||
|
||||
## Key/Value databases and MapReduce
|
||||
> If distributing databases is complex, simplify your data structures.
|
||||
|
||||
### Key-value
|
||||
* Associative array/dictionary
|
||||
* No explicit connection between entries
|
||||
* Processing either
|
||||
* Retrieval by key
|
||||
* Exhaustive exploration
|
||||
* Easy parallel processing
|
||||
* Easy partition
|
||||
* Partition is always horizontal
|
||||
* Processing should happen near the data where possible
|
||||
|
||||
### MapReduce
|
||||
'MapReduce' is an example, of key-value processing.
|
||||
|
||||
This is based on the observation that the query has two stages:
|
||||
* Gathering and processing information form base tables
|
||||
* Processing and summarizing the result
|
||||
|
||||
```SQL
|
||||
SELECT Actor1, COUNT(Title) --- Map phase
|
||||
FROM MovieLocations -- Reduce phase
|
||||
INNER JOIN Actors
|
||||
ON Actor1=Name
|
||||
WHERE DataOfBirth > "1950-01-01"
|
||||
GROUP BY Actor1; --- Map phase
|
||||
```
|
||||
|
||||
### Map phase
|
||||
* Carried out with direct access to the data
|
||||
* Peration usually loops over all data
|
||||
* Output is a new key-value set
|
||||
|
||||
### Reduce phase
|
||||
It doesn't need to happen close to the data
|
||||
* Carried out by reducer workers
|
||||
* Summarize data based on key
|
||||
* Because we know that the data is based on the key, we can assign the data to workers based on key
|
||||
* Output is collated and setn to the users
|
||||
|
||||
### Why use MapReduce?
|
||||
* Simple
|
||||
* Easy for distributed data
|
||||
* Easy for distributed processing
|
||||
* Move the data you need, not the data you have
|
||||
* Can recover form failure of all but co-ordinating node easily
|
||||
*
|
||||
Reference in New Issue
Block a user