Week 9 notes completed

This commit is contained in:
levdoescode
2023-03-06 01:40:58 -05:00
parent 7e2abe660f
commit 53ac6f4af0

View File

@ -0,0 +1,208 @@
# Advanced RDBMS topics
## Speeding things up
Using binary search on a sorted table:
* Average case - O(log n)
* Worst case - O(log n)
* As fast as tree indexes
* Use no extra space
* Can only sort in one order (column)
## Using indexes
Usually one of:
* B-tree
* Hash table
May also be spatial/geometric
May be held in memory, because it's much smaller than the table.
May optimize for disk structure
## B-trees (balanced trees)
* O(log n) in typical and worst case
* Supports ranges (because we know the limits)
* Support (some) approximate searching
## Hash tables
* O(1) in typical case
* Don't scale with the size of the data
* O(n) in (unusual) worst case
* Hash algorithm may be expensive
* Depending on the size of the dataset, B-trees may be fast enough even with worst O's
* No approximation
* No range-based retrieval -
Things that are similar are placed in different places throughout the hash table (locality, sensitive hashing)
# Query efficiency
The order of the statements in the query is important. This is normally done automatically by the DBMS.
## Optimizing actions
Indexes and sorted tables can save on copying, searching and sorting.
Query strategy has a huge impact
* Order of operations
* Use of indexes
* Making fresh indexes
* Copying or reading data
How do we optimize? Let's see an example.
```SQL
SELECT *
FROM MOvieLocation,
Actors
WHERE Actor1=Name;
```
We can use the keyword `EXPLAIN`:
```SQL
SELECT *
FROM MovieLocation,
Actors
WHERE Actor1=Name;
```
Tells us which operations on which tables and on which order.
1. MovieLocations
select_type: SIMPLE
possible_keys: NULL (Actor Name is not the complete Key)
rows: 1819
filtered: 100 (100% of the data needs to be looked at)
2. Actors
select_type: SIMPLE
possible_keys: NULL
rows: 39175
filtered: 10
Extra: Using where; Using hash join
About 50ms
Making a change, creating an index on the MovieLocations table. How does it change things?
```SQL
CREATE INDEX LocAct1
ON MovieLocations(Actor1);
```
Running EXPLAIN again:
1. Actors
select_type: SIMPLE
possible_keys: NULL
rows: 39175
filtered: 100
Extra: Using where
2. MovieLocations
select_type: SIMPLE
possible_keys: LocAct1
rows: 7
filtered: 100
Extra: NULL
About 60ms
It goes to the Actors table first and only 7 rows.
Let's add an index to the Actors table:
```SQL
CREATE INDEX ActorNames
ON Actors(Name);
```
Now runnign EXPLAIN
```SQL
EXPLAIN SELECT *
FROM MovieLocations,
Actors
WHERE Actor1=Name;
```
It reverts to the original order:
1. MovieLocations
select_type: SIMPLE
possible_keys: LocAct1
rows: 1819
filtered: 100
Extra: Using where
2. Actors
select_type: simple
possible_keys: ActorNames
rows: 1819
filtered: 100
Extra: NULL
About 5ms
There is also a `EXPLAIN ANALYZE` version of the command.
# Removing the safety net - Denormalization
What is normalizaiton bias?
Normalization:
* Can reduce disk reads
* Can reduce integrity checks
* Reduces storage use
* Increases use of joins
We can denormalize
Denormalization:
* Merge tables to reduce joins
* Effectively caches a joined SELECT
* Reduces use of joins
* Only sometimes faster
Example of denormalized table
```SQL
CREATE TABLE MovieActors
(PRIMARY KEY
(Title, Year, Location))
AS
SELECT Title, Year, Location, Name,
DoB, Gender
FROM MovieLocations
LEFT JOIN
Actors
ON Actor1=Name
```
* Risky for very dynamic data. Duplicated data takes time and is risky.
For these issues, we can create a view:
Alternative cache:
```SQL
CREATE VIEW SFMoviesAndActors
AS
SELECT Title, Year, Location, Name, Dob, Gender
FROM MovieLocations
LEFT JOIN
Actors
ON Actor1=Name;
```
The views are dynamic, making queries on the view also expensive
## Caching views
* SQL standard: SNAPSHOW
* aka Embovied or Materialized Views, are saved views.
* Can update when data changes
* We can treat the view as a table, add rows, etc. Set permissions to the view. When the parent table is update, it regenerates the view.
* Not implemented in MySQL
Denormalizing
* Databases are fast
* Keyed based joins are optimized
* Evaluate speed and efficiency before trying to optmize.
## Conclusions
Relations Databases are:
* powerful
* fast
* grounded in theory
* adaptable
* mature
* based on standards(often)
* best with tabular data
* initially confusing