diff --git a/CM3010 Databases and Advanced Data Techniques/Week 1/Week 1 Notes.md b/CM3010 Databases and Advanced Data Techniques/Week 1/Week 1 Notes.md new file mode 100644 index 0000000..45fee3a --- /dev/null +++ b/CM3010 Databases and Advanced Data Techniques/Week 1/Week 1 Notes.md @@ -0,0 +1,46 @@ +# Where does data come from + +* New Data +* Pre-existing + * Internal 'legacy' data + * External data + +## New data +* Adding as you go +* Bulk data entry + +## Pre-existing data +We may need to perform +* Extraction +* Conversion +* Cleaning + +## External sources +Possitives +* No costs for data entry +* No costs for quality checks +* Delegate expertise + +Negatives +* No control over data quality +* No control over data structure +* May be incomplete +* May be ambiguous +* Questions of trustworthiness + +# What does your data look like? +Sometimes the external source of information may be ambiguous or incomplete according to our expectations. + +Different interests will shape the content of the data we want to represent. + +# Licenses, sharing and ethics + +## Why would someone let me use their data? +* Drive sales (Commercial reasons) +* For the common good (ethical reasons) +* Contract requirements (contractual reasons, such as government contracts) + +## Why not publish open data? +* Restrictions on source data (e.g. medical records) +* Control of use +* Value of the data, you're in the business of selling data.