From f1ff55cd803739a82d6f567b22e97c15c591e653 Mon Sep 17 00:00:00 2001
From: levdoescode <levdoescode+sites@gmail.com>
Date: Thu, 12 Jan 2023 17:01:51 -0500
Subject: [PATCH] Week 2 notes completed

---
 .../Week 2/Week 2 Notes.md                    | 49 +++++++++++++++++++
 1 file changed, 49 insertions(+)
 create mode 100644 CM3060 Natural Language Processing/Week 2/Week 2 Notes.md

diff --git a/CM3060 Natural Language Processing/Week 2/Week 2 Notes.md b/CM3060 Natural Language Processing/Week 2/Week 2 Notes.md
new file mode 100644
index 0000000..041fcb8
--- /dev/null
+++ b/CM3060 Natural Language Processing/Week 2/Week 2 Notes.md	
@@ -0,0 +1,49 @@
+# NLP Platforms and toolkit
+We'll be using NLTK, which has its origin in computational linguistics.
+
+## Principles of NLTK
+* Simplicity
+* Consistency
+* Extensibility
+* Modularity
+
+NLTK is not necessarily SOTA (State-Of-The-Art) or optimized for runtime performance.
+
+Other toolkits
+* Scikit-learn (ML library) A lot of NLP tasks can be modeled as ML tasks.
+* Gensim - word embedding, topic modeling for humans
+* spaCy - Industrial-strength Natural Language Processing
+* Textacy - built on top of spaCy
+* TextBlob - Simplified Text Processing built on nltk
+* Deep learning: TensorFlow, PyTorch
+* Chatbot development: RASA
+
+# Introduction to evaluation
+
+Before we build our system we should think how we're going to **measure its performance**, **how do we know when we have done a good job**?
+
+### In a classification problem
+In a spam filter we have a binary problem (yes/no) for whether the message is spam or not.
+
+Accuracy = Correct Predictions / Total Number
+
+This could result in skewed results with unbalanced data, if for example the data already has a majority class which results in high accuracy only by choosing it.
+
+#### Another example classification problem
+
+Predict whether a CT scan shows a tumor or not?
+
+Tumors are rare events, so our classes are unbalanced.
+The cost of missing a tumor is much higher than a false positive. **Accuracy is not a good metric**
+
+This is where confusion matrices come into play, this compares predicted values with actual values (ground truth).
+
+![Confusion Matrix](Files/image1.png)
+
+Accuracy = (TP + TN) / (TP + TN + FP + FN)
+Recall = TP / (TP + FN)
+Precission = TP / (TP + FP)
+
+**Recall** and **Precision** are better at illustrating the performance for unbalanced datasets and a baseline against which we can compare the performance of future iterations of our systems with different algorithms or approaches.
+
+![Confusion Matrix](Files/image2.png "Confusion Matrix with some data")