From f1ff55cd803739a82d6f567b22e97c15c591e653 Mon Sep 17 00:00:00 2001 From: levdoescode Date: Thu, 12 Jan 2023 17:01:51 -0500 Subject: [PATCH] Week 2 notes completed --- .../Week 2/Week 2 Notes.md | 49 +++++++++++++++++++ 1 file changed, 49 insertions(+) create mode 100644 CM3060 Natural Language Processing/Week 2/Week 2 Notes.md diff --git a/CM3060 Natural Language Processing/Week 2/Week 2 Notes.md b/CM3060 Natural Language Processing/Week 2/Week 2 Notes.md new file mode 100644 index 0000000..041fcb8 --- /dev/null +++ b/CM3060 Natural Language Processing/Week 2/Week 2 Notes.md @@ -0,0 +1,49 @@ +# NLP Platforms and toolkit +We'll be using NLTK, which has its origin in computational linguistics. + +## Principles of NLTK +* Simplicity +* Consistency +* Extensibility +* Modularity + +NLTK is not necessarily SOTA (State-Of-The-Art) or optimized for runtime performance. + +Other toolkits +* Scikit-learn (ML library) A lot of NLP tasks can be modeled as ML tasks. +* Gensim - word embedding, topic modeling for humans +* spaCy - Industrial-strength Natural Language Processing +* Textacy - built on top of spaCy +* TextBlob - Simplified Text Processing built on nltk +* Deep learning: TensorFlow, PyTorch +* Chatbot development: RASA + +# Introduction to evaluation + +Before we build our system we should think how we're going to **measure its performance**, **how do we know when we have done a good job**? + +### In a classification problem +In a spam filter we have a binary problem (yes/no) for whether the message is spam or not. + +Accuracy = Correct Predictions / Total Number + +This could result in skewed results with unbalanced data, if for example the data already has a majority class which results in high accuracy only by choosing it. + +#### Another example classification problem + +Predict whether a CT scan shows a tumor or not? + +Tumors are rare events, so our classes are unbalanced. +The cost of missing a tumor is much higher than a false positive. **Accuracy is not a good metric** + +This is where confusion matrices come into play, this compares predicted values with actual values (ground truth). + +![Confusion Matrix](Files/image1.png) + +Accuracy = (TP + TN) / (TP + TN + FP + FN) +Recall = TP / (TP + FN) +Precission = TP / (TP + FP) + +**Recall** and **Precision** are better at illustrating the performance for unbalanced datasets and a baseline against which we can compare the performance of future iterations of our systems with different algorithms or approaches. + +![Confusion Matrix](Files/image2.png "Confusion Matrix with some data")