Week 2 notes completed

2023-01-12 17:01:51 -05:00
parent 9658855302
commit f1ff55cd80
1 changed files with 49 additions and 0 deletions
--- a/Processing/Week
+++ b/Processing/Week
@@ -0,0 +1,49 @@
 # NLP Platforms and toolkit
 We'll be using NLTK, which has its origin in computational linguistics.
 ## Principles of NLTK
 * Simplicity
 * Consistency
 * Extensibility
 * Modularity
 NLTK is not necessarily SOTA (State-Of-The-Art) or optimized for runtime performance.
 Other toolkits
 * Scikit-learn (ML library) A lot of NLP tasks can be modeled as ML tasks.
 * Gensim - word embedding, topic modeling for humans
 * spaCy - Industrial-strength Natural Language Processing
 * Textacy - built on top of spaCy
 * TextBlob - Simplified Text Processing built on nltk
 * Deep learning: TensorFlow, PyTorch
 * Chatbot development: RASA
 # Introduction to evaluation
 Before we build our system we should think how we're going to **measure its performance**, **how do we know when we have done a good job**?
 ### In a classification problem
 In a spam filter we have a binary problem (yes/no) for whether the message is spam or not.
 Accuracy = Correct Predictions / Total Number
 This could result in skewed results with unbalanced data, if for example the data already has a majority class which results in high accuracy only by choosing it.
 #### Another example classification problem
 Predict whether a CT scan shows a tumor or not?
 Tumors are rare events, so our classes are unbalanced.
 The cost of missing a tumor is much higher than a false positive. **Accuracy is not a good metric**
 This is where confusion matrices come into play, this compares predicted values with actual values (ground truth).
 ![Confusion Matrix](Files/image1.png)
 Accuracy = (TP + TN) / (TP + TN + FP + FN)
 Recall = TP / (TP + FN)
 Precission = TP / (TP + FP)
 **Recall** and **Precision** are better at illustrating the performance for unbalanced datasets and a baseline against which we can compare the performance of future iterations of our systems with different algorithms or approaches.
 ![Confusion Matrix](Files/image2.png "Confusion Matrix with some data")