Notebook on regular expressions completed

2023-01-13 18:18:56 -05:00
parent abf3be30f1
commit 0054585a05
1 changed files with 828 additions and 0 deletions
--- a/expressions.ipynb
+++ b/expressions.ipynb
@ -0,0 +1,828 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "1b2088b6-a2c6-4ad7-9fdb-070684661332",
+   "metadata": {},
+   "source": [
+    "# Regular expressions\n",
+    "* A formal language for defining text strings (character sequences)\n",
+    "* Used for pattern matching (e.g. searching & replacing in text)<br>\n",
+    "1. Disjunctions\n",
+    "2. Negation\n",
+    "3. Optionality\n",
+    "4. Aliases\n",
+    "5. Anchors"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2c3429a6-43dc-4323-aa95-7fecec39464f",
+   "metadata": {},
+   "source": [
+    "# 1. Disjunctions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "7e075952-22e9-4b38-8726-5f697c103a4f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import re"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "7283f69c-63cc-4632-9df1-1bbd5596880f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text = '''Most of the time we can use white space.\n",
+    "But what about fred@gmail.com or 13/01/2021?\n",
+    "The students' attempts aren't working.\n",
+    "Maybe it's the use of apostrophes? Or it might need a more up-to-date model.\n",
+    "'''"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "a3a9c07c-5b4d-4319-bda2-0c071f0b4805",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pattern = r\"the\" # We're using the string literal syntax, what follows r is a string literal"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "01a16737-575f-4bd8-8f6c-69570e91a2d8",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Most of X time we can use white space.\n",
+      "But what about fred@gmail.com or 13/01/2021?\n",
+      "The students' attempts aren't working.\n",
+      "Maybe it's X use of apostrophes? Or it might need a more up-to-date model.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"X\", text)) # Substitute X for pattern in text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "7ab3b376-8735-45b8-903a-d5963db60099",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pattern = r\"[Tt]he\" # match with upper or lowercase t"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "20eae2e9-9206-4beb-9fd4-029afd41df54",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Most of X time we can use white space.\n",
+      "But what about fred@gmail.com or 13/01/2021?\n",
+      "X students' attempts aren't working.\n",
+      "Maybe it's X use of apostrophes? Or it might need a more up-to-date model.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"X\", text)) # Substitute X for pattern in text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "67f70f86-3b95-4142-a452-ff60fdec4f68",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pattern = r\"[0-9]\" # match on digits"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "f5e5847a-6faf-409d-bd96-f3482e656f60",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Most of the time we can use white space.\n",
+      "But what about fred@gmail.com or XX/XX/XXXX?\n",
+      "The students' attempts aren't working.\n",
+      "Maybe it's the use of apostrophes? Or it might need a more up-to-date model.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"X\", text)) # Substitute X for pattern in text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "cf1bba29-cbeb-4bd6-83be-390f3423954b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pattern = r\"[A-Z]\" # match on uppercase letters in the range A to Z"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "6263578b-e6ac-42ea-bc6e-53838594e778",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Xost of the time we can use white space.\n",
+      "Xut what about fred@gmail.com or 13/01/2021?\n",
+      "Xhe students' attempts aren't working.\n",
+      "Xaybe it's the use of apostrophes? Xr it might need a more up-to-date model.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"X\", text)) # Substitute X for pattern in text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "cb52d0db-fb29-48e1-9f96-3caa4e11eb94",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pattern = r\"of|the|we\" # Match on a list of words"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "59325f92-3e2f-4652-bb7a-124f905695c7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Most X X time X can use white space.\n",
+      "But what about fred@gmail.com or 13/01/2021?\n",
+      "The students' attempts aren't working.\n",
+      "Maybe it's X use X apostrophes? Or it might need a more up-to-date model.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"X\", text)) # Substitute X for pattern in text"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "68389f1d-e605-4ff9-acbf-89d2d0469a64",
+   "metadata": {},
+   "source": [
+    "# 2. Negation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "7c9599ae-3162-4b09-9d69-183f806edb09",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text = '''Most of the time we can use white space.\n",
+    "But what about fred@gmail.com or 13/01/2021?\n",
+    "The students' attempts aren't working.\n",
+    "Maybe it's the use of apostrophes? Or it might need a more up-to-date model.\n",
+    "'''"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "6e18db87-d15a-4395-8610-a08dfd00daf1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pattern = r\"[^0-9]\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "72d7d0b8-663d-4209-b358-d8d67d5467c0",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "13012021\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"\", text)) # Substitute with empty for pattern in text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "19ec2edd-f671-4d60-a5d6-df094c5a17fd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pattern = r\"[^a-z]\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "f55e3f7c-e2ff-46e0-806b-8afd2bd6b8dd",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " ost of the time we can use white space   ut what about fred gmail com or              he students  attempts aren t working   aybe it s the use of apostrophes   r it might need a more up to date model  \n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \" \", text)) # Substitute with empty for pattern in text"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a5ce7b61-be11-4ff0-8a64-c4e191ef2f0e",
+   "metadata": {},
+   "source": [
+    "# 3. Optionality"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "bd2bdf73-1cd7-40d1-8e21-a31cb54e4a6e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text = '''begin began begun beginning'''"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "629a5a13-d28c-4dda-8cc0-11ef642522f1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# . means match anything, like a wildcard\n",
+    "pattern = r\"beg.n\" # 'beg' followed by anything and followed by n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "ce857270-9737-48a1-b240-c577857ac1c0",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "X X X Xning\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"X\", text))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "97850c49-7503-4809-8de7-8d6674e850fd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text = '''colour can be spelled color'''"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "75c82145-1d37-48d5-ac09-9d41edbc8032",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ? means previous character is optional\n",
+    "pattern = r\"colou?r\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "e581547a-9702-42f4-8900-5ba378e81c06",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " can be spelled \n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"\", text))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "042afb72-6384-4b8c-a3f3-4ce5be65d4d0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# * is the Kleene star, meaning match 0 or more of previous character\n",
+    "pattern = r\"w.*\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "id": "c2e8acc9-f99b-44ce-a192-fb0ea20c022b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text = '''Most of the time we can use white space.\n",
+    "But what about fred@gmail.com or 13/01/2021?\n",
+    "The students' attempts aren't working.\n",
+    "Maybe it's the use of apostrophes? Or it might need a more up-to-date model.\n",
+    "'''"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "4fafab29-1b2e-445c-ba27-298dbf6dd93c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Most of the time \n",
+      "But \n",
+      "The students' attempts aren't \n",
+      "Maybe it's the use of apostrophes? Or it might need a more up-to-date model.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"\", text))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "id": "908f46ea-21f7-4aa2-a082-b4363aef8bdd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# make sure the match is non-greedy using the ? character\n",
+    "pattern = r\"w.*?\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "8c1fa3cf-65fb-4b2b-9e42-43c9e67d6d89",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Most of the time Xe can use Xhite space.\n",
+      "But Xhat about fred@gmail.com or 13/01/2021?\n",
+      "The students' attempts aren't Xorking.\n",
+      "Maybe it's the use of apostrophes? Or it might need a more up-to-date model.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"X\", text))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "id": "b1afabfa-2e93-4ae1-9456-b1caa4b285f3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# make sure the match is non-greedy using the ? character, the whole word\n",
+    "pattern = r\"w.*? \""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "id": "305520e2-21ef-455e-850a-d560db508279",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Most of the time Xcan use Xspace.\n",
+      "But Xabout fred@gmail.com or 13/01/2021?\n",
+      "The students' attempts aren't working.\n",
+      "Maybe it's the use of apostrophes? Or it might need a more up-to-date model.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"X\", text))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "dc5cfc4f-890f-48d9-a9d2-a5417dc47d30",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text = '''foo fooo foooo fooooo!'''"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "fc96a086-0e77-451a-bc70-05be854fdc70",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# + is the Kleene plus, meaning match 1 or more of the previous characters\n",
+    "pattern = r\"fooo+\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "id": "d3594bb1-4668-4b28-a765-83cd633cc369",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "foo X X X!\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"X\", text))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29064c36-95c8-4c62-9fde-21a8b7e5ddb9",
+   "metadata": {},
+   "source": [
+    "# 4. Aliases\n",
+    "Shortcuts\n",
+    "```\n",
+    "\\w - match word\n",
+    "\\d - match digit\n",
+    "\\s - match whitespace\n",
+    "\\W - match not word\n",
+    "\\D - match not digit\n",
+    "\\S - match not whitespace\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "id": "73bb5e8a-2b90-477f-b7c6-8562f69abf8e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text = '''Most of the time we can use white space.\n",
+    "But what about fred@gmail.com or 13/01/2021?\n",
+    "The students' attempts aren't working.\n",
+    "Maybe it's the use of apostrophes? Or it might need a more up-to-date model.\n",
+    "'''"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "id": "e92a698d-509e-499b-8dee-052d007c309b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pattern = r\"\\w\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "id": "b2df707c-c75f-4ec6-808e-969dca4e5b54",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "        .\n",
+      "   @.  //?\n",
+      " '  ' .\n",
+      " '    ?       -- .\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# match all word characters\n",
+    "print(re.sub(pattern, \"\", text))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "id": "d25c8164-2eb9-4fc0-acfc-c3a6d779a314",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pattern = r\"\\d\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "id": "47a39565-955e-48de-bb4b-6d333a6c0597",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Most of the time we can use white space.\n",
+      "But what about fred@gmail.com or XX/XX/XXXX?\n",
+      "The students' attempts aren't working.\n",
+      "Maybe it's the use of apostrophes? Or it might need a more up-to-date model.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"X\", text))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "id": "c4919c02-5f03-499f-9b7d-cf6e3f3b5961",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pattern = r\"\\D\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "id": "49d9a61a-3b64-4aef-af32-fb31b8238418",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "13012021\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"\", text))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "94ceba75-eaf1-4eff-94df-83136b3a3479",
+   "metadata": {},
+   "source": [
+    "# 5. Anchors"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "id": "4949428a-4c53-41f3-8158-4104c8f77da0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# delete all words\n",
+    "pattern = \"\\w+\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "id": "5402a2e6-315d-4c20-a47b-30ba86819193",
+   "metadata": {
+    "jupyter": {
+     "source_hidden": true
+    },
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "        .\n",
+      "   @.  //?\n",
+      " '  ' .\n",
+      " '    ?       -- .\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"\", text))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "id": "8eb6e9e7-72a9-46aa-a9d1-8467d98e5e4e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# delete only words at the start of a string\n",
+    "pattern = \"^\\w+\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "id": "0566be5e-c17b-4603-b877-f3036ea9bb80",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " of the time we can use white space.\n",
+      "But what about fred@gmail.com or 13/01/2021?\n",
+      "The students' attempts aren't working.\n",
+      "Maybe it's the use of apostrophes? Or it might need a more up-to-date model.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"\", text))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "id": "764907c7-45ad-4934-886f-275696252a00",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " of the time we can use white space.\n",
+      " what about fred@gmail.com or 13/01/2021?\n",
+      " students' attempts aren't working.\n",
+      " it's the use of apostrophes? Or it might need a more up-to-date model.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# switch on multiline mode to delete words at the start of each line\n",
+    "print(re.sub(pattern, \"\", text, flags=re.MULTILINE))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 46,
+   "id": "3bcacef3-7c01-4133-9028-100100b3a196",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# use $ to anchr the match at the end of a string\n",
+    "pattern = \"\\W$\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "id": "b4662f11-aeb2-4734-b131-240987dd77a5",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Most of the time we can use white space.\n",
+      "But what about fred@gmail.com or 13/01/2021?\n",
+      "The students' attempts aren't working.\n",
+      "Maybe it's the use of apostrophes? Or it might need a more up-to-date model\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(re.sub(pattern, \"\", text))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "id": "8afed181-1705-4933-a0d0-4ea5e10ca904",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Most of the time we can use white space\n",
+      "But what about fred@gmail.com or 13/01/2021\n",
+      "The students' attempts aren't working\n",
+      "Maybe it's the use of apostrophes? Or it might need a more up-to-date model\n"
+     ]
+    }
+   ],
+   "source": [
+    "# switch on multiline mode to delete non-words at the end of each line\n",
+    "print(re.sub(pattern, \"\", text, flags=re.MULTILINE))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}