- ## Regular Expressions #regex #[[regular expression]] #Coding #programming - ### 1. What is a Regular Expression? A regular expression is a pattern that specifies a set of strings. It's like a wild card search, but more powerful. At its core, it's a way to search, match, and manipulate text. - ### 2. Basic Building Blocks: - **Literals**: These are the most basic elements. If you search for the regex `apple`, it will match the string "apple". - **Dot (`.`)**: Matches any single character, except for a newline. Example: `h.t` will match "hat", "hit", "hot", etc. - **Character Sets (`[]`)**: Matches any one of the characters inside the square brackets. Example: `h[aei]t` will match "hat", "hit", but not "hot". - **Negated Character Sets (`[^]`)**: Matches any character not inside the square brackets. Example: `h[^aei]t` will match "hot", but not "hat" or "hit". - **Quantifiers**: - `*`: Matches 0 or more of the preceding token. - `+`: Matches 1 or more of the preceding token. - `?`: Matches 0 or 1 of the preceding token. - `{n}`: Matches exactly n of the preceding token. - `{n,}`: Matches n or more of the preceding token. - `{n,m}`: Matches between n and m of the preceding token. - ### 3. Some Special Characters: - **Anchors**: - `^`: Start of a string. (e.g., `^apple` matches any string that starts with "apple") - `$`: End of a string. (e.g., `apple$` matches any string that ends with "apple") - **Escape Sequences**: - `\d`: Matches any digit (equivalent to `[0-9]`). - `\D`: Matches any non-digit. - `\w`: Matches any word character (alphanumeric or underscore). - `\W`: Matches any non-word character. - `\s`: Matches any whitespace (spaces, tabs, etc.). - `\S`: Matches any non-whitespace. - **Grouping and Capturing**: - `()`: Groups several tokens together. You can also use this to capture specific parts of a matched string for future reference. - **Alternation (`|`)**: It acts like a logical OR. Matches either the expression before or the expression after it. Example: `apple|banana` will match either "apple" or "banana". - ### 4. Tips: - **Start Small**: Begin with small patterns and test them. Gradually build up your regex pattern. - **Use Tools**: There are numerous online tools like [regex101](https://regex101.com/) which can help you test and debug your regular expressions. These tools often provide real-time feedback, which is invaluable. - **Be Specific**: The more specific your pattern, the less likely you are to get unwanted matches. - **Practice**: Like any other skill, the more you use and practice regex, the more proficient you'll become. - ### 5. Practice: Now, let's have some simple exercises for you to try: 1. Write a regex that matches email addresses. 2. Write a regex that matches URLs. 3. Write a regex that matches phone numbers in the format `(123) 456-7890`. Remember, regex patterns can vary based on specific needs. There might be multiple correct solutions. - 1. **Matching Email Addresses**: This is a basic regex to match most common email formats. Remember that truly validating an email address format comprehensively with regex can be more complex. ```regex ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ ``` - `^[a-zA-Z0-9._%+-]+`: Matches the username part before the "@" symbol. Allows alphanumeric characters as well as some special characters like `.`, `_`, `%`, `+`, and `-`. - `@[a-zA-Z0-9.-]+`: Matches the domain name after the "@" symbol. - `\.[a-zA-Z]{2,}$`: Matches the top-level domain, like `.com`, `.net`, etc. - 2. **Matching URLs**: A very basic example that matches http and https URLs might look like this: ```regex ^(https?://)?(www\.)?[^ ]+\.[a-zA-Z]{2,}(/[^ ]*)?$ ``` - `^(https?://)?`: Matches the start of the URL which might be "http://" or "https://". - `(www\.)?`: Matches the optional "www." part. - `[^ ]+\.[a-zA-Z]{2,}`: Matches the domain and top-level domain, ensuring no spaces are present in the URL. - `/[^ ]*`: This will match a forward slash followed by zero or more characters that aren't spaces. The `*` quantifier means it can match just the slash, or the slash plus a path. - `?`: The following question mark makes the entire previous group optional. So, the regex can match URLs with or without the path part. - Please note that URLs can have various formats and can contain parameters, paths, and anchors. The above regex is quite basic and may not catch all possible URLs. - 3. **Matching Phone Numbers in the Format `(123) 456-7890`**: ```regex ^\(\d{3}\) \d{3}-\d{4}$ ``` - `^\(`: Matches the opening parenthesis. - `\d{3}`: Matches three digits. - `\)` : Matches the closing parenthesis. - ` \d{3}`: Matches three digits after a space. - `-\d{4}$`: Matches the last four digits after a dash. - ### Commonly used regex searched: - 1. **Date (YYYY-MM-DD)** ```regex ^\d{4}-\d{2}-\d{2}$ ``` This regex is correct for the given format, but it will also match invalid dates like `2023-19-39`. For complete validation, a more complex regex or another form of date validation would be necessary. - 2. **Time (HH:MM with 24-hour clock)** ```regex ^([01]\d|2[0-3]):[0-5]\d$ ``` This is correct. It matches times from `00:00` to `23:59`. - 3. **IP Address (IPv4)** ```regex ^((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)$ ``` This regex accurately matches IPv4 addresses. - 4. **MAC Address** ```regex ^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})$ ``` Accurate for common MAC address formats with `:` or `-` separators. - 5. **Hexadecimal Color Code** ```regex ^#?([a-fA-F0-9]{6}|[a-fA-F0-9]{3})$ ``` This matches 3 or 6 character hex color codes with or without the leading `#`. - 6. **Username (8-20 alphanumeric characters)** ```regex ^[a-zA-Z0-9]{8,20}$ ``` Correct for the specified criteria. - 7. **Password** ```regex ^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*()_+{}:"<>?|\[\]\/\\-]).{8,20}$ ``` Matches passwords of 8-20 characters that contain at least one digit, lowercase letter, uppercase letter, and special character. - 8. **Postal/ZIP Code (for U.S.)** ```regex ^\d{5}(-\d{4})?$ ``` Matches both 5-digit ZIP codes and ZIP+4 formats for the U.S. - 9. **Credit Card Number** ```regex ^\d{4}-?\d{4}-?\d{4}-?\d{4}$ ``` Matches 16-digit credit card numbers with optional `-` separators. - 10. **Social Security Number (U.S. format)** ```regex ^\d{3}-\d{2}-\d{4}$ ``` Accurate for the U.S. SSN format. - 11. **UUID** ```regex ^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$ ``` Matches UUIDs in the canonical format. - 12. **File Path (Windows format)** ```regex ^([a-zA-Z]:\\)?(?:[a-zA-Z0-9]+\\?)*$ ``` Matches basic Windows file paths, but it's a simplification and may not capture all valid paths. - 13. **File Path (Mac/Unix format)** - ```regex ^(/[^/ ]+)+/?$ ``` Matched basic Mac and Unix-based file paths - 14. **HTML Tags** ```regex <(/?[^>]+)> ``` This matches simple opening and closing HTML tags, but won't handle all edge cases, especially for tags with attributes.