-
Regular Expressions #regex #[[regular expression]] #Coding #programming
-
1. What is a Regular Expression?
A regular expression is a pattern that specifies a set of strings. It's like a wild card search, but more powerful. At its core, it's a way to search, match, and manipulate text.
-
2. Basic Building Blocks:
-
Literals: These are the most basic elements. If you search for the regex
apple
, it will match the string "apple". -
Dot (
.
): Matches any single character, except for a newline. Example:h.t
will match "hat", "hit", "hot", etc. -
Character Sets (
[]
): Matches any one of the characters inside the square brackets. Example:h[aei]t
will match "hat", "hit", but not "hot". -
Negated Character Sets (
[^]
): Matches any character not inside the square brackets. Example:h[^aei]t
will match "hot", but not "hat" or "hit". -
Quantifiers:
*
: Matches 0 or more of the preceding token.+
: Matches 1 or more of the preceding token.?
: Matches 0 or 1 of the preceding token.{n}
: Matches exactly n of the preceding token.{n,}
: Matches n or more of the preceding token.{n,m}
: Matches between n and m of the preceding token.
-
3. Some Special Characters:
-
Anchors:
^
: Start of a string. (e.g.,^apple
matches any string that starts with "apple")$
: End of a string. (e.g.,apple$
matches any string that ends with "apple")
-
Escape Sequences:
\d
: Matches any digit (equivalent to[0-9]
).\D
: Matches any non-digit.\w
: Matches any word character (alphanumeric or underscore).\W
: Matches any non-word character.\s
: Matches any whitespace (spaces, tabs, etc.).\S
: Matches any non-whitespace.
-
Grouping and Capturing:
()
: Groups several tokens together. You can also use this to capture specific parts of a matched string for future reference.
-
Alternation (
|
): It acts like a logical OR. Matches either the expression before or the expression after it. Example:apple|banana
will match either "apple" or "banana". -
4. Tips:
-
Start Small: Begin with small patterns and test them. Gradually build up your regex pattern.
-
Use Tools: There are numerous online tools like regex101 which can help you test and debug your regular expressions. These tools often provide real-time feedback, which is invaluable.
-
Be Specific: The more specific your pattern, the less likely you are to get unwanted matches.
-
Practice: Like any other skill, the more you use and practice regex, the more proficient you'll become.
-
5. Practice:
Now, let's have some simple exercises for you to try:
- Write a regex that matches email addresses.
- Write a regex that matches URLs.
- Write a regex that matches phone numbers in the format
(123) 456-7890
.
-
Remember, regex patterns can vary based on specific needs. There might be multiple correct solutions.
- 1. **Matching Email Addresses**:
This is a basic regex to match most common email formats. Remember that truly validating an email address format comprehensively with regex can be more complex.
```regex
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
```
- `^[a-zA-Z0-9._%+-]+`: Matches the username part before the "@" symbol. Allows alphanumeric characters as well as some special characters like `.`, `_`, `%`, `+`, and `-`.
- `@[a-zA-Z0-9.-]+`: Matches the domain name after the "@" symbol.
- `\.[a-zA-Z]{2,}$`: Matches the top-level domain, like `.com`, `.net`, etc.
- 2. **Matching URLs**:
A very basic example that matches http and https URLs might look like this:
```regex
^(https?://)?(www\.)?[^ ]+\.[a-zA-Z]{2,}(/[^ ]*)?$
```
- `^(https?://)?`: Matches the start of the URL which might be "http://" or "https://".
- `(www\.)?`: Matches the optional "www." part.
- `[^ ]+\.[a-zA-Z]{2,}`: Matches the domain and top-level domain, ensuring no spaces are present in the URL.
- `/[^ ]*`: This will match a forward slash followed by zero or more characters that aren't spaces. The `*` quantifier means it can match just the slash, or the slash plus a path.
- `?`: The following question mark makes the entire previous group optional. So, the regex can match URLs with or without the path part.
- Please note that URLs can have various formats and can contain parameters, paths, and anchors. The above regex is quite basic and may not catch all possible URLs.
- 3. **Matching Phone Numbers in the Format `(123) 456-7890`**:
```regex
^\(\d{3}\) \d{3}-\d{4}$
```
- `^\(`: Matches the opening parenthesis.
- `\d{3}`: Matches three digits.
- `\)` : Matches the closing parenthesis.
- ` \d{3}`: Matches three digits after a space.
- `-\d{4}$`: Matches the last four digits after a dash.
- ### Commonly used regex searched:
- 1. **Date (YYYY-MM-DD)**
```regex
^\d{4}-\d{2}-\d{2}$
```
This regex is correct for the given format, but it will also match invalid dates like `2023-19-39`. For complete validation, a more complex regex or another form of date validation would be necessary.
- 2. **Time (HH:MM with 24-hour clock)**
```regex
^([01]\d|2[0-3]):[0-5]\d$
```
This is correct. It matches times from `00:00` to `23:59`.
- 3. **IP Address (IPv4)**
```regex
^((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)$
```
This regex accurately matches IPv4 addresses.
- 4. **MAC Address**
```regex
^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})$
```
Accurate for common MAC address formats with `:` or `-` separators.
- 5. **Hexadecimal Color Code**
```regex
^#?([a-fA-F0-9]{6}|[a-fA-F0-9]{3})$
```
This matches 3 or 6 character hex color codes with or without the leading `#`.
- 6. **Username (8-20 alphanumeric characters)**
```regex
^[a-zA-Z0-9]{8,20}$
```
Correct for the specified criteria.
- 7. **Password**
```regex
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*()_+{}:"<>?|\[\]\/\\-]).{8,20}$
```
Matches passwords of 8-20 characters that contain at least one digit, lowercase letter, uppercase letter, and special character.
- 8. **Postal/ZIP Code (for U.S.)**
```regex
^\d{5}(-\d{4})?$
```
Matches both 5-digit ZIP codes and ZIP+4 formats for the U.S.
- 9. **Credit Card Number**
```regex
^\d{4}-?\d{4}-?\d{4}-?\d{4}$
```
Matches 16-digit credit card numbers with optional `-` separators.
- 10. **Social Security Number (U.S. format)**
```regex
^\d{3}-\d{2}-\d{4}$
```
Accurate for the U.S. SSN format.
- 11. **UUID**
```regex
^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$
```
Matches UUIDs in the canonical format.
- 12. **File Path (Windows format)**
```regex
^([a-zA-Z]:\\)?(?:[a-zA-Z0-9]+\\?)*$
```
Matches basic Windows file paths, but it's a simplification and may not capture all valid paths.
- 13. **File Path (Mac/Unix format)**
-
```regex
^(/[^/ ]+)+/?$
```
Matched basic Mac and Unix-based file paths
- 14. **HTML Tags**
```regex
<(/?[^>]+)>
```
This matches simple opening and closing HTML tags, but won't handle all edge cases, especially for tags with attributes.