API Reference
The regexfactory module documentation!
Base Pattern Module
Module for the RegexPattern
class.
- regexfactory.pattern.ValidPatternType
alias of
Union
[Pattern
,str
,RegexPattern
]
- regexfactory.pattern.ESCAPED_CHARACTERS = '()[]{}?*+-|^$\\.&~#'
Special characters that need to be escaped to be used without their special meanings.
- regexfactory.pattern.join(*patterns: Pattern | str | RegexPattern) RegexPattern
Umbrella function for combining
ValidPatternType
’s into aRegexPattern
.
- regexfactory.pattern.escape(string: str) RegexPattern
Escapes special characters in a string to use them without their special meanings.
- class regexfactory.pattern.RegexPattern(pattern: Pattern | str | RegexPattern, /, _precedence: int = 1)
The main object that represents Regular Expression Pattern strings for this library.
- __add__(other: Pattern | str | RegexPattern) RegexPattern
Adds two
ValidPatternType
’s together, into aRegexPattern
- __mul__(coefficient: int) RegexPattern
Treats
RegexPattern
as a string and multiplies it by an integer.
- static get_regex(obj: Pattern | str | RegexPattern, /) str
Extracts the regex content from
RegexPattern
orre.Pattern
objects else return the inputstr
.
- compile(*, flags: int = 0) Pattern
See
re.compile()
.
- sub(replacement: str, content: str, /, count: int = 0, *, flags: int = 0) str
See
re.Pattern.sub()
.
Regex Pattern Subclasses
Module for Regex pattern classes like [^abc]
or (abc)
or a|b
- class regexfactory.patterns.Or(*patterns: Pattern | str | RegexPattern)
For matching multiple patterns. This pattern or that pattern or that other pattern.
from regexfactory import Or patt = Or("Bob", "Alice", "Sally") print(patt.match("Alice")) print(patt.match("Bob")) print(patt.match("Sally"))
<re.Match object; span=(0, 5), match='Alice'> <re.Match object; span=(0, 3), match='Bob'> <re.Match object; span=(0, 5), match='Sally'>
- class regexfactory.patterns.Range(start: str, stop: str)
For matching characters between two character indices (using the Unicode numbers of the input characters.) You can find use
chr()
andord()
to translate characters their Unicode numbers and back again. For example,chr(97)
returns the string'a'
, whilechr(8364)
returns the string'€'
Thus, matching characters between'a'
and'z'
is really checking whether a characters unicode number is betweenord('a')
andord('z')
from regexfactory import Range, Or patt = Or("Bob", Range("a", "z")) print(patt.findall("my job is working for Bob"))
['m', 'y', 'j', 'o', 'b', 'i', 's', 'w', 'o', 'r', 'k', 'i', 'n', 'g', 'f', 'o', 'r', 'Bob']
- class regexfactory.patterns.Set(*patterns: Pattern | str | RegexPattern)
For matching a single character from a list of characters. Keep in mind special characters like
+
and.
lose their meanings inside a set/list, so need to escape them here to use them.In practice,
Set("a", ".", "z")
functions the same asOr("a", ".", "z")
The difference being thatOr
acceptsRegexPattern
‘s andSet
accepts characters only. Special characters do NOT lose their special meaings inside anOr
though. The other big difference is performance,Or
is a lot slower thanSet
.import time from regexfactory import Or, Set start_set = time.time() print(patt := Set(*"a.z").compile()) print("Set took", time.time() - start_set, "seconds to compile") print("And the resulting match is", patt.match("b")) print() start_or = time.time() print(patt := Or(*"a.z").compile()) print("Or took", time.time() - start_or, "seconds to compile") print("And the resulting match is", patt.match("b"))
re.compile('[a.z]') Set took 0.00012803077697753906 seconds to compile And the resulting match is None re.compile('(?:a)|(?:.)|(?:z)') Or took 0.00012755393981933594 seconds to compile And the resulting match is <re.Match object; span=(0, 1), match='b'>
- class regexfactory.patterns.NotSet(*patterns: Pattern | str | RegexPattern)
For matching a character that is NOT in a list of characters. Keep in mind special characters lose their special meanings inside
NotSet
’s as well.from regexfactory import NotSet, Set not_abc = NotSet(*"abc") is_abc = Set(*"abc") print(not_abc.match("x")) print(is_abc.match("x"))
<re.Match object; span=(0, 1), match='x'> None
- class regexfactory.patterns.Amount(pattern: Pattern | str | RegexPattern, i: int, j: int | None = None, or_more: bool = False, greedy: bool = True)
For matching multiple occurences of a
ValidPatternType
. You can match a specific amount of occurences only. You can match with a lower bound on the number of occurences of a pattern. Or with a lower and upper bound on the number occurences. You can also pass agreedy=False
keyword-argument toAmount
, (default is True) which tells the regex compiler match as few characters as possible rather than the default behavior which is to match as many characters as possible.Best explained with an example.
from regexfactory import Amount, Set # We are using the same Pattern with different amounts. content = "acbccbaabbccaaca" specific_amount = Amount(Set(*"abc"), 2) lower_and_upper_bound = Amount(Set(*"abc"), 3, 5, greedy=False) lower_and_upper_bound_greedy = Amount(Set(*"abc"), 3, 5) lower_bound_only = Amount(Set(*"abc"), 5, or_more=True, greedy=False) print(specific_amount.findall(content)) print(lower_and_upper_bound_greedy.findall(content)) print(lower_and_upper_bound.findall(content)) print(lower_bound_only.findall(content))
['ac', 'bc', 'cb', 'aa', 'bb', 'cc', 'aa', 'ca'] ['acbcc', 'baabb', 'ccaac'] ['acb', 'ccb', 'aab', 'bcc', 'aac'] ['acbcc', 'baabb', 'ccaac']
- class regexfactory.patterns.Multi(pattern: Pattern | str | RegexPattern, match_zero: bool = False, greedy: bool = True)
Matches one or more occurences of the given
ValidPatternType
. If givenmatch_zero=True
to the init method it matches zero or more occurences.
- class regexfactory.patterns.Optional(pattern: Pattern | str | RegexPattern, greedy: bool = True)
Matches the passed
ValidPatternType
between zero and one times. Functions the same asAmount(pattern, 0, 1)
.
- class regexfactory.patterns.NamedGroup(name: str, pattern: Pattern | str | RegexPattern)
Lets you sepparate your regex into named groups that you can extract from
re.Match.groupdict()
.from regexfactory import NamedGroup, WORD, Multi stuff = "George Washington" patt = NamedGroup("first_name", Multi(WORD)) + " " + NamedGroup("last_name", Multi(WORD)) print(match := patt.match(stuff)) print(match.groupdict())
<re.Match object; span=(0, 17), match='George Washington'> {'first_name': 'George', 'last_name': 'Washington'}
- class regexfactory.patterns.NamedReference(group_name: str | NamedGroup)
Lets you reference
NamedGroup
’s that you’ve already created, by name, or by passing theNamedGroup
itself.from regexfactory import NamedReference, NamedGroup, DIGIT, RegexPattern timestamp = NamedGroup("time_at", f"{DIGIT * 2}:{DIGIT * 2}am") patt = RegexPattern(f"Created at {timestamp}, and then updated at {NamedReference(timestamp)}") patt2 = RegexPattern(f"Created at {timestamp}, and then updated at {NamedReference('time_at')}") print(repr(patt)) print(repr(patt2))
<RegexPattern 'Created at (?P<time_at>\d\d:\d\dam), and then updated at (?P=time_at)'> <RegexPattern 'Created at (?P<time_at>\d\d:\d\dam), and then updated at (?P=time_at)'>
- class regexfactory.patterns.NumberedReference(group_number: int)
Lets you reference the literal match to
Group
’s that you’ve already created, by its group index.from regexfactory import NumberedReference, Group, DIGIT, RegexPattern timestamp = Group(f"{DIGIT * 2}:{DIGIT * 2}am") patt = RegexPattern(f"{timestamp},{NumberedReference(1)},{NumberedReference(1)}") print(patt.match("09:59am,09:59am,09:59am")) print(patt.match("09:59am,13:00am,09:50am"))
<re.Match object; span=(0, 23), match='09:59am,09:59am,09:59am'> None
- class regexfactory.patterns.Comment(content: str)
Lets you include comment strings that are ignored by regex compilers to document your regex’s.
from regexfactory import Comment, DIGIT, WORD, Or patt = Or(DIGIT, WORD) patt_with_comment = patt + Comment("I love comments in regex!") print("Pattern without comment:", patt) print("Pattern with comment", patt_with_comment) print(patt.match("1")) print(patt.match("a"))
Pattern without comment: (?:\d)|(?:\w) Pattern with comment (?:\d)|(?:\w)(?#I love comments in regex!) <re.Match object; span=(0, 1), match='1'> <re.Match object; span=(0, 1), match='a'>
- class regexfactory.patterns.IfAhead(pattern: Pattern | str | RegexPattern)
A mini if-statement in regex. It does not consume any string content. Makes the whole pattern match only if followed by the given pattern at this position in the whole pattern.
from regexfactory import IfAhead, escape, WORD, Multi, Or name = Multi(WORD) + IfAhead( Or( escape(" Jr."), escape(" Sr."), ) ) print(name.findall("Bob Jr. and John Sr. love hanging out with each other."))
['Bob', 'John']
- class regexfactory.patterns.IfNotAhead(pattern: Pattern | str | RegexPattern)
A mini if-statement in regex. It does not consume any string content. Makes the whole pattern match only if NOT followed by the given pattern at this position in the whole pattern.
from regexfactory import IfNotAhead, RegexPattern patt = RegexPattern("Foo") + IfNotAhead("bar") print(patt.match("Foo")) print(patt.match("Foobar")) print(patt.match("Fooba"))
<re.Match object; span=(0, 3), match='Foo'> None <re.Match object; span=(0, 3), match='Foo'>
- class regexfactory.patterns.IfBehind(pattern: Pattern | str | RegexPattern)
A mini if-statement in regex. It does not consume any string content. Makes the whole pattern match only if preceded by the given pattern at this position in the whole pattern.
from regexfactory import IfBehind, DIGIT, Multi, Optional rank = IfBehind("Rank: ") + Multi(DIGIT) print(rank.findall("Rank: 27, Score: 30, Power: 123"))
['27']
- class regexfactory.patterns.IfNotBehind(pattern: Pattern | str | RegexPattern)
A mini if-statement in regex. It does not consume any string content. Makes the whole pattern match only if NOT preceded by the given pattern at this position in the whole pattern.
from regexfactory import IfNotBehind, WORD, Multi, DIGIT patt = IfNotBehind(WORD) + Multi(DIGIT) print(patt.match("b64")) print(patt.match("64"))
None <re.Match object; span=(0, 2), match='64'>
- class regexfactory.patterns.Group(pattern: Pattern | str | RegexPattern, capturing: bool = True)
For separating your Patterns into fields for extraction. Basically you use Group to reference regex inside of it later with
NumberedReference
. Passingcapturing=False
unifies the regex inside the group into a single token but does not capture the group. Seen below.from regexfactory import Group, WORD, Multi name = Group(Multi(WORD)) + " " + Group(Multi(WORD), capturing=False) print(name.match("Nate Larsen").groups())
('Nate',)
- class regexfactory.patterns.IfGroup(name_or_id: str | int, yes_pattern: Pattern | str | RegexPattern, no_pattern: Pattern | str | RegexPattern)
Matches with
yes_pattern
if the given group name or group index succeeds in matching and exists, otherwise matches withno_pattern
from regexfactory import IfGroup, NamedGroup, Optional, escape patt = ( Optional(NamedGroup("title", escape("Mr. "))) + IfGroup("title", "Dillon", NamedGroup("first_name", "Bob")) + Optional(IfGroup("first_name", " Dillon", "")) ) # If NamedGroup "title" matches then use the last name pattern # else use the first name pattern print(patt.match("Mr. Dillon")) print(patt.match("Mr. Bob")) print(patt.match("Mr Dillon")) print(patt.match("Bob")) print(patt.match("Bob Dillon"))
<re.Match object; span=(0, 10), match='Mr. Dillon'> None None <re.Match object; span=(0, 3), match='Bob'> <re.Match object; span=(0, 10), match='Bob Dillon'>
Regex Characters
Common regex special characters, such as d
, .
, …
More information about special characters in python regex available
here
- regexfactory.chars.ANY = <RegexPattern '.'>
(Dot.) In the default mode, this matches any character except a newline. If the
re.DOTALL
flag has been specified, this matches any character including a newline.
- regexfactory.chars.ANCHOR_START = <RegexPattern '^'>
(Caret.) Matches the start of the string, and in
re.MULTILINE
mode also matches immediately after each newline.
- regexfactory.chars.ANCHOR_END = <RegexPattern '$'>
Matches the end of the string or just before the newline at the end of the string, and in
re.MULTILINE
mode also matches before a newline. foo matches bothfoo
andfoobar
, while the regular expressionfoo$
matches onlyfoo
. More interestingly, searching forfoo.$
infoo1nfoo2n
matchesfoo2
normally, butfoo1
inre.MULTILINE
mode; searching for a single $ infoon
will find two (empty) matches: one just before the newline, and one at the end of the string.
- regexfactory.chars.WHITESPACE = <RegexPattern '\s'>
Matches Unicode whitespace characters (which includes
[ tnrfv]
, and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If there.ASCII
flag is used, only[ tnrfv]
is matched.
- regexfactory.chars.NOTWHITESPACE = <RegexPattern '\S'>
Matches any character which is not a whitespace character. This is the opposite of s. If the
re.ASCII
flag is used this becomes the equivalent of[^ tnrfv]
.
- regexfactory.chars.WORD = <RegexPattern '\w'>
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the
re.ASCII
flag is used, only[a-zA-Z0-9_]
is matched.
- regexfactory.chars.NOTWORD = <RegexPattern '\W'>
Matches any character which is not a word character. This is the opposite of w. If the
re.ASCII
flag is used this becomes the equivalent of[^a-zA-Z0-9_]
. If there.LOCALE
flag is used, matches characters which are neither alphanumeric in the current locale nor the underscore.