API Reference

The regexfactory module documentation!

Base Pattern Module

Module for the RegexPattern class.

regexfactory.pattern.ValidPatternType: alias of Union[Pattern, str, RegexPattern]

regexfactory.pattern.ESCAPED_CHARACTERS = '()[]{}?*+-|^$\\.&~#': Special characters that need to be escaped to be used without their special meanings.

regexfactory.pattern.join(*patterns: Pattern | str | RegexPattern) → RegexPattern: Umbrella function for combining ValidPatternType’s into a RegexPattern.

regexfactory.pattern.escape(string: str) → RegexPattern: Escapes special characters in a string to use them without their special meanings.

class regexfactory.pattern.RegexPattern(pattern: Pattern | str | RegexPattern, /, _precedence: int = 1)

The main object that represents Regular Expression Pattern strings for this library.

precedence: int

__add__(other: Pattern | str | RegexPattern) → RegexPattern: Adds two ValidPatternType’s together, into a RegexPattern

__mul__(coefficient: int) → RegexPattern: Treats RegexPattern as a string and multiplies it by an integer.

static get_regex(obj: Pattern | str | RegexPattern, /) → str: Extracts the regex content from RegexPattern or re.Pattern objects else return the input str.

compile(*, flags: int = 0) → Pattern: See re.compile().

match(content: str, /, *, flags: int = 0) → Match | None: See re.Pattern.match().

fullmatch(content: str, /, *, flags: int = 0) → Match | None: See re.Pattern.fullmatch().

findall(content: str, /, *, flags: int = 0) → List[Tuple[str, ...]]: See re.Pattern.findall().

finditer(content: str, /, *, flags: int = 0) → Iterator[Match]: See re.Pattern.finditer().

split(content: str, /, maxsplit: int = 0, *, flags: int = 0) → List[Any]: See re.Pattern.split().

sub(replacement: str, content: str, /, count: int = 0, *, flags: int = 0) → str: See re.Pattern.sub().

subn(replacement: str, content: str, /, count: int = 0, *, flags: int = 0) → Tuple[str, int]: See re.Pattern.subn().

search(content: str, /, pos: int = 0, endpos: int = 0, *, flags: int = 0) → Match | None: See re.Pattern.search().

Regex Pattern Subclasses

Module for Regex pattern classes like [^abc] or (abc) or a|b

class regexfactory.patterns.Or(*patterns: Pattern | str | RegexPattern)

For matching multiple patterns. This pattern or that pattern or that other pattern.

from regexfactory import Or

patt = Or("Bob", "Alice", "Sally")

print(patt.match("Alice"))
print(patt.match("Bob"))
print(patt.match("Sally"))

<re.Match object; span=(0, 5), match='Alice'>
<re.Match object; span=(0, 3), match='Bob'>
<re.Match object; span=(0, 5), match='Sally'>

precedence: int

class regexfactory.patterns.Range(start: str, stop: str)

For matching characters between two character indices (using the Unicode numbers of the input characters.) You can find use chr() and ord() to translate characters their Unicode numbers and back again. For example, chr(97) returns the string 'a', while chr(8364) returns the string '€' Thus, matching characters between 'a' and 'z' is really checking whether a characters unicode number is between ord('a') and ord('z')

from regexfactory import Range, Or

patt = Or("Bob", Range("a", "z"))

print(patt.findall("my job is working for Bob"))

['m', 'y', 'j', 'o', 'b', 'i', 's', 'w', 'o', 'r', 'k', 'i', 'n', 'g', 'f', 'o', 'r', 'Bob']

precedence: int

class regexfactory.patterns.Set(*patterns: Pattern | str | RegexPattern)

For matching a single character from a list of characters. Keep in mind special characters like + and . lose their meanings inside a set/list, so need to escape them here to use them.

In practice, Set("a", ".", "z") functions the same as Or("a", ".", "z") The difference being that Or accepts RegexPattern ‘s and Set accepts characters only. Special characters do NOT lose their special meaings inside an Or though. The other big difference is performance, Or is a lot slower than Set.

import time
from regexfactory import Or, Set

start_set = time.time()
print(patt := Set(*"a.z").compile())
print("Set took", time.time() - start_set, "seconds to compile")
print("And the resulting match is", patt.match("b"))

print()

start_or = time.time()
print(patt := Or(*"a.z").compile())
print("Or took", time.time() - start_or, "seconds to compile")
print("And the resulting match is", patt.match("b"))

re.compile('[a.z]')
Set took 0.00012803077697753906 seconds to compile
And the resulting match is None

re.compile('(?:a)|(?:.)|(?:z)')
Or took 0.00012755393981933594 seconds to compile
And the resulting match is <re.Match object; span=(0, 1), match='b'>

precedence: int

class regexfactory.patterns.NotSet(*patterns: Pattern | str | RegexPattern)

For matching a character that is NOT in a list of characters. Keep in mind special characters lose their special meanings inside NotSet’s as well.

from regexfactory import NotSet, Set

not_abc = NotSet(*"abc")

is_abc = Set(*"abc")

print(not_abc.match("x"))
print(is_abc.match("x"))

<re.Match object; span=(0, 1), match='x'>
None

precedence: int

class regexfactory.patterns.Amount(pattern: Pattern | str | RegexPattern, i: int, j: int | None = None, or_more: bool = False, greedy: bool = True)

For matching multiple occurences of a ValidPatternType. You can match a specific amount of occurences only. You can match with a lower bound on the number of occurences of a pattern. Or with a lower and upper bound on the number occurences. You can also pass a greedy=False keyword-argument to Amount, (default is True) which tells the regex compiler match as few characters as possible rather than the default behavior which is to match as many characters as possible.

Best explained with an example.

from regexfactory import Amount, Set

# We are using the same Pattern with different amounts.

content = "acbccbaabbccaaca"

specific_amount = Amount(Set(*"abc"), 2)

lower_and_upper_bound = Amount(Set(*"abc"), 3, 5, greedy=False)

lower_and_upper_bound_greedy = Amount(Set(*"abc"), 3, 5)

lower_bound_only = Amount(Set(*"abc"), 5, or_more=True, greedy=False)

print(specific_amount.findall(content))
print(lower_and_upper_bound_greedy.findall(content))
print(lower_and_upper_bound.findall(content))
print(lower_bound_only.findall(content))

['ac', 'bc', 'cb', 'aa', 'bb', 'cc', 'aa', 'ca']
['acbcc', 'baabb', 'ccaac']
['acb', 'ccb', 'aab', 'bcc', 'aac']
['acbcc', 'baabb', 'ccaac']

precedence: int

class regexfactory.patterns.Multi(pattern: Pattern | str | RegexPattern, match_zero: bool = False, greedy: bool = True)

Matches one or more occurences of the given ValidPatternType. If given match_zero=True to the init method it matches zero or more occurences.

precedence: int

class regexfactory.patterns.Optional(pattern: Pattern | str | RegexPattern, greedy: bool = True)

Matches the passed ValidPatternType between zero and one times. Functions the same as Amount(pattern, 0, 1).

precedence: int

class regexfactory.patterns.NamedGroup(name: str, pattern: Pattern | str | RegexPattern)

Lets you sepparate your regex into named groups that you can extract from re.Match.groupdict().

from regexfactory import NamedGroup, WORD, Multi

stuff = "George Washington"

patt = NamedGroup("first_name", Multi(WORD)) + " " + NamedGroup("last_name", Multi(WORD))

print(match := patt.match(stuff))
print(match.groupdict())

<re.Match object; span=(0, 17), match='George Washington'>
{'first_name': 'George', 'last_name': 'Washington'}

precedence: int

class regexfactory.patterns.NamedReference(group_name: str | NamedGroup)

Lets you reference NamedGroup’s that you’ve already created, by name, or by passing the NamedGroup itself.

from regexfactory import NamedReference, NamedGroup, DIGIT, RegexPattern

timestamp = NamedGroup("time_at", f"{DIGIT * 2}:{DIGIT * 2}am")

patt = RegexPattern(f"Created at {timestamp}, and then updated at {NamedReference(timestamp)}")
patt2 = RegexPattern(f"Created at {timestamp}, and then updated at {NamedReference('time_at')}")
print(repr(patt))
print(repr(patt2))

<RegexPattern 'Created at (?P<time_at>\d\d:\d\dam), and then updated at (?P=time_at)'>
<RegexPattern 'Created at (?P<time_at>\d\d:\d\dam), and then updated at (?P=time_at)'>

precedence: int

class regexfactory.patterns.NumberedReference(group_number: int)

Lets you reference the literal match to Group’s that you’ve already created, by its group index.

from regexfactory import NumberedReference, Group, DIGIT, RegexPattern

timestamp = Group(f"{DIGIT * 2}:{DIGIT * 2}am")

patt = RegexPattern(f"{timestamp},{NumberedReference(1)},{NumberedReference(1)}")
print(patt.match("09:59am,09:59am,09:59am"))
print(patt.match("09:59am,13:00am,09:50am"))

<re.Match object; span=(0, 23), match='09:59am,09:59am,09:59am'>
None

precedence: int

class regexfactory.patterns.Comment(content: str)

Lets you include comment strings that are ignored by regex compilers to document your regex’s.

from regexfactory import Comment, DIGIT, WORD, Or

patt = Or(DIGIT, WORD)
patt_with_comment = patt + Comment("I love comments in regex!")

print("Pattern without comment:", patt)
print("Pattern with comment", patt_with_comment)
print(patt.match("1"))
print(patt.match("a"))

Pattern without comment: (?:\d)|(?:\w)
Pattern with comment (?:\d)|(?:\w)(?#I love comments in regex!)
<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(0, 1), match='a'>

precedence: int

class regexfactory.patterns.IfAhead(pattern: Pattern | str | RegexPattern)

A mini if-statement in regex. It does not consume any string content. Makes the whole pattern match only if followed by the given pattern at this position in the whole pattern.

from regexfactory import IfAhead, escape, WORD, Multi, Or

name = Multi(WORD) + IfAhead(
    Or(
        escape(" Jr."),
        escape(" Sr."),
    )
)

print(name.findall("Bob Jr. and John Sr. love hanging out with each other."))

['Bob', 'John']

precedence: int

class regexfactory.patterns.IfNotAhead(pattern: Pattern | str | RegexPattern)

A mini if-statement in regex. It does not consume any string content. Makes the whole pattern match only if NOT followed by the given pattern at this position in the whole pattern.

from regexfactory import IfNotAhead, RegexPattern

patt = RegexPattern("Foo") + IfNotAhead("bar")

print(patt.match("Foo"))
print(patt.match("Foobar"))
print(patt.match("Fooba"))

<re.Match object; span=(0, 3), match='Foo'>
None
<re.Match object; span=(0, 3), match='Foo'>

precedence: int

class regexfactory.patterns.IfBehind(pattern: Pattern | str | RegexPattern)

A mini if-statement in regex. It does not consume any string content. Makes the whole pattern match only if preceded by the given pattern at this position in the whole pattern.

from regexfactory import IfBehind, DIGIT, Multi, Optional

rank = IfBehind("Rank: ") + Multi(DIGIT)

print(rank.findall("Rank: 27, Score: 30, Power: 123"))

['27']

precedence: int

class regexfactory.patterns.IfNotBehind(pattern: Pattern | str | RegexPattern)

A mini if-statement in regex. It does not consume any string content. Makes the whole pattern match only if NOT preceded by the given pattern at this position in the whole pattern.

from regexfactory import IfNotBehind, WORD, Multi, DIGIT

patt = IfNotBehind(WORD) + Multi(DIGIT)

print(patt.match("b64"))
print(patt.match("64"))

None
<re.Match object; span=(0, 2), match='64'>

precedence: int

class regexfactory.patterns.Group(pattern: Pattern | str | RegexPattern, capturing: bool = True)

For separating your Patterns into fields for extraction. Basically you use Group to reference regex inside of it later with NumberedReference. Passing capturing=False unifies the regex inside the group into a single token but does not capture the group. Seen below.

from regexfactory import Group, WORD, Multi

name = Group(Multi(WORD)) + " " + Group(Multi(WORD), capturing=False)

print(name.match("Nate Larsen").groups())

('Nate',)

precedence: int

Matches with yes_pattern if the given group name or group index succeeds in matching and exists, otherwise matches with no_pattern

from regexfactory import IfGroup, NamedGroup, Optional, escape

patt = (
    Optional(NamedGroup("title", escape("Mr. "))) +
    IfGroup("title", "Dillon", NamedGroup("first_name", "Bob")) +
    Optional(IfGroup("first_name", " Dillon", ""))
)
# If NamedGroup "title" matches then use the last name pattern
# else use the first name pattern

print(patt.match("Mr. Dillon"))
print(patt.match("Mr. Bob"))
print(patt.match("Mr Dillon"))
print(patt.match("Bob"))
print(patt.match("Bob Dillon"))

<re.Match object; span=(0, 10), match='Mr. Dillon'>
None
None
<re.Match object; span=(0, 3), match='Bob'>
<re.Match object; span=(0, 10), match='Bob Dillon'>

precedence: int

Regex Characters

Common regex special characters, such as d, ., … More information about special characters in python regex available here

regexfactory.chars.ANY = <RegexPattern '.'>: (Dot.) In the default mode, this matches any character except a newline. If the re.DOTALL flag has been specified, this matches any character including a newline.

regexfactory.chars.ANCHOR_START = <RegexPattern '^'>: (Caret.) Matches the start of the string, and in re.MULTILINE mode also matches immediately after each newline.

regexfactory.chars.ANCHOR_END = <RegexPattern '$'>: Matches the end of the string or just before the newline at the end of the string, and in re.MULTILINE mode also matches before a newline. foo matches both foo and foobar, while the regular expression foo$ matches only foo. More interestingly, searching for foo.$ in foo1nfoo2n matches foo2 normally, but foo1 in re.MULTILINE mode; searching for a single $ in foon will find two (empty) matches: one just before the newline, and one at the end of the string.

regexfactory.chars.WHITESPACE = <RegexPattern '\s'>: Matches Unicode whitespace characters (which includes [ tnrfv], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the re.ASCII flag is used, only [ tnrfv] is matched.

regexfactory.chars.NOTWHITESPACE = <RegexPattern '\S'>: Matches any character which is not a whitespace character. This is the opposite of s. If the re.ASCII flag is used this becomes the equivalent of [^ tnrfv].

regexfactory.chars.WORD = <RegexPattern '\w'>: Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the re.ASCII flag is used, only [a-zA-Z0-9_] is matched.

regexfactory.chars.NOTWORD = <RegexPattern '\W'>: Matches any character which is not a word character. This is the opposite of w. If the re.ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_]. If the re.LOCALE flag is used, matches characters which are neither alphanumeric in the current locale nor the underscore.

regexfactory.chars.DIGIT = <RegexPattern '\d'>: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also many other digit characters. If the re.ASCII flag is used only [0-9] is matched.

regexfactory.chars.NOTDIGIT = <RegexPattern '\D'>: Matches any character which is not a decimal digit. This is the opposite of d. If the re.ASCII flag is used this becomes the equivalent of [^0-9].