CS374: Principles of Programming Languages - Regular Expressions (100 Points)
Assignment Goals
The goals of this assignment are:- To match text patterns using Regular Expressions
Background Reading and References
Please refer to the following readings and examples offering templates to help get you started:The Assignment
The purpose of this assignment is to practice matching patterns in texts using regular expressions.
Part 1: Warmup
The Python language provides a regular expression processing library called re
that you can use to match and replace text in a String
. Here’s how it works:
import re
file = open("somefile.txt")
data = file.read()
offset = 0
result = []
# This will return a Match object for the first semicolon in the string
match = re.search(r";", data)
# This will print a list of the starting index (from 0) and the ending index (non-inclusive) of the match
print(match.span())
result.append(match.span()) # keep track of all the matching positions, relative to the prior match position
nextIndex = match.span()
nextIndex = nextIndex[1]
offset = offset + nextIndex # keep track of the absolute position of the most recent match
# This will search the rest of the string for the next instance
# ... and return None if no match is found
match = re.search(r";", data[offset:])
Create a text file called somefile.txt
(or something else, and you can modify your code to use this file instead!). Modify the code example above to search for semicolons in your file, in a loop, printing out the span of all found matches until the nextIndex
exceeds the length of the file data, or until no match is found.
You will want to play around with the functions: print out their results, figure out how they work. Check out the re
library documentation. Don’t expect to understand everything at first glance - I’m asking you to explore the library as part of this exercise! In particular, note that re.search
returns a match
whose span
is a tuple containing two items: the index of the beginning of the match, and the index of the end of the match. You can generate a substring using data[match.span()[0], match.span()[1]
to view the matching text (although I suggest making variables to represent the span, the starting index, and the ending index, to improve the readability of your code!). However, to ensure you don’t find the same match again, you will likely substring your data prior to looping using data[offset:]
, where offset
is the index of the end of the match. The next time you loop, though, you’ll be given offsets that are relative to this new position! You’ll want to add offset
to the beginning and end of your span
tuple values in order to obtain absolute offsets. Alternatively, you can leave these matches as relative offsets, and when you print the substrings themselves, add the ending index of the prior match span to the current ones to obtain a running total.
Modularizing this Function into a Regular Expression Processor
Put this code into a function that returns an array of match spans for a given regular expression string and text data:
import re
def regexmatch(pattern, file):
# You can append your match spans to this list
result = []
# TODO: your code here
return result
Don’t forget to replace r";"
with the variable parameter pattern
when you merge into the function. You will have effectively created grep
by doing this!
Part 2: Regular Expressions
Write a new program based on your warmup that computes the following regular expressions, and prints a list of all the spans matching each expression. For each span that you find, print the substring (keep in mind that the ending index is non-inclusive, so your substring can omit the very last character index in each returned match).
- Each substring ending with a semicolon: match the whole substring. Hint: here is the regular expression:
r"\w*;"
. Ther
character before theString
indicates to Python that this is a “raw”String
, so no escape characters are needed. Alternatively, you could use this:"\\w*;"
. - All characters inside a set of parentheses. You can use
\(
and\)
to represent an opening and closing parenthesis in your regular expressionString
(don’t forget to mark theString
as a rawString
so you don’t have to also escape the backslash character! You may use\w
again to represent a character to match inside the parenthesis. - Valid variable names. A valid variable name can be formed in one of two ways: first, a lowercase letter (not a number), followed by zero or more upper case letters, lower case letters, or numbers; alternatively, a variable name may be an upper case letter (not a number) followed by zero or more upper case letters or numbers (no lower case letters). So these are valid:
aVariableName1
,piTimesD
,FALSE
, but these are not:2piR
,AnotherVariable
. - All phone numbers. Phone numbers contain the following format:
1-215-555-1212
; however, the leading “1-“ is optional, and the dashes may be either dashes or spaces. - All lines beginning with the word
DEPOSIT:
.
Part 3: Limitations of Regular Expressions
In your README, discuss why it would be impossible to write a single regular expression to ensure that all opening parentheses are properly nested inside other parentheses, and balanced. In other words, why is it difficult to write a regular expression that matches these String
s: (())
, (()()(()))
, but not these: ((()()
, ())
? Note: do not attempt to write this regular expression, beyond identifying that it cannot be done!
Part 4: Replacements
The re.sub(pattern, replacement, data)
will look for the regular expression defined by pattern
in the text specified by data
, and replace each instance with the string replacement
. Write a program or function to replace all instances of CS374
or CS 374
with Principles of Programming Languages
.
Submission
In your submission, please include answers to any questions asked on the assignment page, as well as the questions listed below, in your README file. If you wrote code as part of this assignment, please describe your design, approach, and implementation in a separate document prepared using a word processor or typesetting program such as LaTeX. This document should include specific instructions on how to build and run your code, and a description of each code module or function that you created suitable for re-use by a colleague. In your README, please include answers to the following questions:- Describe what you did, how you did it, what challenges you encountered, and how you solved them.
- Please answer any questions found throughout the narrative of this assignment.
- If collaboration with a buddy was permitted, did you work with a buddy on this assignment? If so, who? If not, do you certify that this submission represents your own original work?
- Please identify any and all portions of your submission that were not originally written by you (for example, code originally written by your buddy, or anything taken or adapted from a non-classroom resource). It is always OK to use your textbook and instructor notes; however, you are certifying that any portions not designated as coming from an outside person or source are your own original work.
- Approximately how many hours it took you to finish this assignment (I will not judge you for this at all...I am simply using it to gauge if the assignments are too easy or hard)?
- Your overall impression of the assignment. Did you love it, hate it, or were you neutral? One word answers are fine, but if you have any suggestions for the future let me know.
- Using the grading specifications on this page, discuss briefly the grade you would give yourself and why. Discuss each item in the grading specification.
- Any other concerns that you have. For instance, if you have a bug that you were unable to solve but you made progress, write that here. The more you articulate the problem the more partial credit you will receive (it is fine to leave this blank).
Assignment Rubric
Description | Pre-Emerging (< 50%) | Beginning (50%) | Progressing (85%) | Proficient (100%) |
---|---|---|---|---|
Algorithm Implementation (60%) | The algorithm fails on the test inputs due to major issues, or the program fails to compile and/or run | The algorithm fails on the test inputs due to one or more minor issues | The algorithm is implemented to solve the problem correctly according to given test inputs, but would fail if executed in a general case due to a minor issue or omission in the algorithm design or implementation | A reasonable algorithm is implemented to solve the problem which correctly solves the problem according to the given test inputs, and would be reasonably expected to solve the problem in the general case |
Code Quality and Documentation (30%) | Code commenting and structure are absent, or code structure departs significantly from best practice, and/or the code departs significantly from the style guide | Code commenting and structure is limited in ways that reduce the readability of the program, and/or there are minor departures from the style guide | Code documentation is present that re-states the explicit code definitions, and/or code is written that mostly adheres to the style guide | Code is documented at non-trivial points in a manner that enhances the readability of the program, and code is written according to the style guide |
Writeup and Submission (10%) | An incomplete submission is provided | The program is submitted, but not according to the directions in one or more ways (for example, because it is lacking a readme writeup) | The program is submitted according to the directions with a minor omission or correction needed, and with at least superficial responses to the bolded questions throughout | The program is submitted according to the directions, including a readme writeup describing the solution, and thoughtful answers to the bolded questions throughout |
Please refer to the Style Guide for code quality examples and guidelines.