CSCI 2041 Lab10: Lexing and Parsing
- Due: 11:59pm Mon 11/19/2018
- Approximately 0.83% of total grade
- Submit to Canvas
- Lab exercises are open resource/open collaboration. You must submit your own work but you may freely discuss lab topics with other members of the class.
CODE DISTRIBUTION: lab10-code.zip
- Download the code distribution every lab
- See further setup instructions below
CHANGELOG:
- Mon Nov 19 10:41:10 CST 2018
- The original version of the lab contained some faulty
instructions for the parser, that
Minus
should be added toprec1
andDiv
toprec2
. This leads to incorrect parsing as subtraction is higher precedence than addition and division higher than multiplication. The instructions have been corrected and a solution will be posted shortly.
1 Rationale
Lexing and Parsing are early phases in converting a raw text source code into a form suitable for execution as a computer program. Lexing converts strings of characters into some symbolic "tokens" that indicates some higher meaning an discards characters such as whitespace which are no longer needed. Parsing analyzes sequences of tokens to determine if the appear in an acceptable order. Typically this results in the construction of a tree or sequence data structure which will later be used for evaluation of further transformation.
Completing this lab will acquaint one with the basics of lexing and parsing as simple arithmetic expression language.
Grading Policy
- Check-off 30%: Demonstrate to a TA that a student understands answers to questions. This must be done in person in groups of one or two. Check-offs can happen during the lab period of during a TA office hour.
- Submit 70%: Submit required files according to the lab instruction. This can be done at any time and from anywhere with a network connection. Submitting does not require attending lab. All students must submit files even if they were checked off in a group during lab.
See the full policy in the syllabus.
2 Codepack
The codepack for the lab contains the following files:
File | State | Description |
---|---|---|
QUESTIONS.txt |
Edit | Answer questions in the file to complete the lab |
lexer.ml |
Edit | Problem 1 file to analyze and edit |
parser.ml |
Edit | Problem 2 file to analyze and edit |
3 Questions
Analyze the files in the provided codepack and answer the questions
given in QUESTIONS.txt
.
__________________ LAB 10 QUESTIONS __________________ - Name: (FILL THIS in) - NetID: (THE kauf0095 IN kauf0095@umn.edu) Answer the questions below according to the lab specification. Write your answers directly in this text file and submit it to complete the lab. PROBLEM 1: lexer.ml =================== (A) ~~~ Examine `lexer.ml'. The central function of interest is `lex_string' which is intended to process a string with an arithmetic expression in it and produce a list of tokens of the `token' type defined at the top of the file. Study the given code carefully and answer the following questions on how it works. 1. The code makes use of recursion. Describe how recursion is used to process the entire string and to produce a list of the tokens that are found. 2. There is a loop associated with digits. Why is this needed while no loop is needed for the other cases of tokens? 3. Demonstrate some uses of `lex_string' in a REPL to show that it works. (B) ~~~ Extend the code in `lexer.ml' to account for the following new kinds of tokens. 1. Subtraction with the Minus (-) sign - like Addition 2. Division with the Slash (/) - like multiplication 3. Identifiers of strings (abbreviated Ident) - like Int of integers. To complete the extension, make the following three modifications. 1. Add in a variant to the `token' type for each of Minus, Slash, Ident. 2. In `lex_string', extend the `match/with' single character cases to include one for - and / associated with Minus and Slash. These will behave similarly to their Plus and Times counterparts. 3. Add in a case for character data which will become Ident cases and be similar to Int. Make use of the provided `is_letter' function to help identify letters. Once you have completed the modifications, uncomment the provided examples at the end of the file to ensure the modifications compile and work. Try out the `lex_string' in a REPL. PROBLEM 2: `parser.ml' ====================== (A) ~~~ Examine `parser.ml'. This file contains a solution for Problem 1's lexer as it is required for the parser (this is also useful if you get stuck). The central function of interest is `parse_tokens' which converts a list of `tokens' into a tree structure which will be referred to as a Parse Tree or Abstract Syntax Tree (AST). This tree is comprised of variants of the `expr' type defined midway through the file. Describe the kinds of elements in the `expr' type and what kinds of data they carry. Note how this creates a tree structure. (B) ~~~ The code for `parse_tokens' uses recursion to perform the parsing. Notice its structure of dividing the task into 4 mutually recursive functions: 1. prec0 which handles Int and OParen/CParen tokens. 2. prec1 which handles Times tokens 3. prec2 which handles Plus tokens 4. parse which serves as an entry point for the beginning of parsing. Answer the following questions about how each of these work. 1. Functions `prec2' starts by calling `prec1' and prec1 starts by calling prec0. In both situations, a pair of `(lexpr, rest)' are returned. What types are the `lexpr' and `rest' and how are they used? 2. `prec0' matches in one case the `OParen' token which corresponds to `(' in the input. This is handled by calling the `parse' function. Describe what effect this has in parsing. What happens if there is no `CParen' (closing paraenthesis `)' ) in the input? 3. Try running `parse_tokens' on the following inputs as follows and report your results. Each of these will raise an error. How the code reaches this error. ,---- | # parse_tokens (lex_string "5+10+");; | # parse_tokens (lex_string "5+10+11 )");; | # parse_tokens (lex_string "5*10+11 9");; | # parse_tokens (lex_string "5+10+11 + +");; | # parse_tokens (lex_string "5*(10+11");; `---- (C) ~~~ Modify the parser to accommodate the new tokens - and / and string identifiers. Add in the following variants type to the `expr' type. - `Sub' which corresponds to the `Minus' token and is similar to `Add' - `Div' which corresponds to the `Slash' token and is similar to `Mul' - `Varname' which corresponds to the `Ident' token and is similar to `Const' Note that each of these has analogues already in the parser. CORRECTED INSTRUCTIONS - Handle `Ident' tokens in the `prec0' function. It should create a `Varname' expression similar to `Int'. It is at the same precedence level as `Int' so can be in the same function. - Handle `Slash' tokens in a new `prec1_div' function. It should create a `Div' expression similar to `Mul' but be at a higher precedence than `prec1'. - Handle `Minus' tokens in a new `prec2_sub' function. It should create a `Sub' expression similar to `Add' but be at a higher precedence than `prec2'. ORIGINAL INCORRECT INSTRUCTIONS - Handle `Minus' tokens in the `prec2' function. It should create a `Sub' expression similar to `Add' - Handle `Slash' tokens in the `prec1' function. It should create a `Div' expression similar to `Mul'. - Handle `Ident' tokens in the `prec0' function. It should create a `Varname' expression similar to `Int'. When you complete the parser extensions, uncomment the code at the end of the file which allows you to test it in a REPL.
4 What to Understand
Ensure that you understand
- The idea of lexing as processing character input to a symbolic token sequence.
- The idea of parsing as transforming a token sequence into a tree of data for further processing.
- The use of recursive functions in both lexing and parsing.
5 Getting Credit: Submit and Check-off
- Check-off your answers with a TA in person during lab or office hours.
- Submit your completed
QUESTIONS.txt
file to Canvas under the appropriate lab heading. Make sure to paste in any new code and answers you were to write at appropriate locations in the file.