CSCI 2041 Lab10: Lexing and Parsing

Due: 11:59pm Mon 11/19/2018
Approximately 0.83% of total grade
Submit to Canvas
Lab exercises are open resource/open collaboration. You must submit your own work but you may freely discuss lab topics with other members of the class.

CODE DISTRIBUTION: lab10-code.zip

Download the code distribution every lab
See further setup instructions below

CHANGELOG:

Mon Nov 19 10:41:10 CST 2018: The original version of the lab contained some faulty instructions for the parser, that Minus should be added to prec1 and Div to prec2. This leads to incorrect parsing as subtraction is higher precedence than addition and division higher than multiplication. The instructions have been corrected and a solution will be posted shortly.

1. Rationale
- Grading Policy
2. Codepack
3. Questions
4. What to Understand
5. Getting Credit: Submit and Check-off

1 Rationale

Lexing and Parsing are early phases in converting a raw text source code into a form suitable for execution as a computer program. Lexing converts strings of characters into some symbolic "tokens" that indicates some higher meaning an discards characters such as whitespace which are no longer needed. Parsing analyzes sequences of tokens to determine if the appear in an acceptable order. Typically this results in the construction of a tree or sequence data structure which will later be used for evaluation of further transformation.

Completing this lab will acquaint one with the basics of lexing and parsing as simple arithmetic expression language.

Grading Policy

Check-off 30%: Demonstrate to a TA that a student understands answers to questions. This must be done in person in groups of one or two. Check-offs can happen during the lab period of during a TA office hour.
Submit 70%: Submit required files according to the lab instruction. This can be done at any time and from anywhere with a network connection. Submitting does not require attending lab. All students must submit files even if they were checked off in a group during lab.

See the full policy in the syllabus.

2 Codepack

The codepack for the lab contains the following files:

File	State	Description
`QUESTIONS.txt`	Edit	Answer questions in the file to complete the lab
`lexer.ml`	Edit	Problem 1 file to analyze and edit
`parser.ml`	Edit	Problem 2 file to analyze and edit

3 Questions

Analyze the files in the provided codepack and answer the questions given in QUESTIONS.txt.

                           __________________

                            LAB 10 QUESTIONS
                           __________________


- Name: (FILL THIS in)
- NetID: (THE kauf0095 IN kauf0095@umn.edu)

Answer the questions below according to the lab specification. Write
your answers directly in this text file and submit it to complete the
lab.


PROBLEM 1: lexer.ml
===================

(A)
~~~

  Examine `lexer.ml'. The central function of interest is `lex_string'
  which is intended to process a string with an arithmetic expression in
  it and produce a list of tokens of the `token' type defined at the top
  of the file.

  Study the given code carefully and answer the following questions on
  how it works.

  1. The code makes use of recursion. Describe how recursion is used to
     process the entire string and to produce a list of the tokens that
     are found.

  2. There is a loop associated with digits. Why is this needed while no
     loop is needed for the other cases of tokens?

  3. Demonstrate some uses of `lex_string' in a REPL to show that it
     works.


(B)
~~~

  Extend the code in `lexer.ml' to account for the following new kinds
  of tokens.
  1. Subtraction with the Minus (-) sign - like Addition
  2. Division with the Slash (/) - like multiplication
  3. Identifiers of strings (abbreviated Ident) - like Int of integers.

  To complete the extension, make the following three modifications.
  1. Add in a variant to the `token' type for each of Minus, Slash,
     Ident.
  2. In `lex_string', extend the `match/with' single character cases to
     include one for - and / associated with Minus and Slash. These will
     behave similarly to their Plus and Times counterparts.
  3. Add in a case for character data which will become Ident cases and
     be similar to Int. Make use of the provided `is_letter' function to
     help identify letters.

  Once you have completed the modifications, uncomment the provided
  examples at the end of the file to ensure the modifications compile
  and work.  Try out the `lex_string' in a REPL.


PROBLEM 2: `parser.ml'
======================

(A)
~~~

  Examine `parser.ml'. This file contains a solution for Problem 1's
  lexer as it is required for the parser (this is also useful if you get
  stuck).

  The central function of interest is `parse_tokens' which converts a
  list of `tokens' into a tree structure which will be referred to as a
  Parse Tree or Abstract Syntax Tree (AST). This tree is comprised of
  variants of the `expr' type defined midway through the file.

  Describe the kinds of elements in the `expr' type and what kinds of
  data they carry. Note how this creates a tree structure.


(B)
~~~

  The code for `parse_tokens' uses recursion to perform the
  parsing. Notice its structure of dividing the task into 4 mutually
  recursive functions:
  1. prec0 which handles Int and OParen/CParen tokens.
  2. prec1 which handles Times tokens
  3. prec2 which handles Plus tokens
  4. parse which serves as an entry point for the beginning of parsing.

  Answer the following questions about how each of these work.

  1. Functions `prec2' starts by calling `prec1' and prec1 starts by
     calling prec0.  In both situations, a pair of `(lexpr, rest)' are
     returned.  What types are the `lexpr' and `rest' and how are they
     used?
  2. `prec0' matches in one case the `OParen' token which corresponds to
     `(' in the input.  This is handled by calling the `parse'
     function. Describe what effect this has in parsing. What happens if
     there is no `CParen' (closing paraenthesis `)' ) in the input?
  3. Try running `parse_tokens' on the following inputs as follows and
     report your results. Each of these will raise an error. How the
     code reaches this error.

  ,----
  | # parse_tokens (lex_string "5+10+");;
  | # parse_tokens (lex_string "5+10+11 )");;
  | # parse_tokens (lex_string "5*10+11 9");;
  | # parse_tokens (lex_string "5+10+11 + +");;
  | # parse_tokens (lex_string "5*(10+11");;
  `----


(C)
~~~

  Modify the parser to accommodate the new tokens - and / and string
  identifiers.  Add in the following variants type to the `expr' type.
  - `Sub' which corresponds to the `Minus' token and is similar to `Add'
  - `Div' which corresponds to the `Slash' token and is similar to `Mul'
  - `Varname' which corresponds to the `Ident' token and is similar to
    `Const'

  Note that each of these has analogues already in the parser.

  CORRECTED INSTRUCTIONS
  - Handle `Ident' tokens in the `prec0' function. It should create a
    `Varname' expression similar to `Int'. It is at the same precedence
    level as `Int' so can be in the same function.
  - Handle `Slash' tokens in a new `prec1_div' function. It should
    create a `Div' expression similar to `Mul' but be at a higher
    precedence than `prec1'.
  - Handle `Minus' tokens in a new `prec2_sub' function. It should
    create a `Sub' expression similar to `Add' but be at a higher
    precedence than `prec2'.

  ORIGINAL INCORRECT INSTRUCTIONS
  - Handle `Minus' tokens in the `prec2' function. It should create a
    `Sub' expression similar to `Add'
  - Handle `Slash' tokens in the `prec1' function. It should create a
    `Div' expression similar to `Mul'.
  - Handle `Ident' tokens in the `prec0' function. It should create a
    `Varname' expression similar to `Int'.

  When you complete the parser extensions, uncomment the code at the end
  of the file which allows you to test it in a REPL.

4 What to Understand

Ensure that you understand

The idea of lexing as processing character input to a symbolic token sequence.
The idea of parsing as transforming a token sequence into a tree of data for further processing.
The use of recursive functions in both lexing and parsing.

5 Getting Credit: Submit and Check-off

Check-off your answers with a TA in person during lab or office hours.
Submit your completed QUESTIONS.txt file to Canvas under the appropriate lab heading. Make sure to paste in any new code and answers you were to write at appropriate locations in the file.