Last Updated: 2019-03-04 Mon 10:28

CSCI 4061 Lab06: Binary Files and mmap()

CODE DISTRIBUTION: lab06-code.zip

  • Download the code distribution every lab
  • See further setup instructions below

CHANGELOG:

Mon Mar 4 10:27:58 CST 2019
make_dept_directory.c was missing the O_CREAT flag in its call to open() which may have caused some folks some problems. This has been corrected in the lab code pack.
Sun Mar 3 21:29:44 CST 2019
The lab06-code.zip file has been updated to include all files. It initially was based on an older version of the lab.

1 Rationale

For efficiency of storage and access, data is often stored in "binary files" which contain direct writes of arrays and structs from memory to permanent storage devices. These then require use of binary file I/O to manipulate them, frequently low level Unix read() / write() calls. They also often require jumping to different positions in the file which can be done via the lseek() system call. These are explored in this lab.

A viable alternative to file I/O is to make use of memory mapped files through mmap(). This utilizes a system call to expose files as a pointer into operating system managed space which holds parts of the file in main memory. While equivalent in power to standard I/O, mmap() avoids the need for intermediate buffers and allows pointer arithmetic to be used to locate and alter the file.

This lab introduces and contrasts handling binary file using standard I/O and memory mapping.

Associated Reading / Preparation

Stevens/Rago Ch 3 covers basic I/O functions like read() / write() as well as lseek() in Ch 3.6. These functions work equally as well for text and binary data.

Stevens/Rago Ch 14 discusses advanced I/O techniques with Ch 14.8 covering mmap() for creating a memory mapped file.

Optionally, Bryant and O'Hallaron's "Computer Systems: A Programmers Perspective" also has some coverage of mmap() in section 9.8.4. This textbook is mentioned as it is the required text for CSCI 2021, a prerequisite to CSCI 4061.

Grading Policy

  • Check-off 30%: Demonstrate to a TA that a student understands answers to questions. This must be done in person in groups of one or two. Check-offs can happen during the lab period of during a TA office hour.
  • Submit 70%: Submit required files according to the lab instruction. This can be done at any time and from anywhere with a network connection. Submitting does not require attending lab. All students must submit files even if they were checked off in a group during lab.

See the full policy in the syllabus.

2 Codepack

The codepack for the lab contains the following files:

File State Description
QUESTIONS.txt Provided Questions to answer
Makefile Provided Makefile to build programs for the lab
department.h Provided Header file for programs
make_dept_directory.c Provided Problem 1 program to create data file
cse_depts.dat.bk Provided Backup of data file created in Problem 1
print_department_read.c Provided Problem 1 program to analyze
print_department_mmap.c Provided Problem 2 program to analyze

3 Questions

Analyze the files in the provided codepack and answer the questions given in QUESTIONS.txt.

                           __________________

                            LAB 06 QUESTIONS
                           __________________


- Name: (FILL THIS in)
- NetID: (THE kauf0095 IN kauf0095@umn.edu)

Answer the questions below according to the lab specification. Write
your answers directly in this text file and submit it to complete the
lab.


PROBLEM 1: Binary File Format w/ Read
=====================================

A
~

  Compile all programs in the lab code directory with the provided
  `Makefile'.  Run the command
  ,----
  | ./make_dept_directory cse_depts.dat
  `----
  to create the `cse_depts.dat' binary file. Examine the source code for
  this program the header `department.h'. Explain the format of the
  binary file `cse_depts.dat'.
  - What system calls are used in `make_dept_directory.c' to create this
    file?
  - How is the `sizeof()' operator used to simplify some of the
    computations in `make_dept_directory.c'?
  - What data is in `cse_depts.dat' and how is it ordered?


B
~

  Run the `print_department_read' program which takes a binary data file
  and a department code to print.  Show a few examples of running this
  program with the valid command line arguments. Include in your demo
  runs that
  - Use the `cse_depts.dat' with known and unknown department codes
  - Use a file other than `cse_depts.dat'


C
~

  Study the source code for `print_department_read' and describe how it
  initially prints the table of offsets shown below.
  ,----
  | Dept Name: CS Offset: 104
  | Dept Name: EE Offset: 2152
  | Dept Name: IT Offset: 3688
  `----
  What specific sequence of calls leads to this information?


D
~

  What system call is used to skip immediately to the location in the
  file where desired contacts are located? What arguments does this
  system call take? Consult the manual entry for this function to find
  out how else it can be used.


PROBLEM 2: mmap() and binary files
==================================

  An alternative to using standard I/O functions is "memory mapped"
  files through the system call `mmap()'. The program
  `print_department_mmap.c' provides the functionality as the previous
  `print_department_read.c' but uses a different mechanism.


(A)
~~~

  Early in `print_department_mmap.c' an `open()' call is used as in the
  previous program but it is followed shortly by a call to `mmap()' in
  the lines
  ,----
  |   char *file_bytes =
  |     mmap(NULL, size, PROT_READ, MAP_SHARED,
  |          fd, 0);
  `----
  Look up reference documentation on `mmap()' and describe some of the
  arguments to it including the `NULL' and `size' arguments. Also
  describe its return value.


(B)
~~~

  The initial setup of the program uses `mmap()' to assign a pointer to
  variable `char *file_bytes'.  This pointer will refer directly to the
  bytes of the binary file.

  Examine the lines
  ,----
  |   ////////////////////////////////////////////////////////////////////////////////
  |   // CHECK the file_header_t struct for integrity, size of department array
  |   file_header_t *header = (file_header_t *) file_bytes; // binary header struct is first thing in the file
  `----

  Explain what is happening here: what value will the variable `header'
  get and how is it used in subsequent lines.


(C)
~~~

  After finishing with the file header, the next section of the program
  begins with the following.
  ,----
  |   ////////////////////////////////////////////////////////////////////////////////
  |   // SEARCH the array of department offsets for the department named
  |   // on the command line
  | 
  |   dept_offset_t *offsets =           // after file header, array of dept_offset_t structures
  |     (dept_offset_t *) (file_bytes + sizeof(file_header_t));
  | 
  `----

  Explain what value the `offsets_arr' variable is assigned and how it
  is used in the remainder of the SEARCH section.


(D)
~~~

  The final phase of the program begins below
  ,----
  |   ////////////////////////////////////////////////////////////////////////////////
  |   // PRINT out all personnel in the specified department
  |   ...
  |   contact_t *dept_contacts = (contact_t *) (file_bytes + offset);
  `----
  Describe what value `dept_contacts' is assigned and how the final
  phase uses it.

4 What to Understand

Ensure that you understand

  • How data in main memory can be directly written to files with write()
  • How data in files can be directly read() into arrays and structs.
  • Use of the lseek() system call to move to a desired byte position in a file
  • Use of mmap() to create a memory mapped file for reading

5 Getting Credit: Check-off and Turn-in

  • Check-off your answers with a TA in person during lab or office hours.
  • Submit your completed QUESTIONS.txt file to Canvas under the appropriate lab heading. Make sure to paste in any new code and answers you were to write at appropriate locations in the file.

Author: Chris Kauffman (kauffman@umn.edu)
Date: 2019-03-04 Mon 10:28