cs644 > spring2025 > week1

Week 1: Course introduction

In-class exercises

Create a file called hello.c on the remote server. We'll write a small "Hello, world!" program, and compile and run it.
Do ls /usr/share/cs644/code on the remote server to see the course's source code. Notice that each of you has a directory under /home/shared. (Try running ls /home/shared with --color=never if you can't read the output.) Pick one of your classmates and create a file with a greeting in it.
Linux commands and C library functions are documented in man pages. We'll take a look at man puts and man gcc.
man pages can be quite detailed and hard to read. Fortunately, we have a tool installed on the server called tldr
A brief run-down of some other useful tools: bat, rg, fd, fzf

Homework exercises

(★) What is the size_t type for in C?
Solution
size_t is for representing array indices and lengths. It is an unsigned integer type that is guaranteed to be large enough to hold the largest possible array index on your machine. Otherwise, it has no special properties. Older C code often uses int for array indices, but size_t is more portable and reliable.
(★) True or false: Semicolons are optional in C.
Solution
False. Statements and type definitions must end with semicolons in C.
(★★) Choose a language and think about what final project you'd like to do (see syllabus for details). Neither choice is binding for now.
(★★) Does this program have a bug? Why or why not?
Solution
The program does not have a bug. The LANG_RU clause of the switch-case lacks a break statement, but this is intentional: the programmer meant for LANG_RU to fall through to LANG_FR, since LANG_RU has the same logic as LANG_FR. That being said, I would recommend (a) avoiding fall-through if possible, and (b) explicitly annotating it with a comment if you do decide to do it.
(★★) Take a look at this C function from the Python source code. Try to understand what it does and how it works. (Bonus: Can you explain how the second function in the file works?)
Solution
- The function does a case-insensitive comparison of the two strings. First, it defines two local variables (p1 and p2) to iterate over the string. Next, it loops over both strings until it hits the end of either string, or finds a character that is not equal. The loop condition is complicated, so let's break it down:
  - --size > 0 – Decrement size and check that it is greater than zero. --size is equivalent to "decrement size by one and return the decremented value". It's not necessarily a good idea to put a mutating statement in a loop condition like this, but old-style C code often does it for brevity.
  - *p1 && *p2 – This checks that we haven't reached the end of either string. *p1 evaluates to the current character the pointer points to; C strings always have a null character (equal to the integer 0) at the end, so if we're at the end, *p1 will evaluate to false.
  - Py_TOLOWER(*p1) == Py_TOLOWER(*p2) – Check that the two characters are equal to each other, ignoring case.
- Comparison functions in C return a negative number if a < b, 0 if a == b, and a positive number if a > b. Py_TOLOWER(*p1) - Py_TOLOWER(*p2) accomplishes this concisely: if the strings are equal, we should have terminated at the end of the string, in which case *p1 and *p2 are both the null character, so we return 0. If not, then *p1 and *p2 are the first pair of characters that differ, and it's not hard to see that the subtraction maintains the desired invariant.
- The reason that the function takes in a separate size parameter, even though we can determine both strings' lengths without it, is so you can compare a fixed-length prefix of the two strings.
(★★) Write a C program, redact.c, that takes a string argument and prints out the string with all digits replaced by the character 'X'. (Hint: Some standard library functions might come in handy.)
Solution
https://github.com/iafisher/cs644/blob/master/week1/solutions/redact.c
(★★★) Research the concept of the stack and the heap for memory allocation. (Not to be confused with the data structures of the same name.) When do you use one versus the other? How do I know if a value in a C program is allocated on the stack or the heap?
Solution
- You can imagine that your program's memory is a gigantic array of boxes, each with a numeric memory address and each holding a fixed-size integer (64 bits, on modern computers). The stack and heap are two different regions of this memory array.
- Your program has the illusion of practically infinite memory. Of course, your machine's memory is finite and shared by all the other processes that are running. So, you have to claim memory from the OS when you need it, and release it when you are done.
- The stack is where local variables are allocated (e.g., int x = 42). It's called a stack because it's last-in, first-out: the last memory to be allocated is the first memory to be freed. The stack mirrors the call stack of your program: when you call a function, its local variables are pushed onto the stack, and when you return, the local variables are freed. The stack is fast and de-allocation is automatic, but it has some drawbacks.
  - The size of the stack is limited, and if you exceed its maximum size your program will crash.
  - Because memory is freed automatically, you can't use the stack when you need more control over when a value is freed.
    - A function should never return a pointer to a local variable, because that local variable will be freed on the stack as soon as the function returns, and the pointer will become invalid. If you need to return a pointer, then you need to use the heap.
- The heap solves these two drawbacks, but is trickier to use because allocation and de-allocation are not automatic. You claim memory from the heap with a function called malloc, and release it with free. If you fail to free it, then your program has a memory leak. If you try to access memory after you've freed it, then anything could happen. If you call free on the same pointer twice, then anything could happen. So, you have to be careful when using the heap because it's easy to make a mistake and access memory unsafely. The heap is an appropriate choice for holding the data payload of large data structures like dynamic arrays and hash tables, and it is a necessity if you need to write a function that allocates memory that lives beyond the scope of the function.
- In a language with garbage collection, you don't have to explicitly free memory. The program's runtime will periodically scan the heap and find memory that is no longer reachable and can be freed.
(★★★) What is the memory representation of strings (char*) in C? What is an alternate representation? What are the advantages and disadvantages of each?
Solution
- A string is represented in C as an array of characters terminated by the null character (written as '\0', equal to the integer zero). Such strings are referred to as null-terminated strings. One consequence of this representation is that finding the length of the string (strlen(s)) is a linear-time operation: you have to iterate over all the characters to find the null terminator. Another consequence is that if you forget to put a null terminator at the end, then any code that tries to read the string will run past the end and access unrelated memory. So be very careful when using functions like strncpy.
- The most common alternative are length-prefixed strings, in which the length is stored as a separate field: struct string { size_t len; const char* data }.
- Length-prefixed strings are generally superior: you can read the length in constant time, it's harder to accidentally access invalid memory, and they can be made backwards-compatible with null-terminated strings. The one advantage that traditional C strings have is that they have a slightly lower per-string memory overhead: 1 byte for the null terminator, versus 8 bytes for size_t in length-prefixed strings. This doesn't seem like a lot, but if your program allocates a lot of short strings, it can make a difference.