Sunday, April 14, 2013

Code Golf

Summary
This post walks through the process of shortening the code for a simple deciphering program.  The language used is C.

The problem
Code golf is the questionable practice of writing a program in the fewest (key)strokes (characters).  It is similar in spirit, but not exactly identical to code obfuscation. 

Let's take the Vigenère ciphertext decyption problem from the StackExhange code golfing site (http://codegolf.stackexchange.com/) and see what is the shortest C program that we can write.  Our approach will be to start with an "ungolfed" (readable) program and shorten it until we believe that it cannot be shortened any further.

First Attempt
The following program works but from a golfing perspective is very bad

#include <stdio.h>
#include <string.h> 
char rotated_alphabet[62][63];
char *unrotated_alphabet =
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789";

void gen_rotated_alphabet(void) {
    int i, j, k;
    for (i = 0; i < 62; ++i) {
        j = i;
        k = 0;
        do {
            rotated_alphabet[i][k++] = unrotated_alphabet[j];
            j = (j + 1)%62;
        } while (j != i);
        rotated_alphabet[i][62] = 0;
    }
}

char decipher(char c, char key) {
    int i, j;
    for (i = 0; i < 62 && key != unrotated_alphabet[i]; ++i)
    ;
    for (j = 0; j < 62 && c != rotated_alphabet[i][j]; ++j)
    ;
    return unrotated_alphabet[j];
}

// argv[1] : key
// argv[2] : encrypted string
void main(int argc, char **argv) {
    int i;
    gen_rotated_alphabet();
    for (i = 0; i < strlen(argv[2]); ++i) {
        putchar(decipher(argv[2][i], argv[1][i%strlen(argv[1])]));
    }    
}    

  1. It is bad because it uses variables longer than one character.
  2. It is bad because it has "#include's".  These usually aren't necessary when golfing code.  Many times the assumptions that the compiler makes about any routines linked in later will suffice.
  3. It is bad because it uses functions which are called only once.  When golfing, it is usually far better to inline functions by hand.  (unless, of course calling a function would result in a shorter program)
  4. It is bad because it doesn't make use of obscure C library functions.  The more obscure and poorly-named, the better as this gives your golfed programs a kind of mystique.
  5. The algorithm for deciphering results in longer than necessary code.
  6. A comment is used. 
  Let's start with (5) first and use a better algorithm.  Storing all permutations of  unrotated_alphabet in the rotated_alphabet table and searching for the ciphertext in the row corresponding to the key isn't necessary.

A Better Algorithm
Let's use the smaller table 'A' .. 'Z' and work by examples.  

Example 1.
Key = 'F', ciphertext = 'H'



The deciphered text is 'C'.  Notice that the difference of the positions of key 'H' and ciphertext 'F' in the unrotated alphabet is 2, which corresponds exactly to the position of the deciphered text 'C' in the unrotated alphabet as well.

Example 2.
Key = 'F', ciphertext = 'Q'.

The deciphered text is 'L'.  The distance between key 'F' and ciphertext 'Q' in the unrotated alphabet is 11 characters which corresponds to the position of 'L' in the unrotated alphabet.

Example 3.
Key = 'R', ciphertext = 'F'


The deciphered text is 'O'.  Since the position in the unrotated alphabet of the ciphertext 'F' is prior to the position of 'O' the difference is -12.  If we add the length of the alphabet to that we have -12+26 = 14 which is the position of the letter 'O' in the unrotated alphabet.

Example 4.
Key = 'Z', ciphertext = 'A'.


The deciphered text is 'B'. Here position('A')-position('Z') = -25.  If we add -25+26 we have 1, which is the position of 'B' in the unrotated alphabet.

So we have a much shorter deciphering algorithm:


k = position(key)
c = position(cipher)
if k > c then decipher = alphabet[26 + c - k]
else decipher = alphabet[c - k]
 
The resulting C program is (after placing the decipher() routine inline)


char *alphabet =
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789";

// argv[1] : key
// argv[2] : encrypted string
void main(int argc, char **argv) {
    int i;
    char *pc, *pk;
    for (i = 0; i < strlen(argv[2]); ++i) {
        pc = strchr(alphabet, argv[2][i]);
        pk = strchr(alphabet, argv[1][i%strlen(argv[1])]);
        putchar(pk > pc ? alphabet[62 + pc - pk] 
                        : alphabet[pc - pk]);
    }
}    

the call strchr(s, c) returns a pointer to the location of the first occurrence of character c within string s.

Shortening The Code Further
Shortening variables and moving declarations outside reduces the program size even further.  Note that at global scope, explicitly unintialized variables will be given an  value of 0.


char *a =
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789",
    *c, *k;
int i;
void main(int n, char **v) {
    for (;i < strlen(v[2]); ++i) {
        c = strchr(a, v[2][i]);
        k = strchr(a, v[1][i%strlen(v[1])]);
        putchar(k > c ? a[62 + c - k] 
                        : a[c - k]);
    }
}
 
Since we only iterate through the characters of the ciphertext v[2] once, we use pointer arithmetic and modify it directly.

char *a =
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789",
    *c, *k;
int i;    
void main(int n, char **v) {
    while (*v[2]) {
        c = strchr(a, *v[2]++);
        k = strchr(a, v[1][i++%strlen(v[1])]);
        putchar(k > c ? a[62 + c - k] 
                      : a[c - k]);
    }
}

The initialization of the alphabet a can be shortened with code.

char a[63], *c, *k;
int i,j;    
void main(int n, char **v) {
    for(;j<26;++j)
        a[j]=32|(a[j+26]=65+j),
        a[52+j%10]=48+j%10;    
    while (*v[2]) {
        c = strchr(a, *v[2]++);
        k = strchr(a, v[1][i++%strlen(v[1])]);
        putchar(k > c ? a[62 + c - k] 
                      : a[c - k]);
    }
}

The code a[j+26]=65+j initializes a[26], a[27], a[28], .., a[51] of a with 'a', 'b', ..,'z' ORing the result of that assignment these entries to uppercase and they get placed into a[0], a[1], .., a[25].  The code a[52+j%10]=48+j%10 initializes a[52], a[53], .., a[61] with '0', '1', .., '9'.  The j%10 ensures that garbage entries never get written past the end of the array.  We can eliminate the %10 by extending the size of a and allowing garbage to get written there.  This is still within the boundaries set by the author of the problem.  Using the comma operator saves us from having to use opening and closing braces.

Explicitly assigning to a cost us 71 characters when the lines were joined, and unnecessary double-quotes were removed.  Doing this in a loop cost us 51 characters when everything was scrunched together.  A savings of 10 characters!

The final (?) steps are:
  1. In the while loop, embed the conditional inside of the array subscript.  A savings of five characters since we no longer need braces.
  2. The two calls to strchr()  can also be placed inside of the array subscript if we use the comma operator.  A savings of one character.
  3. main() doesn't need to be declared as void. A savings of five characters.
  4. the int type can be removed from the declaration of i and j since this is the default type in C.  A savings of four characters.
char a[99],*c,*k;
i,j;    
main(int n, char **v) {
    for(;j<26;++j)
        a[j]=32|(a[j+26]=65+j),
        a[52+j]=48+j;    
    while (*v[2])                 
        putchar(a[c=strchr(a,*v[2]++),
                  k=strchr(a,v[1][i++%strlen(v[1])]),
                  k>c?62+c-k 
                        :c-k]);
}

Scrunching the code together gives it the much sought-after mystique

char a[99],*c,*k;i,j;main(int n,char**v)
{for(;j<26;++j)a[j]=32|(a[j+26]=65+j),a[52+j]=48+j;
while(*v[2])putchar(a[c=strchr(a,*v[2]++),
k=strchr(a,v[1][i++%strlen(v[1])]),k>c?62+c-k:c-k]);}

The final charcter count (when lines are joined) is 186.  This is the complete code which can be copied, compiled and run as it appears here (it was tested with gcc).  Ignore any warnings of implicit declarations of strchr() or the unspecified storage class of i and j.

5 comments: