Sunday, April 21, 2013

Using Emacs for Fun & Profit. Part 1: Making the Mode Line Useful

Summary
The Emacs mode line is somewhat cryptic.  A brief introduction to mode line customization in Emacs is given.
 
The Default Mode Line
The Emacs mode line is an area below the contents of a buffer displaying information such as current line number of the cursor, buffer name, and major/minor modes.  Here is the default modeline (in Emacs 24.2.1)


The line and column numbers are shown as are the buffer name and current modes.  Not so clear though are the mysterious "U:---" and "All".  What do they mean?

If I open a file and page down several times the word "All" changes to "39%" which is the amount of text below the bottom of the window.



What about  that mysterious  "-:--" which  was "U:**-"  in the  scratch buffer's
mode line?

The emacs documentation for the mode line states that the left part of the mode line has the form

CS:CH-FR

Where
  • CS is the coding system (possibly prefixed with a  character designating the current input method).
  • The character(s) immediately after CS indicate the end-of-line convention.  This can be "\" or "(DOS)" if the DOS/WINDOWS carriage return + linefeed convention is used to indicate end-of-line.
  • CH is
    •  "--" if the buffer is unmodified,
    • "**" if the buffer has been edited,
    • "%%" if the buffer is read-only and unmodified and
    • "%*" if the buffer is read-only and modified         
  • The character following CH is a dash ("-") unless the default directory for the buffer is on a remote machine, in which case it is a "@".
  • FR is the frame name and is seen only on text terminals.
Customizing the Mode Line  
If you're like me, then you probably won't remember the bizarre conventions of the default mode line in your day-to-day use of emacs. Fortunately the mode line can be customized like anything else in emacs.  I'll walk through the interesting parts of customizing a mode line to look like this



  (the vertical text was added to explain each part)

Here is the elisp code for this modeline

(0) (setq-default mode-line-format
(1)  (list
(2)    "  "
(3)    mode-line-buffer-identification
(4)    " │ "
(5)    "dir: "
(6)    '(:eval (last-dir default-directory))
(7)    " │ "
(8)    ;; '%02' to set to 2 chars at least; prevents flickering
(9)      "%04l"
(10)     ","
(11)     "%02c" 
(12)   " │ "
(13)   '(:eval (if (buffer-modified-p) "M" "-"))
(14)   '(:eval (if buffer-read-only    "R" "-"))
(15)   '(:eval (if (window-dedicated-p (selected-window)) "D" "-"))
(16)   " │ "
(17)   mode-line-modes    
(18)   )) 
 
The Emacs lisp variable mode-line-format on line 0 determines what gets placed onto the mode line.  This variable must be set with setq-default so that the new value can be seen in all buffers.

The predefined variable on line 3, mode-line-buffer-identification, will show the current buffer name in bold text. Clicking on the buffer identification with mouse-1 will cause Emacs to switch to the next buffer in its buffer list and clicking on it with mouse-2 will switch to the previous buffer.

Lines 4 and 5 are simply text that will be displayed verbatim. The current line and column are displayed by lines 9 and 11.

Percent characters followed by a single-letter code in literal strings are a signal for Emacs to substitute various data. Some examples:
  • %I - Abbreviated size of buffer (e.g. 32k, 2M, etc.)
  • %p - percentage of buffer text above the top of the window.
  • %P - percentage of buffer text below the bottom of the window.  This is displayed in the default mode line.
Others can be found here. In our case, we are using %l and %c to display the cursor line and column. Additionally, you can designate a field width in C's printf() style by placing a number after the '%'.

Emacs also allows arbitrary pieces of lisp code to be evaluated when constructing the mode line. Here the real power of mode line customization can be seen. A simple example is on line 6 which uses the :eval construct. On this line, the default directory for the current buffer is passed to the function last-dir which peels off all but the last level of the directory path

(defun last-dir (p)
  (let ((dirs (reverse (split-string p "/"))))
    (cond ((and (equal (car dirs) "")
                (equal (cadr dirs) ""))
           "/")   ; root dir
          ((and (equal (car dirs) "")
                (equal (cadr dirs) "~"))
           "~")   ; home dir
          (t (cadr dirs))))) 

The variable mode-line-modes is used on line 17 to display the major and minor modes which are active for the current buffer. Clicking on the major mode name with mouse-1 will display a menu of available commands for that mode. Clicking on the minor mode name with mouse-1 will display a pop-up menu allowing you to turn off the minor mode or display help. Clicking on the major or minor mode name with mouse-2 will display a menu of minor modes to choose from.

Consult the Emacs documentation for more details on mode line customization.
http://www.gnu.org/software/emacs/manual/html_node/elisp/Mode-Line-Format.html#Mode-Line-Format

Sunday, April 14, 2013

Code Golf

Summary
This post walks through the process of shortening the code for a simple deciphering program.  The language used is C.

The problem
Code golf is the questionable practice of writing a program in the fewest (key)strokes (characters).  It is similar in spirit, but not exactly identical to code obfuscation. 

Let's take the Vigenère ciphertext decyption problem from the StackExhange code golfing site (http://codegolf.stackexchange.com/) and see what is the shortest C program that we can write.  Our approach will be to start with an "ungolfed" (readable) program and shorten it until we believe that it cannot be shortened any further.

First Attempt
The following program works but from a golfing perspective is very bad

#include <stdio.h>
#include <string.h> 
char rotated_alphabet[62][63];
char *unrotated_alphabet =
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789";

void gen_rotated_alphabet(void) {
    int i, j, k;
    for (i = 0; i < 62; ++i) {
        j = i;
        k = 0;
        do {
            rotated_alphabet[i][k++] = unrotated_alphabet[j];
            j = (j + 1)%62;
        } while (j != i);
        rotated_alphabet[i][62] = 0;
    }
}

char decipher(char c, char key) {
    int i, j;
    for (i = 0; i < 62 && key != unrotated_alphabet[i]; ++i)
    ;
    for (j = 0; j < 62 && c != rotated_alphabet[i][j]; ++j)
    ;
    return unrotated_alphabet[j];
}

// argv[1] : key
// argv[2] : encrypted string
void main(int argc, char **argv) {
    int i;
    gen_rotated_alphabet();
    for (i = 0; i < strlen(argv[2]); ++i) {
        putchar(decipher(argv[2][i], argv[1][i%strlen(argv[1])]));
    }    
}    

  1. It is bad because it uses variables longer than one character.
  2. It is bad because it has "#include's".  These usually aren't necessary when golfing code.  Many times the assumptions that the compiler makes about any routines linked in later will suffice.
  3. It is bad because it uses functions which are called only once.  When golfing, it is usually far better to inline functions by hand.  (unless, of course calling a function would result in a shorter program)
  4. It is bad because it doesn't make use of obscure C library functions.  The more obscure and poorly-named, the better as this gives your golfed programs a kind of mystique.
  5. The algorithm for deciphering results in longer than necessary code.
  6. A comment is used. 
  Let's start with (5) first and use a better algorithm.  Storing all permutations of  unrotated_alphabet in the rotated_alphabet table and searching for the ciphertext in the row corresponding to the key isn't necessary.

A Better Algorithm
Let's use the smaller table 'A' .. 'Z' and work by examples.  

Example 1.
Key = 'F', ciphertext = 'H'



The deciphered text is 'C'.  Notice that the difference of the positions of key 'H' and ciphertext 'F' in the unrotated alphabet is 2, which corresponds exactly to the position of the deciphered text 'C' in the unrotated alphabet as well.

Example 2.
Key = 'F', ciphertext = 'Q'.

The deciphered text is 'L'.  The distance between key 'F' and ciphertext 'Q' in the unrotated alphabet is 11 characters which corresponds to the position of 'L' in the unrotated alphabet.

Example 3.
Key = 'R', ciphertext = 'F'


The deciphered text is 'O'.  Since the position in the unrotated alphabet of the ciphertext 'F' is prior to the position of 'O' the difference is -12.  If we add the length of the alphabet to that we have -12+26 = 14 which is the position of the letter 'O' in the unrotated alphabet.

Example 4.
Key = 'Z', ciphertext = 'A'.


The deciphered text is 'B'. Here position('A')-position('Z') = -25.  If we add -25+26 we have 1, which is the position of 'B' in the unrotated alphabet.

So we have a much shorter deciphering algorithm:


k = position(key)
c = position(cipher)
if k > c then decipher = alphabet[26 + c - k]
else decipher = alphabet[c - k]
 
The resulting C program is (after placing the decipher() routine inline)


char *alphabet =
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789";

// argv[1] : key
// argv[2] : encrypted string
void main(int argc, char **argv) {
    int i;
    char *pc, *pk;
    for (i = 0; i < strlen(argv[2]); ++i) {
        pc = strchr(alphabet, argv[2][i]);
        pk = strchr(alphabet, argv[1][i%strlen(argv[1])]);
        putchar(pk > pc ? alphabet[62 + pc - pk] 
                        : alphabet[pc - pk]);
    }
}    

the call strchr(s, c) returns a pointer to the location of the first occurrence of character c within string s.

Shortening The Code Further
Shortening variables and moving declarations outside reduces the program size even further.  Note that at global scope, explicitly unintialized variables will be given an  value of 0.


char *a =
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789",
    *c, *k;
int i;
void main(int n, char **v) {
    for (;i < strlen(v[2]); ++i) {
        c = strchr(a, v[2][i]);
        k = strchr(a, v[1][i%strlen(v[1])]);
        putchar(k > c ? a[62 + c - k] 
                        : a[c - k]);
    }
}
 
Since we only iterate through the characters of the ciphertext v[2] once, we use pointer arithmetic and modify it directly.

char *a =
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789",
    *c, *k;
int i;    
void main(int n, char **v) {
    while (*v[2]) {
        c = strchr(a, *v[2]++);
        k = strchr(a, v[1][i++%strlen(v[1])]);
        putchar(k > c ? a[62 + c - k] 
                      : a[c - k]);
    }
}

The initialization of the alphabet a can be shortened with code.

char a[63], *c, *k;
int i,j;    
void main(int n, char **v) {
    for(;j<26;++j)
        a[j]=32|(a[j+26]=65+j),
        a[52+j%10]=48+j%10;    
    while (*v[2]) {
        c = strchr(a, *v[2]++);
        k = strchr(a, v[1][i++%strlen(v[1])]);
        putchar(k > c ? a[62 + c - k] 
                      : a[c - k]);
    }
}

The code a[j+26]=65+j initializes a[26], a[27], a[28], .., a[51] of a with 'a', 'b', ..,'z' ORing the result of that assignment these entries to uppercase and they get placed into a[0], a[1], .., a[25].  The code a[52+j%10]=48+j%10 initializes a[52], a[53], .., a[61] with '0', '1', .., '9'.  The j%10 ensures that garbage entries never get written past the end of the array.  We can eliminate the %10 by extending the size of a and allowing garbage to get written there.  This is still within the boundaries set by the author of the problem.  Using the comma operator saves us from having to use opening and closing braces.

Explicitly assigning to a cost us 71 characters when the lines were joined, and unnecessary double-quotes were removed.  Doing this in a loop cost us 51 characters when everything was scrunched together.  A savings of 10 characters!

The final (?) steps are:
  1. In the while loop, embed the conditional inside of the array subscript.  A savings of five characters since we no longer need braces.
  2. The two calls to strchr()  can also be placed inside of the array subscript if we use the comma operator.  A savings of one character.
  3. main() doesn't need to be declared as void. A savings of five characters.
  4. the int type can be removed from the declaration of i and j since this is the default type in C.  A savings of four characters.
char a[99],*c,*k;
i,j;    
main(int n, char **v) {
    for(;j<26;++j)
        a[j]=32|(a[j+26]=65+j),
        a[52+j]=48+j;    
    while (*v[2])                 
        putchar(a[c=strchr(a,*v[2]++),
                  k=strchr(a,v[1][i++%strlen(v[1])]),
                  k>c?62+c-k 
                        :c-k]);
}

Scrunching the code together gives it the much sought-after mystique

char a[99],*c,*k;i,j;main(int n,char**v)
{for(;j<26;++j)a[j]=32|(a[j+26]=65+j),a[52+j]=48+j;
while(*v[2])putchar(a[c=strchr(a,*v[2]++),
k=strchr(a,v[1][i++%strlen(v[1])]),k>c?62+c-k:c-k]);}

The final charcter count (when lines are joined) is 186.  This is the complete code which can be copied, compiled and run as it appears here (it was tested with gcc).  Ignore any warnings of implicit declarations of strchr() or the unspecified storage class of i and j.