Thursday, June 6, 2013

Visualizing Data structures With dot. Part 0: the dot Language.


The Problem
Visualizing data structures can be a great aid in debugging. Unfortunately most development environments don't have the capability to display linked data in graphical form. One notable exception is the GNU Data Display Debugger (http://www.gnu.org/software/ddd/).

A Solution
In this four-part series, we will explore a few simple techniques for drawing data structures using the dot language.  
  • Part 0: The dot Language.
  • Part 1: Some Advanced Features of the dot Language.
  • Part 2: Visualizing Data Structures in C.
  • Part 3: Visualizing Data Structures in Java.
dot is part of the popular graphviz suite of tools (http://graphviz.org/).

All of the source code used in this blog post can be found at https://bitbucket.org/todd_k_fisher/problem-solved-blog/overview. I have placed the Emacs org-mode file used to create this post there as well.

Basics of the dot Language
dot is used to draw directed graphs which have a (mostly) hierarchical structure. The language is easily understood by example. A binary tree might be specified as 

file: example0.dot
digraph bintree {
  10 -> 5;
  10 -> 15;
  5  -> 2;
  5  -> 8;
  2  -> 1;
  2  -> 3;
  8  -> 6;
  8  -> 7;
  15 -> 13;
  15 -> 20;
}

Edges are given with: startNode -> endNode. Use the dot command to render this to an SVG file
dot -Tsvg example0.dot -oexample0.svg
The result can be viewed in your browser

Simple binary tree (from example0.dot)

dot can also handle general graph structures. Here is a non-hierarchical graph

file: cycle.dot
digraph cycle {
  1 -> 2 -> 3 -> 4 -> 1;
}

Cyclic graph
(from cycle.dot)
Care must be used if the ordering of children is important. Had the two children of node 8 been mistakenly reversed in the source file as indicated by the red lines

file: example1.dot
digraph bintree {
  10 -> 5;
  10 -> 15;
  5  -> 2;
  5  -> 8;
  2  -> 1;
  2  -> 3;
  8  -> 7;
  8  -> 6;
  15 -> 13;
  15 -> 20;
}

the result would not be a correctly rendered binary tree

Incorrectly drawn binary tree
(from example1.dot)

To remedy this, set the ordering to match the order listed in the source file with the code: ordering=out and be sure to list the children of each node from lowest to highest.

Node Shapes
The default shapes of nodes in dot are ellipses. Differently-shaped nodes can be drawn by writing the node name followed by the shape as in

file: example2.dot
digraph bintree {
  ordering=out;
  1  [shape=rectangle];
  3  [shape=rectangle];
  6  [shape=rectangle];
  7  [shape=rectangle];
  13 [shape=rectangle];
  20 [shape=rectangle];
  10 -> 5;
  10 -> 15;
  5  -> 2;
  5  -> 8;
  2  -> 1;
  2  -> 3;
  8  -> 6;
  8  -> 7;
  15 -> 13;
  15 -> 20;
}
The tree now looks like

Binary tree with
rectangular leaves.
(from example2.dot)

The location of the [shape = rectangle] lines in the source file is unimportant.

Colors
Nodes can be filled or outlined with a particular color. Let's outline the interior nodes in our tree brown and fill in the leaves light green. To do this, set the color attributes of each node.
For outline colors write: color = brown.
For color fills write: color = palegreen,style = filled.
Let's color the edges in the tree brown as well.
Colored edges should be followed with [color = brown]

file: example3.dot
digraph bintree {
  ordering=out;
  1        [shape=rectangle, color=palegreen, style=filled];
  3        [shape=rectangle, color=palegreen, style=filled];
  6        [shape=rectangle, color=palegreen, style=filled];
  7        [shape=rectangle, color=palegreen, style=filled];
  13       [shape=rectangle, color=palegreen, style=filled];
  20       [shape=rectangle, color=palegreen, style=filled];
  10       [color=brown];
  5        [color=brown];
  2        [color=brown];
  8        [color=brown];
  15       [color=brown];
  10 -> 5  [color=brown];
  10 -> 15 [color=brown];
  5  -> 2  [color=brown];
  5  -> 8  [color=brown];
  2  -> 1  [color=brown];
  2  -> 3  [color=brown];
  8  -> 6  [color=brown];
  8  -> 7  [color=brown];
  15 -> 13 [color=brown];
  15 -> 20 [color=brown];
}

Colored binary tree.
(from example3.dot)

Next

In the next part of this series, we will explore some more advanced features of dot:
  • Record structures with labeled fields.
  • Subgraphs.

Thursday, May 16, 2013

Emacs Word Commands in C and Java Modes


The Problem
You use the Emacs commands:

CommandAction
M-fMove forward by one word
M-bMove backward by one word
M-dDelete word ahead of the cursor
M-DELDelete word behind cursor
M-tTranspose words
M-@Mark word
M-uUpcase word
M-lDowncase word
M-cCapitalize word

They work beautifully with ordinary text but not so great in C or Java modes. For example, the arrows in the image below show where the cursor will skip to if you use M-f to skip forward starting at the first arrow.




The problem here is that C-mode doesn't recognize "_" as part of a "word".

The Solution
There is an easy fix for this: simply add the following lines to your .emacs file:
01 (add-hook 'c-mode-common-hook
02        (lambda ()
03          (modify-syntax-entry ?_ "w" c-mode-syntax-table)))
04 (add-hook 'java-mode-hook
05        (lambda ()
06          (modify-syntax-entry ?_ "w" java-mode-syntax-table)))
The image below shows where the cursor will skip to after the hook is installed.




All word-related commands will recognize "_" as part of a "word" in C and Java modes.

Problem solved.

Wednesday, May 15, 2013

Using Emacs for Fun and Profit. Part 2: Copying random text from random places.





The Problem

In this post we will address the problem of collecting text from scattered locations in one or more buffers and placing it in a specific location. For example: suppose that you want to copy some C function prototypes from a .c file into a header file as illustrated below.


What is an efficient way to accomplish this? There are two solutions that come to mind right away.

Solution 1: Use Text Registers
  1. Mark text in the source buffer.
  2. Append text to some register (we'll use "A" here).
  3. Repeat steps 1 and 2 until you have collected all of the text that you need to move.
  4. Switch to the destination buffer.
  5. Move to the target location.
  6. Copy register "A" to destination buffer with C-x r i A
Let's walk through this. In the screenshots below, the source buffer is on the left and the destination buffer is on the right. 

Steps 1 and 2: Mark text in the source buffer, append with M-x append-to-register 


 
Steps 4, 5, and 6: Switch to the destination buffer, move to the target location and insert register "A".




Unfortunately append-to-register places all of the copied text onto a single line so some editing must be done.



Solution 2: Use the Global Mark
Emacs' CUA mode provides a global mark which offers a more efficient solution than text registers. When the global mark is enabled, any keystrokes that you type will be sent to the target location and any text that you place in the kill ring will also be sent to the target location. So the sequence of steps is now:
  1. Move to target location and enable the global mark S-C-SPC.
  2. Copy the text you wish to send to the target location to the kill ring.
  3. Press ";" then RET to place a semicolon at the the end of the target text. Pressing RET causes any other text copied to the kill ring to be placed on a separate line. This saves us the editing step required when we use append-to-register.
To use the global mark, place the following in your .emacs file:
(setq-default cua-enable-cua-keys nil)
(cua-mode t)
This will enable CUA features such as global mark/copy and enhanced rectangle editing but not enable CUA keys such as C-c for copy, C-v for paste etc.

Let's illustrate the above steps with some screenshots.

Step 1: Move to target location and enable the global mark S-C-SPC.



Step 2: Copy the text you wish to send to the target location to the kill ring.   The text will instantly appear at the target location.



Step 3: Press ";" then RET to place a semicolon at the the end of the target text. Pressing RET causes any other text copied to the kill ring to be placed on a separate line. This saves us the editing step required when we use append-to-register.


  Repeat steps 2 and 3 to copy more text

Sunday, April 21, 2013

Using Emacs for Fun & Profit. Part 1: Making the Mode Line Useful

Summary
The Emacs mode line is somewhat cryptic.  A brief introduction to mode line customization in Emacs is given.
 
The Default Mode Line
The Emacs mode line is an area below the contents of a buffer displaying information such as current line number of the cursor, buffer name, and major/minor modes.  Here is the default modeline (in Emacs 24.2.1)


The line and column numbers are shown as are the buffer name and current modes.  Not so clear though are the mysterious "U:---" and "All".  What do they mean?

If I open a file and page down several times the word "All" changes to "39%" which is the amount of text below the bottom of the window.



What about  that mysterious  "-:--" which  was "U:**-"  in the  scratch buffer's
mode line?

The emacs documentation for the mode line states that the left part of the mode line has the form

CS:CH-FR

Where
  • CS is the coding system (possibly prefixed with a  character designating the current input method).
  • The character(s) immediately after CS indicate the end-of-line convention.  This can be "\" or "(DOS)" if the DOS/WINDOWS carriage return + linefeed convention is used to indicate end-of-line.
  • CH is
    •  "--" if the buffer is unmodified,
    • "**" if the buffer has been edited,
    • "%%" if the buffer is read-only and unmodified and
    • "%*" if the buffer is read-only and modified         
  • The character following CH is a dash ("-") unless the default directory for the buffer is on a remote machine, in which case it is a "@".
  • FR is the frame name and is seen only on text terminals.
Customizing the Mode Line  
If you're like me, then you probably won't remember the bizarre conventions of the default mode line in your day-to-day use of emacs. Fortunately the mode line can be customized like anything else in emacs.  I'll walk through the interesting parts of customizing a mode line to look like this



  (the vertical text was added to explain each part)

Here is the elisp code for this modeline

(0) (setq-default mode-line-format
(1)  (list
(2)    "  "
(3)    mode-line-buffer-identification
(4)    " │ "
(5)    "dir: "
(6)    '(:eval (last-dir default-directory))
(7)    " │ "
(8)    ;; '%02' to set to 2 chars at least; prevents flickering
(9)      "%04l"
(10)     ","
(11)     "%02c" 
(12)   " │ "
(13)   '(:eval (if (buffer-modified-p) "M" "-"))
(14)   '(:eval (if buffer-read-only    "R" "-"))
(15)   '(:eval (if (window-dedicated-p (selected-window)) "D" "-"))
(16)   " │ "
(17)   mode-line-modes    
(18)   )) 
 
The Emacs lisp variable mode-line-format on line 0 determines what gets placed onto the mode line.  This variable must be set with setq-default so that the new value can be seen in all buffers.

The predefined variable on line 3, mode-line-buffer-identification, will show the current buffer name in bold text. Clicking on the buffer identification with mouse-1 will cause Emacs to switch to the next buffer in its buffer list and clicking on it with mouse-2 will switch to the previous buffer.

Lines 4 and 5 are simply text that will be displayed verbatim. The current line and column are displayed by lines 9 and 11.

Percent characters followed by a single-letter code in literal strings are a signal for Emacs to substitute various data. Some examples:
  • %I - Abbreviated size of buffer (e.g. 32k, 2M, etc.)
  • %p - percentage of buffer text above the top of the window.
  • %P - percentage of buffer text below the bottom of the window.  This is displayed in the default mode line.
Others can be found here. In our case, we are using %l and %c to display the cursor line and column. Additionally, you can designate a field width in C's printf() style by placing a number after the '%'.

Emacs also allows arbitrary pieces of lisp code to be evaluated when constructing the mode line. Here the real power of mode line customization can be seen. A simple example is on line 6 which uses the :eval construct. On this line, the default directory for the current buffer is passed to the function last-dir which peels off all but the last level of the directory path

(defun last-dir (p)
  (let ((dirs (reverse (split-string p "/"))))
    (cond ((and (equal (car dirs) "")
                (equal (cadr dirs) ""))
           "/")   ; root dir
          ((and (equal (car dirs) "")
                (equal (cadr dirs) "~"))
           "~")   ; home dir
          (t (cadr dirs))))) 

The variable mode-line-modes is used on line 17 to display the major and minor modes which are active for the current buffer. Clicking on the major mode name with mouse-1 will display a menu of available commands for that mode. Clicking on the minor mode name with mouse-1 will display a pop-up menu allowing you to turn off the minor mode or display help. Clicking on the major or minor mode name with mouse-2 will display a menu of minor modes to choose from.

Consult the Emacs documentation for more details on mode line customization.
http://www.gnu.org/software/emacs/manual/html_node/elisp/Mode-Line-Format.html#Mode-Line-Format

Sunday, April 14, 2013

Code Golf

Summary
This post walks through the process of shortening the code for a simple deciphering program.  The language used is C.

The problem
Code golf is the questionable practice of writing a program in the fewest (key)strokes (characters).  It is similar in spirit, but not exactly identical to code obfuscation. 

Let's take the Vigenère ciphertext decyption problem from the StackExhange code golfing site (http://codegolf.stackexchange.com/) and see what is the shortest C program that we can write.  Our approach will be to start with an "ungolfed" (readable) program and shorten it until we believe that it cannot be shortened any further.

First Attempt
The following program works but from a golfing perspective is very bad

#include <stdio.h>
#include <string.h> 
char rotated_alphabet[62][63];
char *unrotated_alphabet =
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789";

void gen_rotated_alphabet(void) {
    int i, j, k;
    for (i = 0; i < 62; ++i) {
        j = i;
        k = 0;
        do {
            rotated_alphabet[i][k++] = unrotated_alphabet[j];
            j = (j + 1)%62;
        } while (j != i);
        rotated_alphabet[i][62] = 0;
    }
}

char decipher(char c, char key) {
    int i, j;
    for (i = 0; i < 62 && key != unrotated_alphabet[i]; ++i)
    ;
    for (j = 0; j < 62 && c != rotated_alphabet[i][j]; ++j)
    ;
    return unrotated_alphabet[j];
}

// argv[1] : key
// argv[2] : encrypted string
void main(int argc, char **argv) {
    int i;
    gen_rotated_alphabet();
    for (i = 0; i < strlen(argv[2]); ++i) {
        putchar(decipher(argv[2][i], argv[1][i%strlen(argv[1])]));
    }    
}    

  1. It is bad because it uses variables longer than one character.
  2. It is bad because it has "#include's".  These usually aren't necessary when golfing code.  Many times the assumptions that the compiler makes about any routines linked in later will suffice.
  3. It is bad because it uses functions which are called only once.  When golfing, it is usually far better to inline functions by hand.  (unless, of course calling a function would result in a shorter program)
  4. It is bad because it doesn't make use of obscure C library functions.  The more obscure and poorly-named, the better as this gives your golfed programs a kind of mystique.
  5. The algorithm for deciphering results in longer than necessary code.
  6. A comment is used. 
  Let's start with (5) first and use a better algorithm.  Storing all permutations of  unrotated_alphabet in the rotated_alphabet table and searching for the ciphertext in the row corresponding to the key isn't necessary.

A Better Algorithm
Let's use the smaller table 'A' .. 'Z' and work by examples.  

Example 1.
Key = 'F', ciphertext = 'H'



The deciphered text is 'C'.  Notice that the difference of the positions of key 'H' and ciphertext 'F' in the unrotated alphabet is 2, which corresponds exactly to the position of the deciphered text 'C' in the unrotated alphabet as well.

Example 2.
Key = 'F', ciphertext = 'Q'.

The deciphered text is 'L'.  The distance between key 'F' and ciphertext 'Q' in the unrotated alphabet is 11 characters which corresponds to the position of 'L' in the unrotated alphabet.

Example 3.
Key = 'R', ciphertext = 'F'


The deciphered text is 'O'.  Since the position in the unrotated alphabet of the ciphertext 'F' is prior to the position of 'O' the difference is -12.  If we add the length of the alphabet to that we have -12+26 = 14 which is the position of the letter 'O' in the unrotated alphabet.

Example 4.
Key = 'Z', ciphertext = 'A'.


The deciphered text is 'B'. Here position('A')-position('Z') = -25.  If we add -25+26 we have 1, which is the position of 'B' in the unrotated alphabet.

So we have a much shorter deciphering algorithm:


k = position(key)
c = position(cipher)
if k > c then decipher = alphabet[26 + c - k]
else decipher = alphabet[c - k]
 
The resulting C program is (after placing the decipher() routine inline)


char *alphabet =
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789";

// argv[1] : key
// argv[2] : encrypted string
void main(int argc, char **argv) {
    int i;
    char *pc, *pk;
    for (i = 0; i < strlen(argv[2]); ++i) {
        pc = strchr(alphabet, argv[2][i]);
        pk = strchr(alphabet, argv[1][i%strlen(argv[1])]);
        putchar(pk > pc ? alphabet[62 + pc - pk] 
                        : alphabet[pc - pk]);
    }
}    

the call strchr(s, c) returns a pointer to the location of the first occurrence of character c within string s.

Shortening The Code Further
Shortening variables and moving declarations outside reduces the program size even further.  Note that at global scope, explicitly unintialized variables will be given an  value of 0.


char *a =
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789",
    *c, *k;
int i;
void main(int n, char **v) {
    for (;i < strlen(v[2]); ++i) {
        c = strchr(a, v[2][i]);
        k = strchr(a, v[1][i%strlen(v[1])]);
        putchar(k > c ? a[62 + c - k] 
                        : a[c - k]);
    }
}
 
Since we only iterate through the characters of the ciphertext v[2] once, we use pointer arithmetic and modify it directly.

char *a =
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789",
    *c, *k;
int i;    
void main(int n, char **v) {
    while (*v[2]) {
        c = strchr(a, *v[2]++);
        k = strchr(a, v[1][i++%strlen(v[1])]);
        putchar(k > c ? a[62 + c - k] 
                      : a[c - k]);
    }
}

The initialization of the alphabet a can be shortened with code.

char a[63], *c, *k;
int i,j;    
void main(int n, char **v) {
    for(;j<26;++j)
        a[j]=32|(a[j+26]=65+j),
        a[52+j%10]=48+j%10;    
    while (*v[2]) {
        c = strchr(a, *v[2]++);
        k = strchr(a, v[1][i++%strlen(v[1])]);
        putchar(k > c ? a[62 + c - k] 
                      : a[c - k]);
    }
}

The code a[j+26]=65+j initializes a[26], a[27], a[28], .., a[51] of a with 'a', 'b', ..,'z' ORing the result of that assignment these entries to uppercase and they get placed into a[0], a[1], .., a[25].  The code a[52+j%10]=48+j%10 initializes a[52], a[53], .., a[61] with '0', '1', .., '9'.  The j%10 ensures that garbage entries never get written past the end of the array.  We can eliminate the %10 by extending the size of a and allowing garbage to get written there.  This is still within the boundaries set by the author of the problem.  Using the comma operator saves us from having to use opening and closing braces.

Explicitly assigning to a cost us 71 characters when the lines were joined, and unnecessary double-quotes were removed.  Doing this in a loop cost us 51 characters when everything was scrunched together.  A savings of 10 characters!

The final (?) steps are:
  1. In the while loop, embed the conditional inside of the array subscript.  A savings of five characters since we no longer need braces.
  2. The two calls to strchr()  can also be placed inside of the array subscript if we use the comma operator.  A savings of one character.
  3. main() doesn't need to be declared as void. A savings of five characters.
  4. the int type can be removed from the declaration of i and j since this is the default type in C.  A savings of four characters.
char a[99],*c,*k;
i,j;    
main(int n, char **v) {
    for(;j<26;++j)
        a[j]=32|(a[j+26]=65+j),
        a[52+j]=48+j;    
    while (*v[2])                 
        putchar(a[c=strchr(a,*v[2]++),
                  k=strchr(a,v[1][i++%strlen(v[1])]),
                  k>c?62+c-k 
                        :c-k]);
}

Scrunching the code together gives it the much sought-after mystique

char a[99],*c,*k;i,j;main(int n,char**v)
{for(;j<26;++j)a[j]=32|(a[j+26]=65+j),a[52+j]=48+j;
while(*v[2])putchar(a[c=strchr(a,*v[2]++),
k=strchr(a,v[1][i++%strlen(v[1])]),k>c?62+c-k:c-k]);}

The final charcter count (when lines are joined) is 186.  This is the complete code which can be copied, compiled and run as it appears here (it was tested with gcc).  Ignore any warnings of implicit declarations of strchr() or the unspecified storage class of i and j.