Competing with Giants

Yesterday, I was reading through a HackerNews post, asking which pieces of software were mindblowing upon first sight.

Among some of the usual suspects, I came across a comment by user nl, describing a simple spell-checker that Peter Norvig wrote on a plane ride, as an explanation to some friends on how spell-checkers work.

For those who might not have heard of Peter Norvig, he’s a giant in the field of AI, and is currently Director of Research at Google. Computer science students may recognize his amazing textbook- Artifical Intelligence: A Modern Approach (the one with chess pieces on the front).

Norvig’s spell-checker is a gorgeous thing. It’s refined, readable, and useful, and so fits my criteria for beauty in software. I’d recommend reading the source and equally wonderful explanations below it, if you’re interested.

But this isn’t meant to be a fanboy post; instead, this is about me, flexing in the shadow of giants before being appropriately humbled.

The Test

Another post on Norvig’s site describes some papers which compare programming languages for development time and performance on a sample task.

Besides the results of the papers, the cool part was that Norvig included his own solution (in Lisp), as well as how long it took him to complete the task!

You can read the original test instructions here.
The test inputs, expected results, and results of the study can be found here.

And so I set out to test my mettle, in attempting to write a python solution in less time than the 2 hours it took Peter Norvig (1.5 hours writing, 0.5 hours documenting and bugfixing).

My Timeline

+0:00 - Begin Reading problem
+0:05 - Half read problem, start writing solution
+1:20 - Finished solution, count chickens, begin writing documentation.
+1:50 - Finish documentation. Test on partial datasets, celebrate! Take break.
+2:00 - Decide to try solution on full dataset. The horror. Begin rewriting.
+3:15 - Finish and compare final solution to Norvig

At about +1:00 I was over the moon, thinking I’d finished much before any of the other programmers in the study. Then I read the problem statement more closely and realised I had made some wrong assumptions about how the output was meant to be formatted.

Then at +1:20, finishing my first solution, I had pretty much the same feeling. I ran the program on the smaller, test dataset which was provided, and when the outputs matched, fist-pumped and began writing some documentation.

At +2:00 when it finally dawned upon me to test my solution on the full dataset, I was stunned. Despite being correct, my program was horribly inefficient:

command time -l python3 solution1.py numbers.txt words.txt
236.60 real       235.65 user         0.94 sys
            49217536  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               81537  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                  10  signals received
                  25  voluntary context switches
                 274  involuntary context switches
       4367111360040  instructions retired
        747507413225  cycles elapsed
            24413312  peak memory footprint

For reference as to how bad that is, this time of 237 seconds is 3x worse than the worst time in the study, and 12x worse than the best time. By itself that might not be too bad- lisp can be much faster than python. Sure. But then again, I’m running this on an M1 mac in 2022 while these results are from 2000 running on god knows what.

When I ran Peter Norvig’s solution on the dataset…

❯ command time -l sbcl --script norvig.lsp > out
        0.09 real         0.06 user         0.02 sys
            61292544  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                4776  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                  59  signals received
                   0  voluntary context switches
                  17  involuntary context switches
           587319952  instructions retired
           181442127  cycles elapsed
            60670976  peak memory footprint

0.09 seconds, or 2629x faster than my python solution! This is probably a better benchmark for the best times from back in 2000.

So let’s see why my solution performs so poorly!

Original Solution 🤢

You can find my original solution here. See how many design flaws you can find, then we’ll go over potential fixes.

Design Flaws

1. Unnecessary repeated file IO

The most obvious sin is re-opening and reading the word dictionary for every single phone number! I immediately wrote a solution2.py afterwards, which just loads the dictionary once at the very beginning of the program. The results of this were fairly expected, shaving off 80 seconds:

❯ command time -l ./solution2.py input.txt dictionary.txt
 156.88 real       156.19 user         0.30 sys
 ...

2. Reading every word in the dictionary for each potential encoded section.

Probably the most egregious flaw though, is in find_encodings:

def find_encodings(
    phone_number: str, words: List[str], last_was_digit: bool = False
) -> List[str]:
  ...
  # Get a list of all the words that could be a valid next word.
  next_words = list(
      filter(lambda word: can_encode(phone_number[: len(word)], word), words)
  )
  ...

To generate next_words, we are actually checking every single word in the dictionary for every phone number. Without any proof, I’m going to claim that this makes find_encodings O(really really bad).

Of course, had I read this line from the original problem, I could have saved myself some time:

Your program must be run time efficient in so far that it analyzes only a very small fraction of all dictionary entries in each word appending step.

Improving this shortcoming is heavily linked to the next step.

3. Awkward choice of storage for the original words dictionary

Originally, to store the corpus, I built a dictionary which mapped standardized words to their original counterparts. This is an awkward idea, because now, determining if a word would be a valid encoding for a section of a phone number requires a conversion process from one to the other. We have the potential to perform this same conversion multiple times, unecessarily, when it could be performed only once at the very start of the program.

4. `find_encodings` only taking in the standardized list of words.

This forces an awkward step after all the encodings are found, where we must find which sections can be mapped back to their original words.

Instead, we should pass in the entire dictionary which maps standardized words to their original forms, and then encode the original word for a section instead of the standardized form.

5. `find_encodings` Returning a Result

We don’t need to do anything additional to the encodings once we find them, so find_encodings can just print the encodings as soon as they’re found, simplifying the internal logic.

And probably many more!

Improved Solution

For my second attempt, I decided it would be much nicer to store the original dictionary as a mapping from encoded phone numbers to a set of their potential original words within the dictionary.

This gave the following solution (also available here):

#!/usr/bin/env python
"""Usage: solution3.py numbers_file words_file"""
import sys
from typing import Dict, List, Set

NUMBER_TO_LETTERS = {
    0: "e",
    1: "jnq",
    2: "rwx",
    3: "dsy",
    4: "ft",
    5: "am",
    6: "civ",
    7: "bku",
    8: "lop",
    9: "ghz",
}

LETTER_TO_NUMBER = {
    letter: str(number)
    for (number, letters) in NUMBER_TO_LETTERS.items()
    for letter in letters
}


def main():
    numbers_file, words_file = sys.argv[1:]
    word_map = generate_word_map(words_file)

    with open(numbers_file) as numbers:
        for number in numbers:
            print_encodings(number.strip(), word_map)


def print_encodings(phone_number: str, word_map: Dict[str, Set[str]]) -> None:
    """Find all possible encodings for the given phone number and word list.

    Args:
        phone_number: A standardized phone number to find encodings for.
        word_map: A mapping from phone numbers to original words that could
            make them.

    Returns:
        A list of encodings, which takes the form of individual words from the
        dictionary, or a single digit, separated by a single space.
    """

    digits = standardize_phone_number(phone_number)
    if not digits:
        return

    def subsections(start: int) -> List[str]:
        """For k in the interval (start, len(phone_number)), gives all slices:
        phonenumber[start: k]
        """
        word = digits[start:]
        return [word[: end + 1] for end in range(len(word))]

    def find_encodings(
        sections: List[str], start: int = 0, last_was_digit: bool = False
    ) -> None:
        """Print all possible encodings of the slice: phone_number[start:].

        Args:
            sections: The original words (or numbers) that would be a valid encoding
                so far for the slice phone_number[: start]
            start: The index from which to search for potential encodings in
                phone_number.
            last_was_digit: An optional (redundant) parameter which is true iff
                the last section was a digit.
        """
        if start >= len(digits):
            print(f'{phone_number}: {" ".join(sections)}')
            return

        next_sections = list(
            filter(lambda section: section in word_map, subsections(start))
        )
        for section in next_sections:
            for original_word in word_map[section]:
                find_encodings(sections + [original_word], start + len(section))

        if not next_sections and not last_was_digit:
            find_encodings(sections + [digits[start]], start + 1, True)

    find_encodings([])


def generate_word_map(words_file: str) -> Dict[str, Set[str]]:
    """Generates a mapping from phone numbers to all original words that could
    have made that phone number.

    Args:
        words_file: A valid path to a words_file.
    """
    with open(words_file) as word_list:
        original_words = [word.strip() for word in word_list.readlines()]

    result: Dict[str, Set[str]] = {}
    for word in original_words:
        phone_number = word_to_phone_number(word)
        result.setdefault(phone_number, set()).add(word)

    return result


def word_to_phone_number(word: str) -> str:
    """Converts a word to its corresponding phone number."""
    letters = filter(lambda letter: letter.isalpha(), word.lower())
    return "".join(map(lambda letter: LETTER_TO_NUMBER[letter], letters))


def standardize_phone_number(phone_number: str) -> str:
    """Removes all non-numeric characters from a phone number."""
    return "".join(list(filter(lambda digit: digit.isnumeric(), phone_number)))


if __name__ == "__main__":
    main()

Key improvements

Now we’re only checking at max all possible subsections of the phone number, rather than multiplying that by every word in the original dictionary.
Adding the sections parameter to the find_encodings closure made the body much nicer, and was an idea I got from the section on tail-recursive calls in the beginning of SICP. I also wouldn’t have thought to use the closure to hide the start and last_was_digit parameters without that book.
By passing the entire mapping from digits->set<original_word>, we can add the original word to sections as opposed to the standardized form, avoiding an awkward reconversion process later.
I know that last_was_digit is technically redundant, but its prettier to include it this way then check if sections is empty.

Also without further ado, the new runtime was 0.265 seconds 🎉 and used around 38MB of memory.

Comparing to Norvig’s Solution

I was incredibly happy to find that I converged to the same solution as Norvig, even though it took me an extra hour and a bit (and I didn’t document anywhere near as much as he did). The main differences are:

He encoded the original words as digits representing the converted phone numbers rather than strings of the same thing.
I’m calculating all subsections of the phonenumber a little more eagerly, while he’s unrolling them in a loop.
His solution contains so few lines of code that it’s kind of bizarre.
His solution uses 60MB compared to my 38MB, but as far as I can tell, his should actually be using less than mine and I’m thinking this is something to do with Python vs lisp?

Also if anybody is skeptical that I just read his solution ahead of time for my second attempt and converted it to python, I will say that as somebody quite unpracticed with lisp, it took me more time to read/lookup/understand his solution than to write mine.

Conclusion

This turned out to be alot of fun in the end. I hadn’t had the chance to write this sort of program (non-trivial and well defined) in a long time, so it was nice to exercise that part of my brain.

As for how my experience relates to the point of the original studies- I was able to produce a python solution in comparable time to the lisp participants, and which ran within (probably) the same efficiency range. It’s hard to know this for sure without running the unavailable code from the other participants on my machine.