Solving ITA’s Word Numbers Puzzle

With all the stores closed and cold and damp weather outside, yesterday was the perfect day to work through a programming puzzle. I skimmed down ITA’s puzzle archive and thought Word Numbers seemed interesting. It was retired in September of 2008, so I don’t feel bad publishing my solution publicly. I’ve seen this puzzle before briefly, but had not actually made an attempt at solving it until yesterday.

The gist of the problem is to determine what number is completed at the 51 billionth byte when you concatenate the numbers 1 through 999,999,999 written out in words, such as “onemilliontwohundredthousandthreehundredfortyfive”, sorted alphabetically. What seems like an easy problem at first glance becomes dramatically more difficult with those last two words; sorting the word numbers alphabetically is what makes things interesting. The problem also asks for the sum of the numbers up-to-and-including that number ending at the 51 billionth byte in the sorted order.

With infinite resources, this problem is trivial: write out all billion numbers, run it through sort, and walk through the file until we get to the 51 billion byte mark, tallying up the values along the way. Unfortunately, my infinite supercomputer is stuck in an infinite loop from a few years back and I haven’t been able to make use of it since, so we’re going to have to approach this the old-fashioned way and use our brains. Full solution after the fold.

Getting Started

Being an older puzzle, I thought that perhaps the advent of cheap RAM and better hardware (especially SSDs) could allow the brute force method to now be viable, so I thought I’d approach this as a pure speed challenge in C++.

The natural starting point is to write a function that converts a normal number to the appropriate word number. My first attempt was highly expressive, but slow as molasses:

// convert the number N to a string,
// e.g. "911610034" => "ninehundredelevenmillionsixhundredtenthousandthirtyfour"
std::string numWord(int n)
{
    std::string ret;

    // word bank
    static const std::string
        ones[] = {"", "one", "two", "three", "four", "five", "six", "seven",
            "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen",
            "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"},
        tens[] = {"", "", "twenty", "thirty", "fourty", "fifty", "sixty",
            "seventy", "eighty", "ninety"},
        magnitudes[] = {"thousand", "million", "billion"};
    static const int ords[] = { 1000, 1000000, 1000000000 };

    // handle the <1k case specially
    if (n < 1000)
    {
        int hundreds = n / 100,
            leftover = n % 100;

        if (hundreds)
            ret += ones[hundreds] + "hundred";

        // special handle "teens" for their oddball-ness
        ret += leftover >= 20 ? (tens[leftover / 10] + ones[leftover % 10])
            : ones[leftover];
    }
    else
    {
        // now find which order of magnitude we are and recurse down
        for(int i = 0; i < 3; i++)
        {
            // we really mean n<ords[i]*1000, i.e. is n between ords[i] and
            // ords[i]*1000, but need to avoid overflowing the int
            if ((n / 1000) < ords[i])
            {
                // e.g. 123,456 => numWords(123) + "thousand" + numWords(456)
                ret += numWord(n / ords[i])
                    + magnitudes[i]
                    + numWord(n % ords[i]);
                break;
            }
        }
    }

    return ret;
}

This conventional approach, while easy to use, read, and maintain, was much too slow for my brute forcing effort. Running through 1 million, 10 million, and 100 million numbers projected the full billion to take hours. Changing the function to int numWord(int n, std::string *ret) to avoid making more than one string sped things up a bit, but not nearly enough.

Since it’s a short bit of code that’s right at the center of it all, I decided to write a simple C-string wrapper that provided fast string concatenation and use that instead. It exposes a similar interface to std::string and uses a single manual allocation to provide predictable performance. It also provides an unsafe string concatenation function that favors speed over reliability:

// C-string wrapper that mimicks basic std::string functionality
// using fixed-size allocations and offering fast string concatenation
class FastCatStr
{
    uint64_t len, allocated;
    char *data;

public:
    FastCatStr() : data(nullptr), len(0), allocated(0) {}
    ~FastCatStr() {
        reserve(0);
    }

    void reserve(unsigned int n) {
        if (data != nullptr)
        {
            data = nullptr;
            delete [] data;
        }
        allocated = n;
        if (allocated > 0)
            data = new char[allocated];
    }
    inline void clear() {
        len = 0; // virtually no work required
    }

    inline void concat(const char *str) {

        /* this should really be something like:
                len += strlen(str);
                if (len > allocated)
                    return false; // or throw an exception
                // .. as is

            or more "traditionally" (which takes twice as long, presumably
            due to in-register optimizations):

                // fail silently, but don't trash memory
                while(*str && len < allocated)
                    data[len++] = *str++;
                data[len] = '\0';

            interestingly enough, making "end" a class variable and carrying
            it through instead of len (i.e. clear => (end = data), size =>
            (end - data)) is significantly slower than managing it here
        */

        char *end = data + len;
        // EXTREMELY dangerous without bounds check, but we're a fast cat.
        while(*str)
            *end++ = *str++;
        *end = '\0';
        len = end - data;
    }

    // external interfaces to act like a std::string
    inline void push_back(char c) { data[len++] = c; data[len] = '\0'; }
    inline void pop_back() { data[--len] = '\0'; }
    inline uint64_t size() const { return len; }
    inline const char *c_str() const { return data; }
};

Instead of *ret += (...) we just use ret->concat(...), and as long as we pass in a FastCatStr that has reserved plenty of space (such as 1024 bytes), all is well. Coupled with a few extra optimizations in the numWord function, this new implementation is blazing fast. The original std::string implementation was still running when the FastCat version finished running, clocking in at less 90 seconds to run through all billion numbers. This affords us with some valuable knowledge: the longest string is much less than 1024 bytes (so we’re safe), and the whole dataset is roughly 70 GB of data. This infers that we could handle the entire dataset in memory on a good server these days; EC2 unfortunately caps out at 68 GB of RAM.

Tokenizing

Since we are working with a finite dictionary of words, we should be able to tokenize them to save space. Representing each word with a single char reduces our memory footprint substantially. In order to achieve our end-target of sorting these numbers, we need to maintain the same sorting order in the tokens as the original words. For example, “three” has to come before “two”, so they may tokenize to 15 and 20.

This makes it a bit harder to tokenize certain words; the number “sixteen”, for example, needs to come after “sixhundred” but before “sixthousand”, so we need to break up “sixteen” into “six” and “teen”. This only applies when a shorter word is totally encompassed by a larger one, which includes four, six, seven, eight, and nine, with eight being extra picky because of its trailing “t”. This means that we need to tokenize “y”, “ty”, “een”, and “teen” separately, but only for the six aforementioned numbers (i.e. fifteen can remain a single token).

This tokening system conveniently results in 26 tokens, which means we can use the numbers 65 (ASCII “A”) through 90 (ASCII “Z”) to represent them, and can easily print them out for both our result and for testing. It is interesting to note that after an hour or so entrenched in this mapping, the tokenized numbers read as fluently as normal numbers, much in the same way that when working with bitmasks or binary file formats, hex becomes as legible as reading decimal numbers. Generating tokenized numbers is easy, as the “words” arrays in the wordNum function can be replaced with the tokenized words:

/*
    tokens to use:
        A. billion, B. een, C. eight, D. eleven, E. fifteen, F. fifty,
        G. five, H. forty, I. four, J. hundred, K. million, L. nine,
        M. one, N. seven, O. six, P. teen, Q. ten, R. thirteen, S. thirty,
        T. thousand, U. three, V. twelve, W. twenty, X. two, Y. ty, Z. y
    e.g. #123456
    = onehundredtwentythreethousandfourhundredfiftysix
    = one hundred twenty three thousand four hundred fifty six
    = M   J       W      U     T        I    J       F     O
    = MJWUTIJFO
    
    e.g. the numbers 1006000 and 1000016 translate to:
            "onemillionsixthousand" > "onemillionsixteen"
        which  translates to MKOT > MKOP
 */
static const char
    *ones[] = {"", "M", "X", "U", "I", "G", "O", "N", "C", "L", "Q",
        "D", "V", "R", "IP", "E", "OP", "NP", "CB", "LP"},
    *tens[] = {"", "", "W", "S", "H", "F", "OY", "NY", "CZ", "LY"};

Tokenizing the numbers means that the full set is roughly 14 GB, which is now well within the realm of just malloc‘ing the whole damn thing and letting C’s built-in qsort do the work if you had a server with more than 16 GB of RAM, which isn’t too uncommon these days. My laptop only has 8 GB, though, so this isn’t a satisfactory solution for me.

It’s clear that in order to solve this problem on normal hardware, we’re going to need to shard the data in some way. The natural way to do so would be by the first letter in the tokenized word number. We know that “C” (eight) is the earliest first letter that can start the word, so if we start there we can tally up how many bytes have passed after “C”, then move on to “D” (eleven) and so forth. The only problem is that there are tens of millions of word numbers that start with the letter “C”; it’s still prohibitively memory intensive.

Instead of sharding the set manually, let’s make the computer do the work. We can shard the dataset all the way down, recursively, until there’s nothing left to shard. While this effectively provides us with constant memory, the question remains as to whether we can do it in a reasonable amount of CPU time.

The naive implementation of an auto-sharding traversal algorithm would be to try every combination of all 26 letters, from A to Z each time to maintain the correct order, and test if that string is a valid word number. If it is, count the number of bytes it represents towards the total, until we get to 51 billion. Unfortunately, the longest tokenized word is 17 characters long. 26^17 is roughly 10^24, or 1 million billion billion. Constant memory would be nice, but finishing in my lifetime isn’t something I’m willing to give up for it.

I’s before E’s, except after C’s (or Y’s, T’s, Z’s, or A’s)

Instead of using a guess-and-check method, we can instead create a grammar for our tokenized word numbers language that defines whether a word is legal or not, and then reverse that grammar to determine the complete list of legal combinations. This may sound daunting, but the word numbers language is pretty constrained. Words like “oneoneonehundredhundredmillionmillion” are not valid.

This is where things get interesting: effectively what we’re setting out to accomplish is the ability to walk the entire sorted set without actually generating it, in constant memory, using no extra cycles.

The grammar I came up with is as follows, in a quasi-BNF format:


    wordnum:
        billionnum

    billionnum:
        -- hundrednum A millionnum          -- n/a, just showing the trend
        -- hundrednum A                     -- ...
        millionnum

    millionnum:                             -- falls through
        hundrednum K thousandnum
        hundrednum K
        thousandnum

    thousandnum:                            -- e.g. 1512 (MTGJR) or 1000 (MT)
        hundrednum T hundrednum
        hundrednum T
        hundrednum

    hundrednum:                             -- e.g. 512 (GJR) or 500 (GJ)
        ones J basicnum
        ones J
        basicnum

    basicnum:                               -- e.g. 5 (G), 13 (R), or 51 (FM)
        ones
        teens
        tens ones
        tens

    ones:
        M, X, U, I, G, O, N, C, L

    teens:
        Q, D, V, R, IP, E, OP, NP, CB, LP

    tens:
        W, S, H, F, OY, NY, CZ, LY

We start (from the bottom to the top) by defining the primitives, the tens, teens, and ones, as a list of the tokens that represent. From here, we can define a basic number to represent numbers under 100. The basic number is either a ones, a teen, a tens followed by a ones, or a tens. Note that a word number can end with any token, so every rule can be a terminus.

With basic numbers in place, we can create a hundreds number to represent numbers less than 1,000. The hundreds number is either a ones followed by a “J” token (the word “hundred”), followed by an optional basic number, or it’s just a basic number. It falls through because there’s no requirement that a number less than 1,000 has the word “hundred” in it. In essence, this rule could be written as [ones J] [basicnum], where square brackets are used to signify optional tokens, with any combination being valid.

The thousands number (numbers less than 1 million), millions number (numbers less than 1 billion), and billions number (numbers less than 1 trillion) all behave similarly to the hundreds num, with an optional “thousands” or “millions” chunk and falling through accordingly. The billions number also ends up being irrelevant because our problem is scoped within numbers less than 1 billion. If we had to handle bigger numbers, however, this grammar could expand accordingly.

The next step is to reduce this grammar down manually into something a little more digestible by us mortals by substituting each rule until we only have wordnum, hundrednum, and our primitives.

    wordnum:    -- wordnum == billionnum == millionnum
        -- originally: hundrednum K thousandnum
        -- recall that "K" means "million" and "T" means "thousand"
        hundrednum K (hundrednum T hundrednum)
        hundrednum K (hundrednum T)
        hundrednum K (hundrednum)
        -- originally: hundrednum K
        hundrednum K
        -- originally: thousandnum
        (hundrednum T hundrednum)
        (hundrednum T)
        (hundrednum)
    
    hundrednum:
        -- originally: ones J basicnum
        ones J (ones)
        ones J (teens)
        ones J (tens ones)
        ones J (tens)
        -- originally: ones J
        ones J
        -- originally: basicnum
        ones
        teens
        tens
        tens

As the previous description of our grammar suggested, the parallelism of each rule falling down to the smaller rule all the way through to the basicnum makes it much easier to grasp what the grammar is doing. Reducing a more complex grammar would obviously be much more difficult. At this point, we can determine the number of permutations available through each rule as a sanity check:

    hundrednum:
        ones J ones                             -- 9*9=81 permutations
        ones J teens                            -- 10*9=90 permutations
        ones J tens ones                        -- 9*8*9=648 permutations
        ones J tens                             -- 9*8=72 permutations
        ones J                                  -- 9 permutations
        ones                                    -- 9 permutations
        teens                                   -- 10 permutations
        tens ones                               -- 8*9=72 permutations
        tens                                    -- 8 permutations
        ------------------------------------------ 999 permutations in total

    wordnum:
        hundrednum K (hundrednum T hundrednum)    -- 999^3=997,002,999 permutations
        hundrednum K (hundrednum T)             -- 999^2=998,001 permutations
        hundrednum K (hundrednum)               -- 999^2=998,001 permutations
        hundrednum K                            -- 999 permutations
        (hundrednum T hundrednum)               -- 999^2=998,001 permutations
        (hundrednum T)                          -- 999 permutations
        (hundrednum)                            -- 999 permutations
        ------------------------------------------ 999,999,999 permutations in total

The hundrednum rule thus represents 999 permutations, as we would hope that a grammar that represents the numbers 1 through 999 would, and the wordnum rule represents 999,999,999 permutations, again as we would expect. While this doesn’t prove that our grammar is correct, it goes a long way in reassuring that we’re on the right track. Writing the actual grammar parser for this language is actually surprisingly trivial now that it’s been reduced so far, and effectively writes itself:

// convert a tokenized word number string into a real number as a uint64_t
class WordnumParser
{
    const char *str;
    int res;

    // eat up the chars given in toeat from str. if unmatched, return false.
    inline bool eat(const char *toeat)
    {
        // only want to increment str if we eat the whole thing
        const char *temp = str;
        while(*toeat)
        {
            if (*temp++ != *toeat++)
                return false;
        }

        // go to end of toeat, we're good.
        str = temp;
        return true;
    }

    // parse for a wordnum. returns 0 if not matched.
    uint64_t num()
    {
        uint64_t n = 0,
            temp = 0;

        if ((n = hundrednum()) == 0) // needs a legit number to start
            return 0;

        if (eat("K"))
        {
            n *= 1000000; // optional K multiplies by 1M and carries

            // optional inner chunk:
            if ((temp = hundrednum()) && eat("T"))
            {
                temp *= 1000; // optional T multiples by 1k and carries
                temp += hundrednum(); // optional inner chunk
            }

            // we add temp (from hundrednum, possibly *1000+hundrednum)
            // regardless of eat("T") since it's optional
            n += temp;
        }
        else if (eat("T"))
        {
            n *= 1000; // optional T multiplies by 1k and carries
            n += hundrednum(); // optional inner chunk
        }

        return n;
    }

    // parse for the actual number. returns 0 if not matched.
    uint64_t hundrednum(bool allowhundreds = true)
    {
        uint64_t n = 0,
            temp = 0;

        if ((n = teens()))
            ; // nothing to do, we just return it
        else if ((n = tens()))
        {
            // handle possibility of ones following the tens (i.e. 21)
            if ((temp = ones()))
                n += temp;
        }
        else if ((n = ones()))
        {
            // we recurse upon ourselves removing the ability to match J
            // when only looking for basicnum, i.e. hundrednum(false)
            // is equivalent to basicnum; allowhundreds == !basicnum
            if (allowhundreds && eat("J"))
            {
                n *= 100; // optional J multiplies by 100 and carries
                n += hundrednum(false); // optional inner (basicnum) chunk
            }
        }

        return n;
    }
    // ones, teens, and tens return the appropriate number based on index
    inline uint64_t ones()
    {
        static const char *ones[] =
            { "M", "X", "U", "I", "G", "O", "N", "C", "L" };
        for(int i = 0; i < 9; i++)
            if (eat(ones[i]))
                return 1 + i;
        return 0;
    }
    inline uint64_t teens()
    {
        static const char *teens[] =
            { "Q", "D", "V", "R", "IP", "E", "OP", "NP", "CB", "LP" };
        for(int i = 0; i < 10; i++)
            if (eat(teens[i]))
                return 10 + i;
        return 0;
    }
    inline uint64_t tens()
    {
        static const char *tens[] =
            { "W", "S", "H", "F", "OY", "NY", "CZ", "LY" };
        for(int i = 0; i < 8; i++)
            if (eat(tens[i]))
                return 10 * (i + 2);
        return 0;
    }
    
public:
    WordnumParser(const char *_str) : str(_str) {}
    ~WordnumParser() {}

    // public interface to the parser
    uint64_t wordNum() { return num(); }
};

// example usage:
int main()
{
    WordnumParser test("MJWUTIJFO");
    printf("MJWUTIJFO == %llu\n", test.wordNum()); // MJWUTIJFO == 123456

    return 0;
}

Going Backwards

Thinking back to our original goals, the real reason for simplifying the grammar is so that we can reverse it by hand. In order to this, we need to determine which tokens can follow each of the 26 tokens. Intutively, this is fairly easy to determine, although it becomes obvious in the process that there are some additional constraints by which our grammar will need to abide. The hand-picked reversed grammar behaves as follows, annotated and divided by context:

    -- the funky 8, with eight-een and eight-y
    C:
        millionnum: K
        thousandnum: T
        hundrednum: Z, J
        basicnum: B
    
    -- standard non-suffixable ones
    M, X, U, G:
        millionnum: K
        thousandnum: T
        hundrednum: J
    
    -- the semi-funky 4, with four-teen but not four-ty (forty)
    I:
        millionnum: K
        thousandnum: T
        hundrednum: J
        basicnum: P

    -- standard suffixable ones (-teen and -ty)
    O, N, L:
        millionnum: K
        thousandnum: T
        hundrednum: Y, J
        basicnum: P
    
    -- teens
    Q, D, V, R, E, B, P:
        millionnum: K
        thousandnum: T

    -- tens
    W, S, H, F, Y, Z:
        millionnum: K
        thousandnum: T
        tenones: M, X, U, I, G, O, N, C, L
    
    -- hundreds
    J:                                      
        millionnum: K
        thousandnum: T
        basicnum:
            ones: M, X, U, I, G, O, N, C, L
            teens: Q, D, V, R, E
            tens: W, S, H, F

    -- a new "hundrednum" chunk
    nil, K, T:
        basicnum:
            ones: M, X, U, I, G, O, N, C, L
            teens: Q, D, V, R, E
            tens: W, S, H, F

As with before, since a word number can end with any token, end-of-word is also valid after anything. To illustrate how this works, here’s an example. The “C” (eight) can be followed by end-of-word, B, J, K, T, or Z, so that means that valid (sorted) words are “eight”, “eighteen”, “eighthundred”, “eightmillion”, “eightthousand”, or “eighty”. Recursing down the list, we get something like:

    C
    CB
    CBK
    CBKC...
    CBT
    CBTC...
    CJ
    CJC...
    ...

This mapping system provides the foundation for walking the set, however there are four additional constraints that are needed in order to determine true validity; without these rules, the generator would go on forever with nonsense words.

  1. Once K (million) is reached, there can be no more K’s.
  2. Once T (thousand) is reached, there can be no more K’s or T’s.
  3. There can only be one J (hundred) per block, which gets reset at nil (i.e. the empty string to start), T (thousand), and K (million).
  4. Ten-ones (as opposed to normal ones) have to treat suffixable ones the same as normal ones (i.e. ones that follow with T, Y, B, or P cannot be followed by those when in ten-one mode).

These constraints prevent “impossible” words, such as “fourthousandonehundredmilliontwohundredfiftysixteenhundredone”, which would otherwise count as words even though they are very clearly invalid. More importantly, though, these constraints allow the traversal to end naturally; token combinations now have a finite limit, and the recursion finishes gracefully without the need for a hard length cut-off.

Implementation

The implementation of the final solution, now that we’ve determined that we can walk the grammar in reverse and can already parse a tokenized number into a real number, is actually rather simple. We are now armed with the power to walk through the entire set without actually generating all of the data, adding up the numbers as we go along. This is just a matter of sticking to the “possible followers” list and implementing the four constraints listed above.

// walk through the entire set of valid wordnums between 1 and 999,999,999
class TraverseSet
{
    // which byte we're looking for
    static const uint64_t ByteN = 51000000000ULL;

    // define our constraint flags
    typedef char SyntaxFlags;
    static const SyntaxFlags
        NothingPassed = 0,
        TensPassed = 8,
        HundredPassed = 16,
        ThousandPassed = 32,
        MillionPassed = 64;

    // the data we're really interested in: total # of bytes,
    // total # of words, and sum of words when totallen < ByteN
    uint64_t totallen, totalwords, totalsum;

    // actually do something with the given (assumed to be) valid word
    inline void gotWord(const FastCatStr *str, int len)
    {
        totallen += len;
        if (totallen <= ByteN)
        {
            WordnumParser test(str->c_str());
            uint64_t wordn = test.wordNum();
            totalsum += wordn;

            // check if this is indeed the word we are looking for
            if (totallen == ByteN)
            {
                printf("Got totallen = %llu @ i = %d,\n"
                        "\tcurrent str = %s => %llu\n",
                    totallen, totalwords, str->c_str(), wordn);
            }
        }

        totalwords++;
    }

public:
    TraverseSet() : totallen(0), totalwords(0), totalsum(0) {}
    ~TraverseSet() {}
    
    // run through the given str (requires an allocated FastCatStr)
    void exec(FastCatStr *str, char last = 0, int len = 0,
        SyntaxFlags flags = NothingPassed)
    {
        // list of "what follows what" determined by hand
        static const char *possiblefollowers[26] = {
            // "billion" doesn't exist, so we'll use 0 for the nil case
            /* nil: */ "CDEFGHILMNOQRSUVWX",
            /* B: */ "KT",
            /* C: */ "BJKTZ",
            /* D: */ "KT",
            /* E: */ "KT",
            /* F: */ "CGIKLMNOTUX",
            /* G: */ "JKT",
            /* H: */ "CGIKLMNOTUX",
            /* I: */ "JKPT",
            /* J: */ "CDEFGHIKLMNOQRSTUVWX",
            /* K: */ "CDEFGHILMNOQRSUVWX",
            /* L: */ "JKPTY",
            /* M: */ "JKT",
            /* N: */ "JKPTY",
            /* O: */ "JKPTY",
            /* P: */ "KT",
            /* Q: */ "KT",
            /* R: */ "KT",
            /* S: */ "CGIKLMNOTUX",
            /* T: */ "CDEFGHILMNOQRSUVWX",
            /* U: */ "JKT",
            /* V: */ "KT",
            /* W: */ "CGIKLMNOTUX",
            /* X: */ "JKT",
            /* Y: */ "CGIKLMNOTUX",
            /* Z: */ "CGIKLMNOTUX"
        };
        static const int lengths[26] = {
            0, 3, 5, 6, 7, 5, 4, 5, 4, 7, 7, 4, 3,
            5, 3, 4, 3, 8, 6, 8, 5, 6, 6, 3, 2, 1
        };

        // go through the possible followers for the last character
        for(const char *followers = possiblefollowers[last];
            *followers != '\0'; followers++)
        {
            // flags needs to reset on each iteration since we pop it back off
            SyntaxFlags newflags = flags;

            // enforce language constraints:
            switch(*followers)
            {
                /* additional constraints:
                    1. skip K if K or T have passed
                        (i.e. only one K in a word, and has to be before T)
                    1. skip T if T has passed (i.e. only one T in a word)
                    3. skip J if J or tens/teens has passed
                        (i.e. only one J per block)
                    4. skip tens/teens if tens/teens have passed
                        (i.e. treat suffixable ones as normal ones in context)

                    "skip" in this sense means "skip as a possible follower".

                    also, we need to clear the J-flag when K and T pass
                    (i.e. set hundreds to be allowed again), and the
                    tens/teens-flag needs to be cleared at K, T, and J.
                 */
            
            case 'K':
                if ((newflags & MillionPassed) == MillionPassed
                    || (newflags & ThousandPassed) == ThousandPassed)
                    continue;

                newflags |= MillionPassed;
                newflags &= ~(HundredPassed | TensPassed);
                break;
            case 'T':
                if ((newflags & ThousandPassed) == ThousandPassed)
                    continue;

                newflags |= ThousandPassed;
                newflags &= ~(HundredPassed | TensPassed);
                break;
            case 'J':
                if ((newflags & HundredPassed) == HundredPassed
                    || (newflags & TensPassed) == TensPassed)
                    continue;

                newflags |= HundredPassed;
                newflags &= ~TensPassed;
                break;
            case 'W': case 'S': case 'H': case 'F': case 'Y': case 'Z':
            case 'Q': case 'D': case 'V': case 'R': case 'E':
            case 'B': case 'P':

                if ((newflags & TensPassed) == TensPassed)
                    continue;

                newflags |= TensPassed;
                break;
            }

            // at this point, *followers is truly a possible follower.
            // so let's tack it on and add the length
            str->push_back(*followers);
            len += lengths[*followers - 'A'];

            // ...and now pass str off to our worker function
            // (not using a callback for performance reasons)
            gotWord(str, len);

            // ...and recursively tack on followers
            exec(str, *followers - 'A', len, newflags);

            // now that we're done with this, take *followers off
            // (and take the length of it off too)
            len -= lengths[*followers - 'A'];
            str->pop_back();
        }
    }

    // public interfaces
    uint64_t getTotalSum() const { return totalsum; }
    uint64_t getTotalWords() const { return totalwords; }
    uint64_t getTotalLen() const { return totallen; }
};

// example usage:
int main()
{
    FastCatStr s;
    s.reserve(1024);
    TraverseSet solver;
    solver.exec(&s);

    printf("Total length = %llu, total words = %llu,\n\ttotalsum = %llu\n",
        solver.getTotalLen(), solver.getTotalWords(), solver.getTotalSum());

    return 0;
}

The exec function is very straight forward. It builds up a valid word number string by pushing and popping each possible follower in order, which is the hand-picked list of what-follows-what that gets filtered down by the four additional constraints discussed above. For each of these, we recursively apply this same function on the new string, continuing down the rabbit hole until there’s no more repetition (which turns out to be 17 levels deep).

The gotWord function simply accepts a word number, which it can assume is completely valid, converts it to a real number, and keeps track of the total length, total sum, and total number of words. When the total length hits our magic number (51 billion), it prints out the result. This may be a little crude, but it gets the job done.

Without further ado, here are the final results:

$ time ./wordnum
Got totallen = 51000000000 @ i = 723302491,
    current str = OJNYOKNJHOTGJNYG => 676746575
Total length = 70305000000, total words = 999999999,
    totalsum = 413540008163475743

real    1m43.757s
user    0m0.000s
sys     0m0.031s

Thanks to the additional information that we tally up, we’re able to confirm at the end that we did indeed traverse 999,999,999 words totaling 70 GB of data. The word number whose final letter resides at the 51 billionth byte is 676,746,575, and the total sum of numbers up to and including that number is 413,540,008,163,475,743. The program runs in a little over 100 seconds, but best of all the memory usage stays at 1.1 MB the entire time, exactly what we sought out to achieve.

Wrapping Up

I thoroughly enjoyed working on this puzzle, and I think it’s a great way to spend a quiet Sunday afternoon and evening. The things I liked most about it are that it’s:

  • Digestible. The description of the puzzle itself is short and sweet, and well described. There are no extra resources needed, it’s platform agnostic, and there are no lingering concerns about ambiguity in the question as presented. This made it easy and inviting to get started.
  • A great length. Often puzzles like these are either “gotchas” with quick solutions or require too much of a time commitment for those of us who are doing these for fun and are otherwise fairly busy. The word numbers problem fits in perfectly at about ¾ of a day – and a solution could likely be reached in significantly less time given additional resources and for those with exposure to similar problems in the past.
  • Rewarding. For me, the greatest part about the puzzle (or at least, my solution to it) is that each chunk only takes a short period of time to implement. The wordNum function, the tokenizer, the grammar, the parser, the reverse grammar, the traversal function, and a few other dead-end approaches I took; none of these individual chunks take more than half an hour to actually implement, leaving more time to think about the problem itself. All of the code in the puzzle is highly intuitive.
  • Discussable. Armed with the quasi-BNF grammar, I would imagine most programmers could convert it to the actual parser on a whiteboard or in a code-sharing application in a reasonable amount of time (say, 20 minutes). The same goes for the traversal routine. This makes the problem very practical to work through with others and discuss as you’re never debugging the code itself, but rather the logic and approach taken.
  • Translatable. The fundamentals used here translate surprisingly well to every day programming for a puzzle of this nature. Parsing data and thinking about data as code, and thus within the constraint of a grammar, is a great way to exercise your mind. Thinking in tokens is just like reading numbers in different bases. Proper use of data structures and traversal algorithms are basically the foundation of well written code. And of course, breaking down a problem and actually solving it within the limitations of the system and in a reasonable amount of time is exactly what programmers do.

I feel as though I could shave some time off the clock with further optimization, but I think that significantly speeding up the program would require a different tack. One thing that’s nagging at me is the fact that there’s so much repetition – the same thousand numbers are tacked onto the right of each of a million numbers on the left, and the same million numbers are tacked onto the right of each of the thousand numbers on the left. I think an alternate approach could focus on the repetition and similarity of the numbers to optimize out some of the cycles.

It’s interesting nonetheless how problems that seemed difficult to manage a few years ago are becoming exceedingly possible with hardware advancements. Running in under two minutes on my laptop with fairly straight forward C++ code is very cool. It’s worth noting as well that if we only look for the number itself, not the sum of numbers before it, the program runs in under 30 seconds.

All in all, this is a great puzzle to pass the time on a Sunday afternoon. It’s intriguing, exciting, and practical; a loud shout-out goes to the engineering team at ITA for consistently coming up with awesome puzzles like this one.

If anyone has any other ideas of ways to optimize or solve the word numbers problem, please share in the comments!

On a related note, I’m looking to hire software developers for BuySellAds.com. If you’re interested, let’s talk.

15 Responses to Solving ITA’s Word Numbers Puzzle
  1. Bart J

    This is beyond awesome. Thank you very much for taking the time to write this.

    • Nathan

      Thank you for the kind words! I’m glad you enjoyed it. :)

      • Cheyanne

        Wow I must confess you make some very trenchant pontis.

  2. Bart J

    An alternate approach using a B-Tree:

    http://deepaksurti.com/wnums.html

  3. Ben

    Hey,

    Thanks for posting this. You’ve taken an intriguing approach, which I never would have thought of.

    I just subdivided the problem by computing the strings from 1-999999 once, and then using this value to rapidly skip over everything of the form (eg.) “threehundredsevenmillion…..” until I passed the limit. Then I stepped down into that particular range, and checked each number. Takes about 20s on a fairly old machine.

    Ugly code here

  4. Sean Estabrooks

    Nice post with lots of interesting ideas. I took a more basic approach but it does solve the problem in about 1 second. https://github.com/loops/num2words

    • Nathan

      Cool, thanks for sharing!

    • Jase

      I thank you humbly for sharing your wsiodm JJWY

  5. Carrieann

    Deadly acucrtae answer. You’ve hit the bullseye!

  6. Sol Stueber

    Great site! I am loving it!! Will come back again. I am taking your feeds also

  7. Madeleine Voetsch

    Awesome post! I will keep an on eye on your blog.

  8. Linda

    Cool blog!

  9. Chuck Jessel

    Simply wanna say that this is handy , Thanks for taking your time to write this.

  10. Jeffrey Smith

    Hi and thanks for the advice, and the website really looks excellent. What wp theme are you utilizing?

  11. Perla Lagomarsino

    WONDERFUL Post.thanks for share..more wait ..