Generation of synthetic text using word transition analysis

This is a simple investigation to see if the work of great poets and novelists can be recreated by statistical analysis of word-pairs used. Of course, the answer is clearly 'no', but it'll be fun to find out what happens.

Word transition analysis involves taking each word in the training text and building a normalised histogram of all the words that follow it. For instance, in the sentence:

One fish two fish red fish blue fish

The words 'one', 'two', 'red', blue' are followed by the word 'fish' 100% of the time, while the word 'fish' is followed by 'two', 'red', blue' and 'end of sentence' each for 25% of the time. This can be represented by a state transition diagram:



Now all we have to do to regenerate some text with these statistics is to take a random path through the states according to the state transition probabilities. Some examples of possible sentences include:

One fish blue fish100% x 100% x 25% x 100% x 25% = 6.25% chance
One fish100% x 100% x 25% = 25% chance
One fish red fish two fish blue fish red fish100% x 100% x 25% x 100% x 25% x 100% x 25% x 100% x 25% x 100% x 25% = 0.01% chance

These state transitions can be represented as a matrix:

Next wordOnefishtworedblueend
Current word
start 1.00.00.00.00.00.0
One 0.01.00.00.00.00.0
fish 0.00.00.250.250.250.25
two 0.01.00.00.00.00.0
red 0.01.00.00.00.00.0
blue 0.01.00.00.00.00.0


Larger samples of text can be analysed in a similar way, resulting in a larger state transition diagram. If multiple text files are used, then there may be more than one possible word after the 'start' state and before the 'end' state.

A program to analyse text using this method was written. The state transitions were stored as a 2D array, as in the example table above. This isn't a very efficient way to represent a sparse matrix, but it's quick to implement and simple to debug. As might be expected, most of the resulting text is gramatically incorrect. This is due to words such as 'a', 'and', 'the' etc. being common to most sentences and cause the direction of the generated sentences to change dramatically mid sentence. This can be controlled by limiting the training text to a few paragraphs. However, with too little training data the system simply reproduces the original text almost verbatim.

When a good balance is found then convincing sentences can be created, and very occasionally they provide more humour and insight than the original. This is particularly true when analysing political speeches. I guess some people lend themselves more readily to being modelled by a simple state machine.

You can download an example C++ program with a KDE interface here. You should be able to get the core analyser class (in BuildStatistics.cpp) to compile on other platforms without any trouble, maybe.

EmailGenerator-0.1.tar.gz

Why is it called 'EmailGenerator'? Because it was originally created to generate comedy emails in the style of one of my favourite helpdesk customers.

Here are some examples of generated sentences using statistics from various sources

William Wordsworth 'I Wandered Lonely As A Cloud' remix
I at once I at once I saw I lie
In vacant or in the bliss of golden daffodils;
Beside the milky way,
They stretched in glee;
A poet could not be but they
Out-did the daffodils.
H.G. Welles 'War of the Worlds' (1st paragraph)
No one gave a thought of human danger, or thought to dismiss the idea of water. With infinite complacency men fancied there might scrutinise the older worlds of human danger, or thought of space as his own; that this globe about their little affairs, serene in their little affairs, serene in the idea of space as men busied themselves about their empire over matter. It is curious to the idea of the same. No one would have believed in the great disillusionment.
UK Chancellor Alistair Darling's 2008 Mansion House speech
Indeed, the close co-operation between us have made London the City of foreign exchange trading and to keep it that way. That is one of our country. Our approach to be based on our country faces and welcoming investment from overseas including the UK and sectors, such as law and gentlemen. It must also bringing together senior industry players to the talents, drive and the world to set up in partnership, we must do everything to look at home and sectors, such as law and sectors, such as this evening.
The analyser source code. Generated source code rarely compiles :-)
// read so we are words too
int CBuildStatistics::BuildWordListFromFile(char *filename)
{
    fstream file;
    char word_buffer[80];
    map::const_iterator row,col;

    file.open (filename, fstream::in);

    if(file.bad())
        output += buff;
        return(output);
        characters++;
    {
    // add the CBuildStatistics class.
// //////////////////////////////////////////////////////////////////////