Wednesday 23 September 2009

Chars, words and sorted lists


Pictures today are autumn things in the garden. First is some pink Sedum, looking much like pink cauliflower (you can get pink varieties). The second is bright orange Pyracanthus berries before the blackbirds scoff them all.

I've recently been working on a little task which is to run on the console. Not a website at all, much simpler. I found myself starting up my Visual Studio and setting up a project for the first time. Even, rather later, gathering my nice new classes into one place and making them into a library. Hot stuff! I should certainly create libraries more often.

The task was to create an instance of a character reader and assign it a chunk of text as a string. This can then be accessed, one char at a time. In a friendly way it lets you know you have got to the end by throwing an EndOfStreamException. The output wanted was a  list of words in the string sorted by number of occurrences and then alphabetically. After that it got more complicated but that will do for the next time.

I got my text from the first 3 paras of a classic short story "The Star" by Arthur C Clarke. Thank you Wikipedia.

I decided to create a class wrapper for a list of words called WordList. Also a WordListBuilder to process each character as it came and wrap things up at the end. This character reader class is called RealTimeCharReader, there will be another type later.
So:

    ICharReader myCharReader = new RealTimeCharReader();
    assignReaderContent(myCharReader);
    WordListBuilder myWordListBuilder = new WordListBuilder(myCharReader, 
         new WordList());

This means I want an instance of ICharReader, which implements IDispose, so has to have a Dispose() method, to make sure it cleans up. It also has to have a method fetchNextChar() which delivers a character, moves the pointer along and throws an exception when it is finished.

assignReaderContent is a static function allowing me to pass in whatever string I feel like in the Main() function.

WordListBuilder is going to have to take each char and either add it to a new word, or finish the word, eat any nonletter characters and start a new word. Each word is then added to the WordList. A Word is another new class containing a string and a counter. If a word turns up more than once, you can just increment the counter. Both the char reader and the wordlist are passed to the WordListBuilder to do the real work. Here is the call:
myWordListBuilder.buildWordList();
myWordListBuilder.BuilderWordList.makeReport();
buildWordList will process the list, makeReport will sort it and print the results to the console.

buildWordList consists of a try/catch structure containing a forever while loop which takes each returned char and decides what to do with it. The catch intercepts the EndOfStreamException and exits the while loop. Any other runtime exception in the reader gets caught by a more general catch. Either way the finally clause means Dispose is always invoked.

    public void buildWordList()
    {
        char nextChar;
        string nextWord = string.Empty;
                
        try{
            while (true){            
                nextChar = BuilderReader.fetchNextChar();
                if(char.IsLetter(nextChar)){
                    nextWord += nextChar;
                }
                else{
                   // treat all puntuation, digits and whitespace the same
                    if(nextWord.Length > 0){
                        completeWord(nextWord);
                        nextWord = string.Empty;                            
                    }
                }
            }
         }
             
         catch(EndOfStreamException ex){
             // add last word - even if it is partial - discuss[EOS]ion point?
            completeWord(nextWord);
            Console.WriteLine(ex.Message);
         }
         catch(Exception ex){
             Console.WriteLine(ex.Message);
         }
         finally{
             BuilderReader.Dispose();
         }
    }

    private void completeWord(string s){
        if(s.Length > 0){
            Word w = new Word(s);
            BuilderWordList.addNewWord(w);
         }
     }

The local variable nextWord is just a string, set to empty string. Since you have to add all words, including the last one - even if it has only partially arrived before transmission ceased, I added a short method completeWord, to be called whenever a non-letter is read or an EndOfStreamException is hit. This calls the WordList method addNewWord if there is anything in the string.


    public void addNewWord(Word w)
    {  
        // check word exists
        bool foundMatch = false;
        foreach (Word listWord in MyList){
             // if yes, then just increment count
             if (w.Content == listWord.Content){
                  foundMatch = true;
                  listWord.incrementCount();
             }
        }
        if (!foundMatch){
             MyList.Add(w);
        }
    }

Again, a simple method (as I prefer it!). It simply iterated through the list of Word objects in the member variable myList. If it finds a match it sets the boolean foundMatch to true and adds one to the count (or I could have used Count++). If it doesn't, then a new word is added to the list.

So now we have a list which contains a max of one entry for each word but it is not in order. I needed to create a class derived from IComparer and pass it to the sort function for the list of words. An IComparer clas needs to implement the function Compare. In this case we have to compare counts and then alphabetical order. It can be done in one pass, which I think makes life a lot easier!

    public class WordComparer : IComparer<word>
    {
        public int Compare(Word x, Word y)
        {
            if (x.Count > y.Count){
                return -1;
            }
            else if (x.Count < y.Count){
                return 1;
            }
            else{
            // they're the same word count, so sort alphabetically
                return string.Compare(x.Content, y.Content);
            }
        }
    }
The only unexpected part of this was that I assigned 1 as the result if x.Count was the greater. This gave me a reverse sorted list! So I swapped the values round. Notice for the alphabetical order I just use string.Compare, no further specification needed.

Nearly there! The sort function in WordList is very short:
private void sortList()
        {
            WordComparer wc = new WordComparer();
            MyList.Sort(wc);
        }
Passing an instance of IComparer enables the list to sort itself.

The last function simply iterates through the list and writes each word to the console, together with its count.

What's this all good for? Well, you might be monitoring someone's Tweets for rude words or buzzwords to display a Tag_cloud. The next time we'll consider monitoring several streams (Tweeters?) at once with possible delays in transmission.

By the way, I even managed to put all my classes in a library and make the small project refer to it. I'm quite proud.

I don't have any idea how to place a zip file here. I doubt if it would be that interesting at this stage but I ought to find out.

Next Episode

No comments:

Post a Comment