What Others Know About Compression That You Don't

20 Aug 2012 Ah, the compression formats. Always a mystery. Always just something we will use, but something we never understand. For one thing, this makes sense, considering Wikipedia alone lists 15 different compression formats. There are the well-known bz2, gz and LZMA algorithms. And the less well-known rz, ?Q? and ??_. Now that last one makes me happy this isn't a podcast... Though for this post we will concentrate on Deflate. Deflate is not only used in gzip and zip, but is also a fan favorite to compress web assets when sent to the browser.

The Basic Building Blocks

Underneath Deflate uses LZ77 combined with a Huffman encoding. LZ77 is used to detect duplicate strings and insert a so-called back reference (more on this later). Huffman Encoding is similar to ASCII, but instead of always using 8 bits, it stores the most used characters in 3 bits, and the least used ones, with, well, more bits. More on this also later.

Come On, Deflate Already!

First, Deflate chops up all parts into blocks to make processing easier. Then it decides in which mode it should encode that block:

Store raw stream
Store with a static Huffman encoding
Store with a dynamic Huffman Encoding

Now, it still doesn't deflate. But lets see. First mode is raw mode. In raw mode, Deflate decides that the data is already compressed and it can't do a better job. So it just inserts its header to say "I can't do anything with that, this is already compressed, dummy!". That's why, the next time a coworker shouts at you why you didn't ZIP these files, you can shout back: "D'oh, mode 1, man!". Second mode is static Huffman mode. This just means that Deflate decides that the data is standard enough that the standard Huffman Encoding table should be used. For example useful for English text. This way Deflate doesn't have to add a table as overhead to the compressed content. Third is dynamic Huffman mode. With dynamic Huffman mode, the content gets compressed with a Huffman Encoding, but the encoding table is generated based on the input. So for example, in Finish, the shortest code would probably be "ö", or something like that. In @zedshaw's recent tweets it could be "C", "F", "K" and "U", who knows! But that leaves us with two question: Huffman Encoding, wut? And LZ77?

Huffman Encoding

Huffman Encoding is something that a guy called Huffman developed while getting a Ph.D from ~~MTI~~ MIT. Nothing surprising until now! The paper on it is actually from 1952. Damn it! That's double my age! That paper should feel old! What it basically says: Tally up all individual characters. The most used one gets the shortest code (at 3 bits). Then continue down until you arrived at the end. Actually, Wikipedia does a better job at that than I ever could, so here's a picture: [caption id="attachment_241" align="alignnone" width="600"]

Huffman Coding Tree[/caption] (Source: http://en.wikipedia.org/wiki/File:Huffman_tree_2.svg) So that one sounded scary, turned out to be really easy!

So What About That LZ77

The next part in this tour around compression algorithms leads us right back into history. Can't we get something new going in the compression world? LZ77 was introduced in a paper in, well, 1977 by Lempel and Ziv. LZ77 is a so-called sliding window compressor. At least they came up with impressive names back then. Why sliding window? It checks the last 2, 4 or 32 kB for repeating strings. Simple as that. Should it find one such repeating string, instead of adding that string to the output stream, it puts a back marker in that basically says: If you want to finish reading that sentence, go back to page 32, line 17, and read from character 17 to character 34. That lazy bastard!

Now I See How They Tie Together

Well I hope you do! Else I failed and you should shame in the comments. :) Or of course I can try again: First, Deflate splits up the document into blocks to make working with it easier. Then (lets assume the input isn't already compressed) it runs each block through LZ77 to get all the obvious double strings out of the way. Then it sends it through a Huffman Encoding to make it even shorter. So yes, that's deflate.

What About The Others (BZIP2 Anyone?)

Not all of them use this simple approach, because simple apparently isn't good enough. BZIP2 goes through 9 different stages.

Replacing all 4 or more repeating characters with just 4 characters. (There you see, a geek had to make this algorithm for comic books! "AAAAAAAAAA!" is so much longer than "AAA\7!") So called Comic Book Encoding... Unfortunately not, it's simply called Run Length Encoding. How boring!
Next they wheel around the characters. Which basically means a fancy way to sort the same letters after each other so compression is easier. The so called Burrows-Wheeler Transform.
Replace all repeating characters with 0, so we have as many 0 as possible. Could also be any other character. The Move-to-Front Transform (Yes, I hate these people too at concerts).
And again, after sorting everything possible, we do a Run-length Encoding.
Then a Huffman Encoding. Don't tell me now I have to explain what that is!
Then again we see Huffmaaa\7n! This time with multiple tables per block to make the output even shorter.
8. and 9. Add some binary trickery.

So yes, you can use a lot of other things, but it all comes back to making the output even shorter. Surprising, hu?

So That's Compression

Yes, that's compression. Use a bit of special encoding, remove duplicates, and you are halfway there. Of course, this is just for Deflate. We saw that BZIP2 is already a lot more complicated and uses a lot of different tricks to make the output even smaller, but then again, it is also slower! If you liked this post, you should follow me on Twitter and sign up for the RSS feed! B58P3MMWCDTH

How Does An ORM Work?

13 Aug 2012 One of the things all of us use regularly and (most?) of us have no idea how it actually works is an Object Relation Mapper or ORM. I think it is crazy that a piece that we rely on and couldn't work without anymore is so foreign to most of us. So I decided to change that. And end of this wordy piece I want you to know how a ORM works. Of course you won't know all the details, like how some ORMs seem to be able to do more with the database that you even knew was possible. But some basic should be cleared up by then.

On DataMapper

For this, I decided to deep dive into DataMapper. DataMapper is an alternative to ActiveRecord, but with less magic, which is great if you are trying to read something. A basic DM model looks something like this:

class Duck
  include DataMapper::Resource

  property :id, Serial
  property :name, String
  property :hungry, Boolean
end

DataMapper::Resource is the base include that enables all the magic DataMapper functionality. Like in ActiveRecord, where you inherit from ActiveRecord::Base. The property method is defined in DataMapper::Model::Property. This method does a lot of stuff! In fact, it spans from line 27 to line 99! First up, it determines the class based on the type we provided in the class definition:

klass = DataMapper::Property.determine_class(type)

So if we define property :name, String, it will determine that we want a String! Mind bending, isn't it? Afterwards, it will instantiate the class (in our case DataMapper::Property::String). Then we get the now defined properties in the given repository (basically a "different" DB connection) and add our property to it. Easy. Now we have some filler code until the end, with the exception of these two babies:

create_reader_for(property)
create_writer_for(property)

This (shockingly!) creates the readers and writers for the property we defined. So what we know by now is that DataMapper keeps a list of all properties defined for a given class, so it can query, search, create etc. based on these values.

Querying DataMapper

So we can create a model, but that is really now all that a ORM does. With a ORM, we can of course query for data, and that is what makes it really interesting! So let's get to it! Querying is done in DataMapper::Model. Lets take an example query:

Duck.all

We find all on lines 341 - 348.

def all(query = Undefined)
  if query.equal?(Undefined) || (query.kind_of?(Hash) && query.empty?)
    # TODO: after adding Enumerable methods to Model, try to return self here
    new_collection(self.query.dup)
  else
    new_collection(scoped_query(query))
  end
end

So we are generating a new collection with the given query. Cool. Model#new_collection makes a new collection (line 749 for the lazy ones). D'uh! So what happens when we generate a new collection? This:

    def initialize(query, resources = nil)
      raise "#{self.class}#new with a block is deprecated" if block_given?

      @query        = query
      @identity_map = IdentityMap.new
      @removed      = Set.new

      super()

      # TODO: change LazyArray to not use a load proc at all
      remove_instance_variable(:@load_with_proc)

      set(resources) if resources
    end

(On Github) You might now realize: Nothing happens! No database connection is made, nothing! In fact, DataMapper doesn't actually perform the database call until really, really necessary! It just returns a "placeholder" collection if you want that has the information needed to actually perform the call. One way to trigger that call would be #first:

    def first(*args)
      first_arg = args.first
      last_arg  = args.last

      limit_specified = first_arg.kind_of?(Integer)
      with_query      = (last_arg.kind_of?(Hash) && !last_arg.empty?) || last_arg.kind_of?(Query)

      limit = limit_specified ? first_arg : 1
      query = with_query      ? last_arg  : {}

      query = self.query.slice(0, limit).update(query)

      if limit_specified
        all(query)
      else
        query.repository.read(query).first
      end
    end

Quite a bit of code here, luckily we can ignore most of it. For example the first few lines are just applicable if we actually pass in some options (which we won't do for now). The first really interesting line is line 16. Here we actually call query.repository.read(query).first. Here we call repository#read. So lets see this in action.

      def read(query)
        fields = query.fields
        types  = fields.map { |property| property.dump_class }

        statement, bind_values = select_statement(query)

        records = []

        with_connection do |connection|
          command = connection.create_command(statement)
          command.set_types(types)

          # Handle different splat semantics for nil on 1.8 and 1.9
          reader = if bind_values
            command.execute_reader(*bind_values)
          else
            command.execute_reader
          end

          begin
            while reader.next!
              records << Hash[ fields.zip(reader.values) ]
            end
          ensure
            reader.close
          end
        end

        records
      end

This is a lot simpler than it looks. It takes the query, constructs a select query and then executes it. Simple as that! In the end, it maps the values it got back from the Database into "real" objects and returns the given records to us.

So That's It

Lets review all that. So an ORM needs a structure to tell it how the database tables should be mapped to the object. DataMapper does this with the DataMapper::Model#property method, where ActiveRecord reads this data from a schema file. In the end, both automatically generate setters and getters for the properties. Then we have a layer that stores the queries, the results, etc. and constructs the right, database agnostic data so the query can be fulfilled. And the third layer, where I really glossed over very shortly, is the database adapter (Do you want to know some more about the database adapter themselves? Then contact me on Twitter or in the comments and tell me about it!). The database adapter knows how the query should be run in the database and knows how to transform the results back into the objects. And that's the way we query a database! If you liked this post, you should follow me on Twitter and signup for the RSS feed!

Why Having A Vision Is More Important Than Anything Else

06 Aug 2012 Do you know why we advance slower than we have some decades ago? Do you know that we went from the planning stage to the moon in 11 years? That is if you consider the time from the creation of NASA to the time we actually landed on the moon. But I think the most important part was the speech that President Kennedy gave*. He said

First, I believe that this nation should commit itself to achieving the goal, before this decade is out, of landing a man on the Moon and returning him safely to the Earth

Which is, as someone from that time would have put it: "Thy be insane?". Or something like that. Now we know better, of course, because the US had this vision. So how will having a vision help you? * Actually, the most important thing was probably the Germans. Because both the Soviet Union and the United States took, after World War II, rocket engineers from Germany to kick-start their endeavors. Yes, the famous rocket we all saw flying around were based on Nazi technology. So yeah…

So, A Vision, Any Vision?

Of course, not any vision will do. For example, saying "I will be one of the people in my circle that some people will know" isn't a good vision. So isn't "I will be the biggest site on the internet in 3 months". The first one isn't really something measurable, and, honestly, lame. And the second one isn't really possible to achieve. What makes a good vision? A good vision is measurable. For example, "I will have 2000 subscribers by the end of the year" is simple to understand, measurable, and we always know where we stand. But it isn't a vision yet, it is just a goal. A vision should be something that can never really be reached, but motivates us to continue working. A vision is something that we can strive for, something that is measurable, and something that is barely reachable. So, a good vision could be:

We will become the number one payment processor in the world with customers that just love us!

Ok, the love part is a bit hard to measure, but it is a good guideline. And I'm sure we could find some grad students who would love to study, well, love. Should we give this customer a refund? Without a vision, we don't know, and to some we may give, to others we won't, which will confuse the customers even more. But with that vision? The action is clear!

A Good Vision Is A Good Guide

A vision isn't just for suits, or for people who want to sell something to suits (read: a company). No matter what your project is, I will bet that having a vision for it will help you immensely. For one, we won't have to ask if we should show an error message, or if it won't annoy users more if we actually tell them that they are stupid. If the vision is to build the most robust, secure system, then yes, we should show a general error message. If the software should be the most usable piece of software ever written, then we have to redesign it so that this error can't possibly happen. If we write a command line utility for Linux, we don't even have to care about that. :) A good vision answers probably 90% or more of all the small questions we have while developing a new project.

Sit Around, It's Story Time!

For example just recently I started on a new project. The first thing I did was asking for a vision. And they just repeated: "Oh yeah, build this basic thing here, we can try to analyze the rest later, it will be fine." So I started implementing it. And lo and behold, it wasn't what they wanted. It showed error messages that were clear to us, but not to the expected user. It required things which seemed obvious to me, but which weren't actually needed. At all. Even though it technically did everything they asked, it just didn't match with their vision. And yes, by now I do have a vision for the project. Most people actually have a vision in mind, but they somehow don't want to spell it out. It's crazy!

Vision == Motivation

A vision keeps us motivated. That's all. Lets go to the next paragraph. Oh, you want some more details? Ok. Well, it's simple. We can code, and code, and code, but we will never "get there", we will never "get closer to the finished product". We will always be stuck in a rut. Why? We reach the next goal, so our boss just figures out a new one for us. It sucks! We don't see the end of the tunnel. But a vision changes that! With a vision, we suddenly have something to strive for. The goals suddenly aren't arbitrary points that the boss sets for us, but are actually real points that bring us closer to the ultimate goal. Which is really what we want. Because then, at the end of the day, we can say: "Yeah, I got closer to the vision, and people are even more happy!"

Steve Jobs, A Visionary

One of the things that I think is just amazing about Steve Jobs is that he always had a vision. Maybe his private vision was to be a jerk and still be considered a successful, beloved business man, and if it was, he would have succeeded. But even more amazing are his visions for the company. For example, before the iPhone, no one thought this will be possible. And Steve Jobs just said: "It is, do it!". Or as Steve (being a lot more eloquent than I am) would have put it:

Do we have what it takes to establish a third category of products? The bar is pretty high. It has to be far better at doing some key things. We think we have the goods. Our most advanced technology in a magical and revolutionary device at an unbelievable price.

Now that's a vision! If you liked this post, you should follow me on Twitter and signup for the RSS feed!

What you should know about SOLID

29 Jul 2012 Did you always wonder what that whole SOLID thing is all about? Why everyone is shouting at you for violating the Law of Demeter (you won't go to jail for that one, by the way)? And did you talk about soccer when people mentioned the Liskov substitution principle and looked at you confused? Then read on! I always like going to the SOLID wikipedia page because SOLID is a mnemonic acronym. SOLID stands for Single responsibility, Open-closed, Liskov substitution, Interface segregation and Dependency inversion. Makes sense yet? Lets briefly go over each one.

Single Responsibility Principle

The Single Responsibility Principle says that every class should have one job, and one job alone, but do it really, really well. If you want to know if your class is violating SRP, describe what the class does. If you use and or or in the sentence, it is a strong sign that you violate the SRP principle. Let's look at an example of a duck:

class Duck
  attr_accessible :name

  def eat
    @food -= 1
    remove_food
  end

  def swim
  end

  def remove_food
  end
end

A very simple class that does one thing: Keep track of a duck (in a pond for example). The problem with this example is that if we want to describe it, we would probably say the following: It keeps track of a duck and keeps track of the food. Yes, the dreaded and. So how would we rewrite this example? One way to encapsulate this better is to remove the food in a Food class and not manage the food supplies in the Duck itself. SRP also says that the model should hide the implementation details. That's why I use in the example eat instead of grab_food. How we eat is an implementation detail of the Duck model itself. If we want to change how food is handled, the only thing we should have to change is the Duck class itself, and nothing else.

Open/Closed Principle

The next principle is about open/closed classes. But you might now way "But Codepoet, in Ruby all classes are open!". And you would of course be right. But that's not the whole point of the open/closed principle. If we want to follow this principle, all classes cannot be changed once they implementation is completed, except for bug fixing. If the behavior should be changed for one part of the system, the class should be open, however, to be extended in another class. This is one of those principles which is sometimes hard to follow, since requirements change, and programs evolve. So sometimes a class has to be changed, or added some functionality. Always making a new class just isn't feasible. However, keeping this principle in mind is a good thing, especially with Ruby where we have easy ways to extend classes and even objects. Maybe some methods shouldn't be added to the base class, since most of the system doesn't need them, but just to one part of the system.

Liskov Substitution Principle

LSP (not to be confused with LSD) states that any object inheriting from another has to keep the constraints of the superclass intact. This includes preconditions (this value can't be null), postconditions (return value is rounded to an integer) and invariants. This sounds too abstract? I agree, so let's make an example: Given we develop a system to measure ducks. So of course we have a RealDuck class that can swim, eat, quack, etc. Swim means we give it a value, it proceeds the given distance, and then returns. If we then ask it for the position, it will have advanced the given distance. Easy as pie! Now we want to compare the real ducks with a mechanical replacement duck (the reason to do this is left to the reader). So we subclass it, add a new method replaceBatteries, and we are done! But not quite! Even though this looks perfect on paper, we just violated Liskovs Substitution Principle. Why? Because now, if the MechanicalDuck swims, it might not get a single feet ahead because the battery is empty. But that's an invariant that the RealDuck guarantees! (Except if you insure the duck before, but come on!) Most of the violations of LSP are small, but can have a huge rippling effect if the rest of the system relies on a different behavior.

Interface Segregation Principle

Another principle with an impressive name. But this one is simple to understand: Design your interfaces so that they deal with one thing, and only one. Sounds familiar? It should, because it is similar to the Single Responsibility Principle discussed above. The difference between the two is that SRP deals mostly on the class level, while ISP is concerned mostly about the overall architecture of a system. If you follow this principle, then your interfaces will be small and to the point, without anything in there that can't be separated out. Lets look at the RealDuck and the MechanicalDuck again. The duck has to eat, of course, and that's why it is in the class. But a mechanical device doesn't have to eat! So it doesn't make any sense that the mechanical duck has an eat method. So the interface is too large for this use case, and it should be refactored out.

Dependency Inversion Principle

The Dependency Inversion Principle states that high-level and low-level modules shouldn't be tightly coupled together, and that they can be replaced. This is especially useful for testing the higher-level components, because the lover-level modules that they depend on can simply be replaced by mock modules. In Ruby, one way to achieve this goal is to add an optional argument to initialize where the dependency can be replaced. To give an example:

class Duck
  def initialize(food_source = DefaultFoodSource.new)
    @food_source = food_source
  end
end

This way we can add another food source if we want (for example, an InfiniteFoodSource), so that we can test the duck in isolation.

And That's It!

Yes, that's it. We are done. No more talking about sports when Liskovs Substitution Principle comes up. No more hand stands when someone talks about Dependency Inversion. Nice! Though always keep in mind: These are just guidelines. Principles. They aren't law. If you have a good reason to go against one of these principles, then do it! If you liked this post, you should follow me on Twitter and signup for the RSS feed!

"Working with Unix Processes" Review (By Jesse Storimer)

23 Jul 2012 The last thing I want to do is dealing with low-level operating system things. That's one of the reasons why I like Ruby, because it keeps me from having to deal with all that. So why the hell would I read "Working with Unix Processes"? Well, first, it was the book of choice for the (fantastic) Ruby Rogues podcast, episode 058. And second: It was a really amazing ride! Jesse managed to get down to the lowest level and still managed to explain everything in terms I actually understand! You can get the book on Amazon for around 17.00€ ($21.00). Side note: "Working with Unix Processes" was first self published by Jesse Storimer over at workingwithunixprocesses.com, but has now been picked up by The Pragmatic Programmers.

Content

Working with Unix Processes concentrates on the rarely used Ruby module Kernel#fork and its companions. Jesse Storimer starts out with a gentle introduction to processes, process IDs and everything else you didn't know you wanted to know about processes. Did you know, for example, that processes belong to a process group? And that a whole process group can be killed at the same time? And that that's basically how a terminal emulator works? No? Neither did I! After he explains everything about processes, he goes on and explains that processes can fork. This is actually a very important feature of all operating system. All processes are spawned from one process, which is created at boot. And so he introduces us to process parents, children, and orphaned processes, and what that means. After forking, the next important thing is joining processes together again (for example to enable multicore processing). Of course, now that we have multiple processes, another important aspect is communication between processes. This is done either through signals or interprocess communication. He then ends "Working with Unix Processes" with an extensive Appendix, where he digs deeper into the inner workings of such gems as Resque or Unicorn.

The good

One part I loved about the book is that each chapter has a short note on "Use in the real world". Here Jesse Storimer highlights projects that use the aforementioned technique. Another nice thing he did was include a Spyglass server. This is a heavily documented server in Ruby that he developed specifically for the book, to highlight how his techniques can be leveraged. Reading through it is pure joy! A very nice feature of the book is that it is a living book. Since Jesse released it (Dec 20, 2011) he released 14 new version (some just fix typos, but others added full new chapters or new and extended explanations for the things he explained.

The bad

For one, the name. It is called "Working with Unix Processes", but a better name would have been "Working with Unix Processes in Ruby". A lot of the advise he gives is very Ruby specific, and all examples are in Ruby. I'm not saying it isn't applicable if you use other languages, but this book focuses on Ruby. Of course, this was the audience he had in mind, but it is something that everyone should remember when buying this book. I am also not sure what the target audience for the book is. A system programmer (which doesn't know Ruby) doesn't need to be told how fork works, or what process IDs are. For a Ruby programmer that just tries to leverage some more programming near the system, the book was a bit too terse, and the examples were too abstract.

Conclusion

Besides its shortcoming, I found it a great book to read. You can read it in 3 hours (aka a rainy Sunday afternoon) and get through it pretty quickly. If you want to dive deeper, the source code to Spyglass and the other mentioned open source projects are right at your fingertips. I recommend this book to everyone who is interested in a more systems programming approach, and wants to know the basics so he can see when they could help. Ruby knowledge is required though, or else the book won't make that much sense. The topic sounds boring, mundane even, but I think if you are at least a bit curious you should give it a chance and read it. Sparked your interest? Then don' wait and get the book! If you liked this post, you should follow me on Twitter and signup for the RSS feed!

Older Newer