Web mashups

According to Wikipedia, “A mashup is a website or web application that seamlessly combines content from more than one source into an integrated experience.” More popularly what it means is that you take content, data or query results from various Internet web applications such as Google Maps, Yahoo Flickr, Amazon.com etc. and put them all together into a single web application that provides a service that is not available from any of the source services. It’s a synergistic approach to building applications that uses existing applications as building blocks. Some call it web application hybrid, but I personally think it’s more than that. ‘Hybrid’ gives an impression that it’s just a sum of the individual (web applications) but I think mashups have much greater potential than being a ‘glue’ for web applications. It’s not 1 + 1 + 1 = 3, it’s more like (or should be) 1 + 1 + 1 = 10.

A typical simple mashup should be something like this :-

  1. Take the job search results from Indeed.com APIs (say of jobs relating to ‘Ruby on Rails’)
  2. Using the information results, map the jobs’ locations using Google Map APIs
  3. At each location, show the company that is returned, and latest news on the company using Yahoo Web Search Service

At a basic level this service is not something that is provided by any of the web applications individually (though Yahoo comes close) and this is a simplistic mashup anyway. With more advanced services being added continually on the Internet every day, we could continually enhance this in a way that might not be anticipated by the original service providers.

I used to do ‘mashups’ during my days with elipva, though at that point in time we didn’t call it that. We had the dotcom names for them then — things like portal, personalization and content aggregation, using technology like screen-scraping, iframes and such. Names seem to have changed, but the ideas have evolved but strangely persisted.

Maybe good ideas stick around? Or we never learn from our ‘failures’? I would rather be more optimistic on this. In any case, I still love a good mashup, and I still love coding them, whatever they’re called.

Over the next few weeks/months (depending on how much free time I have) I’ll try to implement the simple job search mashup I just described above, and will describe how it can be done, step-by-step. The technology used? Ruby on Rails, of course. Stick this page in a bookmark somwhere if you want to see how I do it.

CardsAsia::Payment World 2006

Was at the CardsAsia today, some pretty interesting topics discussed during the conference. The exhibition was disappointing though — it was much smaller than I thought, and there were practically no terminal vendors or card vendors. Ingenico, Hypercom, Verifone etc were all missing, Gemplus, Axalto, G&D, Obethur weren’t there at all.

The conference items were an eye opener though. More in another post. In the meantime, took a snapshot of the Asian payment bootcamp session, which had Aneace as a moderator.

Nintendo DS Lite!

I bought an NDS Lite (light blue) yesterday! Also bought the ‘Age of Empire: Age of Kings’ game, cost me bomb — S$410, but worth every cent! Amazing little game gadget, with touch screen, dual screen, great user interface, built-in wifi. I cannot believe how much good stuff Nintendo managed to cram into something so small!

Projectible in alpha

I’ve wrapped up on the broad strokes on Projectible, finally! This is something I’ve been doing on and off for the past few weeks. Originally a subset of Beehive’s massive set of new features, I never really thought how much real work it took to get a publicly available web application ready for even semi ‘production’ use. Real lessons learnt here on deploying a live Rails application.

Well, what is Projectible? Projectible is an online project management and publication tool, that focuses on an important feature that not many online web project management do — to publish MS Project plan files. Yes, that’s right — Projectible can take a MS Project plan you have and push it to the server, but you’ll need to use a small client application to do this. Besides this, everything else is like any other online project management and publication tool, but I believe this is a crucial feature. MS Project is used by almost everyone I know who manages a project professionally yet most people ignore this and goes on to repeat what MS Project does, online!

So how does it compete with Microsoft’s own Project Server? There isn’t any real competition though — MS Project Server is not avaiable publicly and requires someone to host and maintain it, so it’s often out of question for many except for larger enterprises. Besides MS Project Server is not easy to use or maintain (I know — I use it in Welcome), and it is a bandwidth hog — it’s deployment guide states flatly that it requires 10Mbps to run it normally. It’s not really meant for Internet use though you can do it.

How about other online project management tools like the venerable Basecamp, the original web application which Rails is extracted from? Well, Basecamp is a great tool, but it mainly focuses on basic project management features and for people who are used to MS Project the interface isn’t really that familiar. Besides, Projectible’s main focus is publication of MS Project plan files, something which Basecamp is not into.

Of course, Basecamp and other tools have been around for a while, and I’ve just started Projectible :)

Try it out at http://www.projectible.com !

I’ve also moved all blogs to do with Projectible to http://blog.projectible.com

Evolutionary Algorithms and Genetic Programming [4]

The Mechanics of Genetic Programming

Having gone through the necessary background, let’s go into the actual mechanics of how genetic programming works. The basic mechanics of GP are very similar to those of genetic algorithms (of which GP is really an offshoot). The algorithm for GP can be summarized as:

  1. Randomly generate an initial population P.
  2. Repeat the following sub-steps until an acceptable solution is found or the maximum number of generations is reached:
    • Execute each program in the population and assign it a fitness value indicating how good a solution to the problem it is.
    • Create a new population by selecting members with a bias towards the fitter programs, and then applying crossover and mutation to produce new P.
  3. The program found to give the best solution to the problem is designated as the result of this run of genetic programming.

For any particular problem it is necessary to specify the list of functions and terminals that shall be used to create the P. These in effect constitute the language to be used to solve the problem. The terminals will form the leaf nodes of each tree, since they take no arguments, and the functions will form all other nodes – having one branch for each argument they require.
So for example, lets say we are attempting to solve a problem which requires finding a boolean function that takes the input from three data lines and returns true if an even number of them are true and false otherwise (the even-3-parity rule). We might use the terminal set:

T = { A, B, C }

And the function set:

F = { AND, OR, NOT }

How do we produce the random trees in the initial population? How do we perform crossover and mutation on them later on during reproduction?

Creation

Creating random trees is easily done by a recursive process:

  1. For the root node select a random primitive from the combined set of functions and terminals.
  2. If a terminal is selected the process ends there.
  3. Otherwise for each of the functions arguments produce a random sub-tree by repeating the same process.

Figure 3 : Randomly created program trees

In practice it is necessary to set a limit on the size to which the tree can grow by only selecting from the list of terminals after reaching a certain depth. Also, since trees consisting of only a single terminal are not likely to be very interesting they are usually excluded by forcing the choice of primitive at the root of the tree to be a function.

Recombination/Crossover

Crossover is the random exchange of genetic material between two parents to produce a unique offspring. In GP this is easily done by selecting a random sub-tree from within each of the parent trees, and then swapping them over. For example:

Figure 4 : Recombination/Crossover

(a) and (b) are the parents. Subtree 1 and subtree 2 are the randomly selected sub-trees. After the sub-trees have been swapped (c) and (d) are the resulting children.

Mutation

Mutation is random changes to the genetic material of a single individual. In GP this is again easily performed by deleting a randomly chosen sub-tree and replacing it with a newly created one. For example:

Figure 5 : Mutation

The program tree is now mutated. The original sub-tree is removed and replaced by a new subtree (which has been randomly generated) to produce the final program tree.

Evolutionary Algorithms and Genetic Programming [3]

Required theoretical background

Prefix Notation and S-Expressions

The most common way of writing down a function with two arguments, for example addition, is the infix notation. That is, the two arguments are connected with the operation symbol between them.

A + B

A different method is the so-called prefix (or polish) notation. Here the operation symbol is written down first, followed by its required arguments.

+ A B

While this may be a bit more difficult or just unusual for human eyes, it opens some advantages for computational uses.
The computer language LISP uses so-called symbolic expressions (or S-expressions) composed in prefix notation. Then a simple S-expression could be
(operator argument)

where operator is the name of a function (here with one argument). Argument can be either an atom (i.e. constant or variable) or either another symbolic expression (e.g. S-expressions can be nested).

(operator argument (operator argument) (operator argument))

In LISP, programs and data have the same structure providing an easy means of manipulation and evaluation. This is why LISP is popularly used for implementing genetic programming.

Program trees and Subtrees

A tree in this context is a hierarchical structure consisting of nodes and leaves. The simplest tree consists only of a single leaf. Nodes are inner points of trees and may have more nodes or just leaves connected to them. The topmost point of a tree is called the root of this tree.

A program tree is describes a computer program. Functions are written down as nodes, their arguments as leaves. There is a close relationship between these program trees and S-expression; in fact these trees are just another way of writing down the expressions.

Figure 1 : A program tree representing the expression (+ 1 2)

Figure 2 : A program tree representing the expression (* 3 (+ 1 2))

A subtree is the part of a tree that is under an inner node of this tree. If this tree is cut out from its parent, the inner node becomes a root node and the subtree is a valid tree of its own. Subtrees are equivalent to inner S-expressions in other S-expressions.

is a valid subtree of 

A method to measure the complexity of these trees is the depth, meaning the maximum number of parent nodes a leave might have, or the complete size, meaning the total sum of leaves and nodes a tree contains.

Functions and Terminals

While the term function is self-explanatory, the meaning of terminal is not so obvious. While functions will be the nodes of the trees (or the operators in the S-expressions) and can have other functions as their arguments, the leaves will be formed by terminals, that is symbols that may not be further expanded.
In this work, a variety of functions will be used. Simple arithmetic functions like addition or multiplication, more complex ones like sinus or highly specific functions like IF-FOOD-AHEAD (a conditional branch). Terminals can be variables, constants or specific actions that are to be performed.
The process of selecting the functions and terminals that are needed or useful for finding a solution to a given problem is one of the key steps in the preparation of genetic programming.

Evaluation

Evaluation of these structures is straightforward. Beginning at the root node, the values of all sub-expressions (or subtrees) are computed, descending the tree down to the leaves. Their return values will serve as arguments for the upper node.

(+ 3 ( - 4 2)) = (+ 3 2) = 5

(- (* 2 3) (+ 2 1)) = (- 6 (+ 2 1)) = (-6 3) = 3

If the function represents a conditional branch, it is imperative that just the designated argument is evaluated; the other(s) must be ignored.

Fitness Function

The most difficult and most important concept of genetic programming is the fitness function. The fitness function determines how well a program is able to solve the problem. It varies greatly from one type of program to the next. For example, if one were to create a genetic program to set the time of a clock, the fitness function would simply be the amount of time that the clock is wrong. Unfortunately, few problems have such an easy fitness function; most cases require a slight modification of the problem in order to find the fitness.

Evolutionary Algorithms and Genetic Programming [2]

Genetic Algorithm

Genetic algorithms are a variation of evolutionary algorithms. Genetic algorithms model of machine learning that derives its behavior from a metaphor of the processes of evolution in nature. This is done by the creation within a machine of a population of individuals represented by chromosomes, in essence a set of character strings that are analogous to the base-4 chromosomes that we see in our own DNA.

The individuals in the population then go through a process of evolution. The processes of nature boil down to different individuals competing for resources in the environment. Some are better than others. Those that are better are more likely to survive and propagate their genetic material.

Genetic algorithms are used for a number of different application areas. An example of this would be multidimensional optimization problems in which the character string of the chromosome can be used to encode the values for the different parameters being optimized. In practice, therefore, we can implement this genetic model of computation by having arrays of bits or characters to represent the chromosomes. Simple bit manipulation operations allow the implementation of crossover, mutation and other operations.

Although a substantial amount of research has been performed on variable-length strings and other structures, the majority of work with genetic algorithms is focused on fixed-length character strings. We should focus on both this aspect of fixed-lengthness and the need to encode the representation of the solution being sought as a character string, since these are crucial aspects that distinguish genetic programming, which does not have a fixed length representation and there is typically no encoding of the problem.

When the genetic algorithm is implemented it is usually done in a manner that involves the following cycle:
1.    Evaluate the fitness of all of the individuals in the population.
2.    Create a new population by performing operations such as crossover, fitness-proportionate reproduction and mutation on the individuals whose fitness has just been measured.
3.    Discard the old population and iterate using the new population.

One iteration of this loop is referred to as a generation. There is no theoretical reason for this as an implementation model. Indeed, we do not see this punctuated behavior in populations in nature as a whole, but it is a convenient implementation model. The first generation (generation 0) of this process operates on a population of randomly generated individuals. From there on, the genetic operations, in concert with the fitness measure, operate to improve the population.

Genetic Programming

Genetic programming is a branch of genetic algorithms. The main difference between genetic programming and genetic algorithms is the representation of the solution – genetic algorithms create a string of numbers that represent the solution while genetic programming creates computer programs as the solution.

The objects in genetic programming that constitute the population are not fixed-length character strings that encode possible solutions to the problem at hand, they are programs that, when executed, “are” the candidate solutions to the problem. These programs are expressed in genetic programming as program trees, rather than as lines of code. Thus, for example, the simple program “a + b * c” would be represented as:

Or, to be precise, as suitable data structures linked together to achieve this effect. Because this is a very simple thing to do in the programming language LISP, many GP programmers tend to use LISP. However, this is simply an implementation detail. There are straightforward methods to implement GP using a non-LISP programming environment. The programs in the population are composed of elements from the function set and the terminal set, which are typically fixed sets of symbols selected to be appropriate to the solution of problems in the domain of interest. In GP the crossover operation is implemented by taking randomly selected subtrees in the individuals (selected according to fitness) and exchanging them.   It should be pointed out that GP usually does not use any mutation as a genetic operator.

Further explanation of how genetic programming works is given in the section below.