Intro to Machine Learning in Ruby

Machine learning is branch of Artificial Intelligence (AI) concerned with design and development of algorithms that allow computers to learn. It's a very broad subject so we will just focus on a simple example that uses statistical classification.

 

Let's build...

 

In this tutorial we are going to build a simple news classification application that will parse and classify RSS/HTML articles from the Times Live newspaper.

For the job, we will use nokogiri gem and 2 ruby standard libraries: open-uri and rss/2.0.

RSS Parser

To find sources of articles for processing,  we could build a complex search engine or we can simply use the RSS feeds the newspaper provides us with to look for and discover links. RssParser class does exactly that, you initialize it with a feed url and it gives you back the links to all the articles discovered in that feed.

HTML Parser

Having article links, we need to parse page content and extract meaningful parts from these pages. HtmlParser class can be initialized with a page url and DOM selector. In this example we will be using a CSS selector to extract the content from articles - Firebug and jQuery were used to find the selector for the text we are extracting from the article. In this class you will also notice clean_whitespace method which cleans the whitespace characters from the extracted text.

Statistical Classifier

We introduce a class that is responsible for classification of articles. It is initialized with a hash that consists of categories (keys) and training data (values).

Training data is used to discover potential relationships between articles and categories. This data should be carefully selected in order to give better classification results. It is created by determining the value for each word in the context of all words for that category (see the train_data() method).

In this example we are using content of Wikipedia articles for economy, sport and health as training data for our categories.

When classifying articles we want to compare only meaningful words and ignore other words that do not add any value for a certain category. We (partially) solve this problem using stop words.

Finally, the scores() method creates the scores for each category (per text) that we are testing.

Lets have a go

Here is the script that runs the program:

Although our statistical classification algorithm is very simple, it can give remarkably good results provided the training data is good. For even better results you can try other classification algorithms like Bayesian probability and Latent Semantic Analysis.

If you are interested in a more indepth example of a news aggregator application - you can check newsagg at Github. It's a simple Sinatra application with a Redis datastore that we put together. It crawls, classifies and creates 'clusters' of articles using statistical algorithms.

If you want to learn more about Machine Learning, checkout Programming Collective Intelligence book - code examples written in Python and Scripting Intelligence - code examples written in Ruby.

Redis in the NoSQL ecosystem

Redis (REmote DIctionary Server) is key-value in-memory database storage that also supports disk storage for persistence. It supports several data types: Strings, Hashes, Lists, Sets and Sorted Sets; implements publish/subscribe messaging paradigm and has transactions.

All these different options place Redis in the NoSQL ecosystem somewhere between simple caching systems like memcache and feature-heavy document databases like MongoDB and CouchDB. The question is: when do you pick Redis over other NoSQL systems?

Give us some ACID

Before going into the use-cases, let's say one more important thing about Redis. Redis is single threaded which allows it to be ACID compliant (Atomicity, Consistency, Isolation and Durability). Other NoSQL databases generally don't provide ACID compliance, or they provide it partially. By default Redis trades some durability in return for speed (default fsync() is set to everysec which means it will save data to disk every second). But, because Redis is very configurable, you can change how many times it will fsync() the data on disk by using the appendfsync command (you can use appendfsync always and system will fsync data after every write - it's slow but safest!).

When to use Redis?

In your production environment you don't need to switch to Redis. You can just use it for the new things you are implementing. Always pick right tool for the job. For stable, predictable and relational data pick relational database. For temporary, highly dynamic data pick NoSQL database; schema changes can be a big problem and can take forever in big relational databases.

If you have a highly dynamic data that changes often, storage tends to grow quickly and further involves schema adjusting to store them, then Redis can be a potential good choice.

If you need a more featured document oriented database that allows you to perform range queries, regular expression searches, indexing, and MapReduce you should check MongoDB, CouchDB or similar. If you need a simple caching with better expiration algorhitms than Redis has then you should check memcache.

Redis Use-Cases

  • Access Logger: When you need to log different activities, Redis is a good solution. Because Redis has to keep all stored objects in memory, don't forget to archive data to relational/document database because it can grow quickly after some time.
  • Counting Downloads: Rubygems uses Redis for counting downloads of gems. See how it's implemented in the Download model.
  • High Score tables: Redis supports data type functions that can be very handy.
  • Who's Online: Use Redis to implement who is online logic in your application.
  • Caching: Finding followings, followers or similar is very expensive operation in relational databases, use Redis to cache these data.
  • Queues: Resque is a Redis-backed Ruby library for creating background jobs, placing them on multiple queues, and processing them later.
  • Live debugging: You need to do live debugging or roll out new features for production testing for specific users only - Rollout gem does exactly that.
  • HN style social news site written in Ruby/Sinatra/Redis/jQuery - lamernews.

Play with Redis (install, start and stop server)

Redis console

You can use redis-cli to connect to a local or remote Redis server and call com-
mands. Here is an example (first connect to the server using: redis-cli -p 6379):

Redis from Ruby

Here is an example using Ruby to execute commands on Redis server. You need to install redis gem by executing gem install redis first.

Learn more about Redis commands and give some speed to your web applications for free.

Real world Redis example

At the end, lets show a real world example how Redis is used in Rubygems for caching gem downloads count. For keeping the code snippet short some code is ommited and/or simplified.

First, in the initializer a new redis object as a global variable $redis is instantiated. This object is used in Download model for updating the downloads count for a gem with self.incr(name) method and reading the downloads count for a gem with self.for_rubygem(name) method. Rubygems is using Sinatra application Hostess to speed up gem downloads. Sinatra application is registered as a middleware in the application.rb config file. This application defines get "/gems/*.gem" route which triggers the downloads count to be updated in the Redis database.

Rubygems is doing more download stats like: total downloads, downloads per gem, downloads for a specific gem version, etc. Check out the source code at Github for more details.

jQuery Mobile overview

We've all heard it, the browser is the future for the mobile applications. Mobile web frameworks are getting a lot of traction lately. Is this the turning point where these frameworks will finally set the developers free from the walled garden model of the Apple App Store and the complexity and resources needed for developing and maintaining a separate code base for each platform out there?

It's been almost a month since the 1.0 stable release of jQuery Mobile, the touch optimized mobile web framework. Only a year after John Resig announced the project in its first alpha stage, the jQuery team managed to deliver a solid release with an impressive graded support for a wide range of mobile browsers and platforms.

Built upon the jQuery core and the jQuery UI, the framework aims to deliver a quick and easy way to build unified cross-platform HTML5 based user interfaces with a single code base. All you need to start building applications using the framework is knowledge of some JavaScript and some HTML5.

Building a mobile application with jQuery Mobile is extremely easy and fast

One of the biggest advantages of jQuery Mobile is that even as a newcomer to the framework you can develop a rough version of your application in a matter of days as opposed to the native application development for Android and iOS where the learning curve is not quite as steep.

To make your application mobile just include the jQuery Mobile files in your header

Add the data attributes to your HTML markup

And this is the result:

34522551610-orig

The framework takes care of applying the styles and positioning the elements to fit and scale nicely on any device. With just a couple of html lines we have a template for our mobile application. Off course you can add your custom css styles to tweak the looks to your desires. Also with the release of the 1.0 support for the ThemeRoller tool has been added, so quickly theming your application is a breeze.

The important part in the html snippet above are the HTML5 data-* attributes. The framework relies on these data attributes in order to do it's magic. Older browser will ignore this attributes and just render the regular html. Basically the layout of a jQuery Mobile application is defined by the data-role attribute. In the above example we can see that we have a divs with data-roles "header", "content" and "footer" all of which belong to the main page wrapper. Also we added a data-role on the list element in order for the framework to “widgetize” it.

By default, linking to other pages is automatically done via Ajax in a single-page model. All you need to do is define a standard links and the framework will take care of it. If the device doesn't support Ajax for some reason the framework falls back to standard http requests. Another cool feature of the framework is linking within a multi-page document. A single page can contain many page containers defined by the data-role of “page” which means immediate display of the clicked page. When using this techniques combined with the page transition effects, the user experience feels almost “native app” like.

Web vs Native

Currently the biggest argument being used against the wider adoption of the mobile web applications is the limited access to the client device features. A common example to address this issue is combining the framework with tools like PhoneGap and delivering the application as native. Combining the framework with tools like PhoneGap also minimizes the lower performance issues since all of the resources can be loaded from the device.

Hopefully we will see a quick adoption of the proposed WebAPI specifications by Mozilla that define a standard of APIs that would give the browser high level access to the device components like GPS api, filesystem api, accelerometer api, dialer, messaging etc.

Conclusion

While all of those features provided by the jQuery Mobile framework are really cool and exciting, does the framework really provides a mature native-like feel and functionality to the applications built with it?

The jQuery team have done a great job by providing a nicely packed framework with lots of features like progressive enhancement approach, theming framework, unified widgets, modular design, a great cross-browser compatibility, even for older and less capable browsers and devices. But there is much left to be desired in the performance area. Even when trying the simple examples on a high end devices the page transition effects and the general user flow feels a little clunky. This is the area that jQuery Mobile fails short when compared to the native applications and even other Mobile Application frameworks (jQTouch or Sencha) but it probably be fixed in near future with the polishing of the rough edges.