+27 21 461 2050
Machine learning is branch of Artificial Intelligence (AI) concerned with design and development of algorithms that allow computers to learn. It's a very broad subject so we will just focus on a simple example that uses statistical classification.
Let's build...
In this tutorial we are going to build a simple news classification application that will parse and classify RSS/HTML articles from the Times Live newspaper.
For the job, we will use nokogiri gem and 2 ruby standard libraries: open-uri and rss/2.0.
RSS Parser
To find sources of articles for processing, we could build a complex search engine or we can simply use the RSS feeds the newspaper provides us with to look for and discover links. RssParser class does exactly that, you initialize it with a feed url and it gives you back the links to all the articles discovered in that feed.
HTML Parser
Having article links, we need to parse page content and extract meaningful parts from these pages. HtmlParser class can be initialized with a page url and DOM selector. In this example we will be using a CSS selector to extract the content from articles - Firebug and jQuery were used to find the selector for the text we are extracting from the article. In this class you will also notice clean_whitespace method which cleans the whitespace characters from the extracted text.
Statistical Classifier
We introduce a class that is responsible for classification of articles. It is initialized with a hash that consists of categories (keys) and training data (values).
Training data is used to discover potential relationships between articles and categories. This data should be carefully selected in order to give better classification results. It is created by determining the value for each word in the context of all words for that category (see the train_data() method).
In this example we are using content of Wikipedia articles for economy, sport and health as training data for our categories.
When classifying articles we want to compare only meaningful words and ignore other words that do not add any value for a certain category. We (partially) solve this problem using stop words.
Finally, the scores() method creates the scores for each category (per text) that we are testing.
Lets have a go
Here is the script that runs the program:
Although our statistical classification algorithm is very simple, it can give remarkably good results provided the training data is good. For even better results you can try other classification algorithms like Bayesian probability and Latent Semantic Analysis.
If you are interested in a more indepth example of a news aggregator application - you can check newsagg at Github. It's a simple Sinatra application with a Redis datastore that we put together. It crawls, classifies and creates 'clusters' of articles using statistical algorithms.
If you want to learn more about Machine Learning, checkout Programming Collective Intelligence book - code examples written in Python and Scripting Intelligence - code examples written in Ruby.
Redis (REmote DIctionary Server) is key-value in-memory database storage that also supports disk storage for persistence. It supports several data types: Strings, Hashes, Lists, Sets and Sorted Sets; implements publish/subscribe messaging paradigm and has transactions.
All these different options place Redis in the NoSQL ecosystem somewhere between simple caching systems like memcache and feature-heavy document databases like MongoDB and CouchDB. The question is: when do you pick Redis over other NoSQL systems?
Give us some ACID
Before going into the use-cases, let's say one more important thing about Redis. Redis is single threaded which allows it to be ACID compliant (Atomicity, Consistency, Isolation and Durability). Other NoSQL databases generally don't provide ACID compliance, or they provide it partially. By default Redis trades some durability in return for speed (default fsync() is set to everysec which means it will save data to disk every second). But, because Redis is very configurable, you can change how many times it will fsync() the data on disk by using the appendfsync command (you can use appendfsync always and system will fsync data after every write - it's slow but safest!).
When to use Redis?
In your production environment you don't need to switch to Redis. You can just use it for the new things you are implementing. Always pick right tool for the job. For stable, predictable and relational data pick relational database. For temporary, highly dynamic data pick NoSQL database; schema changes can be a big problem and can take forever in big relational databases.
If you have a highly dynamic data that changes often, storage tends to grow quickly and further involves schema adjusting to store them, then Redis can be a potential good choice.
If you need a more featured document oriented database that allows you to perform range queries, regular expression searches, indexing, and MapReduce you should check MongoDB, CouchDB or similar. If you need a simple caching with better expiration algorhitms than Redis has then you should check memcache.Redis Use-Cases
Play with Redis (install, start and stop server)
Redis console
You can use redis-cli to connect to a local or remote Redis server and call com-
mands. Here is an example (first connect to the server using: redis-cli -p 6379):
Here is an example using Ruby to execute commands on Redis server. You need to install redis gem by executing gem install redis first.
Learn more about Redis commands and give some speed to your web applications for free.
Real world Redis example
At the end, lets show a real world example how Redis is used in Rubygems for caching gem downloads count. For keeping the code snippet short some code is ommited and/or simplified.
First, in the initializer a new redis object as a global variable $redis is instantiated. This object is used in Download model for updating the downloads count for a gem with self.incr(name) method and reading the downloads count for a gem with self.for_rubygem(name) method. Rubygems is using Sinatra application Hostess to speed up gem downloads. Sinatra application is registered as a middleware in the application.rb config file. This application defines get "/gems/*.gem" route which triggers the downloads count to be updated in the Redis database.Rubygems is doing more download stats like: total downloads, downloads per gem, downloads for a specific gem version, etc. Check out the source code at Github for more details.