Machine Learning with Ruby

Artificial Intelligence Big Data Technology

Posted by Enrico on Apr 27, 2017 1990

Who says that Ruby isn’t designed for machine learning? Well, it probably isn’t. Python, C++ and Java have their own machine learning frameworks and are more commonly used due to the fact that university labs have historically used those three languages for their experiments. Having said that, I’ve recently had to implement some machine learning algorithms for a couple of my projects and I discovered that some solutions for machine learning exist under Ruby as well.

I've seen companies using machine learning in different scenarios:

  • Sentiment analysis
    Think of getting opinions about any kind of content that could be interesting to your company.
  • Search relevance
    Think of improving the search results on users’ requests.
  • Content moderation
    Think of blog comments, profile pages, posts, photos, text, tweets, videos, and many more.
  • Data collection
    Think of automatically gathering data on your clients, prospects, investors, etc.
  • Data categorization
    Think of automatically categorizing large amounts of any kind of data.
  • Etc.

Based on my findings, two main options are available for Ruby developers willing to use machine learning.

Using the Weka resources under the JRuby (Java for Ruby) framework

The easiest way to use the machine learning resources available for Java is to use the JRuby framework instead of just Ruby. JRuby is a Java interpretation of the Ruby interpreter that gives you simultaneous access to most Java and Ruby libraries.

If working in a JRuby environment is an option, then you can access the machine learning tools available in Java. I suggest using the “Weka” library, simply by adding the “weka” gem. For clarity, “Weka” is a collection of machine learning algorithms developed at the University of Waikato.

In practice, just create a JRuby environment under Ruby and add the “weka” gem. I also suggest adding the “scalpel” gem for text processing.

 

 

-------------------------------------------------------------------
# gemfile

source 'https://rubygems.org'
ruby '2.3.1', engine: 'jruby', engine_version: '9.1.7.0'
gem 'rails', '~> 5.0.0'


platform :jruby do
    github 'jruby/activerecord-jdbc-adapter', branch: 'rails-5' do
        gem 'activerecord-jdbc-adapter'           # Base database adapter
        gem 'activerecord-jdbcpostgresql-adapter' # Postgres adapter
        gem 'activerecord-jdbcmysql-adapter' # MySQL adapter
    end
    gem 'weka'    # this gives access to the weka library
    gem 'scalpel' # this gives access to text processing
end


-------------------------------------------------------------------

 

I had to pull the “jdc” adapter from the Rails-5 branch in order for my system to work in a Rails-5 environment, although I was not able to make the PostgreSQL database work on Heroku. I hope this issue will be solved soon.

Bundle and you should be OK.

Then you could use the following code:


text_to_classify = 'This is the new text to classify as tennis.”

# Initialization of the data

    # Creating the dataset to be used in the machine learning algorithm

        instances = create_dataset

    # Definition of training data
    # This should be an array of hash in the form of [{text: "my_text", class: "class_1"}, ...]

        trainings = training_dataset # To be changed if the subject changes

    # Calculated of instances of training data
    # Ex: [0,1,47,rugby, 1,0,48,tennis, 0,1,39,rugby, 1,1,60,tennis, 0,1,23,rugby]

        instances = get_training_instances(trainings, instances)

# Training the algorithm

    classifier = train_algorithm(instances)

# Classifying the new text

    final_result = classify_new_item(text_to_classify, classifier)  
        # => "Tennis"


# Gets the distribution of the classes

    classes_distribution = get_distribution(text_to_classify, classifier)
        # => {"Tennis", 0;75; "Rugby", 0.25}


# Evaluation classifyer

    evaluation = classifier.cross_validate(folds: 10).summary
    # This gives a hash with values indicating the quality of the solution


# Used Functions Definition

    def create_dataset
        # https://github.com/paulgoetze/weka-jruby/wiki/Instances
        article_types= %i(rugby tennis) # This defines the articles categories
        attribute_names = get_features("").keys# This gets the features keys from the given text
        dataset = Weka::Core::Instances.new.with_attributes do
            attribute_names.each do |name|
                numeric(name)
            end
            nominal(:class, values: article_types, class_attribute: true)
        end
    end


    def training_dataset
        # This should be an array of hash in the form of [{text: "bla", class: "bla"}, ...]
        training_1= {text: "this is my text, and 5 4 it includes 32 rugby for sure", class: "rugby" }
        training_2= {text: "this is my 2text, and it includes tennis for sure", class: "tennis" }
        training_3= {text: "come on, and it 4includes rugby for sure", class: "rugby" }
        training_4= {text: "this is your text, and it f gh tennis and rugby for sure", class: "tennis" }
        training_5= {text: "includes^ rugby 2forhgfg", class: "rugby" }
        training_6= {text: "this is myefgh fgtext, and 5 4 it includes 32 rugby for sure", class: "rugby" }
        training_7= {text: "this isd fglkm my 2text, and irtyrtytsdfsfghfg tennis for sure", class: "tennis" }
        training_8= {text: "come on, and it 4incls ds sdfudes rugby for sure", class: "rugby" }
        training_9= {text: "this is a text and it includes tennis and rugby for sure", class: "tennis" }
        training_10 = {text: "includesbgfh jujt^ rugby 2for sure", class: "rugby" }
        trainings = [training_1, training_2, training_3, training_4, training_5,
                           training_6, training_7, training_8, training_9, training_10]
    end


    def get_features(text)
        {
            # Specific features
                tennis_hints_count:match_count(text, 'tennis'),
                rugby_hints_count:match_count(text, 'rugby'),
            # Semi recurrent
                single_person_hints_count:terms_count(text, %w(I me my)),
                team_hints_count:terms_count(text, %w(we us our team)),
            # Recurrent generic features
               number_count:number_count(text),
               quote_count:quote_count(text),
               capitalized_words_count:capitalized_words_count(text),
               gender_dominance:gender_dominance(text),
               sentences_count:sentences_count(text),
               paragraphs_count:paragraph_count(text),
               words_per_sentence_average:words_per_sentence_average(text),
               text_length:text.length
        }
    end


    def match_count(text, word)
        text.scan(/#{word}/i).count
    end


    def number_count(text)
        text.scan(/\d+[\.,]\d+|\d+/).count || 0
    end


    def quote_count(text)
        text.scan(/"[^"]+"/).count
    end


    def sentences_count(text)
        Scalpel.cut(text).count
    end


    def paragraph_count(text)
        text.split(/\n{2,}/).count
    end


    def capitalized_words_count(text)
        words = text.scan(/[\w'-]+/)
        words.count { |word| word.start_with?(word[0].upcase) }
    end


    def gender_dominance(text)
        terms_count(text, %w(she her)) > terms_count(text, %w(he his)) ? 1 : 0
    end


    def terms_count(text, terms)
        words = text.scan(/[\w'-]+/)
        words.count { |word| terms.include?(word.downcase) }
    end


    def words_per_sentence_average(text)
        sentences_count = sentences_count(text)
        words = text.scan(/[\w'-]+/)
        sentences_count.zero? ? 0 : (words.count / sentences_count)
    end


    def get_training_instances(trainings, instances)
        all_features = []
        trainings.each do |training|
            features = get_features(training[:text])
            features_keys = features.keys
            features_values = features.values
            features_class = training[:class]
            all_features = all_features + [features_values]
                # => [[0, 1, 47], [1, 0, 48], [0, 1, 39], [1, 1, 60], [0, 1, 23]]


            instances.add_instance(features_values + [features_class])
            # Note how we have to append the known class
        end
        return instances
    end


    def train_algorithm(instances)
        classifier = Weka::Classifiers::Trees::RandomForest.new
        classifier.use_options('-I 200') # The default is 100. The higher the better the performances.
        classifier.train_with_instances(instances)
        return classifier
    end


    def classify_new_item(text_to_classify, classifier)
        text_to_classify_features = get_features(text_to_classify).values + [' ']
        final_result = classifier.classify(text_to_classify_features)
        # => rugby
    end


    def get_distribution(text_to_classify, classifier)
        text_to_classify_features = get_features(text_to_classify).values + [' ']
        final_result = classifier.distribution_for(text_to_classify_features)
        # => {"rugby"=>0.255, "tennis"=>0.745}
    end

 

 

Although Weka allows you to choose any algorithm, I picked RandomForest for this example. More information about the Weka framework under JRuby can be found here.

Use of currently available Ruby gems

If using the JRuby environment is not an option, you can still use some of the machine learning gems available. The advantage is that they run on Ruby on Rails as usual; the disadvantage is that their functionalities are quite limited.

I report here what I did with the “classifier-reborn” gem.

 

--------------------------------------------------

# gemfile

gem “classifier-reborn”


--------------------------------------------------

 

The "classifier-reborn" gem lets you classify using the LSI and Bayes algorithms. Some example of code that you can use to classify text are reporeted here using the LSI classifier.

 


# Initialitation of the classifier

    classifier = ClassifierReborn::LSI.new

# Training the classifier

    strings = [ ["This text deals with dogs. Dogs.", :dog],
                     ["This text involves dogs too. Dogs! ", :dog],
                     ["This text revolves around cats. Cats.", :cat],
                     ["This text also involves cats. Cats!", :cat],
                     ["This text involves birds. Birds.",:bird ] ]
    strings.each {|x| lsi.add_item x.first, x.last} 

# Classification options

    classifier.search("dog", 3)
        # returns => ["This text deals with dogs. Dogs.",
                             "This text involves dogs too. Dogs! ",
                             "This text also involves cats. Cats!"] 

    classifier.find_related(strings[2], 2)
        # returns => ["This text revolves around cats. Cats.", "This text also involves cats. Cats!"] 

    classifier.classify "This text is also about dogs!"
        # returns => :dog

 


The following code is an example of how you could use the Bayes classifier. Its main advantage is the ability to return an array of all the classifications with a ranking. This option is particularly useful if you have to classify data that has more than one class.
 


# Initialitation of the classifier

    classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'

# Training the classifier

    classifier.train "Interesting", "Here are some good words. I hope you love them."
    classifier.train "Uninteresting", "Here are some bad words, I hate you."

# Classification options

    classifier.classify "I hate bad words and you."
        # => 'Uninteresting'

    classifier.classifications("I hate bad words and you.")
        # => { 'Uninteresting', 0.8; 'Interesting', 0.2 }

 

 

Share this post on social media:

Contact me if you'd like to become a contributor.

  • Thumb img 2844

    Enrico Tam

    MBA, PhD, tech entrepreneur, maker

    Hi, I’m Enrico and I started hacking at 9 years old back when it was Visual Basic. After trying to become a professional tennis player I somehow got entangled in a PhD in engineering, an MBA programme and a big consulting fir... (continued)

Join the discussion

Never miss a post!

I’m Enrico, I write to learn and to share my adventures :)

Don't hesitate to write to me on twitter!

Popular bloggers

Popular posts

See all posts

Cookies help us deliver our services. By using our services, you agree to our use of cookies.