hoodwink.d enhanced
 

juretta.com

Loading the Netflix Dataset into MySQL | April 24, 2007-->

April 24, 2007

I just couldn't resist. The by now well known Netflix Prize is a great opportunity to do a little bit research about recommendation engines (recommender).

I decided to use a plain MySQL 5 database to store the Netflix dataset even though the common wisdom seems to believe that a database isn't the best solution for this problem.

But first things first. I needed to import the data into the database and i'm going to describe a really fast way to import the Netflix data set.

The training_set directory contains 17770 files with movie ratings. Each filename contains the movie id (mv_0009569.txt is the ratings file for the movie with the id 9569). I did a quick check to ensure that the movie ids contained in the file are matching the movie ids contained in the filename. They do match - so when can skip the first line of each ratings file.

  1. Create the necessary tables to hold the training data (ratings) and movie titles.
  2. Insert the movie titles using simple INSERT statements (Fast enough for 17770 titles)
  3. Insert the movie ratings by using the LOAD DATA INFILE statements for each of the 17770 rating files in the training_set directory.
  4. Optional: Create one or more indices (may take a while to create!)

Download the script here

#!/usr/bin/env ruby
#
#  Created by Stefan Saasen on 2007-04-24.
#  Copyright (c) 2007. All rights reserved.
require 'dbi'           
# Configuration
DB_DATABASE   = 'netflix'
DB_USER       = 'YOUR DB USERNAME'
DB_PASSWORD   = 'YOUR DB PASSWORD'
               
# Directory with netflix dataset
NETFLIX_DOWNLOAD_DIR = File.join(File.dirname(__FILE__), "..", "..", "download")

NETFLIX_MOVIE_TITLES = File.join(NETFLIX_DOWNLOAD_DIR, "movie_titles.txt")
NETFLIX_TRAINING_SET = File.join(NETFLIX_DOWNLOAD_DIR, "training_set")

# Profile...
start = Time.now

# Connect to the MySQL database
DBI.connect("DBI:Mysql:#{DB_DATABASE}", DB_USER, DB_PASSWORD) do |dbh|
  # create the necessary tables
  stmts = DATA.read.split(";").delete_if{|stmt| stmt.strip.empty?}
  stmts.each do |stmt|
    # Execute the current SQL statement
    dbh.do(stmt)
  end
  
  # 1. Insert movie titles into movie_titles table using simple INSERT STATEMENTS
  sql = "INSERT INTO movie_titles (id, year_of_release, title) VALUES(?,?,?)"
  # It would be even more efficient to use INSERT statements that 
  # use VALUES syntax which can insert multiple rows. 
  # e.g. INSERT INTO movie_titles (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);
  dbh.prepare(sql) do |sth|
    File.open(NETFLIX_MOVIE_TITLES) do |f|
       f.each do |line|
         # Execute the prepared INSERT Statement
         id, year_of_release, title = line.split(",")
         sth.execute id.to_i, year_of_release.strip, title.strip
       end
    end
  end
  
  # 2. Insert rating data from training_set using the fast LOAD DATA INFILE syntax
  # The filename contains the movie id! Skip the first line (IGNORE 1 LINES)
  sql = "LOAD DATA INFILE ? INTO TABLE ratings FIELDS TERMINATED BY ',' "+
        "IGNORE 1 LINES (customer_id, rating, date) SET movie_id = ?;"
  dbh.prepare(sql) do |sth|
    Dir[NETFLIX_TRAINING_SET + "/*.txt"].each do |trs_file|
      puts "Importing #{trs_file}..."
      if trs_file =~ /([0-9]+)\.txt$/
        movie_id = $1.dup.to_i 
      else
        raise "Missing movie_id (file: #{trs_file})"
      end
      # Execute the prepared statement using the .txt file and the movie_id
      sth.execute File.expand_path(trs_file), movie_id
    end
  end
  
  # Add index for certain columns
  #puts "Creating index on the ratings table - may take a while"
  #["CREATE INDEX mid_idx ON ratings (...)", "CREATE INDEX cid_idx ON ..."].each{|stmt| dbh.do(stmt)}             
end
puts "Import successfully finished!\n"
puts "The netflix data import took #{sprintf("%0.2f", (Time.now - start)/60.0)} minutes"

__END__
DROP DATABASE IF EXISTS netflix;
CREATE DATABASE netflix /*!40100 DEFAULT CHARACTER SET latin1 */;
USE netflix;

DROP TABLE IF EXISTS `movie_titles`;
CREATE TABLE `movie_titles` (
  `id` int(11) NOT NULL,
  `year_of_release` YEAR(4) default NULL,
  `title` VARCHAR(250),
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS `ratings`;
CREATE TABLE `ratings` (
  `id` int(11) NOT NULL auto_increment,
  `movie_id` int(11) NOT NULL,
  `customer_id` int(11) NOT NULL,
  `date` date default NULL,
  `rating` int(1),
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;


Diggman

About

juretta.com is the personal workspace of Stefan Saasen. You can send him an email or read more about this site in the „About“ section.

« Previous entry

Attachr.com: OpenID support added
posted about 1 year ago

» Next entry

svn: Can't convert string from 'UTF-8...
posted about 1 year ago

Recent comment

On: “New Zealand Daylight Saving Time Change: Mac OS X and Java

very usefull,thanks,i linked it on my blog ispig:hobix

posted 6 months ago by foxcamel

Look!

Latest links  RSS  

More...