Loading the Netflix Dataset into MySQL
April 24, 2007I just couldn't resist. The by now well known Netflix Prize is a great opportunity to do a little bit research about recommendation engines (recommender).
I decided to use a plain MySQL 5 database to store the Netflix dataset even though the common wisdom seems to believe that a database isn't the best solution for this problem.
But first things first. I needed to import the data into the database and i'm going to describe a really fast way to import the Netflix data set.
The training_set directory contains 17770 files with movie ratings. Each filename contains the movie id (mv_0009569.txt is the ratings file for the movie with the id 9569). I did a quick check to ensure that the movie ids contained in the file are matching the movie ids contained in the filename. They do match - so when can skip the first line of each ratings file.
- Create the necessary tables to hold the training data (ratings) and movie titles.
- Insert the movie titles using simple INSERT statements (Fast enough for 17770 titles)
- Insert the movie ratings by using the
LOAD DATA INFILEstatements for each of the 17770 rating files in the training_set directory. - Optional: Create one or more indices (may take a while to create!)
#!/usr/bin/env ruby
#
# Created by Stefan Saasen on 2007-04-24.
# Copyright (c) 2007. All rights reserved.
require 'dbi'
# Configuration
DB_DATABASE = 'netflix'
DB_USER = 'YOUR DB USERNAME'
DB_PASSWORD = 'YOUR DB PASSWORD'
# Directory with netflix dataset
NETFLIX_DOWNLOAD_DIR = File.join(File.dirname(__FILE__), "..", "..", "download")
NETFLIX_MOVIE_TITLES = File.join(NETFLIX_DOWNLOAD_DIR, "movie_titles.txt")
NETFLIX_TRAINING_SET = File.join(NETFLIX_DOWNLOAD_DIR, "training_set")
# Profile...
start = Time.now
# Connect to the MySQL database
DBI.connect("DBI:Mysql:#{DB_DATABASE}", DB_USER, DB_PASSWORD) do |dbh|
# create the necessary tables
stmts = DATA.read.split(";").delete_if{|stmt| stmt.strip.empty?}
stmts.each do |stmt|
# Execute the current SQL statement
dbh.do(stmt)
end
# 1. Insert movie titles into movie_titles table using simple INSERT STATEMENTS
sql = "INSERT INTO movie_titles (id, year_of_release, title) VALUES(?,?,?)"
# It would be even more efficient to use INSERT statements that
# use VALUES syntax which can insert multiple rows.
# e.g. INSERT INTO movie_titles (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);
dbh.prepare(sql) do |sth|
File.open(NETFLIX_MOVIE_TITLES) do |f|
f.each do |line|
# Execute the prepared INSERT Statement
id, year_of_release, title = line.split(",")
sth.execute id.to_i, year_of_release.strip, title.strip
end
end
end
# 2. Insert rating data from training_set using the fast LOAD DATA INFILE syntax
# The filename contains the movie id! Skip the first line (IGNORE 1 LINES)
sql = "LOAD DATA INFILE ? INTO TABLE ratings FIELDS TERMINATED BY ',' "+
"IGNORE 1 LINES (customer_id, rating, date) SET movie_id = ?;"
dbh.prepare(sql) do |sth|
Dir[NETFLIX_TRAINING_SET + "/*.txt"].each do |trs_file|
puts "Importing #{trs_file}..."
if trs_file =~ /([0-9]+)\.txt$/
movie_id = $1.dup.to_i
else
raise "Missing movie_id (file: #{trs_file})"
end
# Execute the prepared statement using the .txt file and the movie_id
sth.execute File.expand_path(trs_file), movie_id
end
end
# Add index for certain columns
#puts "Creating index on the ratings table - may take a while"
#["CREATE INDEX mid_idx ON ratings (...)", "CREATE INDEX cid_idx ON ..."].each{|stmt| dbh.do(stmt)}
end
puts "Import successfully finished!\n"
puts "The netflix data import took #{sprintf("%0.2f", (Time.now - start)/60.0)} minutes"
__END__
DROP DATABASE IF EXISTS netflix;
CREATE DATABASE netflix /*!40100 DEFAULT CHARACTER SET latin1 */;
USE netflix;
DROP TABLE IF EXISTS `movie_titles`;
CREATE TABLE `movie_titles` (
`id` int(11) NOT NULL,
`year_of_release` YEAR(4) default NULL,
`title` VARCHAR(250),
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
DROP TABLE IF EXISTS `ratings`;
CREATE TABLE `ratings` (
`id` int(11) NOT NULL auto_increment,
`movie_id` int(11) NOT NULL,
`customer_id` int(11) NOT NULL,
`date` date default NULL,
`rating` int(1),
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;