Loading the Netflix Dataset into MySQL | April 24, 2007-->
April 24, 2007I just couldn't resist. The by now well known Netflix Prize is a great opportunity to do a little bit research about recommendation engines (recommender).
I decided to use a plain MySQL 5 database to store the Netflix dataset even though the common wisdom seems to believe that a database isn't the best solution for this problem.
But first things first. I needed to import the data into the database and i'm going to describe a really fast way to import the Netflix data set.
The training_set directory contains 17770 files with movie ratings. Each filename contains the movie id (mv_0009569.txt is the ratings file for the movie with the id 9569). I did a quick check to ensure that the movie ids contained in the file are matching the movie ids contained in the filename. They do match - so when can skip the first line of each ratings file.
- Create the necessary tables to hold the training data (ratings) and movie titles.
- Insert the movie titles using simple INSERT statements (Fast enough for 17770 titles)
- Insert the movie ratings by using the
LOAD DATA INFILEstatements for each of the 17770 rating files in the training_set directory. - Optional: Create one or more indices (may take a while to create!)
#!/usr/bin/env ruby # # Created by Stefan Saasen on 2007-04-24. # Copyright (c) 2007. All rights reserved. require 'dbi' # Configuration DB_DATABASE = 'netflix' DB_USER = 'YOUR DB USERNAME' DB_PASSWORD = 'YOUR DB PASSWORD' # Directory with netflix dataset NETFLIX_DOWNLOAD_DIR = File.join(File.dirname(__FILE__), "..", "..", "download") NETFLIX_MOVIE_TITLES = File.join(NETFLIX_DOWNLOAD_DIR, "movie_titles.txt") NETFLIX_TRAINING_SET = File.join(NETFLIX_DOWNLOAD_DIR, "training_set") # Profile... start = Time.now # Connect to the MySQL database DBI.connect("DBI:Mysql:#{DB_DATABASE}", DB_USER, DB_PASSWORD) do |dbh| # create the necessary tables stmts = DATA.read.split(";").delete_if{|stmt| stmt.strip.empty?} stmts.each do |stmt| # Execute the current SQL statement dbh.do(stmt) end # 1. Insert movie titles into movie_titles table using simple INSERT STATEMENTS sql = "INSERT INTO movie_titles (id, year_of_release, title) VALUES(?,?,?)" # It would be even more efficient to use INSERT statements that # use VALUES syntax which can insert multiple rows. # e.g. INSERT INTO movie_titles (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9); dbh.prepare(sql) do |sth| File.open(NETFLIX_MOVIE_TITLES) do |f| f.each do |line| # Execute the prepared INSERT Statement id, year_of_release, title = line.split(",") sth.execute id.to_i, year_of_release.strip, title.strip end end end # 2. Insert rating data from training_set using the fast LOAD DATA INFILE syntax # The filename contains the movie id! Skip the first line (IGNORE 1 LINES) sql = "LOAD DATA INFILE ? INTO TABLE ratings FIELDS TERMINATED BY ',' "+ "IGNORE 1 LINES (customer_id, rating, date) SET movie_id = ?;" dbh.prepare(sql) do |sth| Dir[NETFLIX_TRAINING_SET + "/*.txt"].each do |trs_file| puts "Importing #{trs_file}..." if trs_file =~ /([0-9]+)\.txt$/ movie_id = $1.dup.to_i else raise "Missing movie_id (file: #{trs_file})" end # Execute the prepared statement using the .txt file and the movie_id sth.execute File.expand_path(trs_file), movie_id end end # Add index for certain columns #puts "Creating index on the ratings table - may take a while" #["CREATE INDEX mid_idx ON ratings (...)", "CREATE INDEX cid_idx ON ..."].each{|stmt| dbh.do(stmt)} end puts "Import successfully finished!\n" puts "The netflix data import took #{sprintf("%0.2f", (Time.now - start)/60.0)} minutes" __END__ DROP DATABASE IF EXISTS netflix; CREATE DATABASE netflix /*!40100 DEFAULT CHARACTER SET latin1 */; USE netflix; DROP TABLE IF EXISTS `movie_titles`; CREATE TABLE `movie_titles` ( `id` int(11) NOT NULL, `year_of_release` YEAR(4) default NULL, `title` VARCHAR(250), PRIMARY KEY (`id`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1; DROP TABLE IF EXISTS `ratings`; CREATE TABLE `ratings` ( `id` int(11) NOT NULL auto_increment, `movie_id` int(11) NOT NULL, `customer_id` int(11) NOT NULL, `date` date default NULL, `rating` int(1), PRIMARY KEY (`id`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
