Quick data import and linking in Rails

Some web-applications have to ingest an enormous amount of new data on a regular basis. Import scripts easily become an ever-growing procedural mess, annoying to maintain. In this post I show a bit of code which can be used to simplify and unify such import scripts.

Assume you have a pipeline of post-import steps to run. This can be organized in numerous ways. Simplest is to just have a bunch of methods called one after the other once you have the data loaded:

link_frobnitz
spin_really_fast_around_z_axis
reticulate_splines
deploy_hamsters

Now, assume once in a while one of the steps fail for an unexpected reason. You know, it’s rare data from external sources is as clean as we’d like. So you need to fix a few things and retry the import. However, as datasizes grow and with that the running time of the import, it can be a huge waste redoing all the work because of a misplaced comma made the final deploy_hamsters step fail.

Exceptions are the obvious way to report fatal data-errors, and implicit or explicit transactions to ensure consistency of the import. But how can this easily be combined for a resume-friendly import mechanism?

Enter the bulk importer step runner with trivial progress reporting:

def import_updaters
  all_steps.each do |step_name|
    run_import_step(step_name)
  end
end

private

def run_import_step step_name
  puts "Running #{step_name}"

  ImportModel.transaction do
    self.send(step_name)
  end

rescue => e
  STDERR.print "\nImport error:  #{e.inspect}\n#{e.backtrace.join("\n")}"
  STDERR.print "Please resume at step #{step_name}"
  exit 1
end

protected

def all_steps
  [
    :link_frobnitz,
    :spin_really_fast_around_z_axis,
    :reticulate_splines,
    :deploy_hamsters
  ]
end

Notice you obviously have to change the model-name (ImportModel above) and provide the actual implementation for these individual steps. all_steps returns the list of methods to run, run_import_step runs a single step with error-handling, and import_updaters runs all the relevant updaters.

Easy performance statistics

As a bit of bonus-functionality, the following can be used for reporting import progress with timing-statistics after each step completes:

def report_progress message, &block
  STDERR.print message

  if block_given?
    time = Benchmark.measure { yield }
    formatted_time = "%.2fs" % time.real

    STDERR.puts " - #{formatted_time}"
  else
    STDERR.puts
  end
end

Usage is simple - just call report_progress with a comment to print and a block of code, like this:

def run_import_step step_name
  report_progress "Running #{step_name}" do
    ImportModel.transaction do
      self.send(step_name)
    end
  end
  //...

What do you use to make data-imports easier to manage?