• Quick data import and linking in Rails

    Some web-applications have to ingest an enormous amount of new data on a regular basis. Import scripts easily become an ever-growing procedural mess, annoying to maintain. In this post I show a bit of code which can be used to simplify and unify such import scripts.

    Assume you have a pipeline of post-import steps to run. This can be organized in numerous ways. Simplest is to just have a bunch of methods called one after the other once you have the data loaded:

    link_frobnitz
    spin_really_fast_around_z_axis
    reticulate_splines
    deploy_hamsters

    Now, assume once in a while one of the steps fail for an unexpected reason. You know, it’s rare data from external sources is as clean as we’d like. So you need to fix a few things and retry the import. However, as datasizes grow and with that the running time of the import, it can be a huge waste redoing all the work because of a misplaced comma made the final deploy_hamsters step fail.

    Exceptions are the obvious way to report fatal data-errors, and implicit or explicit transactions to ensure consistency of the import. But how can this easily be combined for a resume-friendly import mechanism?

    Enter the bulk importer step runner with trivial progress reporting:

    def import_updaters
      all_steps.each do |step_name|
        run_import_step(step_name)
      end
    end
    
    private
    
    def run_import_step step_name
      puts "Running #{step_name}"
    
      ImportModel.transaction do
        self.send(step_name)
      end
    
    rescue => e
      STDERR.print "\nImport error:  #{e.inspect}\n#{e.backtrace.join("\n")}"
      STDERR.print "Please resume at step #{step_name}"
      exit 1
    end
    
    protected
    
    def all_steps
      [
        :link_frobnitz,
        :spin_really_fast_around_z_axis,
        :reticulate_splines,
        :deploy_hamsters
      ]
    end

    Notice you obviously have to change the model-name (ImportModel above) and provide the actual implementation for these individual steps. all_steps returns the list of methods to run, run_import_step runs a single step with error-handling, and import_updaters runs all the relevant updaters.

    Easy performance statistics

    As a bit of bonus-functionality, the following can be used for reporting import progress with timing-statistics after each step completes:

    def report_progress message, &block
      STDERR.print message
    
      if block_given?
        time = Benchmark.measure { yield }
        formatted_time = "%.2fs" % time.real
    
        STDERR.puts " - #{formatted_time}"
      else
        STDERR.puts
      end
    end

    Usage is simple - just call report_progress with a comment to print and a block of code, like this:

    def run_import_step step_name
      report_progress "Running #{step_name}" do
        ImportModel.transaction do
          self.send(step_name)
        end
      end
      //...

    What do you use to make data-imports easier to manage?

  • Easy local code-review with git

    When multiple developers contribute to a project, keeping on top of the constant flow changes can be a challenge. The following simple review workflow assumes a shared git-repository with a fairly linear commit-history, that is, not having too many merge-commits.

    So, assuming a fairly linear history of commits from multiple developers, how do you easily keep track of what you have already read through and reviewed? Easy, use a local branch as a bookmark. This tiny script makes it trivial to add or update such a branch:

    #!/bin/sh
    NEW_BASE=${1?"Usage: $0 <treeish>"}
    
    git branch --force reviewed $NEW_BASE || exit 1
    
    echo Marked as reviewed: `git rev-parse --short reviewed`

    Save this as a new file called reviewed.sh in your PATH.

    Usage is extremely simple:

    reviewed.sh 369b5cc
    reviewed.sh master
    reviewed.sh v1.0.7
    reviewed.sh HEAD~5

    Running one of these commands will mark the given treeish as reviewed, and when you look at your commit-history in a visual tool such as git gui or gitx, the reviewed branch visually indicates how far you have gotten. Note how both commit-IDs, branch names, tags, and relative commit-IDs can be used as argument.

    You can also utilize this review bookmark from the commandline. The following shows you all commits added to master since your last review:

    git log reviewed..master --reverse

    You can add a --patch to that command to see the full diff for each change. Adding --format=oneline just shows you the commit-IDs and first line of the commit-message.

    Once you’ve read all the latest commits on master, simply do a

    reviewed.sh master
    

    and you’re done.

    Why not use a tag?

    I find it convenient to be able to do a push of all tags to the central repository with

    git push --tags
    

    and this would share such a private review-tag. As this is my private reminder of how far in the commit-history I have reviewed, sharing it is just confusing to other developers.

    Notice: Any commits which are added only to the reviewed branch are unreferenced when you mark a new treeish as reviewed. Just something to keep in mind.

    How do you keep track of the flow of changes?

subscribe via RSS