Some web-applications have to ingest an enormous amount of new data on a regular basis. Import scripts easily become an ever-growing procedural mess, annoying to maintain. In this post I show a bit of code which can be used to simplify and unify such import scripts.
Assume you have a pipeline of post-import steps to run. This can be organized in numerous ways. Simplest is to just have a bunch of methods called one after the other once you have the data loaded:
Now, assume once in a while one of the steps fail for an unexpected reason. You know, it’s rare data from external sources is as clean as we’d like. So you need to fix a few things and retry the import. However, as datasizes grow and with that the running time of the import, it can be a huge waste redoing all the work because of a misplaced comma made the final
Exceptions are the obvious way to report fatal data-errors, and implicit or explicit transactions to ensure consistency of the import. But how can this easily be combined for a resume-friendly import mechanism?
Enter the bulk importer step runner with trivial progress reporting:
Notice you obviously have to change the model-name (
ImportModelabove) and provide the actual implementation for these individual steps.
all_stepsreturns the list of methods to run,
run_import_stepruns a single step with error-handling, and
import_updatersruns all the relevant updaters.
Easy performance statistics
As a bit of bonus-functionality, the following can be used for reporting import progress with timing-statistics after each step completes:
Usage is simple - just call
report_progresswith a comment to print and a block of code, like this:
What do you use to make data-imports easier to manage?
When multiple developers contribute to a project, keeping on top of the constant flow changes can be a challenge. The following simple review workflow assumes a shared git-repository with a fairly linear commit-history, that is, not having too many merge-commits.
So, assuming a fairly linear history of commits from multiple developers, how do you easily keep track of what you have already read through and reviewed? Easy, use a local branch as a bookmark. This tiny script makes it trivial to add or update such a branch:
Save this as a new file called
Usage is extremely simple:
Running one of these commands will mark the given treeish as reviewed, and when you look at your commit-history in a visual tool such as
reviewedbranch visually indicates how far you have gotten. Note how both commit-IDs, branch names, tags, and relative commit-IDs can be used as argument.
You can also utilize this review bookmark from the commandline. The following shows you all commits added to master since your last review:
git log reviewed..master --reverse
You can add a
--patchto that command to see the full diff for each change. Adding
--format=onelinejust shows you the commit-IDs and first line of the commit-message.
Once you’ve read all the latest commits on master, simply do a
and you’re done.
Why not use a tag?
I find it convenient to be able to do a push of all tags to the central repository with
git push --tags
and this would share such a private review-tag. As this is my private reminder of how far in the commit-history I have reviewed, sharing it is just confusing to other developers.
Notice: Any commits which are added only to the
reviewedbranch are unreferenced when you mark a new treeish as reviewed. Just something to keep in mind.
How do you keep track of the flow of changes?
subscribe via RSS