Friday, September 21, 2007

Custom Processors for ActiveWarehouse ETL

For anyone interested in extending ActiveWarehouse ETL's features with custom pre/post processors, I thought I would share this piece of code that I wrote in August for a personal project I am working on. The example should provide you with enough details for you to create your own custom processors.


# Written by Susan Potter under open source MIT license.
# August 12, 2007.

require 'net/ftp'

module ETL
module Processor
# Custom pre-processor to download files via FTP before beginning control process.
class FtpDownloaderProcessor < ETL::Processor::Processor
attr_reader :host
attr_reader :port
attr_reader :remote_dir
attr_reader :files
attr_reader :username
attr_reader :local_dir

# configuration options include:
# * host - hostname or IP address of FTP server (required)
# * port - port number for FTP server (default: 21)
# * remote_dir - remote path on FTP server (default: /)
# * files - list of files to download from FTP server (default: [])
# * username - username for FTP server authentication (default: anonymous)
# * password - password for FTP server authentication (default: nil)
# * local_dir - local output directory to save downloaded files (default: '')
#
# As an example you might write something like the following in your control process file:
# pre_process :ftp_downloader, {
# :host => 'ftp.sec.gov',
# :path => 'edgar/Feed/2007/QTR2',
# :files => ['20070402.nc.tar.gz', '20070403.nc.tar.gz', '20070404.nc.tar.gz',
# '20070405.nc.tar.gz', '20070406.nc.tar.gz'],
# :local_dir => '/data/sec/2007/04',
# }
# The above example will anonymously download via FTP the first week's worth of SEC filing feed data
# from the second quarter of 2007 and download the files to the local directory +/data/sec/2007/04+.
def initialize(control, configuration)
@host = configuration[:host]
@port = configuration[:port] || 21
@remote_dir = configuration[:remote_dir] || '/'
@files = configuration[:files] || []
@username = configuration[:username] || 'anonymous'
@password = configuration[:password]
@local_dir = configuration[:local_dir] || ''
end

def process
Net::FTP.open(@host) do |conn|
conn.connect(@host, @port)
conn.login(@username, @password)
remote_files = conn.chdir(@remote_dir)
@files.each do |f|
conn.gettextfile(remote_file(f), local_file(f))
end
end
end

private
attr_accessor :password

def local_file(name)
File.join(@local_dir, name)
end

def remote_file(name)
File.join(@remote_dir, name)
end
end
end
end

The key things to note from this is that you are at present required to:
  • define all your custom processors with in the ETL::Processor module
  • name your custom processor class in the form XXXXProcessor
  • need to extend (or really just adhere to the message interface of) ETL::Processor::Processor class defined in ActiveWarehouse ETL
  • define initialize taking two arguments (look above for guidance)
  • define a process method to do what you need do before or after the control process runs (for pre and post processors respectively)

Hope this helps someone customize ActiveWarehouse more easily, since the only bad thing I have found with ActiveWarehouse is lack of documentation.

0 comments: