by Hannu Krosing for
In this talk I describe ways to do terabyte-scale multi-machine data warehousing using PostgreSQL as “storage and query processing layer” and the “skype scalability triplets” pl/proxy, pgbouncer and (largely python based) skytools for loading the data into the cluster. Easy map-reduce type huge-data processing using pl/proxy, SQL, pl/pgsql and/or pl/pythonu is demonstrated and differences from typical NoSQL map-reduce are shown. Writing the “transform” part of ETL (extract-transform-load) as python plugins in near-real-time data collection pipeline for this type of data warehouse is demonstrated. Also, a short comparison to other distributed data processing approaches is given, including which one to use for which task.