Tuesday, 15 April 2014

Creating some Pivotal Cloud Foundry (PCF) PHD services

After installing PHD add on for Pivotal Cloud Foundry 1.1 I quickly created some development services for PHD using the CLI as shown below.

[Tue Apr 15 22:40:08 papicella@:~/vmware/pivotal/products/cloud-foundry ] $ cf create-service p-hd-hawq-cf free dev-hawq
Creating service dev-hawq in org pivotal / space development as pas...
OK
[Tue Apr 15 22:42:31 papicella@:~/vmware/pivotal/products/cloud-foundry ] $ cf create-service p-hd-hbase-cf free dev-hbase
Creating service dev-hbase in org pivotal / space development as pas...
OK
[Tue Apr 15 22:44:10 papicella@:~/vmware/pivotal/products/cloud-foundry ] $ cf create-service p-hd-hive-cf free dev-hive
Creating service dev-hive in org pivotal / space development as pas...
OK
[Tue Apr 15 22:44:22 papicella@:~/vmware/pivotal/products/cloud-foundry ] $ cf create-service p-hd-yarn-cf free dev-yarn
Creating service dev-yarn in org pivotal / space development as pas...
OK

Finally using the web console to brow the services in the "Development" space


Wednesday, 9 April 2014

Pivotal Greenplum GPLOAD with multiple CSV files

I recently needed to setup a cron script which loaded CSV files from a directory into Greenplum every 2 minutes. Once loaded the files are moved onto Hadoop for archive purposes. The config below shows how to use GPLOAD data load utility which utilises GPFDIST.

1. Create a load table. In this example the data is then moved to the FACT table once the load is complete

  
drop table rtiadmin.rtitrans_etl4;

CREATE TABLE rtiadmin.rtitrans_etl4 (
    imsi character varying(82),
    subscriber_mccmnc character varying(10),
    msisdn character varying(82),
    imei character varying(50),
    called_digits character varying(50),
    start_datetime integer,
    end_datetime integer,
    first_cell_lac integer,
    first_cell_idsac integer,
    current_cell_lac integer,
    current_cell_idsac integer,
    dr_type integer,
    status character varying(50),
    ingest_time bigint,
    processed_time bigint,
    export_time bigint,
    extra_col text,
    gploaded_time timestamp without time zone
)
WITH (appendonly=true) DISTRIBUTED BY (imsi); 

2. GPLOAD yaml file defined as follows

VERSION: 1.0.0.1
DATABASE: mydb
USER: rtiadmin
HOST: 172.1.1.1
PORT: 5432
GPLOAD:
   INPUT:
    - SOURCE:
         LOCAL_HOSTNAME:
            - loadhost
         PORT: 8100
         FILE:
          - /data/rti/stage/run/*.csv
    - COLUMNS:
          - imsi : text
          - subscriber_mccmnc : text
          - msisdn : text
          - imei : text
          - called_digits : text
          - start_datetime : text
          - end_datetime : text
          - first_cell_lac : integer
          - first_cell_idsac : integer
          - current_cell_lac : integer
          - current_cell_idsac : integer
          - dr_type : integer
          - status : text
          - ingest_time : bigint
          - processed_time : bigint
          - export_time : bigint
          - extra_col : text
    - FORMAT: text
    - HEADER: false
    - DELIMITER: ','
    - NULL_AS : ''
    - ERROR_LIMIT: 999999
    - ERROR_TABLE: rtiadmin.rtitrans_etl4_err
   OUTPUT:
    - TABLE: rtiadmin.rtitrans_etl4
    - MODE: INSERT
    - MAPPING:
           imsi : imsi
           subscriber_mccmnc : subscriber_mccmnc
           msisdn : msisdn
           imei : imei
           called_digits : called_digits
           start_datetime : substr(start_datetime, 1, 10)::int
           end_datetime : substr(end_datetime, 1, 10)::int
           first_cell_lac : first_cell_lac
           first_cell_idsac : first_cell_idsac
           current_cell_lac : current_cell_lac
           current_cell_idsac : current_cell_idsac
           dr_type : dr_type
           status : status
           ingest_time : ingest_time
           processed_time : processed_time
           export_time : export_time
           extra_col : extra_col
           gploaded_time : current_timestamp
   PRELOAD:
    - TRUNCATE : true 
    - REUSE_TABLES : true
   SQL:
    - AFTER : "insert into rtitrans select * from rtitrans_etl4" 

3. Call GPLOAD as follows

source $HOME/.bash_profile
gpload -f rtidata.yml


Note: We use the ENV variable as $PGPASSWORD which is used during the load if a password is required which was in this demo

Few things worth noting here.

REUSE_TABLES : This ensures the external tables created during the load are maintained and re-used on next load.

TRUNCATE: This clears the load table prior to load and we use this as we COPY the data once the load is finished into the main FACT table using the "AFTER"