Copy option in Snowflake DW
This article will help you to understand different feature/options available with Copy data command in Snowflake.
Understanding Copy Options -
Validation
Size Limit
Return Failed only
TruncateColumn
Force
Default Copy Syntax
Syntax-
Copy into <table_name>
From ExternalStage
Files = (‘File_name_1’,’File_name_2')
FILE FORMAT = <file format name>
Validation Mode -
Validate data files instead of loading them
To check wheater if any error or data is corrupted or not
Syntax -
Copy into <table_name>
From ExternalStage
Files = (‘File_name_1’,’File_name_2')
FILE FORMAT = <file format name>
VALIDATION_MODE = RETURN_n_ROWS | RETURN_ERROR
Lets create a Database and Table name in Snowflake for Data Validation
CREATE OR REPLACE DATABASE COPY_DB;
CREATE OR REPLACE TABLE COPY_DB.PUBLIC.ORDERS (
ORDER_ID VARCHAR(30),
AMOUNT VARCHAR(30),
PROFIT INT,
QUANTITY INT,
CATEGORY VARCHAR(30),
SUBCATEGORY VARCHAR(30));
// Prepare stage object
CREATE OR REPLACE STAGE COPY_DB.PUBLIC.aws_stage_copy
url=’s3://snowflakebucket-copyoption/size/’;
LIST @COPY_DB.PUBLIC.aws_stage_copy;
Before Loading, check the data by running the validation script
COPY INTO COPY_DB.PUBLIC.ORDERS
FROM @aws_stage_copy
file_format= (type = csv field_delimiter=’,’ skip_header=1)
pattern=’.*Order.*’
VALIDATION_MODE = RETURN_ERRORS
Output -
Another which will return ’N’ rows when there are no error in data validation. N = 5
COPY INTO COPY_DB.PUBLIC.ORDERS
FROM @aws_stage_copy
file_format= (type = csv field_delimiter=’,’ skip_header=1)
pattern=’.*Order.*’
VALIDATION_MODE = RETURN_5_ROWS
Output -
What happens when there are bad records??
Let’s create another stage object
// Prepare stage object
CREATE OR REPLACE STAGE COPY_DB.PUBLIC.aws_stage_copy
url=’s3://snowflakebucket-copyoption/returnfailed/’;
Run the validation script -
COPY INTO COPY_DB.PUBLIC.ORDERS
FROM @aws_stage_copy
file_format= (type = csv field_delimiter=’,’ skip_header=1)
pattern=’.*Order.*’
VALIDATION_MODE = RETURN_ERRORS
Output -
Working with Error -
Now we can see our validation get failed. What will be the next step? How we should remove those error so our data would be clean?
There will be lots of question arises after getting an Error. In Snowflake, approach is little bit different. Lets find out -
Method 1
First we will store the Error buy running few scripts -
// Storing rejected /failed results in a table
CREATE OR REPLACE TABLE rejected AS
select rejected_record from table(result_scan(last_query_id())); — Run this scripts it will give you error records from last run query
INSERT INTO rejected
select rejected_record from table(result_scan(last_query_id()));
SELECT * FROM rejected;
Method 2
Suppose we have to load data into target table and also check/save error records. We can achive this by below comand
COPY INTO COPY_DB.PUBLIC.ORDERS
FROM @aws_stage_copy
file_format= (type = csv field_delimiter=’,’ skip_header=1)
pattern=’.*Order.*’
ON_ERROR=CONTINUE
This will load data, only those records which don’t have any error.
Check the Status column. It clearly says there are some bad records in the file due to which its loaded only Partial Data.
select * from table(validate(orders, job_id => ‘_last’));
Error Details and File name can be found in using the select statement.
Key points to be noted -
select * from table(validate(table_name, job_id => Query_id));
If we place ‘_last’ this will give error records from last executed query. If you want for some other query, you will need Query ID of that execution command.
Then create a table and insert records just like Method 1
How to understand what error message is saying -
Run the select command over the table to check what kind of data is loaded and do the sum with rejected table records.
If you notice the error message is nothing but the records which are populated in single column.
Now, we will split the records into multiple and replace those with new records(modified records).
CREATE OR REPLACE TABLE rejected_values as
SELECT
SPLIT_PART(rejected_record,’,’,1) as ORDER_ID,
SPLIT_PART(rejected_record,’,’,2) as AMOUNT,
SPLIT_PART(rejected_record,’,’,3) as PROFIT,
SPLIT_PART(rejected_record,’,’,4) as QUATNTITY,
SPLIT_PART(rejected_record,’,’,5) as CATEGORY,
SPLIT_PART(rejected_record,’,’,6) as SUBCATEGORY
FROM rejected;
Size Limit -
This will help to load data into target table by specifying threshold file size. After the threshold reached data load will get stop. Size in bytes.
Note — The file size limit of for all file which will be loaded.
Syntax -
Copy into <table_name>
From ExternalStage
Files = (‘File_name_1’,’File_name_2')
FILE FORMAT = <file format name>
Size_limit = num;
Return Failed Only -
It is used when you want to specify those files which are failed to return while loading the data.
Syntax -
Copy into <table_name>
From ExternalStage
Files = (‘File_name_1’,’File_name_2')
FILE FORMAT = <file format name>
ON_Error = continue
RETURN_FAILED_ONLY = TRUE; //Default is FALSE
TRUNCATECOLUMS -
Suppose you created a table with column of 10 character length. But while loading the data there is one record which have char length greater then 10 i.e 12. Using this option those extra char will be truncated to avoid error while copying data into the Target table.
In simple terms, extra char will be truncated.
Syntax -
Copy into <table_name>
From ExternalStage
Files = (‘File_name_1’,’File_name_2')
FILE FORMAT = <file format name>
TRUNCATECOLUMNS = True;
//False will produce error while loading data into Target.
Force -
Specifies to load files weather they are already loaded or not changed since they have loaded.
Leads to duplication of data
Syntax -
Copy into <table_name>
From ExternalStage
Files = (‘File_name_1’,’File_name_2')
FILE FORMAT = <file format name>
FORCE = TRUE; //Default is FALSE
Some of my other blog -
Database in Snowflake. If I ask this question… | by Sarvesh Pandey | Jun, 2022 | Medium
Snowflake Architecture and Warehouse | by Sarvesh Pandey | Medium | Medium