Loading Semi-structure and Unstructured Data in Snowflake
Semi-Structured data —
Semi-structured data is information that does not reside in a relational database but that has some organizational properties that make it easier to analyze. With some processes, you can store them in the relation database (it could be very hard for semi-structured data), but Semi-structured exist to ease space. Example: XML data.
Unstructured data —
Unstructured data is a data which is not organized in a predefined manner or does not have a predefined data model; thus it is not a good fit for a mainstream relational database. So for Unstructured data, there are alternative platforms for storing and managing, it is increasingly prevalent in IT systems and is used by organizations in a variety of business intelligence and analytics applications. Example: Word, PDF, Text, Media logs.
Now above two points gives a clear picture about Semi-structure and Unstructured Data. Okay, but the question arises here “If it’s a data, How can we process it? How can we get information from it?”
This article and article series will help you to understand and answer this question.
Initially, Semi-structure and Unstructured Data are stored in either in FILES or Streaming data and we have to some how load it into snowflake then we can perform all the process and visualization easily.
Steps to Load data in Snowflake. NOTE — This architecture will follow weather its Snowflake or any other service.
1. Create Stage
2. Load Raw data
3. Analyze and Parse
4. Flatten Load
To handles these types of data we use Data type — Variant which handles unstructured data
Data type — Variant Handles unstructured data
// create stage
CREATE OR REPLACE STAGE MANAGE_DB.EXTERNAL_STAGES.JSONSTAGE
URL = ‘s3://bucketsnowflake-jsondemo’
//create json file format
CREATE OR REPLACE file format MANAGE_DB.FILE_FORMATS.JSONFORMAT
TYPE = JSON;
//create target table with VARIANT data type (single column)
CREATE OR REPLACE table OUR_FIRST_DB.PUBLIC.JSON_RAW (
raw_file variant);
//Loading data into target table
COPY INTO OUR_FIRST_DB.PUBLIC.JSON_RAW
FROM @MANAGE_DB.EXTERNAL_STAGES.JSONSTAGE
file_format= MANAGE_DB.FILE_FORMATS.JSONFORMAT
files = (‘HR_data.json’);
SELECT * FROM OUR_FIRST_DB.PUBLIC.JSON_RAW
Now, this is what JSON looks like in single row or single records in RDBMS but this single records contains many information.
{
“city”: “Bakersfield”,
“first_name”: “Portia”,
“gender”: “Male”,
“id”: 1,
“job”: {
“salary”: 32000,
“title”: “Financial Analyst”
},
“last_name”: “Gioani”,
“prev_company”: [],
“spoken_languages”: [
{
“language”: “Kazakh”,
“level”: “Advanced”
},
{
“language”: “Lao”,
“level”: “Basic”
}
]
}
By looking the JSON structure, first question crosses our mind “How we going to retrieve data from JSON? What would be the format to pull required information?”
Follow the below format –
SELECT COLUMN_NAME:NAME FROM TABLE WHERE CONDITION (NAME/VALUE FORMAT OF JSON)
SELECT RAW_FILE:city FROM OUR_FIRST_DB.PUBLIC.JSON_RAW
SELECT $1:first_name FROM OUR_FIRST_DB.PUBLIC.JSON_RAW //$1 REPRESENTS FIRST COLUMN
SELECT RAW_FILE:first_name::string AS FIRST_NAME FROM OUR_FIRST_DB.PUBLIC.JSON_RAW
//handling array in JSON
SELECT RAW_FILE:prev_company as prev_company FROM OUR_FIRST_DB.PUBLIC.JSON_RAW;
SELECT RAW_FILE:prev_company[1]::STRING as prev_company FROM OUR_FIRST_DB.PUBLIC.JSON_RAW;
//Dealing with hierarchy
select raw_file:spoken_languages from our_first_db.public.json_raw
select raw_file:spoken_languages[0] from our_first_db.public.json_raw
select raw_file:spoken_languages[0]:language from our_first_db.public.json_raw
FOR PARQUTE
//PARQUTE
CREATE OR REPLACE FILE FORMAT MANAGE_DB.FILE_FORMATS.PARQUET_FORMAT
TYPE = ‘parquet’;
CREATE OR REPLACE STAGE MANAGE_DB.EXTERNAL_STAGES.PARQUETSTAGE
url = ‘s3://snowflakeparquetdemo’
FILE_FORMAT = MANAGE_DB.FILE_FORMATS.PARQUET_FORMAT
LIST @MANAGE_DB.EXTERNAL_STAGES.PARQUETSTAGE;
SELECT * FROM @MANAGE_DB.EXTERNAL_STAGES.PARQUETSTAGE;
{
“__index_level_0__”: 7,
“cat_id”: “HOBBIES”,
“d”: 489,
“date”: 1338422400000000,
“dept_id”: “HOBBIES_1”,
“id”: “HOBBIES_1_008_CA_1_evaluation”,
“item_id”: “HOBBIES_1_008”,
“state_id”: “CA”,
“store_id”: “CA_1”,
“value”: 12
}
SELECT $1:cat_id ,$1:__index_level_0__ FROM @MANAGE_DB.EXTERNAL_STAGES.PARQUETSTAGE WHERE $1:__index_level_0__ > 100;
SELECT
$1:__index_level_0__::int as index_level,
$1:cat_id::VARCHAR(50) as category,
DATE($1:date::int ) as Date,
$1:”dept_id”::VARCHAR(50) as Dept_ID,
$1:”id”::VARCHAR(50) as ID,
$1:”item_id”::VARCHAR(50) as Item_ID,
$1:”state_id”::VARCHAR(50) as State_ID,
$1:”store_id”::VARCHAR(50) as Store_ID,
$1:”value”::int as value,
METADATA$FILENAME as FILENAME,
METADATA$FILE_ROW_NUMBER as ROWNUMBER,
TO_TIMESTAMP_NTZ(current_timestamp) as LOAD_DATE //NTZ IS A NO TIMEZONE FUNCTIONS
FROM @MANAGE_DB.EXTERNAL_STAGES.PARQUETSTAGE;
What we have observed, loading data required a Stage table and Target table.
To load data between Stage and Target table we need copy script and use SQL query to pull the required information.
Copy option in Snowflake DW. This article will help you to… | by Sarvesh Pandey | Jul, 2022 | Medium