Loading Semi-structure and Unstructured Data in Snowflake

Sarvesh Pandey
4 min readAug 13, 2022
Loading Semi-structure and Unstructured Data in Snowflake

Semi-Structured data —
Semi-structured data is information that does not reside in a relational database but that has some organizational properties that make it easier to analyze. With some processes, you can store them in the relation database (it could be very hard for semi-structured data), but Semi-structured exist to ease space. Example: XML data.

Unstructured data —
Unstructured data is a data which is not organized in a predefined manner or does not have a predefined data model; thus it is not a good fit for a mainstream relational database. So for Unstructured data, there are alternative platforms for storing and managing, it is increasingly prevalent in IT systems and is used by organizations in a variety of business intelligence and analytics applications. Example: Word, PDF, Text, Media logs.

Now above two points gives a clear picture about Semi-structure and Unstructured Data. Okay, but the question arises here “If it’s a data, How can we process it? How can we get information from it?”

This article and article series will help you to understand and answer this question.

Initially, Semi-structure and Unstructured Data are stored in either in FILES or Streaming data and we have to some how load it into snowflake then we can perform all the process and visualization easily.

Steps to Load data in Snowflake. NOTE — This architecture will follow weather its Snowflake or any other service.

1. Create Stage

2. Load Raw data

3. Analyze and Parse

4. Flatten Load

To handles these types of data we use Data type — Variant which handles unstructured data

Data type — Variant Handles unstructured data

// create stage

CREATE OR REPLACE STAGE MANAGE_DB.EXTERNAL_STAGES.JSONSTAGE

URL = ‘s3://bucketsnowflake-jsondemo’

//create json file format

CREATE OR REPLACE file format MANAGE_DB.FILE_FORMATS.JSONFORMAT

TYPE = JSON;

//create target table with VARIANT data type (single column)

CREATE OR REPLACE table OUR_FIRST_DB.PUBLIC.JSON_RAW (

raw_file variant);

//Loading data into target table

COPY INTO OUR_FIRST_DB.PUBLIC.JSON_RAW

FROM @MANAGE_DB.EXTERNAL_STAGES.JSONSTAGE

file_format= MANAGE_DB.FILE_FORMATS.JSONFORMAT

files = (‘HR_data.json’);

SELECT * FROM OUR_FIRST_DB.PUBLIC.JSON_RAW

Loading Semi-structure and Unstructured Data in Snowflake

Now, this is what JSON looks like in single row or single records in RDBMS but this single records contains many information.

{

“city”: “Bakersfield”,

“first_name”: “Portia”,

“gender”: “Male”,

“id”: 1,

“job”: {

“salary”: 32000,

“title”: “Financial Analyst”

},

“last_name”: “Gioani”,

“prev_company”: [],

“spoken_languages”: [

{

“language”: “Kazakh”,

“level”: “Advanced”

},

{

“language”: “Lao”,

“level”: “Basic”

}

]

}

By looking the JSON structure, first question crosses our mind “How we going to retrieve data from JSON? What would be the format to pull required information?”

Follow the below format –

SELECT COLUMN_NAME:NAME FROM TABLE WHERE CONDITION (NAME/VALUE FORMAT OF JSON)

SELECT RAW_FILE:city FROM OUR_FIRST_DB.PUBLIC.JSON_RAW

Loading Semi-structure and Unstructured Data in Snowflake

SELECT $1:first_name FROM OUR_FIRST_DB.PUBLIC.JSON_RAW //$1 REPRESENTS FIRST COLUMN

SELECT RAW_FILE:first_name::string AS FIRST_NAME FROM OUR_FIRST_DB.PUBLIC.JSON_RAW

Loading Semi-structure and Unstructured Data in Snowflake

//handling array in JSON

SELECT RAW_FILE:prev_company as prev_company FROM OUR_FIRST_DB.PUBLIC.JSON_RAW;

SELECT RAW_FILE:prev_company[1]::STRING as prev_company FROM OUR_FIRST_DB.PUBLIC.JSON_RAW;

//Dealing with hierarchy

select raw_file:spoken_languages from our_first_db.public.json_raw

select raw_file:spoken_languages[0] from our_first_db.public.json_raw

select raw_file:spoken_languages[0]:language from our_first_db.public.json_raw

FOR PARQUTE

//PARQUTE

CREATE OR REPLACE FILE FORMAT MANAGE_DB.FILE_FORMATS.PARQUET_FORMAT

TYPE = ‘parquet’;

CREATE OR REPLACE STAGE MANAGE_DB.EXTERNAL_STAGES.PARQUETSTAGE

url = ‘s3://snowflakeparquetdemo’

FILE_FORMAT = MANAGE_DB.FILE_FORMATS.PARQUET_FORMAT

LIST @MANAGE_DB.EXTERNAL_STAGES.PARQUETSTAGE;

SELECT * FROM @MANAGE_DB.EXTERNAL_STAGES.PARQUETSTAGE;

Loading Semi-structure and Unstructured Data in Snowflake

{

“__index_level_0__”: 7,

“cat_id”: “HOBBIES”,

“d”: 489,

“date”: 1338422400000000,

“dept_id”: “HOBBIES_1”,

“id”: “HOBBIES_1_008_CA_1_evaluation”,

“item_id”: “HOBBIES_1_008”,

“state_id”: “CA”,

“store_id”: “CA_1”,

“value”: 12

}

SELECT $1:cat_id ,$1:__index_level_0__ FROM @MANAGE_DB.EXTERNAL_STAGES.PARQUETSTAGE WHERE $1:__index_level_0__ > 100;

SELECT

$1:__index_level_0__::int as index_level,

$1:cat_id::VARCHAR(50) as category,

DATE($1:date::int ) as Date,

$1:”dept_id”::VARCHAR(50) as Dept_ID,

$1:”id”::VARCHAR(50) as ID,

$1:”item_id”::VARCHAR(50) as Item_ID,

$1:”state_id”::VARCHAR(50) as State_ID,

$1:”store_id”::VARCHAR(50) as Store_ID,

$1:”value”::int as value,

METADATA$FILENAME as FILENAME,

METADATA$FILE_ROW_NUMBER as ROWNUMBER,

TO_TIMESTAMP_NTZ(current_timestamp) as LOAD_DATE //NTZ IS A NO TIMEZONE FUNCTIONS

FROM @MANAGE_DB.EXTERNAL_STAGES.PARQUETSTAGE;

Loading Semi-structure and Unstructured Data in Snowflake

What we have observed, loading data required a Stage table and Target table.

To load data between Stage and Target table we need copy script and use SQL query to pull the required information.

Copy option in Snowflake DW. This article will help you to… | by Sarvesh Pandey | Jul, 2022 | Medium

Loading Data in Snowflake. So, on the previous blog we have… | by Sarvesh Pandey | Jun, 2022 | Medium

https://sarveshpandey23.medium.com/snowflake-architecture-and-warehouse-b8d283a1510e?source=your_stories_page-------------------------------------

--

--

Sarvesh Pandey

Hi Everyone, Let me give you my quick Introduction. I am Azure Data Engineer and Microsoft Certified Azure DBA and Trainer.