Snowflake SnowPro Study Guide

2023-09-26

AI

Snowflake Study Guide

I recently obtained my SnowPro certification and I want to share my preparation materials with you. I gathered information from the Snowflake community, documentation, and online training through Udemy and Snowflake. I hope this collection of resources will help you in your journey towards becoming a SnowPro.

After passing the test the other day, I strongly advise you to thoroughly read the guides and materials available at https://docs.snowflake.com/en/guides-overview. The test consists of 100 questions, and some of them are quite obscure. I found most of these obscure questions in the guide after taking the test. If you have any questions, feel free to DM me on LinkedIn.

General Information

  • Analytic Data Warehouse
  • Built by industry professionals from Oracle, it has the look at feel of a traditional DBMS
  • SaaS Offering
    • No hardware
    • No maintenance (handled by Snowflake)
    • Cloud only (no on premise)
  • Cloud
  • Decoupled compute and storage
  • Editions
    • Standard - Entry level product
    • Enterprise - Adds new functionality like time travel, row level security, object tagging, classification, materialized views.
    • Business Critical - For organizations that need PHI compliance (HIPPA, PCI, etc) and high uptime.
    • VPC - Private Snowflake
  • Workloads:
    • AI/ML
    • Applications (Transactional capability adding in 2022)
    • Data Engineering
    • Data Lake
    • Data Warehouse
    • Unistore (Data Cloud)

Costs

  • Function of Storage, Compute, and Edition
  • Costs are in the form of credits
  • Different regions/clouds may have different costs per credit.
    • Generally $2 for standard, $3 for Enterprise, and $4 for business critical
    • Credits are consumed by a running compute cluster
  • Pre-purchase storage capacity is approximately $23/TB depending on the cloud
  • Data transfer costs
    • Same Cloud/Region - 0
    • Same cloud different region - Varies starting at $20/TB US
    • Diff cloud diff region - varies starting at $90/TB
  • Snowflake Docs: Pricing

Architecture

  • Snowflake is a self managed platform
  • No server or software maintenance
  • Hybrid of traditional shared-disk and shared-nothing database architecture
    • Shared Disk - Central data repository accessible by all nodes
    • Shared-nothing - Using MPP clusters where each node stores a portion of the entire data set locally
  • 3 Layers
    • Database Storage
    • Query Processing
    • Cloud Services

Layer 1: Database Storage

  • Stored as internal, optimised, compressed columnar format(micro partitions)
  • Snowflake manages how data is stored
  • Objects stored by snowflake are not visable to customers

Layer 2: Query Processing

Layer 3: Cloud Services

  • Collection of services to coordinates ativiites across Snowflake
  • Includes:
    • Authentication
    • Infrastructure Management
    • Metadata Management
    • Query parsing and optimization
    • Access Control
  • Customers don’t have direct access to change
  • Terms associated are Query Planning, Optimization, and Complication (*Note - Not Processing or Design)

Connecting to Snowflake

Cache

  • Snowflake caches queries to improve performance
  • Metadata Cache
    • Cloud services layer
    • Improves compile time for queries and commonly used tables
  • Result Cache
    • Cloud services layer
    • Holds results of queries every 24 hours
    • Available across warehouses and users
  • Local Disk / Warehouse Cache
    • Storage layer
    • Cache data used by SQL queries in it’s local SSD or MEM
    • Improves query performance
  • Remote Disk
    • Holds long term storage
    • Data resilience and durability (for data center failure)

Virtual Warehouses (Compute Clusters)

  • Cluster of compute resources
  • 2 Types;
    • Standard
    • Snowpark-optimized - Recommended for workloads with large memory requirements
      • X-Small and Small sizes not available
  • Sized
    • X-Small
    • Small
    • Medium
    • Large
    • X-Large to 6X-Large
  • Any transaction requires a warehouse
  • Auto-suspend: defult 10m
  • Auto-resume: starts as needed
  • Multi-cluster Warehouse
    • 1 to 10 clusters
    • Can Autoscale up and down (horizonal scaling)
    • INcrease clusters as needed for performance and scale down as needed
    • Helps with concurrency and not slow running queries
  • Scaling Policy
    • Standard - start clusters over conserving credits
    • Economy - conserve credits over starting clusters

Data Structures

  • Snowflake is a columnstore database that leverages micropartitions
  • Data Types
    • Numeric Data Types
      • NUMBER - Default precision and scale are (38,0).
      • DECIMAL, NUMERIC - Synonymous with NUMBER.
      • INT, INTEGER, BIGINT, SMALLINT, TINYINT, BYTEINT - Synonymous with NUMBER except precision and scale cannot be specified.
      • FLOAT, FLOAT4, FLOAT8
      • DOUBLE, DOUBLE PRECISION, REAL - Synonymous with FLOAT. [1]
    • String & Binary Data Types
      • VARCHAR - Default (and maximum) is 16,777,216 bytes.
      • CHAR, CHARACTER - Synonymous with VARCHAR except default length is VARCHAR(1).
      • STRING - Synonymous with VARCHAR.
      • TEXT - Synonymous with VARCHAR.
      • BINARY
      • VARBINARY - Synonymous with BINARY.
    • Logical Data Types
      • BOOLEAN - Currently only supported for accounts provisioned after January 25, 2016.
    • Date & Time Data Types
      • DATE
      • DATETIME - Alias for TIMESTAMP_NTZ
      • TIME
      • TIMESTAMP - Alias for one of the TIMESTAMP variations (TIMESTAMP_NTZ by default).
      • TIMESTAMP_LTZ - TIMESTAMP with local time zone; time zone, if provided, is not stored.
      • TIMESTAMP_NTZ - TIMESTAMP with no time zone; time zone, if provided, is not stored.
      • TIMESTAMP_TZ - TIMESTAMP with time zone.
    • Semi-structured Data Types
      • VARIANT
      • OBJECT
      • ARRAY
    • Geospatial Data Types
      • GEOGRAPHY
      • GEOMETRY

Tables

  • All tables are automatically divided in micro-partitions of contiguous units between 50 and 500MB of uncompressed data, approximately 16MB after compression
  • Metadata stored: range of values, distinct values, additional properties
  • Extrememly efficient DDL
  • Compression occurs after micropartitioning by snowflake selecting the best model
  • Clustering
    • Tables can be clusted using a cluster key during the create or alter DDL
    • Automatic Clustering is the Snowflake service that seamlessly and continually manages all reclustering, as needed, of clustered tables.
    • You can SUSPEND / RESUME RECLUSTER for a table if you want to disable the cluster
    • Clustering is non-blocking DML
    • Will consume credits, but you don’t have to have a warehouse
  • Search Optimization
    • Table level setting
    • Uses an aditional persistent data structure that is designed for search
    • Only speeds up equality searches, does not speed up range
    • Applied to all columns
    • use select system$estimate_search_optimization_costs('TABLENAME') to estimate the costs of search optimization and maintenance
    • use ALTER TABLE TABLENAME ADD SEARCH OPTIMIZATION
  • Types of Tables
    • Permanent - explicy drop, 7 Days Fail Safe, 90 Days Time Travel
    • Temporary - Remains during session, 0 Fail Safe Days, 1 time travel day
    • Transient - Explicity drop, 0 fail safe, 1 time travel

Views

  • Types of Views
    • Non-Materialized - Named definition of a query. Results are NOT stored anywhere. Performance is slower than materialized views.
    • Materialized - Defined the same as non-materialized, but behaves like a table. Faster access, but will have additional storage costs.
  • Secure Views
    • Both Materialized and Non-materialized can be a secure view.
    • Allow for additional data privacy

Data Management

  • Stages
    • Internal - Snowflake managed space, can be used as a lake.
    • External - AWS, GCP, or Azure storage.
    • PUT command moves data into a stage
    • COPY INTO moves data from the stage into a table
      • Makes use fo the File Format. Each copy into needs to have a format to interpret that data.
        • Structures that have inherent structure like parquet and json, can use INFER SCHEMA.
      • The ON ERROR option allows you to manage error handling (fail on file, fail row, etc)
      • Can truncate columns, transform data, etc on load
    • Can remove data with the REMOVE command.
  • SnowPipes
    • Enable automatically loading data once it appears in a bucket
    • Data is made available immediately
    • Serverless feature instead of warehouses (Don’t need to worry about scaling)
    • Uses event notification to trigger
  • Streams
    • A stream object records data manipulation language (DML) changes made to tables (INSERT, UPDATE, DELETE, etc) as well as metadata changes
  • Stored Procs
    • Similar to other DB platforms
  • UDF’s - You can create to extend Snowflake. Supports Java, Javascript, Python, Scala, and SQL.
    • External Functions
      • An external function calls code that is executed outside Snowflake.
      • A type of UDF
  • Orchestration
    • Tasks - task can execute any one of the following types of SQL code:
      • Single SQL statement
      • Call to a stored procedure
      • Procedural logic using Snowflake Scripting
  • Classes
    • A Class is similar to a class in object oriented programming and serves as a blueprint for creating instances. An Instance is an object created from a Class. Classes and instances are schema-level objects in Snowflake.

Security

Roles

  • Roles are a logical grouping of privledges
  • Security structure is Privledges -> Roles -> Users
  • Roles Hierarchy:
    • ACCOUNTADMIN - All SecurityAdmin and SysAdmin functions
      • SYSADMIN - Can create databases, schemes, etc.
        • Custom Roles - User created roles should live here
      • SECURITYADMIN - All user priveledges - can manage objects and grants globally
        • USERADMIN - User and role management only
          • PUBLIC - create and own objects indivudally

API Auth and Secrets

  • Snowflake supports external API code and can store sensitive information in keys/passwords as secrets
  • secret is a schema-level object that stores sensitive information, limits access to the sensitive information using RBAC, and is encrypted using the Snowflake key encryption hierarchy
  • can create and alter secrets

Network Policy and Rules

  • Allows access based on IP whitelist or restrict via blacklist
  • ACCOUNTADMIN and SECURITYADMIN can CRUD

Federated Auth and SSO

  • User auth is seperated from user access though and independent IAP

SCIM

  • Snowflake can work with all private link services
  • requires setup
  • not a snowflake service

Key Management

MFA

  • Not enabled by default
  • The Duo application service communicates through TCP port 443.
  • alter account set allow_client_mfa_caching = true;

Data Governance

Tagging

  • Tags enable data stewards to monitor sensitive data
  • Schema level object that can be assigned to another object
  • Created with CREATE TAG cost_center;
  • On create using WITH TAG (cost_center = 'IT');
  • Alter using WITH TAG (cost_center = 'IT');

Data Classification

  • Snowflake has built in semantic and privacy tags avaiable
    • SNOWFLAKE.CORE.SEMANTIC_CATEGORY
    • SNOWFLAKE.CORE.PRIVACY_CATEGORY
  • Helps to manage PII data tracking

Data Masking and External Tokenization

  • Masking
    • Snowflake supports masking policies as a schema-level object to protect sensitive data from unauthorized access while allowing authorized users to access sensitive data at query runtime.
    • Use with data sharing
    • Value of data is lost
  • External Tokens
    • External Tokenization enables accounts to tokenize data before loading it into Snowflake and detokenize the data at query runtime.
    • Can’t be used with data sharing
    • Loses analytical value

Data Sharing

  • Only ACCOUNTADMIN can provision
  • Sharer = Provider
  • User = Consumer
  • uses Snowflake service layer and metadata store
  • 1 DB per share but unlimited number of shares on the account
  • Consists of
    • Privilege to grant DB access
    • Privilege to grant object access
    • Consumer account DB and objects are shared
  • Consumer is READ ONLY

Monitoring

  • Resource monitors can help to impose limits on the number of credits consumed
  • Can adjust to alert, suspend, or immediately suspend

Example Code

Warehouse, Databases, Schemas, Tables, and Views

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
-- Create a warehouse <br>
CREATE OR REPLACE WAREHOUSE WH_BLOG
WITH
WAREHOUSE_SIZE='XSMALL'
MAX_CLUSTER_COUNT = 1
MIN_CLUSTER_COUNT = 1
SCALING_POLICY = ECONOMY
AUTO_RESUME = TRUE
INITIALLY_SUSPENDED = TRUE;

USE WAREHOUSE WH_BLOG;

-- Create a Database <br>
CREATE OR REPLACE database SNOWFLAKE_BLOG_TEMP;
Use database SNOWFLAKE_BLOG_TEMP;

-- Create a Schema <br>
CREATE OR REPLACE schema myblog;
use schema myblog;

-- Create a Permenant Table <br>
CREATE OR REPLACE table tbl_perm (
id integer,
val1 string
);

-- Insert Values into Table <br>
INSERT INTO tbl_perm VALUES
(1, 'a'),
(2, 'b'),
(3, 'c');

-- Select data from table <br>
Select * from tbl_perm;

-- Create a temporary table <br>
CREATE OR REPLACE TEMPORARY TABLE tbl_perm_temp (
id integer,
val1 string
);

-- Insert Values into Table <br>
INSERT INTO tbl_perm_temp VALUES
(1, 'a'),
(2, 'b'),
(3, 'c');

-- Review Table Metadata <br>
Select * from
INFORMATION_SCHEMA.TABLES
where TABLE_NAME like 'TBL_%';

-- Drop a table <br>
Drop Table tbl_perm;

-- Undrop Table <br>
UNDROP Table tbl_perm;

-- Set data retention (up to 90) for perm, 1 day for transient, and 0 for temp <br>
ALTER TABLE tbl_perm set DATA_RETENTION_TIME_IN_DAYS = 7;

-- grab the last query id <br>
select * from tbl_perm;
select last_query_id();

-- Update a table <br>
update tbl_perm
set val1 = 'a' where id = 1;

-- Look at a table from a prior point in time from a previous statement <br>
select * from tbl_perm before (statement => '01af7925-0000-370d-0000-0000941d4241');

-- show tables <br>
SHOW TABLES;

-- Use the following setup for the views demo
CREATE OR REPLACE TABLE tbl_player (
playerid bigint primary key,
playername string,
playercountry string
);

Insert into tbl_player Values
(1, 'Knight1213', 'US'),
(2, 'Squire123', 'US'),
(3, 'Horse33', 'US');

CREATE OR REPLACE TABLE tbl_games (
gameid bigint primary key,
gamename string,
gamerules string
);

Insert into tbl_games Values
(1, 'Game1', 'Fight'),
(2, 'Game2', 'Fight'),
(3, 'Game3', 'King of the Hill'),
(4, 'Game4', 'Fight'),
(5, 'Game5', 'King of the Hill');


CREATE OR REPLACE TABLE tbl_playergames (
gamid bigint,
playerid bigint,
InstantRank int,
GameDate date
);

Insert into tbl_playergames values
(5, 3, 73, '9/1/2023'),
(4, 1, 31, '9/2/2023'),
(4, 3, 15, '9/1/2024'),
(3, 3, 97, '9/2/2024'),
(5, 1, 22, '9/1/2025'),
(4, 3, 68, '9/2/2025'),
(1, 3, 3, '9/1/2026'),
(2, 1, 85, '9/2/2026'),
(4, 3, 3, '9/1/2027'),
(4, 2, 58, '9/2/2027'),
(2, 3, 8, '9/1/2028'),
(1, 2, 41, '9/2/2028'),
(1, 1, 70, '9/1/2029'),
(1, 2, 96, '9/2/2029'),
(1, 1, 19, '9/1/2030'),
(1, 3, 82, '9/2/2030'),
(1, 2, 3, '9/1/2031'),
(5, 3, 41, '9/2/2031'),
(4, 2, 72, '9/1/2032'),
(4, 2, 92, '9/2/2032'),
(3, 1, 86, '9/1/2033'),
(1, 2, 58, '9/2/2033'),
(4, 2, 8, '9/1/2034'),
(2, 1, 84, '9/2/2034'),
(1, 2, 64, '9/1/2035');

-- Create a view
CREATE or REPLACE VIEW myplayerstats as
select playername, count(gameid) as countgames, count(distinct gameid) as distinctgames, avg(instantrank) as AverageInstantRank
from tbl_player p inner join tbl_playergames pg on p.playerid = pg.playerid
inner join tbl_games g on g.gameid = pg.gamid
group by playername;

select * from myplayerstats;

--Create a materialized view - you can only do it on one table
CREATE or REPLACE MATERIALIZED VIEW mygamestats as
select * from tbl_playergames;

-- create a Secure materialized view
CREATE or REPLACE SECURE MATERIALIZED VIEW mygamestats as
select * from tbl_playergames where gamid > 3;

Show VIEWS;


-- Create an Internal Stage
CREATE STAGE my_blog_stage
COPY_OPTIONS = (ON_ERROR='skip_file');

-- Create an External Stage

-- From SNOWSQL load data into internal stage
put file://cities.parquet @my_blog_stage;
list @my_blog_stage;

-- Create a parquet file format
CREATE FILE FORMAT my_parquet_format
TYPE = parquet;

-- Infer the schema of the file via parquet
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@my_blog_stage/cities.parquet'
, FILE_FORMAT=>'my_parquet_format'
)
);

-- Create a table from the schema
CREATE TABLE myinferredtable
USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@my_blog_stage/cities.parquet',
FILE_FORMAT=>'my_parquet_format'
)
));

-- Copy data from the stage to the newly created table
COPY into myinferredtable
from @my_blog_stage/cities.parquet FILE_FORMAT = (FORMAT_NAME= 'my_parquet_format') MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE;

-- Select from the newly populated table
select * from myinferredtable;

-- Clone a table
CREATE or REPLACE Table myinferredtable_clone CLONE myinferredtable;

-- Cluster a table
Select * from information_schema.Tables
Where TABLE_NAME = 'TBL_PLAYERGAMES';

Select * from TBL_PLAYERGAMES;

ALTER TABLE TBL_PLAYERGAMES CLUSTER by (GAMID, PLAYERID);

-- Create a STAGE for SnowPipe (S3 Example)
CREATE OR REPLACE stage MYBLOG.EXTERNALSTAGES.MySnowPipeStage
URL = 's3://example/example'
STORAGE_INTEGRATION = s3_int
FILE_FORMAT = MYBLOG.EXTERNALSTAGES.CSVformat;

// Define pipe
CREATE OR REPLACE pipe MYBLOG.MYSNOWPIPE
auto_ingest = TRUE
AS
COPY INTO SNOWFLAKE_BLOG_TEMP.MYBLOG.TBL_PLAYER
FROM @MYBLOG.EXTERNALSTAGES.MySnowPipeStage;

// Describe pipe
DESC pipe MYBLOG.MYSNOWPIPE;

-- PUll the arn and setup notifications in AWS (not included in demo)

-- Create a Stream to monitor inserts
CREATE STREAM my_ext_table_stream ON EXTERNAL TABLE my_ext_table INSERT_ONLY = TRUE;

-- Create a task to insert a timestamp every hour
create table MYBLOG.MYTIME as
(
ts date
);

CREATE TASK mysnowtask
WAREHOUSE = mywh
SCHEDULE = '60 MINUTE'
AS
INSERT INTO MYTIME(ts) VALUES(CURRENT_TIMESTAMP);