Snowflake Data Types: An Introduction
Numerous data types are supported by Snowflake AI Data Cloud, which provides the ideal match for every sort of data, including strings, numbers, and even intricate semi-structured data formats. This abundance of possibilities is more than just a way to store data. Snowflake can handle more complicated semi-structured and geographic data and supports the majority of fundamental SQL data types, including text, integers, and dates.
These kinds of Snowflake data types can be used in a variety of settings, including parameters, expressions, local variables, columns, and other appropriate places. By leveraging Snowflake consulting to choose the right data types, you can significantly improve the data integrity, maximize storage efficiency, and drastically improve query performance.
In order to balance flexibility, storage efficiency, and query performance, Snowflake requires careful data type selection. Primarily there are Six different categories of data types which is offered natively by Snowflake. Now, let us deep dive into these most widely used Snowflake data types.
1. Numeric Data Types
The numeric data types support storing various numeric values, such as whole numbers, decimals, floating-point numbers, and doubles. Selecting the appropriate numeric Snowflake data types is crucial for ensuring accuracy and optimizing storage and processing efficiency. Let us see of the examples of the Numeric data types.
DECIMAL
The DECIMAL data type supports fixed-point decimal arithmetic by allowing us to define the value range through precision and scale. DECIMAL, DEC, and NUMERIC are interchangeable terms for NUMBER. The default value for precision and scale of DECIMAL is also same as number (i.e. DECIMAL (38, 0)). Let us see these things in action:
If we see this table structure the column “column1” is defined with exact precision and scale (precision being 5 & scale being 2), but the column “column2” do not have any such metric defined, in this case by default the same column would be taking the value as decimal(38,0), when it gets stored in database.
INT, INTEGER, BIGINT, SMALLINT, TINYINT, BYTEINT
In Snowflake, INTEGER is synonymous with NUMBER but does not allow precision and scale to be specified. When modelling whole numbers, you can choose from various integer types such as INT, BIGINT, SMALLINT, and others. These types function similarly but differ in the range of integers they can store.
For instance, INT allows you to store large 38-digit integers ranging from -999 quadrillion to +999 quadrillion, without any decimal components. They provide developers of Snowflake Development Company with guidance on the likely value ranges they can model, eliminating concerns about potential overflows. INT is suitable for common use cases, BIGINT for very large numbers, and SMALLINT for smaller ranges.
The precision of an integer type, or the total number of digits, has no bearing on Snowflake’s storage capacity or performance. Rather, by compressing numbers and calculating storage needs depending on the range of values within each micro-partition, Snowflake optimizes storage. However, storage can be affected by the scale (digits following the decimal point, which is zero for integer types); bigger scales may take up more space and have an effect on memory usage and processing speed.
When we would do the description of the same fields then we would understand that all of them are getting stored with same category of the datatype as mentioned below:
To put it briefly, these integer types provide many easily understood “buckets” for modelling whole numbers without fractions.
NUMBER
Now, that we understood the above two data types, hence understanding this concept of “NUMBER” would be relatively easier. These are the Snowflake data types which depend upon precision and scale that we briefly covered in above topics.
Precision represents the total number of digits a number can have. For instance, if a column is defined as NUMBER (5,2), the precision is 5, allowing values to range from -999 to 999. Lower precision limits the size of numbers that can be stored.
Scale refers to the number of digits permitted after the decimal point. In the example NUMBER (5,2), the scale is 2, meaning values can look like -99.99 or 123.45. A higher scale allows for greater precision in fractional values.
It follows the similar characteristic of the DECIMAL, that is when nothing is defined then this NUMBER data type also holds the value of (38,0) as default.
FLOAT, FLOAT4, FLOAT8
The terms FLOAT, FLOAT4, and FLOAT8 are used for compatibility with other systems, but Snowflake treats all three as 64-bit floating-point numbers. Snowflake uses double-precision (64 bit) IEEE 754 floating-point numbers.
The range for integers extends from -9007199254740991 to +9007199254740991. For floating-point numbers, the range is roughly from 10^-308 to 10^+308, with values between approximately 10^-324 and 10^-308 being represented with lower precision.
Snowflake also supports special values for FLOAT:
- NaN (Not A Number)
- in (infinity)
- -inf (negative infinity)
These values are case-insensitive and should be enclosed in single quotes. It’s important to note that Snowflake handles ‘NaN’ differently from the IEEE 754 standard, treating all ‘NaN’ values as equal and considering ‘NaN’ to be greater than any other FLOAT value.
Rounding errors should be considered when performing floating-point operations. These errors can differ with each query execution and may be more pronounced when operands have varying precision or scale. They are especially significant in aggregate functions like SUM () or AVG (). To minimize or eliminate these errors, Snowflake data types advises casting to a fixed-point before performing aggregation.
As we see the values against the 3 fields got inserted as usual. And all of them have been stored with type FLOAT which means that the type column in the DESC TABLE command shows the data type as FLOAT not only for FLOAT but also for its synonyms, such as DOUBLE, DOUBLE PRECISION, and REAL.
DOUBLE, DOUBLE PRECISION, REAL
DOUBLE, DOUBLE PRECISION, and REAL are synonymous with FLOAT, meaning they are treated as 64-bit floating-point numbers. These types are intended to handle values that require large ranges and high precision, making them particularly useful in situations where FLOAT is used to represent real-world data with great accuracy.
2. String and Binary Data Types
They describe the string/text data types, including binary strings, supported by Snowflake, as well as the formats available for string constants and literals. String and Binary data types in Snowflake are essential for managing text and binary data. These Snowflake data types uniformly support Unicode UTF-8 characters, ensuring consistent handling of data across different string types. Below are some of the common String and Binary data types in Snowflake:
VARCHAR
VARCHAR is the most flexible string data type in Snowflake, supporting Unicode UTF-8 characters. Its length can be defined up to 16,777,216 characters or 16 MB, and if not specified, it defaults to this maximum.
In terms of performance, there is no difference between using the full-length VARCHAR declaration (VARCHAR(16777216)) and a smaller length. This one of the Snowflake data types is ideal for storing text with varying lengths, such as names, descriptions, or comments.
In some systems outside of Snowflake, data types like CHAR and VARCHAR store ASCII, while NCHAR and NVARCHAR store Unicode. In Snowflake, VARCHAR and all other string data types store Unicode UTF-8 characters. There is no difference in Unicode handling between CHAR and NCHAR data types. Synonyms like NCHAR are mainly used for syntax compatibility when migrating DDL commands to Snowflake. Other key characteristics of this one of the Snowflake data types can be seen in the below table.
CHAR, CHARACTER, NCHAR
CHAR (or CHARACTER, NCHAR) is similar to VARCHAR but is generally used for storing fixed-length strings. If no length is specified, it defaults to CHAR(1). Unlike traditional CHAR behavior, Snowflake does not pad shorter strings with spaces to match the maximum length.
Above example shows how CHAR and VARCHAR are defaulted when nothing is specified, and also how it is represented when we define the length of the columns.
STRING, TEXT, NVARCHAR, NVARCHAR2, CHAR VARYING, NCHAR VARYING
The Snowflake data types STRING, TEXT, NVARCHAR, NVARCHAR2, CHAR VARYING, and NCHAR VARYING are all treated as equivalents of VARCHAR. Now, let us try to understand the difference between STRING and VARCHAR.
In Snowflake, STRING and VARCHAR are often used interchangeably, as both are well-suited for handling text data. VARCHAR is typically the go-to choice for variable-length text, while STRING is essentially synonymous with VARCHAR in Snowflake. Unlike some database systems where STRING represents fixed-length character data, in Snowflake, both STRING and VARCHAR function the same way—efficiently accommodating text of any length, whether brief or extensive.
BINARY
The Snowflake data types BINARY is used for fixed-length binary data, such as images or files, with a maximum size of 8 MB. Unlike VARCHAR, BINARY does not handle Unicode characters, and its length is always measured in bytes.
VARBINARY
The Snowflake data types VARBINARY is used for variable-length binary data, with a maximum size of 8 MB. It is similar to BINARY but offers greater flexibility, as its size adjusts to match the actual length of the data.
3. Logical Data Types
In Snowflake, the BOOLEAN data type is used to represent true or false values, with an additional “unknown” value represented by NULL. BOOLEAN can be utilized in various parts of a query, such as a SELECT list or WHERE clause, to evaluate logical conditions. When it comes to converting to BOOLEAN, there are two methods: EXPLICIT and IMPLICIT.
Explicit Conversion:
It allows us to convert text strings like “true,” “yes,” or “1” into TRUE, and “false,” “no,” or “0” into FALSE. This conversion is case-insensitive. For numeric values, the rule is straightforward: zero converts to FALSE, while any non-zero value converts to TRUE. Explicit conversion turns TRUE into “true” or 1, and FALSE into “false” or 0.
Implicit Conversion:
It happens automatically, with the system converting strings like “true” or “false” to their BOOLEAN equivalents and applying the same zero/non-zero rule for numeric values. Implicit conversion automatically converts TRUE and FALSE into the strings “true” and “false”.
Below is the example of how the Boolean expression is getting evaluated.
4. Date and Time Data Types
Snowflake data types Date and Time are designed for handling dates, times, and timestamps. Snowflake supports the following date and time data types:
- DATE
- DATETIME
- TIME
- TIMESTAMP_LTZ, TIMESTAMP_NTZ, TIMESTAMP_TZ
For DATE and TIMESTAMP data, Snowflake recommends using years within the range of 1582 to 9999. While Snowflake supports some years outside this range, years before 1582 should be avoided due to limitations of the Gregorian calendar.
DATE
The DATE data type is specifically designed for storing calendar dates. It supports various formats, such as YYYY-MM-DD and DD-MON-YYYY, and can easily convert TIMESTAMP data by ignoring the time component. Snowflake advises using dates between 1582 and 9999 to avoid issues related to historical calendar reforms.
DATETIME
In Snowflake, DATETIME is an alias for TIMESTAMP_NTZ, which represents dates and times without accounting for time zones. This one of the Snowflake data types is ideal for scenarios where time zone information is not essential.
TIME
The Snowflake data types TIME stores time values without a date component. It allows precision for fractional seconds, ranging from 0 (seconds) to 9 (nanoseconds). This type is useful for storing specific times of the day, such as schedules or routines, without the need for date information.
Below is the example of how the TIME data type works within the platform. As we see the scale of the TIME column has got fractional components.
TIMESTAMP_LTZ, TIMESTAMP_NTZ, TIMESTAMP_TZ
TIMESTAMP_LTZ (Local Time Zone) stores times in UTC and converts them to the local time zone for display and calculations. It is perfect for applications that require recording the precise moment of an event in a globally consistent manner, irrespective of the event’s location.
TIMESTAMP_NTZ (No Time Zone) records time without any time zone information. This variant is ideal for applications where time zone context is either unnecessary or handled separately.
TIMESTAMP_TZ (Time Zone) stores both the time and the associated time zone, making it ideal for applications that need to account for time zones, such as scheduling events across multiple regions.
5. Semi-Structured Data Types
Snowflake data types Semi-Structured are highly flexible, enabling the storage and manipulation of various data formats. There are various data types enabling the support of the semi-structured data, that is explained as below:
VARIANT
VARIANT is a versatile data type capable of holding any other type, including OBJECT and ARRAY. It is especially useful for storing data in formats such as JSON, Avro, ORC, Parquet, or XML. VARIANT has a maximum size of 16 MB and stores both the value and its type, enabling operations without the need for explicit casting.
OBJECT
OBJECT is similar to a JSON object or a dictionary in other programming languages. It consists of key-value pairs, where each key is a VARCHAR and each value is a VARIANT. OBJECT is ideal for representing structured data in a flexible format.
ARRAY
The Snowflake data types ARRAY is similar to arrays in many programming languages, allowing it to hold an ordered list of elements. An ARRAY can contain zero or more data items, with each element accessed by its position in the array. Each element in an ARRAY is a VARIANT, meaning it can store various types of data. ARRAY is ideal for storing lists or sequences of values, where the elements may differ but are often of the same or compatible types for consistent processing.
6. Geospatial Data Types
The Snowflake data types Geospatial are used to store information about geographic locations and shapes, such as points, lines, and polygons on the Earth’s surface. Snowflake offers two data types for storing and analyzing this type of location data, and they are:
GEOGRAPHY
Stores data using spherical latitude/longitude coordinates based on the WGS 84 standard. It interprets lines as curved geodesic paths on the Earth’s surface, enabling the natural modelling of real-world geospatial shapes.
GEOMETRY
This datatype handles 2D planar Cartesian data, typically from projected coordinate systems. It treats lines as straight line segments between points rather than curved paths. It also supports various well-known spatial reference systems for interpreting coordinates. The output from the table is given in a JSON format.
Summary
As we have seen variety supports of Snowflake data types that makes the platform unique with respect to handling various forms and formats of the data sets. Snowflake data types as a set of keys on a keyring, each designed to unlock a specific door to solve a particular data challenge. Whether it’s numeric precision, text processing, or managing complex JSON structures, each key represents a unique data type tailored for the task. Just like selecting the right key ensures seamless access, choosing the right Snowflake data types unlocks the full potential of your data efficiently and effectively.