At the re:Invent conference held in Las Vegas in November 2016, Amazon’s CEO announced Athena, among a series of new services. Athena’s specialty is to support interactive queries using SQL over data stored in S3 buckets. What sets it apart from Amazon’s Redshift, used for data analysis on a highly structure data warehouse environment and Amazons EMR, used for data analysis on unstructured data, is Athena’s simplicity and ease of use.
Easy to Set Up
To begin with, Athena is serverless. No infrastructure setup is required to start using Athena. No clusters need to be set up and no data warehouses need to be created. Athena simply asks for a location path data stored in a S3 bucket. The location path must be of the form:
s3://<bucket name>/<folder name>. Along with the location, you must specify the schema for the table being created. Once the table is created, you can perform data analysis using SQL.
Low Cost Option
In addition to the simplicity of Athena is its cost effective pricing option, where you only pay for the amount of data that the query scans during it execution. Currently the pricing is a $5 per TB scanned. That makes Athena much more economical than using EMR for analyzing semi-structured data like log files and website clickstream data.
Supports the SQL Standard for Queries
Behind the scenes, Athena utilizes Presto. Presto is Amazon’s open source distributed SQL query engine optimized for low latency and ad-hoc analysis of data. Hence, Athena is able to support the ANSI SQL standard, including complex queries, aggregations, joins and window functions. Backed by a highly available and fault tolerant infrastructure, Athena guarantees durability of your data as well capacity to scale up to large volumes which are characteristics of big data. Athena does not currently support stored procedures or user defined functions (UDFs).
Utilizing the vast number of computational resources offered by AWS, queries in Athena yields results in sub-seconds. Athena’s query editor allows interactive query analysis by simply typing in the query against the Athena table. This is ideal for quick ad-hoc analysis and also supports complex analysis with complex queries.
Creating Databases and Tables
Athena uses an internal data catalog to store metadata about tables and databases, similar to a Hive meta store. Databases consist of multiple tables. Additionally, Athena uses a schema-on-read scheme which means that table definitions are applied to the data in S3 only when the queries are being executed. This eliminates the need to load or transform data saving time and resources. Table definitions and schemas can be deleted without affecting the underlying data in S3.
Databases and tables can be created in Athena using Hive data definition statements. Tables can be created using the query editor or a wizard or a JDBC driver. The JDBC driver also allows Athena to integrate with Business Intelligence tools. Since Hive supports multiple data formats through the use of its serializer-deserializer (SerDe) library, Athena is also able to support a variety of data formats like CSV, TSV, Parquet, JSON, ORC, Apache web server logs and custom delimiters. Hive allows Athena to define complex schemas using regular expressions also.
Partitioning of data is also supported, which can be used to reduce the amount of data scanned by Athena’s query, thus reducing costs further. Compressing and converting data to open source columnar format also results in greater performance and reduced costs with Athena.
Getting Started with Athena
If you have an AWS account, you are automatically signed up for Athena services. As of December 2016, Athena is available only for US-East (N. Virginia) and US-West 2 (Oregon) regions and should be available in all regions in the coming months. If you do not have an AWS account, LinuxAcademy.com has released a lab for you to start playing with Athena. Also available on LinuxAcademy.com is the nugget titled “Getting Started with Athena” that guides you with setting up database, tables and querying in Athena.