Glue Crawlers don’t correctly recognize Ion data - here’s how you fix that

Amazon Ion is one of the data serialization formats you can use when exporting data from DynamoDB to S3. Recently, I tried to select data from one of these exports with Athena after using a Glue Crawler to create the schema and table. It didn’t work, and I got a weird error message. In this post, I’ll show you how to fix that problem. If you’re not familiar with Ion yet, check out my recent blog post introducing it for more details.

First, I should explain why I want to do this. DynamoDB is a NoSQL database that is great for data with relatively static access patterns. It handles vast amounts of data at ease with predictable latency. That’s great for OLTP workloads, but analytics workloads typically have anything but static access patterns. Because of that, I decided to export the data to S3 and use a service that allows me to throw SQL against data in S3 to fulfill my analytics queries: Athena.

For Athena to be able to read the data, we need a table in the Glue Data Catalog that holds information about the columns and data structures, as well as the storage location and information about how the data should be read or written. If you don’t feel like populating all that information manually, you can use a Crawler to sift through the data in S3 and create tables based on it. That’s what I tried to do. I deployed and ran a crawler that set up the table for me. Everything looked great at first glance.

When I queried the data from Athena, it responded with this beautiful error message:

HIVE_UNSUPPORTED_FORMAT: Unable to create input format

Error 1

Naturally, I tried looking at the table in Athena to see what was going on, but that wasn’t very helpful. It wouldn’t even show me the create table statement:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.NullPointerException

Error 2

At this point, I got a little annoyed. I looked at the table in Glue and got confused. I saw the “Classification: ion” label, which made it seem like the Crawler hat recognized the data format. In the schema below, it had also correctly identified the columns and data types in the files.

Glue Table View

For whatever reason, essential information is missing, though - there is no Serde (Serializer / Deserializer) configured, which would tell the system how the data can be read. A quick look at the documentation showed me that the serializer information just consists of three key-value pairs which are static and in no way computed. I don’t understand why the Crawler failed to add them.

To fix this, you simply add the values in the “Edit Table” dialog in Glue, and it should look like this:

Glue Edit Table Here’s the copy-paste friendly version:

Input Format:
com.amazon.ionhiveserde.formats.IonInputFormat

Output Format:
com.amazon.ionhiveserde.formats.IonOutputFormat

Serde serialization lib
com.amazon.ionhiveserde.IonHiveSerDe

Afterward, we can rerun the original statement and should now see data in Athena:

Athena Query works

I don’t understand why the Crawler doesn’t add this basic information, but after talking to AWS support, it seems this is a known error, and the service team is aware of it. If or when they’ll fix it - as usual, no ETA.

Hopefully, you learned something from this post. I’m looking forward to any questions, feedback, or concerns.

— Maurice

Title Photo by Torbjørn Helgesen on Unsplash

Similar Posts You Might Enjoy

Solving Hive Partition Schema Mismatch Errors in Athena

Working with CSV files and Big Data tools such as AWS Glue and Athena can lead to interesting challenges. In this blog I will explain to you how to solve a particular problem that I encountered in a project - the HIVE_PARTITION_SCHEMA_MISMATCH. - by Maurice Borgmeier

What is Amazon Ion, and how can I read and write it in Python?

Amazon Ion is a data serialization format that was open-sourced by Amazon in 2016 and is used internally at the company. Over time it has also been introduced into some AWS services and is the data format that services like the Quantum Ledger Database (QLDB) use. It has also started to appear in more commonly used services, so I think it’s worth taking a closer look at. This article will explain what Ion is, its benefits, and how you can use it in Python. - by Maurice Borgmeier

Working around Glue's habit of dropping unsuspecting columns

This point explains how to work around Glue’s problem of selective amnesia when creating Dynamic Frames from the Glue data catalog. - by Maurice Borgmeier