How do I configure AWS Kinesis Firehose to be able to copy into Redshift?
Problem
I have set up a Firehose and selected options that it seemed would result in the streamed data to end up in Redshift. I have a single Firehose and a single target table in Redshift. The data is showing up correctly in the intermediate S3 bucket but not making it to Redshift.
I have allowed the Firehose region CIDR's [1] access to the cluster via Redshift security groups. I have given the Firehose the credentials to the Redshift cluster and have tested the credentials. The cluster is publicly accessible.
The STL_LOAD_ERRORS
table is empty which is where I would expect to see the errors produced by any copy statement that was attempted and failed. There are also no DeliveryToRedshift
events in CloudWatch. This leads me to believe that there isn't even an attempt making it to Redshift.
I am using Redshift COPY options to specify a jsonpath file for the copy. I currently have: json "s3://app-event-data/_jsonpaths/_bigtable.jsonpath.json"
. That file exists in s3 at that location and looks like the jsonpath file [3]. I also tried adding credentials to that box but I don't that is necessary.
Update 1
I enabled logging in Redshift which showed me that there are successful authentications being made from Firehose. I then created a new parameter group with user activity logging enabled and rebooted the cluster. Waiting for the next log batch now.
References
- http://docs.aws.amazon.com/firehose/latest/dev/firehose-dg.pdf
- https://aws.amazon.com/kinesis/firehose/
- http://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-copy-from-json.html
Why ServerFault? Best StackExchange site for asking about Amazon AWS
I would have tagged this question: amazon-kinesis amazon-firehose amazon-redshift
Solution 1:
Here are all of the things I initially misconfigured and eventually adjusted to get Firehose to properly copy data into Redshift:
- Make sure the bucket you select is in the same region as your redshift cluster. You can specify the REGION if for some reason you want to incur additional charges. You cannot change the region of a bucket after you create it.
This is a bit confusing at first because the region selector in the buckets console says "global".
- Firehose will not quote identifiers for you so if the table name you provide in the form requires quotes, you need to add them yourself. Same goes for column names.
Single quotes only in the copy options.
Give Firehose access to your Redshift cluster by authorizing the correct CIDR/IP from the Redshift security tab. You can find the correct CIDR/IP for the region you created your Firehose in here. You don't need to create a new Cluster Security group to do this. You can just add it to the default one. If you add a new security group then you need to reconfigure your cluster from the dashboard to use the new security group, it can't have 2.
Redshift COPY doesn't accept timestamp with timezones outside of UTC. If you are using a format such as RFC3339 you need to put the times in UTC first and specify
TIMEFORMAT 'auto'
in the copy options.-
In order to have any idea what is going on when things aren't working, you need to enable User Activity Logging in Redshift by:
- enabling logging (obviously)
- creating a new parameter group
- setting user_activity_logging to
true
- modifying you cluster to use the new parameter group
- roboot your cluster
Enable user activity logging
achievement unlocked