How to push data from local SQL Server to Tableau Server on AWS

We are developing Tableau dashboards and deploying the workbooks on a EC2 windows instance in AWS. One of the data source is the company SQL server inside firewall. The server is managed by IT and we only have read permission to one of the databases. Now the solution is to build workbook on Tableau desktop locally by connecting to the company SQL server. Before publishing the workbooks to Tableau server, the data are extracted from data sources. The static data got uploaded with workbooks when published.

Instead of linking to static extracted data on Tableau server, we would like to set up a database on AWS (e.g. Postgresql), probably on the same instance and push the data from company SQL server to AWS database.

There may be a way to push directly from SQL server to postgres on AWS. But since we don't have much control of the server plus the IT folks are probably not willing to push data to external, this will not be an option. What I can think of is as follows:

Set up Postgres on AWS instance and create the tables with same schemas as the ones in SQL server.
Extract data from SQL server and save as CSV files. One table per file.
Enable file system sharing on AWS windows instance. So the instance can read files from local file system directly.
Load data from CSV to Postgres tables.
Set up the data connection on Tableau Server on AWS to read data from Postgres.

I don't know if others have come across a situation like this and what their solutions are. But I think this is not a uncommon scenario. One change would be to have both local Tableau Desktop and AWS Tableau Server connect to Postgres on AWS. Not sure if local Tableau could access Postgres on AWS though.

We also want to automate the whole process as much as possible. On local server, I can probably run a Python script as cron job to frequently export data from SQL server and save to CSVs. On the server side, something similar will be run to load data from CSV to Postgres. If the files are big, though, it may be pretty slow to import data from CSV to postgres. But there is no better way to transfer files from local to AWS EC2 instance programmatically since it is Windows instance.

I am open to any suggestions.

A. Platform choice

If you use a database other than SQL Server on AWS (say Postgres), you need to perform one (or maybe two) conversions:

In the integration from on on-prem SQl Server to AWS database you need to map from SQL Server datatypes to postgres datatypes
I don't know much about Tableau, but if it is currently pointing at SQL Server, you probably need some kind of conversion to point it at Postgres

These two steps alone might make it worth your while to investigate a SQL Express RDS. SQL Express has no licencing cost but obviously windows does. You can also run SQL Express on Linux which would have no licencing costs, but would require a lot of fiddling about to get running (i.e. I doubt if there is a SQL Express Linux RDS available)

B. Integration Approach

Any process external to your network (i.e. on the cloud) that is pulling data from your network will need the firewall opened. Assuming this is not an option, that leaves us only with push from on-prem options
Just as an aside on this point, Power BI achieves it's desktop data integration by using a desktop 'gateway' that coordinates data transfer, meaning that cloud Power BI doesn't need to open a port to get what it needs, it uses the desktop gateway to push it out
Given that we only have push options, then we need something on-prem to push data out. Yes, this could be a cron job on Linux or a windows scheduled task. Please note, this is where you start creating shadow IT
To get data out of SQL Server to be pushed to the cloud, the easiest way is to use BCP.EXE to generate flat files. If these are going into a SQL Server, these should be native format (to save complexity). If these are going to Postgres they should be tab delimited
If these files are being uploaded to SQL Server, then it's just another BCP command to push native files into tables into SQL Server (prior to this you need to run SQLCMD.EXE command to truncate the target table

So for three tables, assuming you'd installed the free* SQL Server client tools, you'd have a batch file something like this:

REM STEP 1:  Clear staging folder
DEL /Y C:\Staging\*.TXT

REM STEP 2: Generate the export files
BCP database.dbo.Table1 OUT C:\Staging\Table1.TXT -E -S LocalSQLServer -N
BCP database.dbo.Table2 OUT C:\Staging\Table2.TXT -E -S LocalSQLServer -N
BCP database.dbo.Table3 OUT C:\Staging\Table3.TXT -E -S LocalSQLServer -N


REM STEP 3: Clear target tables
REM Your SQL RDS is unlikely to support single sign on 
REM so need to use user/pass here
SQLCMD -U username -P password -S RDSSQLServerName -d databasename -Q"TRUNCATE TABLE Table1; TRUNCATE TABLE Table2; TRUNCATE TABLE Table3;"

REM STEP 4: Push data in
BCP database.dbo.Table1 IN C:\Staging\Table1.TXT -U username -P password -S RDSSQLServerName-N
BCP database.dbo.Table2 IN C:\Staging\Table2.TXT -U username -P password -S RDSSQLServerName-N
BCP database.dbo.Table3 IN C:\Staging\Table3.TXT -U username -P password -S RDSSQLServerName-N

(I'm pretty sure that BCP and SQLCMD are free... not sure but you can certainly download the free SQL Server tools and see)

If you wanted to push to Postgres SQL instead,

in step 2, you'd need to drop the -N option, which would make the file text, tab delimited, readable by anything
in step 3 and step 4 you'd need to use the associated Postgres command line tool, but you'd need to deal with data types etc. (which can be a pain - ambiguous date formats alone are always a huge problem)

Also note here the AWS RDS instance is just another database with a hostname, login, password. The only thing you have to do is make sure the firewall is open on the AWS side to accept incoming connections from your IP Address

There are many more layers of sophistication you can build into your integration: differential replication, retries etc. but given the 'shadow IT status' this might not be worth it

Also be aware that I think AWS charges for data uploads, so if you are replicating a 1G database everyday, that's going to add up. (Azure doesn't charge for uploads but I'm sure you'll pay in some other way!)

For this type of problem I would strongly recommend use of SymmetricDS - https://www.symmetricds.org/

The main caveat is that the SQL Server would require the addition of some triggers to track changes but at that point SymmetricDS will handle the push of the data.

An alternative approach, similar to what you suggested, would be to have a script export the data into CSV files, upload them to S3, and then have a bucket event trigger on the S3 bucket that kicks off a Lambda to load the data when it arrives.

How to push data from local SQL Server to Tableau Server on AWS

Related

Recent Posts