AWS RDS MySQL Slowing down over time

I've read many posts about this topic but none of them talk about AWS RDS MySQL Database. Since three days ago, I'm running a python script in an AWS EC2 instance that write rows in my AWS RDS MySQL database. I have to write 35 millions rows, so I know this will take some time. Periodically, I check the performance of the database, and three days later (today) I realize that the database is slowing down. When it started, the first 100,000 rows were written in only 7 minutes (this is an example of the rows I'm working with)

0000002178-14-000056    AccountsPayableCurrent  us-gaap/2014        20131231    0   USD 266099000.0000

After three days, 5,385,662 rows have been written in the database, but now it takes almost 3 hours to write 100,000 rows. What is happening?

The EC2 instance that I'm running is the t2.small. Here you can check the specs if you need so: EC2 SPECS . The RDS database that I'm running is the db.t2.small. Check the specs here: RDS SPECS

I will attached here some charts about the performance of the database and the EC2 Instance: Db CPU / Db Memory / Db Write IOPS / Db Write Throughput / EC2 Network in (bytes) / EC2 Network out (bytes)

It would be great if you could help me. Thanks a lot.

EDIT 1: How am I inserting rows? As I said before, I've a python script running on an EC2 instance, this script read text files, make some calculations with these values, and then write every "new" row to the database. Here are a small piece of my code. How I read the text files?

for i in path_list:
  notify("Uploading: " + i)
  num_path = "path/" + i + "/file.txt"
  sub_path = "path/" + i + "/file.txt"

  try:
    sub_dict = {}
    with open(sub_path) as sub_file:
      for line in sub_file:
        line = line.strip().split("\t")
        sub_dict[line[0]] = line[1] # Save cik for every accession number
        sub_dict[line[1] + "-report"] = line[25] # Save report type for every CIK
        sub_dict[line[1] + "-frecuency"] = line[28] # Save frecuency for every CIK

    with open(num_path) as num_file:
      for line in num_file:
        num_row = line.strip().split("\t")

        # Reminder: sometimes in the very old reports, cik and accession number does not match. For this reason I have to write 
        # the following statement. To save the real cik.

        try: 
          cik = sub_dict[num_row[0]]
        except:
          cik = num_row[0][0:10]

        try: # If there is no value, pass
          value = num_row[7]
          values_dict = {
                  'cik': cik, 
                  'accession': num_row[0][10::].replace("-", ""),  
                  'tag': num_row[1], 
                  'value': value, 
                  'valueid': num_row[6], 
                  'date': num_row[4]
                  }

          sql = ("INSERT INTO table name (id, tag, value_num, value_id, endtime, cik, report, period) "
              "VALUES ('{}', '{}', '{}', '{}', '{}', '{}', '{}', '{}', '{}', '{}')".format(
                  values_dict['cik'] + values_dict['accession'] + values_dict['date'] + values_dict['value'].split(".")[0] + "-" + values_dict['tag'], 
                  values_dict['tag'], 
                  float(values_dict['value']), 
                  values_dict['valueid'], 
                  values_dict['date'], 
                  int(values_dict['cik']), 
                  sub_dict[values_dict['cik'] + "-report"], 
                  sub_dict[values_dict['cik'] + "-frecuency"]
                  ))

          cursor.execute(sql)
          connection.commit()

I know there is no except: to cath the trystatements, but this is only a piece of the script. I think the important part is how I insert every row. In case that I don't need to make calculations with the values, I'll use Load Data Infile to write the text files to the database. I just realize know that maybe is not a good idea to commit every time I insert a row. I will try to commit after 10,000 rows or so.


Solution 1:

T2 and T3 instances (incl db.t2 db.t3 instances) use CPU Credit system. When the instance is idle it accumulates CPU Credits that it can then use to run faster for short periods of time - Burst performance. Once you deplete the credits it slows down to a Baseline performance.

One option is to enable T2/T3 Unlimited setting in your RDS config which will let the instance run at full speed for as long as needed, but you will pay for the extra credits needed.

The other option is to change the instance type to db.m5 or some other non-T2/T3 type that supports consistent performance.

Here is a more in-depth explanation of the CPU credits and how they are accrued and spent: On clarifying t2 and t3 working conditions?

Hope that helps :)

Solution 2:

  • Single-row INSERTs are 10 times as slow as 100-row INSERTs or LOAD DATA.

  • UUIDs are slow, especially when the table gets large.

  • UNIQUE indexes need to be checked before finishing an iNSERT.

  • Non-unique INDEXes can be done in the background, but they still take some load.

Please provide SHOW CREATE TABLE and the method used for INSERTing. There may be more tips.

Solution 3:

Each time you commit a transaction index(es) need to be updated. The complexity of updating an index is related to the number of rows in the table, so as the number of rows increases, the index update becomes progressively slower.

Assuming you are using InnoDB tables, you can do the following:

SET FOREIGN_KEY_CHECKS = 0;
SET UNIQUE_CHECKS = 0;
SET AUTOCOMMIT = 0;
ALTER TABLE table_name DISABLE KEYS;

Then do the inserts, but batch them so that one statement inserts (e.g.) several dozen rows. Like INSERT INTO table_name VALUES ((<row1 data>), (<row2 data>), ...). When the inserts have finished,

ALTER TABLE table_name ENABLE KEYS;
SET UNIQUE_CHECKS = 1;
SET FOREIGN_KEY_CHECKS = 1;
COMMIT;

You can adjust this for your own situation, for example if the number of rows is huge then maybe you want to insert half a million then commit. This assumes your database is not 'live' (i.e. users actively reading/writing to it) while you are doing the inserts, because you're disabling checks that you might otherwise rely on when they enter data.