[Python] How to Improve The Performance of Heavy SQL Joins

python

Background

Software engineering is a multidisciplinary field that encompasses the design, development, testing, and maintenance of software systems. In the realm of software development, one critical aspect that demands continuous attention is database performance tuning. This process involves optimizing the efficiency of database operations, ensuring that the software interacts with the database in the most effective and responsive manner. Today I want to share my experience in improving the performance of heavy SQL join.

Case Study

Alright! First, let me introduce the table schema and provide a sample code. Afterward, we can discuss the issues present in the sample code.

class OceanShipmentExportInfo(models.Model):
    id = models.AutoField(primary_key=True)
    booking_no = models.CharField(max_length=32, null=True, db_index=True)
    
    
class OceanShipment(models.Model):
    class Meta:
        unique_together = ("lookup_id", "subscriber")
   
    subscriber = models.ForeignKey(Subscriber, related_name='shipment_set', null=True, on_delete=models.CASCADE)
    lookup_id = models.CharField(max_length=32, null=True)
    type = models.CharField(max_length=1, default='I', choices=SHIPMENT_TYPE_CHOICES)
    ETA = models.DateTimeField(null=True, db_index=True)
    oe_info = models.OneToOneField(
        'OceanShipmentExportInfo', related_name='shipment', null=True, on_delete=models.SET_NULL
    )
    MBL_NO = models.CharField(max_length=32, db_index=True, null=True)
    HBL_NO = models.CharField(max_length=32, db_index=True, null=True)
    lookup_id = models.CharField(max_length=32, null=True)

def search(search_option, subscriber):
    query = OceanShipment.objects.filter_by_subscriber(None, subscriber).filter(type=search_option.type)

    if search_option.keyword:
        query = query.filter(
            Q(HBL_NO__icontains=search_option.keyword)
            | Q(MBL_NO__icontains=search_option.keyword)
            | Q(oe_info__booking_no__icontains=search_option.keyword)
            | Q(lookup_id=search_option.keyword)
        )

    return query.order_by('-ETA')

Next, let's delve into this piece of code. We have two models named OceanShipmentExportInfo and OceanShipment. In the search function, if there's a keyword to be searched, we look at the following fields: 1. HBL_NO (OceanShipment), 2. MBL_NO (OceanShipment), 3. booking_no (OceanShipmentExportInfo), 4. lookup_id (OceanShipment).

Now, the issue arises because booking_no is in the other model (OceanShipmentExportInfo), and there's a one-to-one relation between OceanShipmentExportInfo and the OceanShipment model. This implies that this query will perform a join operation to complete the search function. The problem is that both OceanShipmentExportInfo and OceanShipment contain millions of records, leading to this query taking almost 7 seconds to complete, which is a very bad performance.

Solutions

When aiming to optimize SQL queries for speed, the immediate consideration is usually indexing. However, upon reviewing the model, it became apparent that all fields used in filtering and joining were already indexed. (Remember, for OneToOneField, Django automatically creates an index on the foreign key field by default.)

Upon closer examination, I identified that the join operation (Q(oe_info__booking_no__icontains=search_option.keyword)) was the bottleneck causing the query to slow down. Remarkably, removing this line resulted in a significant improvement, bringing the query performance to 100 ms.

This led me to explore ways to circumvent SQL JOIN. The solution was a refactoring of the search function, and the results were impressive.

def search(search_option: SearchOption, subscriber=None):
    query = OceanShipment.objects.filter_by_subscriber(None, subscriber).filter(type=search_option.type)
    booking_ids = None
    if search_option.keyword:
        booking_ids = OceanShipmentExportInfo.objects.filter(booking_no=search_option.keyword).values('id')

        if booking_ids:
	        query = query.filter(
	            Q(HBL_NO__icontains=search_option.keyword)
	            | Q(MBL_NO__icontains=search_option.keyword)
	            | Q(oe_info__in=booking_ids)
	            | Q(lookup_id=search_option.keyword)
	        )
        else:
	        query = query.filter(
	            Q(HBL_NO__icontains=search_option.keyword)
	            | Q(MBL_NO__icontains=search_option.keyword)
	            | Q(lookup_id=search_option.keyword)
	        )

As observed, I retrieved the booking_ids through a separate query operation first. Then, if there are booking_ids, we perform the search using only the oe_info foreign key id without needing a JOIN operation. In summary, we utilized two straightforward search queries without any JOIN operations to achieve the same search function. The outcome is remarkably positive, with the overall function performance improving from 7 seconds to 100 ms. That's quite impressive.

Key Takeaways

After implementing this enhancement, I've gained some insights to share with all of you. SQL JOIN is a costly operation; while it's commonly used to address the N+1 problem, we should strive to minimize its usage whenever possible. As demonstrated in this case, employing multiple SQL operations without JOIN can yield better performance, especially when both tables contain a significant number of records.