[Python] How To Add A Django Command For Large Data Migration

python

Background

As a software engineer, sometimes we need to write a custom command to do the data migration in the database. I faced a similar situation a few days ago; this time, we couldn't run the code in production successfully because of performance issues, with the main reason being that when developing locally, the dataset is relatively small, so performance issues weren't taken into account.

The purpose of the post is to remind me not to make the same mistake again. I should change the approach to handling such problems in the future.

Before

We use Python & Django to display the sample code for this event. We get all the subscriptions from the subscriptions model and want to update part of the subscription info to the task, so after selecting all objects from the model, append a new task model through the for loop, we finally use the bulk update to update all data at the same time.

Everything seems straightforward and simple, doesn't it? However, there's an underlying issue. We failed to account for the fact that in a production environment, the subscription model could potentially contain millions of subscriptions. Storing all this data in the tasks_to_update might lead to memory-related problems.

from django.core.management.base import BaseCommand

from model import TaskSubscriptionModel, CarrierTaskModel
from managers import TaskManager


class Command(BaseCommand):  # pragma: no cover
	subscriptions = TaskSubscriptionModel.objects.all()

	tasks_to_update = []
	for subscription in subscriptions:
		task_id = subscription.task.id

		task = CarrierTaskModel(
			id=task_id,
			subscriber=subscription.subscriber,
			lookup_id=subscription.lookup_id,
			tags=subscription.tags,
			expire_time=subscription.expire_time,
		)
		tasks_to_update.append(task)

	TaskManager().bulk_update(
		tasks_to_update,
		fields=["subscriber", "lookup_id", "tags", "expire_time"],
	)

After

Now, how can we address this issue? The solution lies in Pagination. We need to apply the concept of Pagination to divide all subscriptions into smaller segments, and then proceed with data updates. In Django, the Paginator class comes in handy for managing this scenario. Here's an example.

from django.core.management.base import BaseCommand
from django.core.paginator import Paginator

from model import TaskSubscriptionModel, CarrierTaskModel
from managers import TaskManager


class Command(BaseCommand):  # pragma: no cover
    subscriptions = TaskSubscriptionModel.objects.all()
    paginator = Paginator(subscriptions, 5000)
 
    for num in paginator.page_range:
    	page_obj = paginator.get_page(num)
        tasks_to_update = []
        for subscription in page_obj.object_list:
            task_id = subscription.task.id

            task = CarrierTaskModel(
                id=task_id,
                subscriber=subscription.subscriber,
                lookup_id=subscription.lookup_id,
                tags=subscription.tags,
                expire_time=subscription.expire_time,
            )
            tasks_to_update.append(task)

        TaskManager().bulk_update(
            tasks_to_update,
            fields=["subscriber", "lookup_id", "tags", "expire_time"],
        )

Based on this approach, we can manage 5,000 subscriptions in a single iteration, ensuring that memory issues are avoided throughout the process.

Summary

The next time you need to handle data migration using a command, please remember to employ Pagination instead of a loop. This approach will ensure that you don't encounter performance issues again.