Taiker

HashiCorp Vault + Kubernetes: Create External Secrets

Paul Liang — Sat, 25 Apr 2026 17:20:21 GMT

Background

Recently, I decided to set up a Kubernetes environment on my old MacBook Air to host my personal web service. As someone new to Kubernetes, I quickly realized that managing secrets properly is one of the most critical yet challenging aspects of running a production-read cluster.

During my research, I discovered that while Kubernetes does provide a built-in Secret mechanism, it's not sufficient for production use case. This led me down the path of exploring external secret management solutions, ultimately landing on HashiCorp Vault combined with External Secrets Operator (ESO).

This post documents my journey and serves as a practical guide for anyone looking to implement a secure, production-grade secret management solution in Kubernetes.

Why we use Vault?

For everyone new to the Kubernetes, you may ask: "Why do we need to install Vault? Doesn't Kubernetes already support secrets?" The short answer is yes, Kubernetes does provide a basic secret storage solution, but it's too basic and unsafe to use in production environments. Let me show you an example to explain why we need Vault or other external secret engine solutions.

apiVersion: v1
kind: Secret
metadata:
  name: my-secret
type: Opaque
data:
  username: YWRtaW4=         # admin
  password: c3VwZXJzZWNyZXQ= # supersecret

The problem? Kubernetes stores these secrets in etcd using only base64 encoding. Base64 is not encryption - it's just encoding. Anyone with access to etcd or the secret manifest can easily decode it:

echo "YWRtaW4=" | base64 -d
# Output: admin

This is extremely dangerous! That's why you need to enable "Encryption at Rest" if you're using Kubernetes Secrets. After you enabling Kubernetes Secret:

from: Secret → base64 → store to etcd
to: Secret → AES encrypt → store etcd

so if Encryption at Rest solves the storage problem, why do we still need Vault? Fair question! For small teams or personal projects, Kubernetes Secrets with Encryption at Rest might be sufficient. However, Vault provides enterprise-grade features that go far beyond just encrypting data at rest:

Dynamic Secret Generation: Vault can generate database credentials, API keys, and certificates on-demand with automatic expiration. This means secrets are short-lived an automatically rotated, reducing the risk window if credentials are compromised.
Advanced Access Control: While Kubernetes uses RBAC (Role-Based Access Control), Vault's policy-based access control is more granular and flexible. You can define policies like "App A can only read secrets from paht X between 9 AM - 5 PM" or "Service B can generate MySQL credentials but only for 1 hour."

3. Centralized Secret Management: If you're running multiple Kubernetes clusters or have services outside of Kubernetes (VMs, cloud functions,
+ etc.), Vault provides a single source of truth for all your secrets across your entire infrastructure.

4. Audit Logging: Vault maintains detailed audit logs of who accessed which secrets and when, which is crucial for compliance requirements.

5. Secret Versioning and Rollback: Unlike Kubernetes Secrets, Vault keeps a full version history and allows you to rollback to previous versions if needed.

From an operational perspective, these capabilities are why most companies choose Vault over the native Kubernetes secret solution for production workloads.

Solutions

After deciding to use Vault, we need to understand how to integrate it with Kubernetes. There are two main approaches:

Vault Sidecar Injection (Vault Agent Injector)
External Secrets Operator (ESO)

What is a Kubernetes Operator?

A Kubernetes Operator is a software extension that uses Custom Resources to manage applications and their components. Think of it as an automated
administrator that continuously monitors and manages specific resources in your cluster.

Comparison: Sidecar Injection vs External Secrets Operator

Category	Sidecar Injection	External Secrets Operator
Secret Location	Only in Vault + Pod memory	Stored in K8s Secret (etcd)
Security Level	Higher	Medium
etcd Exposure	Not stored in etcd	Stored in etcd
GitOps Friendly	Medium	Very
Resource Overhead	Extra container per Pod	One controller only
App Code Change	None	None
Secret Rotation	Automatic live update	Depends on refresh interval
Startup Complexity	Higher	Lower
Debugging	Harder	Easier
Vault Coupling	Tight	Loose
Multi-cluster Scaling	Harder	Easier
Performance Impact	More overhead	Minimal

How Each Solution works

Sidecar Injection:

Vault Agent runs as a sidecar container alognside your application pod
Secrets are fetched from Vault and written to a shared volume at /vault/secrets/...
Your application reads secrets directly from these files
Secrets NEVER touch etcd
Best for: Highly regulated environments (financial systems, healthcare, government)

External Secrets Operator:

ESO controller watches ExternalSecret custom resources
Fetch secrets from Vault and syncs them into native Kubernetes Secrets
Secrets are stored in etcd (encrypted if Encryption at Rest is enabled)
Applications consume secrets as standard Kubernetes Secrets (env vars or mounted volumes)
Best for: Developer-friendly environments, SaaS systems, teams using GitOps

When to Choose External Secrets Operator

Choose ESO if you:

Use ArgoCD or other GitOps tools
Want simplicity and faster onboarding for developers
Manage many microservices
Prefer Kubernetes-native secret UX (developers don't need to know about Vault)

In this post we will use the ESO and Vault to complete the whole external secret engine solution. The data flow would looks like following:

Vault (Encrypted storage)
        ↓
ESO read the secret and sync
        ↓
Kubernetes Secret
        ↓
etcd
        ↓
Pod

Installation

Step 1: Install Vault

We'll use the official HashiCorp Helm chart to install Vault in standalone mode.

$ helm repo add hashicorp https://helm.releases.hashicorp.com
$ helm repo update
$ kubectl create namespace vault

$ helm install vault hashicorp/vault \
  --namespace vault \
  --set "server.dev.enabled=false"

What this does:

server.dev.enabled=false ensures Vault runs in standalone mode (not dev mode)
Standalone mode uses persistent storage and requires manual initialization/unsealing
Dev mode is insecure (auto-unseals, stores data in memory) and should only be used for local testing.

Verify Installation

kubectl get pods -n vault

# Expected output:
# NAME      READY   STATUS    RESTARTS   AGE
# vault-0   0/1     Running   0          30s

Note: The pod shows 0/1 READY because Vault starts in a sealed state. This is a security feature - Vault needs to be initialized and unsealed before it can serve requests.

Step 2: Initialization + Unseal

When Vault first starts, it's in a sealed state. This is a critical security feature:

Sealed = Locked

Vault's encryption keys are themselves encrypted
Cannot decrypt any data
All API operations blocked (except unseal and status)

Unsealed = Unlocked

Encryption keys are available in memory
Can read / write secrets
Fully operational

Why seal? Even if an attacker gains access to Vault's storage (disk/etcd), they cannot read any secrets without the unseal keys.

You will got the unseal key and root toke after executing the operator init. You should store them in other place and if you lost them you must delete vault storage and reinitialize vault for unseal again which may cause all secrets will be lost.

# login to the vault pod
$ kubectl exec -it -n vault vault-0 -- sh
$ vault operator init

Vault generated:

5 Unseal Keys - Using Shamir's Secret Sharing algorithm
1 Root Token - The initial admin token with full permissions

Unsealing requires providing 3 different unseal keys. Run this command 3 times:

# Unsealing requires providing **3 different unseal keys**. Run this command 3 times:
$ vault operator unseal

# After the third key, Vault is unsealed and ready to use!
$ vault login

$ vault status
Key             Value
---             -----
Seal Type       shamir
Initialized     true
Sealed          false
Total Shares    5
Threshold       3
Version         1.21.2
Build Date      2026-01-06T08:33:05Z
Storage Type    file
Cluster Name    vault-cluster-ac3113bf
Cluster ID      629983de-2ee2-1ad6-35d2-43773d93efca
HA Enabled      false

You should double check the status is correct (Initialized: true, Sealed: false)

Step 3: Install External Secrets Operator

Now that Vault is running, we need to install the External Secrets Operator (ESO) which will sync secrets from Vault into Kubernetes Secrets.

We'll use the official ESO Helm chart:

$ helm repo add external-secrets https://charts.external-secrets.io
$ helm repo update

$ kubectl create namespace external-secrets

$ helm install external-secrets external-secrets/external-secrets \
  -n external-secrets

Verify Installation

$ kubectl get pods -n external-secrets

# Expected output (wait until all pods are Running):
# NAME                                                READY   STATUS    RESTARTS   AGE
# external-secrets-xxxxxxxxxx-xxxxx                   1/1     Running   0          60s
# external-secrets-cert-controller-xxxxxxxxxx-xxxxx   1/1     Running   0          60s
# external-secrets-webhook-xxxxxxxxxx-xxxxx           1/1     Running   0          60s

ESO deploys three components:

external-secrets: The main controller that watches ExternalSecret resources and syncs secrets from Vault into Kubernetes Secrets.
external-secrets-cert-controller: Manages the TLS certificates used by the webhook server.
external-secrets-webhook: A validating/mutating webhook that validates ESO custom resources before they are accepted by the Kubernetes API server.

Step 4: Enable the Secret Engine in Vault

Before enabling the secret engine, you may ask: what does KV mean, and what's the difference between v1 and v2?

KV stands for Key-Value, and it's the most fundamental secret engine in Vault — essentially a secure key-value store. There are two versions:

KV v1: Simple key-value storage. No version history, no rollback. Overwriting a secret destroys the previous value permanently.
KV v2: Adds versioning and soft delete. Every write creates a new version, and deleted secrets can be recovered. This is the recommended choice for production.

$ kubectl exec -it -n vault vault-0 -- sh
$ vault login 

$ vault secrets enable -path=secret kv-v2
# Success! Enabled the kv-v2 secrets engine at: secret/

The -path=secret flag defines the mount path — this is the prefix you'll use when reading and writing secrets (e.g., `secret/myapp/config`). You can name it anything, but secret is the conventional default.

Now write a test secret to verify the engine is working:

$ vault kv put secret/myapp/config username="admin" password="supersecret"
# == Secret Path ==
# secret/data/myapp/config
#
# ======= Metadata =======
# Key                Value
# ---                -----
# created_time       2026-04-25T00:00:00.000000000Z
# custom_metadata    
# deletion_time      n/a
# destroyed          false
# version            1

Read it back to confirm:

$ vault kv get secret/myapp/config
# == Secret Path ==
# secret/data/myapp/config
#
# ======= Data =======
# Key         Value
# ---         -----
# password    supersecret
# username    admin

Notice that Vault internally stores the secret under secret/data/myapp/config (it inserts /data/ automatically for KV v2). This matters later when you configure ESO to reference the path.

Step 5: Enable Kubernetes Auth

Before moving forward, it's worth understanding why this step exists at all.

When ESO needs to fetch a secret from Vault, it must first provide its identity - Vault doesn't just hand out secrets to anyone. The Kubernetes
auth method lets Vault trust Kubernetes ServiceAccount tokens as a form of identity. Here's the flow:

ESO Pod
  → presents its ServiceAccount JWT to Vault
  → Vault calls Kubernetes TokenReview API to validate the JWT
  → Kubernetes confirms: "yes, this is the external-secrets ServiceAccount"
  → Vault grants access based on the bound role

For Vault to call the kubernetes TokenReview API, it needs three things:

kubernetes_host
token_reviewer_jwt – a ServiceAccount token that vault itself uses to call the TokenReview API.
kubernetes_ca_cert – the CA certificate to verify the kubernetes API sever's TLS certificate.

Fortunately, every pod in kubernetes (including the Vault pod) automatically has all three mounted at /var/run/secrets/kubernetes.io/serviceaccount/, so we can read them directly from inside the Vault pod.

First, enable the Kubernetes auth method:

$ kubectl exec -it -n vault vault-0 -- sh
$ vault login 

$ vault auth enable kubernetes
# Success! Enabled kubernetes auth method at: kubernetes/

Then configure it:

$ vault write auth/kubernetes/config \
    token_reviewer_jwt="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
    kubernetes_host="https://kubernetes.default.svc:443" \
 kubernetes_ca_cert=@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# Success! Data written to: auth/kubernetes/config

Breaking down each parameter:

token_reviwer_jwt: The ServiceAccount JWT of the Vault pod itself. Vault uses this token to authenticate against the Kubernetes API when it calls TokenReview to validate incoming requests.
kubernetes_host: The cluster-internal address of the Kubernetes API server. kubernetes.default.svc is the stable DNS name that resolves to the API server from within the cluster.
Kubernetes_ca_cert: The CA certificate used to verify the TLS certificate presented by the Kubernetes API server. The @ prefix tells Vault to read the value from a file path.

Verify the configuration was applied:

$ vault read auth/kubernetes/config
# Key                       Value
# ---                       -----
# disable_iss_validation    true
# disable_local_ca_jwt      false
# issuer                    n/a
# kubernetes_host           https://kubernetes.default.svc:443
# pem_keys                  []

Step 6: Create the Vault Policy

A Vault policy defines what a token is allowed to to – which paths it can access and what operation it can perform (read, write, list, delete, etc.). Without a policy, even an authenticated token has zero permissions.

ESO only need to read secrets from Vault, so we'll create a minimal read-only policy scoped to the secret/ path we enabled in step 4.

Still inside the Vault pod, create a policy file and apply it:

$ vault policy write eso-policy - <

Breaking down the two path rules:

secret/data/*: This is where the actual secret values live in KV v2. The read capability lets ESO fetch any secret under this mount.
secret/metadata/*: This is where KV v2 stores version metadata. list lets ESO enumerate secrets; read lets it inspect version history. ESO needs this to detect when a secret has been updated.

Verify the policy was created:

$ vault policy read eso-policy
# path "secret/data/*" {
#   capabilities = ["read"]
# }
#
# path "secret/metadata/*" {
#   capabilities = ["read", "list"]
# }

`Step 7: Create the Vault Role (Binding to the ServiceAccount)`

We have a policy (eso-policy) that defines what ESO is allowed to do, but Vault still doesn't know who is allowed to use it. That's what a Vault role does – it binds a Kubernetes ServiceAccount to a policy, completing the trust chain:

Kubernetes ServiceAccount  →  Vault Role  →  Vault Policy  →  Secret Access

First, let's confirm the ServiceAccount that ESO created during installation:

$ kubectl get serviceaccount -n external-secrets

# NAME                                       SECRETS   AGE
# external-secrets                           0         10m
# external-secrets-cert-controller           0         10m
# external-secrets-webhook                   0         10m

The main controller uses the external-secrets ServiceAccount in the external-secrets namespace. That's the identity we need to bind.

Now create the Vault role inside the Vault Pod:

$ vault write auth/kubernetes/role/eso-role \
    bound_service_account_names=external-secrets \
    bound_service_account_namespaces=external-secrets \
    policies=eso-policy \
    ttl=1h
# Success! Data written to: auth/kubernetes/role/eso-role

Breaking down each parameter:

bound_service_account_names: Only tokens belonging to the external-secrets ServiceAccount are allowed to authenticate with this role.
bound_service_account_namespaces: Further restricts the scope to the external-secrets namespace. A ServiceAccount with the same name in a different namespace is still denied.
policies: The policy (or policies, comma-separated) to attach. Here we bind eso-policy from Step 6.
ttl: How long the issued Vault token is valid. After 1 hour, ESO must re-authenticate to get a new token.

Verify the role was created correctly:

$ vault read auth/kubernetes/role/eso-role
# Key                                 Value
# ---                                 -----
# bound_service_account_names         [external-secrets]
# bound_service_account_namespaces    [external-secrets]
# policies                            [eso-policy]
# ttl

`Step 8: Create the ClusterSecretStore`

A ClusterSecretStore is a cluster-wide ESO resource that tells ESO how to connect to Vault – which address to reach, how to authenticate, and which secret engine to use. Think of it as the bridge configuration between ESO and Vault.

The difference between SecretStore and ClusterSecretStore:

SecretStore: Namespace-scoped. Only ExternalSecret resources in the same namespace can use it.
ClusterSecretStore: Cluster-scoped. Any ExternalSecret in any namespace can reference it — ideal when multiple teams or apps share the same Vault backend.

Create the manifest and apply it:

apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: vault-backend
spec:
  provider:
    vault:
      server: "http://vault.vault.svc:8200"
      path: "secret"
      version: "v2"
      auth:
        kubernetes:
          mountPath: "kubernetes"
          role: "eso-role"
          serviceAccountRef:
            name: "external-secrets"
            namespace: "external-secrets"

Breaking down the key fields:

server: The in-cluster address of Vault. vault.vault.svc resolves to the Vault service in the vault namespace.
path: The KV mount path we enabled in Step 4.
version: "v2": Tells ESO this is a KV v2 engine, so it will automatically insert /data/ into the path when fetching secrets.
auth.kubernetes.mountPath: The auth method mount path we enabled in Step 5 (kubernetes).
auth.kubernetes.role: The Vault role we created in Step 7 (eso-role).
serviceAccountRef: The ServiceAccount ESO will use to authenticate with Vault. This must match what we bound in the Vault role.

Verify the ClusterSecretStore is ready:

$ kubectl get clustersecretstore vault-backend

# NAME            AGE   STATUS   CAPABILITIES   READY
# vault-backend   30s   Valid    ReadWrite      True

The READY: True and STATUS: Valid confirm that ESO successfully connected to Vault and authenticated using the Kubernetes auth method. If you see Invalid here, double-check the Vault address, role name, and ServiceAccount reference.

The full pipeline is now working end-to-end:

Vault secret (secret/myapp/config)
        ↓  ESO reads via ClusterSecretStore
Kubernetes Secret (myapp-secret)
        ↓
Pod consumes via env vars or mounted volume

`Summary`

In this post, we built a complete, production-oriented secret management pipeline on Kubernetes from scratch. Here is a recap of what we covered:

Why Kubernetes Secrets alone aren't enough — base64 encoding is not encryption, and native Secrets lack dynamic generation, granular access control, audit logging, and versioning.
Why HashiCorp Vault — it addresses all of those gaps with enterprise-grade features, while remaining compatible with Kubernetes through its auth methods.
ESO over Sidecar Injection — we chose the External Secrets Operator for its developer-friendly UX, GitOps compatibility, and Kubernetes-native secret consumption model.

The 8 steps we walked through, and how they connect:





Step
What we did
Why it matters




1
Install Vault via Helm
Standalone mode with persistent storage


2
Initialize and unseal Vault
Required before Vault can serve any requests


3
Install External Secrets Operator
The controller that syncs Vault secrets into Kubernetes


4
Enable KV v2 secret engine
Where secrets are stored in Vault, with versioning support


5
Enable Kubernetes auth
Lets Vault validate Kubernetes ServiceAccount tokens


6
Create a Vault policy
Defines what ESO is allowed to read


7
Create a Vault role
Binds the ESO ServiceAccount to the policy


8
Create ClusterSecretStore + ExternalSecret
Wires ESO to Vault and syncs secrets into Kubernetes



This setup covers a personal cluster or small team environment well. For larger production deployments, the natural next steps would be enabling Vault High Availability, automating unseal with Auto Unseal (e.g., AWS KMS or GCP Cloud KMS), and tightening policies to specific app paths rather than the broad secret/* wildcard used here.

Step	What we did	Why it matters
1	Install Vault via Helm	Standalone mode with persistent storage
2	Initialize and unseal Vault	Required before Vault can serve any requests
3	Install External Secrets Operator	The controller that syncs Vault secrets into Kubernetes
4	Enable KV v2 secret engine	Where secrets are stored in Vault, with versioning support
5	Enable Kubernetes auth	Lets Vault validate Kubernetes ServiceAccount tokens
6	Create a Vault policy	Defines what ESO is allowed to read
7	Create a Vault role	Binds the ESO ServiceAccount to the policy
8	Create ClusterSecretStore + ExternalSecret	Wires ESO to Vault and syncs secrets into Kubernetes



First Spring Boot Application - Todo List (3)
Paul Liang — Sun, 14 Dec 2025 08:14:31 GMT
Background
In the  third post of this series, I want to talk about unit testing, one of the most important part of the software development I thought.
Creaating a developer-friendly unit testing environment and writing high-quality test cases are both essential for building a robust and maintainable SaaS product. Over time, these practices help ensure your service remains reliable as it grows in complexity.
In this post, I'll not only cover how to write unit tests in Java Spring Boot, but also how to set up a clean and efficient testing environment that makes writing and running tests easier for the whole team.
Installation
Make sure you have this (Spring Boot usually adds it already):

    org.springframework.boot
    spring-boot-starter-test
    test
This includes:
JUnit 5
Mockito
AssertJ
Spring Test
Next I want to introduce the H2, H2 is a lightweight relational database written entirely in Java. It belongs to a disposable and in-memory database that exists only for your tests. In Spring Boot, H2 lets you test real database behavior without the cost of running MySQL/Postgres.

    com.h2database
    h2
    test
We also need to add the following application setting for our testing environment.
spring:
  datasource:
    url: jdbc:h2:mem:todolist-test;MODE=PostgreSQL
    driver-class-name: org.h2.Driver
    username: sa
    password:

  liquibase:
    enabled: false

  jpa:
    hibernate:
      ddl-auto: create-drop
    show-sql: false
    properties:
      hibernate:
        format_sql: false

logging:
  level:
    root: WARN

    # Spring framework noise
    org.springframework: WARN

    # Hibernate SQL + binding noise
    org.hibernate.SQL: WARN
    org.hibernate.type.descriptor.sql: WARN

    # JPA bootstrap logs
    org.springframework.orm.jpa: WARN
src/test/resources/application-test.yml
You can change the mode in the database url to your target database, I used the PostgreSQL here. Remenber to add the @ActiveProfiles("test") to the TodolistApplicationTests.java to tell the spring to use the application-test.yml to overwrite the application.yml, which means can load the testing config for unit test.
package com.example.todolist;

import org.junit.jupiter.api.Test;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.ActiveProfiles;

@SpringBootTest
@ActiveProfiles("test")
class TodolistApplicationTests {

	@Test
	void contextLoads() {
	}

}
TodolistApplicationTests.java
Implementation
In this section, we will walk through how to write the unit test in Java Spring Boot. In previous posts we have implemented the controller, service and repository part in Spring. Next we will write the unit tests for each of them, here we go:
Repository
package com.example.todolist.repository;

import com.example.todolist.model.Todo;
import org.springframework.data.jpa.repository.JpaRepository;

public interface TodoRepository extends JpaRepository {}
TodoRepository.java
package com.example.todolist.model;

import jakarta.persistence.*;
import lombok.Getter;
import lombok.NoArgsConstructor;
import lombok.Setter;

import java.time.Instant;

@Entity
@Table(name = "todolist")
@Getter
@Setter
@NoArgsConstructor
public class Todo {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    @Column(nullable = false)
    private String title;

    private String description;

    @Column(nullable = false)
    private boolean completed = false;

    // UTC-safe creation timestamp
    @Column(nullable = false, updatable = false,
            columnDefinition = "TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP")
    private Instant createdAt = Instant.now();

    // Custom constructor for creating new todos
    public Todo(String title, String description) {
        this.title = title;
        this.description = description;
    }

}
Todo.java
We need to define our test plans based the todo model and repository. We are testing:
JPA entity mapping
Auto-generated ID
@Column(nullable = false) constraints
Default values (completed, createdAt)
CRUD behavior (save, find, update, delete)
The whole test code for the repository is
package com.example.todolist.repository;

import com.example.todolist.model.Todo;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.autoconfigure.orm.jpa.DataJpaTest;
import org.springframework.test.context.ActiveProfiles;

import java.util.Optional;

import static org.assertj.core.api.Assertions.assertThat;

@DataJpaTest
@ActiveProfiles("test")
class TodoRepositoryTest {

    @Autowired
    private TodoRepository todoRepository;

    @Test
    void shouldSaveAndFindTodo() {
        // Arrange
        Todo todo = new Todo("Write tests", "Learn H2 with Spring Boot");

        // Act
        Todo saved = todoRepository.save(todo);

        // Assert
        assertThat(saved.getId()).isNotNull();
        assertThat(saved.isCompleted()).isFalse();
        assertThat(saved.getCreatedAt()).isNotNull();

        Optional found = todoRepository.findById(saved.getId());

        assertThat(found).isPresent();
        assertThat(found.get().getTitle()).isEqualTo("Write tests");
    }

    @Test
    void shouldUpdateTodo() {
        // Arrange
        Todo todo = todoRepository.save(new Todo("Old title", "Old desc"));

        // Act
        todo.setTitle("New title");
        todo.setCompleted(true);
        Todo updated = todoRepository.save(todo);

        // Assert
        assertThat(updated.getTitle()).isEqualTo("New title");
        assertThat(updated.isCompleted()).isTrue();
    }

    @Test
    void shouldDeleteTodo() {
        // Arrange
        Todo todo = todoRepository.save(new Todo("Delete me", "Temp"));

        // Act
        todoRepository.deleteById(todo.getId());

        // Assert
        assertThat(todoRepository.findById(todo.getId())).isEmpty();
    }

    @Test
    void shouldEnforceNotNullConstraint() {
        // Arrange
        Todo todo = new Todo();
        todo.setDescription("Missing title");

        // Act & Assert
        assertThat(org.junit.jupiter.api.Assertions.assertThrows(
            org.springframework.dao.DataIntegrityViolationException.class,
            () -> {
                todoRepository.save(todo);
                todoRepository.flush();
            }
        )).isNotNull();
    }
}
The first thing I want to talk about is package com.example.todolist.repository;, "why in both TodoRepositoryTest.java and TodoRepository.java have package com.example.todolist.repository;? ", and it brings very important concept in the JAVA:
💡
Java does not care about folders — it cares about packages. Test classes should live in the same package as the code they test.
Usually I use the AAA (Triple-A) concept to write the unit test no matter which programming language I used. What Is the AAA (Triple-A) Concept? It’s a test structure pattern, not a framework feature.
“Set up → Do the thing → Check the result”
There are three parts you need to complete in your unit test: 
Arrange
Act
Assert
That's why this concept called AAA (Triple-A). Remember, if Arrange, Act, and Assert cannot be clearly separated, reconsider the responsibility of the code being tested.
Now we can run the mvn clean test command to check if the testing works correctly.
➜ mvn clean test
...
...
...
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.352 s -- in com.example.todolist.TodolistApplicationTests
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  3.229 s
[INFO] Finished at: 2025-12-13T23:09:29+08:00
[INFO] ------------------------------------------------------------------------
Ok, everything works well now. Next we will complete the unit tests for controller and service.
Service
package com.example.todolist.service;

import com.example.todolist.model.Todo;
import com.example.todolist.repository.TodoRepository;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import org.mockito.InjectMocks;
import org.mockito.Mock;
import org.mockito.junit.jupiter.MockitoExtension;

import java.util.List;
import java.util.Optional;

import static org.assertj.core.api.Assertions.assertThat;
import static org.assertj.core.api.Assertions.assertThatThrownBy;
import static org.mockito.ArgumentMatchers.any;
import static org.mockito.Mockito.*;

@ExtendWith(MockitoExtension.class)
class TodoServiceTest {

    @Mock
    private TodoRepository todoRepository;

    @InjectMocks
    private TodoService todoService;

    @Test
    void getAllTodos_shouldReturnAllTodos() {
        // Arrange
        Todo todo1 = new Todo("Task 1", "Desc 1");
        Todo todo2 = new Todo("Task 2", "Desc 2");

        when(todoRepository.findAll()).thenReturn(List.of(todo1, todo2));

        // Act
        List result = todoService.getAllTodos();

        // Assert
        assertThat(result).hasSize(2);
        assertThat(result).extracting(Todo::getTitle)
                .containsExactly("Task 1", "Task 2");

        verify(todoRepository).findAll();
    }

    @Test
    void createTodo_shouldSaveAndReturnTodo() {
        // Arrange
        Todo todo = new Todo("New Task", "New Desc");

        Todo savedTodo = new Todo("New Task", "New Desc");
        savedTodo.setId(1L);

        when(todoRepository.save(any(Todo.class))).thenReturn(savedTodo);

        // Act
        Todo result = todoService.createTodo(todo);

        // Assert
        assertThat(result.getId()).isEqualTo(1L);
        assertThat(result.getTitle()).isEqualTo("New Task");
        assertThat(result.isCompleted()).isFalse();

        verify(todoRepository).save(todo);
    }

    @Test
    void updateTodo_shouldUpdateExistingTodo() {
        // Arrange
        Long todoId = 1L;

        Todo existing = new Todo("Old Title", "Old Desc");
        existing.setId(todoId);

        Todo updated = new Todo("New Title", "New Desc");
        updated.setCompleted(true);

        when(todoRepository.findById(todoId)).thenReturn(Optional.of(existing));
        when(todoRepository.save(any(Todo.class))).thenAnswer(invocation -> invocation.getArgument(0));

        // Act
        Todo result = todoService.updateTodo(todoId, updated);

        // Assert
        assertThat(result.getTitle()).isEqualTo("New Title");
        assertThat(result.getDescription()).isEqualTo("New Desc");
        assertThat(result.isCompleted()).isTrue();

        verify(todoRepository).findById(todoId);
        verify(todoRepository).save(existing);
    }

    @Test
    void updateTodo_shouldThrowException_whenTodoNotFound() {
        // Arrange
        Long todoId = 99L;
        Todo updated = new Todo("Doesn't matter", "Nope");

        when(todoRepository.findById(todoId)).thenReturn(Optional.empty());

        // Act + Assert
        assertThatThrownBy(() -> todoService.updateTodo(todoId, updated))
                .isInstanceOf(RuntimeException.class)
                .hasMessage("Todo not found");

        verify(todoRepository).findById(todoId);
        verify(todoRepository, never()).save(any());
    }

    @Test
    void deleteTodo_shouldDeleteById() {
        // Arrange
        Long todoId = 1L;

        doNothing().when(todoRepository).deleteById(todoId);

        // Act
        todoService.deleteTodo(todoId);

        // Assert
        verify(todoRepository).deleteById(todoId);
    }
}
TodolistApplicationTests.java
First of all, @ExtendWith(MockitoExtension.class) this line tells JUnit 5 to enable Mockito (Required for @Mock / @InjectMocks) support for this test class. 
What's the difference between the @Mock and @injectMocks?

Annotation

What it does

@Mock

Creates a fake object (a mock)

@InjectMocks

Creates a real object and injects mocks into it
@Mock, Mockito creates a fake implementation of TodoRepository, no real logic runs unless you explicitly stub it. Example:
when(todoRepository.findAll())
    .thenReturn(List.of(todo1, todo2));
@InjectMocks, Mockito creates a real instance of TodoService, looks for fields annotated with @Mock then injects them into the service. For example
@InjectMocks
TodoService todoService;
is effectively to
TodoService todoService = new TodoService(mockTodoRepository);
For the verify()means "Assert that this method was called on the mock.". Essential for void methods and side effects.
Compared to Django/Python:

Python

Mockito

assert mock.called

verify(mock)

assert mock.call_count == 1

times(1)

assert not mock.called

never()
Controller
Here is the unit test code of controller:
package com.example.todolist.controller;

import com.example.todolist.model.Todo;
import com.example.todolist.service.TodoService;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.autoconfigure.web.servlet.WebMvcTest;
import org.springframework.boot.test.mock.mockito.MockBean;
import org.springframework.http.MediaType;
import org.springframework.test.web.servlet.MockMvc;

import java.util.List;

import static org.hamcrest.Matchers.hasSize;
import static org.mockito.ArgumentMatchers.any;
import static org.mockito.ArgumentMatchers.eq;
import static org.mockito.Mockito.doNothing;
import static org.mockito.Mockito.when;
import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.*;
import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.*;

@WebMvcTest(TodoController.class)
class TodoControllerTest {

    @Autowired
    private MockMvc mockMvc;

    @MockBean
    private TodoService todoService;

    @Autowired
    private ObjectMapper objectMapper;

    @Test
    void getAllTodos_shouldReturnTodoList() throws Exception {
        // Arrange
        Todo todo1 = new Todo("Task 1", "Desc 1");
        Todo todo2 = new Todo("Task 2", "Desc 2");

        when(todoService.getAllTodos()).thenReturn(List.of(todo1, todo2));

        // Act & Assert
        mockMvc.perform(get("/api/todos"))
                .andExpect(status().isOk())
                .andExpect(jsonPath("$", hasSize(2)))
                .andExpect(jsonPath("$[0].title").value("Task 1"))
                .andExpect(jsonPath("$[1].title").value("Task 2"));
    }

    @Test
    void createTodo_shouldReturnCreatedTodo() throws Exception {
        // Arrange
        Todo request = new Todo("New Task", "New Desc");

        Todo saved = new Todo("New Task", "New Desc");
        saved.setId(1L);

        when(todoService.createTodo(any(Todo.class))).thenReturn(saved);

        // Act & Assert
        mockMvc.perform(post("/api/todos")
                        .contentType(MediaType.APPLICATION_JSON)
                        .content(objectMapper.writeValueAsString(request)))
                .andExpect(status().isOk())
                .andExpect(jsonPath("$.id").value(1L))
                .andExpect(jsonPath("$.title").value("New Task"))
                .andExpect(jsonPath("$.completed").value(false));
    }

    @Test
    void updateTodo_shouldReturnUpdatedTodo() throws Exception {
        // Arrange
        Long todoId = 1L;

        Todo updated = new Todo("Updated Task", "Updated Desc");
        updated.setCompleted(true);
        updated.setId(todoId);

        when(todoService.updateTodo(eq(todoId), any(Todo.class)))
                .thenReturn(updated);

        // Act & Assert
        mockMvc.perform(put("/api/todos/{id}", todoId)
                        .contentType(MediaType.APPLICATION_JSON)
                        .content(objectMapper.writeValueAsString(updated)))
                .andExpect(status().isOk())
                .andExpect(jsonPath("$.title").value("Updated Task"))
                .andExpect(jsonPath("$.completed").value(true));
    }

    @Test
    void deleteTodo_shouldReturnOk() throws Exception {
        // Arrange
        Long todoId = 1L;
        doNothing().when(todoService).deleteTodo(todoId);

        // Act & Assert
        mockMvc.perform(delete("/api/todos/{id}", todoId))
                .andExpect(status().isOk());
    }
}
I want to explain a few common scenarios that are especially useful when writing unit tests.
The first one is:
// Act & Assert
mockMvc.perform(post("/api/todos")
.contentType(MediaType.APPLICATION_JSON)
.content(objectMapper.writeValueAsString(request)))
.andExpect(status().isOk())
.andExpect(jsonPath("$.id").value(1L))
.andExpect(jsonPath("$.title").value("New Task"))
.andExpect(jsonPath("$.completed").value(false));
 The logic behind it is "Simulating an HTTP POST request to your controller and asserting the HTTP response."
mockMvc = a fake HTTP client
.contentType(MediaType.APPLICATION_JSON) = sets the HTTP header
.content(objectMapper.writeValueAsString(request)) = set the request body.
The Assert Chain for couple of andExpect 
.andExpect(jsonPath("$.id").value(1L)) = inspects the JSON response body.
$ = root object
$.id = id field
Next one is:
doNothing().when(todoService).deleteTodo(todoId);
It means that "When deleteTodo(todoId) is called on this mock, do nothing.", you can think we mock and method with do nothing.
In this section, I show you how to use the JUnit + Mockito with H2 in memory database to complete the unit tests for the todo list application, and also share the key concept of how to write the good unit test to you. In the next section, I will show you couple of useful command in developing the unit tests.
Command
In this section, I want to show you couple of useful command in developing the unit tests.
Run the whole unit tests
$ mvn clean test
Run one test class
$ mvn test -Dtest=TodoServiceTest,TodoControllerTest
Run multiple test classes
$ mvn test -Dtest=TodoServiceTest,TodoControllerTest
Run a single test method
$ mvn test -Dtest=TodoServiceTest#updateTodo_shouldUpdateExistingTodo
If you use the vscode to develop the Java Spring Boot project, you can install the Extension Pack for Java which including the following packages:
Language Support for Java
Debugger for Java
Maven for Java
Test Runner for Java 
And you can navigate to you testing panel in the left side bar to trigger running the unit test via UI. 
Takeaway
In this post, I walk you though how to write high-quality unit tests using the Arrange–Act–Assert (AAA) pattern, step by step. Unit testing is a critical part of the modern software development life cycle (SDLC): it helps you validate behavior early, refactor with confidence, and keep regression from slipping into production. By adopting AAA as a consistent testing guideline, you can make your test easier to read, maintain and ultimately make your software projects more reliable over time.


First Spring Boot Application - Todo List (2)
Paul Liang — Sun, 30 Nov 2025 14:38:21 GMT
Background
Continuing from the previous post, where we built the database layer and Liquibase migrations, this article shows how to define REST APIs in our Spring Boot Todo application. We'll walk through creating controllers, connecting them to services and repositories, and exposing CRUD operations that interact cleanly with the database. 
Implementation
Today, we'll structure our API using the classic Spring Boot layered architecture: Controller, Service, and Repository. Each layer focues on a specific responsibility so the code stays clean, modular, and easy to maintain. This separation of concerns prevents business logic and persistence logic from mixing together, which greatly improves the scalability of the application.

  
      flowchart TD
      A[Controller\nREST Endpoints] --> B[Service\nBusiness Logic]
      B --> C[Repository\nData Access Layer]
      C --> D[Database\nPostgreSQL / Any RDBMS]
  
src/
 └── main/
     ├── java/
     │   └── com/example/todolist/
     │       ├── TodolistApplication.java
     │       ├── model/
     │       │   └── Todo.java
     │       ├── repository/
     │       │   └── TodoRepository.java
     │       ├── service/
     │       │   └── TodoService.java
     │       └── controller/
     │           └── TodoController.java
     └── resources/
         ├── application.properties
         └── static/   (optional)
Controller
The controller layer is responsible for defining the API endpoints of the applicaion. Its main purpose is to handle incoming HTTP requests and return the appropriate HTTP responses. In this layer, we design what APIs our service will expose and determine which HTTP methods (GET,POST,PUT,DELTE, etc.) should be used to follow RESTful conventions.
A controller also performs input validation and converts requst payloads into Java objects when necessary. After validating the request, the controller delegates the business logic to the service layer. Once the operation completes,  the controller prepares and returns a well-structured response, typically in JSON, back to client.
package com.example.todolist.controller;

import com.example.todolist.model.Todo;
import com.example.todolist.service.TodoService;
import org.springframework.web.bind.annotation.*;

import java.util.List;

@RestController
@RequestMapping("/api/todos")
public class TodoController {
    private final TodoService todoService;

    public TodoController(TodoService todoService) {
        this.todoService = todoService;
    }

    @GetMapping
    public List getAllTodos() {
        return todoService.getAllTodos();
    }

    @PostMapping
    public Todo createTodo(@RequestBody Todo todo) {
        return todoService.createTodo(todo);
    }

    @PutMapping("/{id}")
    public Todo updateTodo(@PathVariable Long id, @RequestBody Todo todo) {
        return todoService.updateTodo(id, todo);
    }

    @DeleteMapping("/{id}")
    public void deleteTodo(@PathVariable Long id) {
        todoService.deleteTodo(id);
    }
}
Here is our controller code, we will go through it.
private final TodoService todoService;

public TodoController(TodoService todoService) {
    this.todoService = todoService;
}
Spring Boot sees that the @RestController on the class and The constructor that requires TodoService, and Spring will automatically creates (or “injects”) a TodoService object and passes it into the constructor.
The following methods are just simple GET,POST,PUT, and DELETE methods, very intuitive.
Service
Next, we move on to the service layer. As mentioned earlier, this is where we place the core business logic of the application. In the context of our Todo application, the service acts as the bridge between the controller and the repository.
Inside the service class, we inject the TodoRepository using Spring Boot's dependency injection mechanism. With the repository available, we implement the methods required by the controller - such as creating, updating, querying and deleting todo items. The service ensures that the controller stays clean and focused on handling requests, while all business rules and orchestration happen here.
This separation not only keeps the codebase modular and easily to maintain, but also simplifies testing because logic can be unit-tested independently from web or database layer.
package com.example.todolist.service;

import com.example.todolist.model.Todo;
import com.example.todolist.repository.TodoRepository;
import org.springframework.stereotype.Service;
import java.util.List;

@Service
public class TodoService {
    private final TodoRepository todoRepository;

    public TodoService(TodoRepository todoRepository) {
        this.todoRepository = todoRepository;
    }

    public List getAllTodos() {
        return todoRepository.findAll();
    }

    public Todo createTodo(Todo todo) {
        return todoRepository.save(todo);
    }

    public Todo updateTodo(Long id, Todo updatedTodo) {
        return todoRepository.findById(id)
            .map(todo -> {
                todo.setTitle(updatedTodo.getTitle());
                todo.setDescription(updatedTodo.getDescription());
                todo.setCompleted(updatedTodo.isCompleted());
                return todoRepository.save(todo);
            })
            .orElseThrow(() -> new RuntimeException("Todo not found"));
    }

    public void deleteTodo(Long id) {
        todoRepository.deleteById(id);
    }
}
After reviewing the code, some of you may wonder where methods like setTitle, setDescription, and setCompleted are actually implemented. If we go back to the model layer, we can see that the class is annotated with Lombok‘s @Getter and @Setter. These annotations automatically generate the getter and setter methods for every field at compile time, which is why you don't see them explicitly defined in the source code.
💡
In Java, the @ symbol is also used for annotations, annotations are just labels. but frameworks like Spring scan annotations at startup and add behavior.
@Entity
@Table(name = "todolist")
@Getter
@Setter
@NoArgsConstructor
public class Todo {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    @Column(nullable = false)
    private String title;

    private String description;

    @Column(nullable = false)
    private boolean completed = false;

    // UTC-safe creation timestamp
    @Column(nullable = false, updatable = false,
            columnDefinition = "TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP")
    private Instant createdAt = Instant.now();

    // Custom constructor for creating new todos
    public Todo(String title, String description) {
        this.title = title;
        this.description = description;
    }

}
Repository
Now, let's look at the repository layer. Unlike Python and Django ORM, where you typically define model-specific query functions manually, Spring Boot makes this part extremely lightweight. All yu need to do is extend the JpaRepository interface, and Spring Data automatically generate the underlying implementation for you at runtime. 
package com.example.todolist.repository;

import com.example.todolist.model.Todo;
import org.springframework.data.jpa.repository.JpaRepository;

public interface TodoRepository extends JpaRepository {}
JpaRepository provides a full set of CRUD operaions out of the box, including:
* save()
* findById()
* findAll()
* deleteById()
* count()
This means you can focus on business logic without writing boilerplate database access code.
P.S.  explains two things:
Todo = emtity type: "This repository works with the Todo table"
Long = ID type: "The primary key type is Long"
Put everything together
Finally, we need to set up the entry point for the Spring Boot application. Once that is in place, I will show you how to use Maven to build and start the Spring Boot service from the command line.
package com.example.todolist;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class TodolistApplication {

	public static void main(String[] args) {
		SpringApplication.run(TodolistApplication.class, args);
	}

}
You can start the service by running the following command from the project's root directory. Just make sure you export your .env variables beforehand. To make things even easier, I've also prepared a Makefile so you can run these commands with simple shortcuts.
mvn spring-boot:run
# Makefile for Liquibase Migration Management
# Provides Django-like commands for database migrations

# Load environment variables from .env file
-include .env
export

.PHONY: help run makemigration migrate migrate-one migrate-to showmigrations rollback rollback-preview fake-migrate fake-migrate-to fake-migrate-preview

# Default target
help:
	@echo "=== Spring Boot Application Commands ==="
	@echo ""
	@echo "Application:"
	@echo "  make run                        - Start Spring Boot application (default port 8080)"
	@echo "  make run PORT=9090              - Start Spring Boot application on custom port"
	@echo ""
	@echo "=== Liquibase Migration Commands ==="
	@echo ""
	@echo "Migration Generation:"
	@echo "  make makemigration              - Generate new migration (auto-numbered)"
	@echo "  make makemigration NAME=example - Generate migration with custom name"
	@echo ""
	@echo "Migration Execution:"
	@echo "  make migrate                    - Apply all pending migrations"
	@echo "  make migrate-one                - Apply only the next pending migration"
	@echo "  make migrate-to NUM=0008        - Migrate to specific version"
	@echo "  make showmigrations             - Show migration status"
	@echo ""
	@echo "Rollback:"
	@echo "  make rollback COUNT=1           - Rollback N changesets (default: 1)"
	@echo "  make rollback-preview COUNT=1   - Preview rollback SQL"
	@echo ""
	@echo "Fake Migrations:"
	@echo "  make fake-migrate               - Mark all pending as executed"
	@echo "  make fake-migrate-to NUM=0008   - Mark up to version as executed"
	@echo "  make fake-migrate-preview       - Preview what would be marked"
	@echo ""

# Variables
CHANGES_DIR := src/main/resources/db/changelog/changes
MVN := mvn
NAME ?= auto_generated
COUNT ?= 1
PORT ?=

# Application Commands
run:
ifdef PORT
	@echo "Starting Spring Boot application on port $(PORT)..."
	@$(MVN) spring-boot:run -Dspring-boot.run.arguments="--server.port=$(PORT)"
else
	@echo "Starting Spring Boot application on default port (8080)..."
	@$(MVN) spring-boot:run
endif

# Auto-detect next migration number
LATEST_NUM := $(shell ls $(CHANGES_DIR) 2>/dev/null | grep -E '^[0-9]+' | sed 's/^0*//' | sed 's/[^0-9].*//' | sort -n | tail -1)
ifeq ($(LATEST_NUM),)
	NEXT_NUM := 1
else
	NEXT_NUM := $(shell echo $$(($(LATEST_NUM) + 1)))
endif
NEXT_FORMATTED := $(shell printf "%04d" $(NEXT_NUM))

# Migration Generation
makemigration:
	@echo "Generating migration $(NEXT_FORMATTED)_$(NAME).yaml..."
	@if [ ! -d "$(CHANGES_DIR)" ]; then \
		echo "Error: Directory $(CHANGES_DIR) does not exist"; \
		exit 1; \
	fi
	@FILEPATH="$(CHANGES_DIR)/$(NEXT_FORMATTED)_$(NAME).yaml"; \
	$(MVN) liquibase:diff -Dliquibase.diffChangeLogFile=$$FILEPATH; \
	if [ -f $$FILEPATH ]; then \
		echo "" >> $$FILEPATH; \
		echo "- changeSet:" >> $$FILEPATH; \
		echo "    id: tag-$(NEXT_FORMATTED)" >> $$FILEPATH; \
		echo "    author: taiker" >> $$FILEPATH; \
		echo "    changes:" >> $$FILEPATH; \
		echo "      - tagDatabase:" >> $$FILEPATH; \
		echo "          tag: \"$(NEXT_FORMATTED)\"" >> $$FILEPATH; \
		echo "✓ Migration created: $$FILEPATH"; \
		echo "✓ Tag $(NEXT_FORMATTED) added"; \
	else \
		echo "Error: Failed to generate migration file"; \
		exit 1; \
	fi

# Migration Execution
migrate:
	@echo "Applying all pending migrations..."
	@$(MVN) liquibase:update

migrate-one:
	@echo "Applying next pending migration..."
	@$(MVN) liquibase:updateCount -Dliquibase.count=1

migrate-to:
	@if [ -z "$(NUM)" ]; then \
		echo "Error: NUM parameter required. Usage: make migrate-to NUM=0008"; \
		exit 1; \
	fi
	@echo "Migrating to version $(NUM)..."
	@$(MVN) liquibase:updateToTag -Dliquibase.toTag=$(NUM)

showmigrations:
	@echo "Checking migration status..."
	@$(MVN) liquibase:status

# Rollback
rollback:
	@echo "Rolling back $(COUNT) changeset(s)..."
	@$(MVN) liquibase:rollback -Dliquibase.rollbackCount=$(COUNT)

rollback-preview:
	@echo "Previewing rollback of $(COUNT) changeset(s)..."
	@$(MVN) liquibase:rollbackSQL -Dliquibase.rollbackCount=$(COUNT)
	@echo ""
	@echo "Preview saved to: target/liquibase/migrate.sql"

# Fake Migrations
fake-migrate:
	@echo "Marking all pending migrations as executed (without running them)..."
	@$(MVN) liquibase:changeLogSync

fake-migrate-to:
	@if [ -z "$(NUM)" ]; then \
		echo "Error: NUM parameter required. Usage: make fake-migrate-to NUM=0008"; \
		exit 1; \
	fi
	@echo "Marking migrations up to $(NUM) as executed (without running them)..."
	@$(MVN) liquibase:changeLogSyncToTag -Dliquibase.toTag=$(NUM)

fake-migrate-preview:
	@echo "Previewing what would be marked as executed..."
	@$(MVN) liquibase:changeLogSyncSQL
	@echo ""
	@echo "Preview saved to: target/liquibase/migrate.sql"
You can use the make run to start the service, and you will find the service has been started and listened to the 8080 port.
$ make run
Starting Spring Boot application...
...
...
2025-11-30T17:39:23.018+08:00  INFO 62999 --- [todolist] [           main] o.h.e.t.j.p.i.JtaPlatformInitiator       : HHH000489: No JTA platform available (set 'hibernate.transaction.jta.platform' to enable JTA platform integration)
2025-11-30T17:39:23.019+08:00  INFO 62999 --- [todolist] [           main] j.LocalContainerEntityManagerFactoryBean : Initialized JPA EntityManagerFactory for persistence unit 'default'
2025-11-30T17:39:23.078+08:00  WARN 62999 --- [todolist] [           main] JpaBaseConfiguration$JpaWebConfiguration : spring.jpa.open-in-view is enabled by default. Therefore, database queries may be performed during view rendering. Explicitly configure spring.jpa.open-in-view to disable this warning
2025-11-30T17:39:23.183+08:00  INFO 62999 --- [todolist] [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port 8080 (http) with context path '/'
2025-11-30T17:39:23.186+08:00  INFO 62999 --- [todolist] [           main] c.example.todolist.TodolistApplication   : Started TodolistApplication in 6.397 seconds (process running for 6.501)
If you want the application to run on a different port, you can specify it when using the make run command, for example:
make run PORT=9090
Once the service is running, open your browser and navigate to:
http://localhost:8080/api/todos
or
http://localhost:9090/api/todos if you chose a custom port.
Since the database is still empty, the API will return an empty list, which means the setup is working correctly.
Deployment
If you want to deploy this application to another environment - such as a server, VM or even a cloud instance - and run it "like a real production service,” you first need to package the project into a JAR file. To do that, run the following Maven command:
export $(cat .env | xargs)
mvn clean package
Spring Boot will create:
$ target/todolist-0.0.1-SNAPSHOT.jar
[INFO] --- jar:3.4.2:jar (default-jar) @ todolist ---
[INFO] Building jar: /Users/taiker/dev/todolist/target/todolist-0.0.1-SNAPSHOT.jar
The version in your JAR name comes from your pom.xml
0.0.1-SNAPSHOT
So the output JAR will always follow:
-.jar
Then you can start the Sprint Boot service using Java:
$ java -jar target/todolist-0.0.1-SNAPSHOT.jar

# or run in different port

$ java -jar target/todolist-0.0.1-SNAPSHOT.jar –server.port=9090
Takeaway
In this post, we walked through how to build RESTful APIs in Spring Boot using the Controller, Service and Repository layers to complete a basic CRUD workflow for our Todo application. We also explored multiple ways to start the service - with and without the Makefile - and covered how to bild a JAR file and run the application like a real deployed service.
In the next post, we'll take the project one step further and focus on writing effective unit tests in Spring Boot, including how to test controllers, services, and repository behavior with clean, maintainable test patterns. 


First Spring Boot Application - Todo List (1)
Paul Liang — Mon, 24 Nov 2025 16:29:46 GMT
Background
After joining a new company, there are always many things to learn from scratch. Today, I want to walk your through how to build your first Java web application using Sprint Boot. In this post, most of the code will be generated with the help of AI, and I will focus on assembling a complete and practical foundation for web application development - including environment configuration with .env, unit testing, ORM integration and more.
This article will be the first in a series documenting how I collaborate with AI to build a fully functional to-do web application. In this initial post, we will focus specifically on managing database migrations in a Sprint Boot application using Liquibase.
Initiation 
At the beginning, we will use the Spring Initializr to help us initiate the basic content of the project. Go to 👉 https://start.spring.io. 
Project Metadata Explained:

Group: A unique identifier for your organization or project

Common pattern: com.companyname.projectname


Artifact: The name of your project’s build artifact (the JAR or WAR file name).

If artifact = demo, your build will produce a file like: demo-0.0.1-SNAPSHOT.jar


Name: The display name of the project (for humans).
Description: A short explanation of your project’s purpose. It appears in your generated pom.xml file for documentation.

Click Generate, unzip it, and open it in your IDE. Your project structure will look like:
➜ tree
.
├── HELP.md
├── mvnw
├── mvnw.cmd
├── pom.xml
├── src
│   ├── main
│   │   ├── java
│   │   │   └── com
│   │   │       └── example
│   │   │           └── todolist
│   │   │               └── TodolistApplication.java
│   │   └── resources
│   │       └── application.properties
│   └── test
│       └── java
│           └── com
│               └── example
│                   └── todolist
│                       └── TodolistApplicationTests.java
└── target
    ├── classes
    │   ├── application.properties
    │   └── com
    │       └── example
    │           └── todolist
    │               └── TodolistApplication.class
    └── test-classes
        └── com
            └── example
                └── todolist
                    └── TodolistApplicationTests.class
File / Folder Explanation 


.mvn — Maven Wrapper Folder

This folder supports the Maven Wrapper, which lets you build the project without installing Maven globally.



src — Source Code Directory

This is where all your code and resources live. src/main/java — your actual application source code. That’s where controllers, services, repositories, etc.



target — Build Output Directory

This folder is automatically created by Maven when you build or run your project. It contains all the compiled code, packaged JAR/WAR files, and temporary build files.



pom.xml

pom.xml stands for Project Object Model file. It’s the central configuration file that defines:

📦 Project structure and metadata
🔗 Dependencies (libraries your app uses)
⚙️ Build configuration (how your app compiles, packages, and runs)
🔁 Plugins (extra tools like testing, packaging, or deployment automation)





Implementation
Let's jump into the implementation. Before we write any code, we need to prepare two things: the project structure and the database dependencies. Below is what a clean Sprint Boot folder layout looks like. Most of this is generated automatically by Sprint Initializr, but we'll add our entity and config files manually...
src/
 ├── main/
 │   ├── java/
 │   │   └── com/example/todolist/
 │   │        ├── TodolistApplication.java       ← main entry point
 │   │        ├── model/                         ← your JPA entity
 │   │        │    └── Todo.java
 │   └── resources/
 │        ├── application.properties             ← DB + JPA config
Add the some dependencies
Before implementing the to-do list application, we need to set up the required dependencies and database connection. In my location environment, I use a .env file to manage config values such as database connection info. Below is an example configuration:
.env
DB_HOST=localhost
DB_PORT=5432
DB_NAME=todolist
DB_USER=paul
DB_PASSWORD=dev
application.properties
spring.application.name=todolist

spring.datasource.url=jdbc:postgresql://${DB_HOST}:${DB_PORT:5432}/${DB_NAME:demo}
spring.datasource.username=${DB_USER:default_user}
spring.datasource.password=${DB_PASSWORD:default_pass}

# Hibernate Settings
spring.jpa.hibernate.ddl-auto=None
spring.jpa.show-sql=true
spring.jpa.properties.hibernate.format_sql=true
I want to take a little of time to explain what's the Hibernate Setting, Hibernate is an ORM (Object-Relational Mapping) Framework, you can think it as the bridge between your java objects to relation database. 
spring.jpa.hibernate.ddl-auto=None
This controls how Hibernate handles DDL (Data Definition Language) → meaning tables, columns, schema generation. The following are different values for this setting.
 - none: Do nothing. Hibernate will not create, update, validate, or drop your database schema.
 - update: Automatically updates schema → adds columns, changes types (not recommended in production).
 - create: Drops all tables and recreates them every time the app starts.
 - create-drop: Like create, but drops the schema when the app stops.
 - validate: Checks entity vs database schema—fails if mismatch but does not modify schema.
spring.jpa.show-sql=true
This tells Hibernate to print SQL statements in the console/logs. Not recommended in production, because it can print sensitive data + cause log flooding.
spring.jpa.properties.hibernate.format_sql=true
Formats SQL logs to be pretty and readable. Works only when show-sql=true
💡
The Jakarta Persistence API (JPA), formerly known as the Java Persistence API, is a Jakarta EE application programming interface specification that describes the management of relational data in enterprise Java applications. It provides a standard approach for object-relational mapping (ORM), simplifying database interactions by enabling developers to map Java objects to database tables. 
Next, edit your pom.xml → inside  section, add:


    org.springframework.boot
    spring-boot-starter-web




    org.springframework.boot
    spring-boot-starter-data-jpa




    org.postgresql
    postgresql
    runtime




    org.springframework.boot
    spring-boot-starter-test
    test
Then reload Maven: (Remember to export your .env to system environmental variable, because the Spring Boot doesn’t read .env automatically)
Create the Model
For the creating the model in the Sprint Boot, I will introduce the Lombok which can help us quickly build the model object without implement the Getter/Setter methods. 
In pom.xml, inside , add:

    org.projectlombok
    lombok
    1.18.32
    provided
 src/main/java/com/example/todolist/model/Todo.java
package com.example.todolist.model;

import jakarta.persistence.*;
import lombok.Getter;
import lombok.NoArgsConstructor;
import lombok.Setter;

import java.time.Instant;

@Entity
@Table(name = "todolist")
@Getter
@Setter
@NoArgsConstructor
public class Todo {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    @Column(nullable = false)
    private String title;

    private String description;

    @Column(nullable = false)
    private boolean completed = false;

    // UTC-safe creation timestamp
    @Column(nullable = false, updatable = false,
            columnDefinition = "TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP")
    private Instant createdAt = Instant.now();

    // Custom constructor for creating new todos
    public Todo(String title, String description) {
        this.title = title;
        this.description = description;
    }

}
We have defined the model schema for the todolist table. For the createdAt field, we use the Instant type rather than LocalDateTime to store the timestamp in UTC. This ensures that the creation time is recorded consistently and is automatically set to the current instant.
Migration Manager
In modern SaaS application, database migration management is a critical part of the development workflow. Each release typically involves schema changes, and keeping track of these changes in version control is essential for maintainability and team collaboration.
Liquibase is an open-source database schema change management tool that helps development and DevOps teams track, version, and automate database migrations. In this post, we will walk through how to install Liquibase and use it to manage database changes in your Spring Boot application. 
To get started, add the following sections to your pom.xml to include the Liquibase dependencies and Maven plugin:

	org.liquibase
	liquibase-core
	4.27.0

    org.liquibase
    liquibase-maven-plugin
    4.27.0

    
        
            src/main/resources/db/changelog/db.changelog-master.yaml
        

        ${env.DB_URL}
        ${env.DB_USER}
        ${env.DB_PASSWORD}
        org.postgresql.Driver

        
            hibernate:spring:com.example.todolist.model?dialect=org.hibernate.dialect.PostgreSQLDialect
        
        
            liquibase.ext.hibernate.database.connection.HibernateDriver
        
    

    

        
        
            org.liquibase.ext
            liquibase-hibernate6
            4.27.0
        

        
        
            org.springframework
            spring-orm
            6.1.3
        

        
        
            org.springframework
            spring-context
            6.1.3
        

        
        
            org.postgresql
            postgresql
            42.7.3
        

    
you can use the following command to verify the installation.
./mvnw liquibase:help
Next we will setup how to generate the ChangeLog for our new model todolist. Create the db.changelog-master.yaml to /src/main/resources/db/changelog and also create the changes folder in the changelog.
todolist/src/main/resources on  main [⇡] is 📦 1 via ☕ v21.0.9
➜ tree
.
├── application.properties
└── db
    └── changelog
        ├── changes
        │   └── 0001-changelog-init.yaml
        └── db.changelog-master.yaml
Folder Structure of the resources
db.changelog-master.yaml
databaseChangeLog:
  - includeAll:
      path: db/changelog/changes/
      relativeToChangelogFile: false
Based on the settings in the db.changelog-master.yaml, we will store all the changelog files in the db/changelog/changes/. New we will use the following commands to generate our first init migration file and apply the update.
export $(cat .env | xargs)
./mvnw liquibase:diff -Dliquibase.diffChangeLogFile=src/main/resources/db/changelog/changes/0001-init.yaml
You will get the following change log:
databaseChangeLog:
- changeSet:
    id: 1763910259559-1
    author: taiker (generated)
    changes:
    - createTable:
        columns:
        - column:
            autoIncrement: true
            constraints:
              nullable: false
              primaryKey: true
              primaryKeyName: todolistPK
            name: id
            type: BIGINT
        - column:
            constraints:
              nullable: false
            name: completed
            type: BOOLEAN
        - column:
            constraints:
              nullable: false
            defaultValueComputed: CURRENT_TIMESTAMP
            name: createdAt
            type: TIMESTAMP WITH TIME ZONE
        - column:
            name: description
            type: VARCHAR(255)
        - column:
            constraints:
              nullable: false
            name: title
            type: VARCHAR(255)
        tableName: todolist
- changeSet:
    id: 1763910259559-2
    author: taiker (generated)
    changes:
    - dropTable:
        tableName: todos

Next we will use the update command to apply our database change log.
export $(cat .env | xargs)
./mvnw liquibase:update
💡
Additional, the liquibase also provide the rollback mechanism, you can refer to the table below.
You can log in to the Postgres to see the result
todolist=# \dt public.*
                List of relations
 Schema |         Name          | Type  | Owner
--------+-----------------------+-------+--------
 public | databasechangelog     | table | taiker
 public | databasechangeloglock | table | taiker
 public | todolist              | table | taiker
(3 rows)
todolist=# \d public.todolist
                                     Table "public.todolist"
   Column    |           Type           | Collation | Nullable |             Default
-------------+--------------------------+-----------+----------+----------------------------------
 id          | bigint                   |           | not null | generated by default as identity
 completed   | boolean                  |           | not null |
 createdAt   | timestamp with time zone |           | not null | now()
 description | character varying(255)   |           |          |
 title       | character varying(255)   |           | not null |
Indexes:
    "todolistPK" PRIMARY KEY, btree (id)
Now you can see that we have successfully created a new table called todolist in PostgreSQL by applying the changelog generated by Liquibase. This is the simplest example of using Liquibase during development. Next, we will add additional fields to the table through multiple changelogs to explore what else Liquibase can do.
The first change we want to make is renaming the createdAt column in the todolist table. According to common database conventions, column names should follow snake_case, so the column should be named created_at instead of  createdAt. This is a perfect opportunity to create another Liquibase changelog to apply this update.
To avoid running into this naming mismatch again in the future, we will also update our Hibernate configuration. By enabling the CamelCaseToUnderscoresNamingStrategy, we can continue using maps them to snake_case columns in the database. This keeps our Java model clean and idiomatic while ensuring consistent database naming conventions.
application.properties
# Use snake_case for database column names
spring.jpa.hibernate.naming.physical-strategy=org.hibernate.boot.model.naming.CamelCaseToUnderscoresNamingStrategy
Generate the following change log for renaming createdAt field, and apply it.
databaseChangeLog:
  - changeSet:
      id: 0002-rename-createdAt-to-created_at
      author: taiker
      changes:
        - renameColumn:
            tableName: todolist
            oldColumnName: createdAt
            newColumnName: created_at
            columnDataType: TIMESTAMP WITH TIME ZONE
Now that we have updated the table schema, we can try using the Liquibase rollback command. After running the rollback, you will see that the database returns to the previous schema exactly as expected.
export $(cat .env | xargs)

# option 1
mvn liquibase:rollback -Dliquibase.rollbackCount=1
todolist=# \d public.todolist
                                     Table "public.todolist"
   Column    |           Type           | Collation | Nullable |             Default
-------------+--------------------------+-----------+----------+----------------------------------
 id          | bigint                   |           | not null | generated by default as identity
 completed   | boolean                  |           | not null |
 createdAt   | timestamp with time zone |           | not null | now()
 description | character varying(255)   |           |          |
 title       | character varying(255)   |           | not null |
Indexes:
    "todolistPK" PRIMARY KEY, btree (id)
Bonus
I use a Makefile to wrap the Maven Liquibase commands, making them much easier to run during development. Below is the Makefile for reference. You can run make help to view all supported commands.
➜ make help
=== Liquibase Migration Commands ===

Migration Generation:
  make makemigration              - Generate new migration (auto-numbered)
  make makemigration NAME=example - Generate migration with custom name

Migration Execution:
  make migrate                    - Apply all pending migrations
  make migrate-one                - Apply only the next pending migration
  make migrate-to NUM=0008        - Migrate to specific version
  make showmigrations             - Show migration status

Rollback:
  make rollback COUNT=1           - Rollback N changesets (default: 1)
  make rollback-preview COUNT=1   - Preview rollback SQL

Fake Migrations:
  make fake-migrate               - Mark all pending as executed
  make fake-migrate-to NUM=0008   - Mark up to version as executed
  make fake-migrate-preview       - Preview what would be marked
# Makefile for Liquibase Migration Management
# Provides Django-like commands for database migrations

# Load environment variables from .env file
-include .env
export

.PHONY: help makemigration migrate migrate-one migrate-to showmigrations rollback rollback-preview fake-migrate fake-migrate-to fake-migrate-preview

# Default target
help:
	@echo "=== Liquibase Migration Commands ==="
	@echo ""
	@echo "Migration Generation:"
	@echo "  make makemigration              - Generate new migration (auto-numbered)"
	@echo "  make makemigration NAME=example - Generate migration with custom name"
	@echo ""
	@echo "Migration Execution:"
	@echo "  make migrate                    - Apply all pending migrations"
	@echo "  make migrate-one                - Apply only the next pending migration"
	@echo "  make migrate-to NUM=0008        - Migrate to specific version"
	@echo "  make showmigrations             - Show migration status"
	@echo ""
	@echo "Rollback:"
	@echo "  make rollback COUNT=1           - Rollback N changesets (default: 1)"
	@echo "  make rollback-preview COUNT=1   - Preview rollback SQL"
	@echo ""
	@echo "Fake Migrations:"
	@echo "  make fake-migrate               - Mark all pending as executed"
	@echo "  make fake-migrate-to NUM=0008   - Mark up to version as executed"
	@echo "  make fake-migrate-preview       - Preview what would be marked"
	@echo ""

# Variables
CHANGES_DIR := src/main/resources/db/changelog/changes
MVN := mvn
NAME ?= auto_generated
COUNT ?= 1

# Auto-detect next migration number
LATEST_NUM := $(shell ls $(CHANGES_DIR) 2>/dev/null | grep -E '^[0-9]+' | sed 's/^0*//' | sed 's/[^0-9].*//' | sort -n | tail -1)
ifeq ($(LATEST_NUM),)
	NEXT_NUM := 1
else
	NEXT_NUM := $(shell echo $$(($(LATEST_NUM) + 1)))
endif
NEXT_FORMATTED := $(shell printf "%04d" $(NEXT_NUM))

# Migration Generation
makemigration:
	@echo "Generating migration $(NEXT_FORMATTED)_$(NAME).yaml..."
	@if [ ! -d "$(CHANGES_DIR)" ]; then \
		echo "Error: Directory $(CHANGES_DIR) does not exist"; \
		exit 1; \
	fi
	@FILEPATH="$(CHANGES_DIR)/$(NEXT_FORMATTED)_$(NAME).yaml"; \
	$(MVN) liquibase:diff -Dliquibase.diffChangeLogFile=$$FILEPATH; \
	if [ -f $$FILEPATH ]; then \
		echo "" >> $$FILEPATH; \
		echo "- changeSet:" >> $$FILEPATH; \
		echo "    id: tag-$(NEXT_FORMATTED)" >> $$FILEPATH; \
		echo "    author: taiker" >> $$FILEPATH; \
		echo "    changes:" >> $$FILEPATH; \
		echo "      - tagDatabase:" >> $$FILEPATH; \
		echo "          tag: \"$(NEXT_FORMATTED)\"" >> $$FILEPATH; \
		echo "✓ Migration created: $$FILEPATH"; \
		echo "✓ Tag $(NEXT_FORMATTED) added"; \
	else \
		echo "Error: Failed to generate migration file"; \
		exit 1; \
	fi

# Migration Execution
migrate:
	@echo "Applying all pending migrations..."
	@$(MVN) liquibase:update

migrate-one:
	@echo "Applying next pending migration..."
	@$(MVN) liquibase:updateCount -Dliquibase.count=1

migrate-to:
	@if [ -z "$(NUM)" ]; then \
		echo "Error: NUM parameter required. Usage: make migrate-to NUM=0008"; \
		exit 1; \
	fi
	@echo "Migrating to version $(NUM)..."
	@$(MVN) liquibase:updateToTag -Dliquibase.toTag=$(NUM)

showmigrations:
	@echo "Checking migration status..."
	@$(MVN) liquibase:status

# Rollback
rollback:
	@echo "Rolling back $(COUNT) changeset(s)..."
	@$(MVN) liquibase:rollback -Dliquibase.rollbackCount=$(COUNT)

rollback-preview:
	@echo "Previewing rollback of $(COUNT) changeset(s)..."
	@$(MVN) liquibase:rollbackSQL -Dliquibase.rollbackCount=$(COUNT)
	@echo ""
	@echo "Preview saved to: target/liquibase/migrate.sql"

# Fake Migrations
fake-migrate:
	@echo "Marking all pending migrations as executed (without running them)..."
	@$(MVN) liquibase:changeLogSync

fake-migrate-to:
	@if [ -z "$(NUM)" ]; then \
		echo "Error: NUM parameter required. Usage: make fake-migrate-to NUM=0008"; \
		exit 1; \
	fi
	@echo "Marking migrations up to $(NUM) as executed (without running them)..."
	@$(MVN) liquibase:changeLogSyncToTag -Dliquibase.toTag=$(NUM)

fake-migrate-preview:
	@echo "Previewing what would be marked as executed..."
	@$(MVN) liquibase:changeLogSyncSQL
	@echo ""
	@echo "Preview saved to: target/liquibase/migrate.sql"
Takeaway
In this post, I walked you through how to start a new Java project using the online Spring Initializr. Unlike man tutorials that build an entire to-do list application in one go, we spent most of our time focusing on how to use Liquibase to manage database schema changes through practical, real-world examples. Understanding database migrations early helps ensure your application remains maintainable as it grows.
In the next post, we will continue building the application and complete the remaining parts of the basic to-do list system using Spring Boot. If you’re interested in building production-ready SaaS applications with Java and Spring Boot, stay tuned for the next article.


[Data] How to Build the Modern Distributed SQL Query Engine for Business-Level Requirement
Paul Liang — Mon, 29 Sep 2025 15:38:23 GMT
Background
In August, I embarked on a new journey by joining a new company, marking the start of another exciting venture. At the same time, I began exploring the world of data — an area full of concepts and tools I had never worked with before. This blog is part of my effort to document that learning process. In particular, I want to share how to build a small, local data solution — complete with authentication — that can run on your own local instance.
Prerequisites
Before jumping into the details, it’s important to have a basic understanding of the three core services we’ll be working with. Below, I’ll provide a brief introduction to each of them.
CloudBeaver
CloudBeaver is a lightweight, web-based database management tool. It allows you to connect to multiple databases through a browser interface, making it easy to explore data, run queries, and manage database objects without installing heavy desktop software. In our setup, CloudBeaver will serve as the main interface to interact with Trino.
Trino
Trino is a high‑performance, distributed SQL query engine designed to run interactive queries at scale. It can connect to a wide variety of data sources — from relational databases like Postgres and MySQL to big data systems and even object storage. In our local setup, Trino serves as the query engine sitting between CloudBeaver (the UI) and the underlying data sources. This means CloudBeaver sends SQL queries to Trino, and Trino takes care of planning, executing, and returning the results.
OPA (Open Policy Agent)
OPA is a general-purpose policy engine that helps enforce fine-grained access control. It uses a declarative policy language called Rego to define what actions are allowed or denied. In our local data solution, OPA will integrate with Trino to ensure that only authorized users can run certain queries or access specific rows of data.
Arch
Based on the image below, you can easily understand the roles and responsibilities of these three services. In the next section, we will walk through how to use these services step by step to build a modern distributed SQL query engine for processing large volumes of data.
Overview of the Arch
Setup
We use Docker Compose to orchestrate the three core services (Trino, CloudBeaver, OPA) for this demo project
services:
  trino:
    image: trinodb/trino:latest
    container_name: trino
    ports:
      - "8080:8080"
    volumes:
      - ./trino/etc:/etc/trino
      - ./trino/data:/data/trino

  cloudbeaver:
    image: dbeaver/cloudbeaver:latest
    container_name: cloudbeaver
    ports:
      - "8978:8978"
    volumes:
      - ./cloudbeaver/workspace:/opt/cloudbeaver/workspace
    depends_on:
      - trino

  opa:
    image: openpolicyagent/opa:0.69.0
    container_name: opa
    ports:
      - "8181:8181"
    command: ["run", "--server", "--addr", "0.0.0.0:8181", "--config-file", "/config.yaml", "--log-level", "debug", "/policy", "/data"]
    volumes:
      - ./opa/policies:/policy
      - ./opa/data:/data
      - ./opa/config.yaml:/config.yaml
docker-compose.yaml
CloudBeaver
CloudBeaver setup is the simplest part. All we need to do is create a folder that Docker Compose can mount, which will be used to store logs and authentication information.
$ mkdir -p cloudbeaver/workspace
Trino
Trino is the heart of the stack, so its setup is a bit more involved than the others. We keep everything under a trino/directory with two subfolders: etc/ for configuration and data/ for runtime state.
trino/etc/ — Configuration
Mounted at /etc/trino/ inside the container. Key files include:
config.properties – Main server settings (coordinator role, memory limits, discovery URI)
node.properties – Node-specific configuration
jvm.config – JVM flags and memory options
log.properties – Logging levels
access-control.properties – Points Trino to OPA for policy decisions
password-authenticator.properties – Enables file-based password authentication
password.db – Local credential store
catalog/tpch.properties – TPCH sample dataset connector
catalog/memory.properties – In-memory connector for quick tests
trino/data/ — Runtime Data
Mounted at /data/trino/ inside the container. Used for:
var/log/ – HTTP and query logs
var/run/launcher.pid – Process ID of the Trino server
Cache and temp files for query execution
This directory ensures Trino can persist logs, state, and temp files across restarts.
Example Config Files
etc/access-control.properties
access-control.name=opa
opa.policy.uri=http://opa:8181/v1/data/authz/trino/allow
opa.policy.batched-uri=http://opa:8181/v1/data/authz/trino/batch_allow
opa.log-requests=true
opa.log-responses=true
etc/access-control.properties
etc/config.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=512MB
query.max-memory-per-node=256MB
discovery-server.enabled=true
discovery.uri=http://localhost:8080
etc/config.properties
etc/jvm.config
-server
-Xmx1G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
etc/jvm.config
etc/log.properties
io.trino.plugin.opa.OpaHttpClient=DEBUG
etc/log.properties
etc/node.properties
node.environment=dev
node.id=1
node.data-dir=/data/trino
etc/node.properties
etc/password-authenticator.properties
password-authenticator.name=file
file.password-file=/etc/trino/password.db
etc/password-authenticator.properties
etc/password.db
alice:password123
bob:password123
etc/password.db
Next, let’s introduce the concept of catalogs in Trino.
In Trino, a catalog is a named configuration that connects to a specific data source. Each catalog:
Points to a different data source (e.g., a database, file system, etc.)
Uses a specific connector to communicate with that data source
Has its own configuration settings
Appears as a separate “database” when queried via SQL
In this demo, we’ll use two catalog configurations. Here’s the first:
1. tpch.properties — TPCH Connector
Purpose: Provides sample datasets for testing and demonstration purposes
Connector: TPCH (Transaction Processing Performance Council)
Data: Dynamically generates standard benchmark tables such as customer, orders, lineitem, and more
Configuration: splits-per-node=4 — Controls the level of parallelism used during data generation
etc/catalog/tpch.properties
connector.name=tpch
tpch.splits-per-node=4
etc/catalog/tpch.properties
2. memory.properties - Memory Connector
Purpose: Creates in-memory tables for temporary data storage
Connector: Memory connector for ephemeral data
Use case: Temporary tables, testing, or data processing pipelines
etc/catalog/memory.properties
connector.name=memory
etc/catalog/memory.properties
OPA
Next we configure OPA. Create an opa/ folder and add a trino.rego file to define authorization rules.
package authz.trino

import rego.v1

default allow := false

debug_info := {
    "user": input.context.identity.user,
    "operation": input.action.operation,
    "full_input": input,
}

# Allow Alice to perform any operation
allow if {
    input.context.identity.user == "alice"
}

# Explicitly deny Bob
allow if {
    input.context.identity.user == "bob"
    false
}

# Handle batch operations with filterResources
batch_allow := result if {
    input.action.filterResources
    result := [i |
        input.action.filterResources[i]
        allow
    ]
}

# Handle other types of batch requests
batch_allow := result if {
    input.batch
    result := [allow | input.batch[_]]
}
opa/trino.rego
Explanation of Rules
default allow := false – Deny by default
debug_info – Collects context for debugging
Alice – Full access
Bob – Always denied
batch_allow – Applies the rules to batch requests
User Story
Alice – Can perform any operation, always allowed.
Bob – Always denied.
Other users – Denied by default.
We also need to set the config.yaml for OPA. 
services:
  authz:
    resource: data.authz

decision_logs:
  console: true
  reporting:
    min_delay_seconds: 1
    max_delay_seconds: 1
opa/config.yaml
when OPA evaluates an input, it will look at the policy path data.authz (from your .rego files) to determine allow/deny or any decision output.
This section controls how OPA records and outputs decision logs (audit trail of all policy decisions).
Validation
In this section, we will run all services using Docker Compose and validate that everything works correctly. First, please make sure the folder structure matches the following:
.
├── cloudbeaver
│   └── workspace
├── docker-compose.yaml
├── opa
│   ├── config.yaml
│   └── policies
│       └── trino.rego
└── trino
    ├── data
    └── etc
        ├── access-control.properties
        ├── catalog
        │   ├── memory.properties
        │   └── tpch.properties
        ├── config.properties
        ├── jvm.config
        ├── log.properties
        ├── node.properties
        ├── password-authenticator.properties
        └── password.db
Next, use docker-compose up --build to start the services. If everything goes well, you can open your browser and navigate to CloudBeaver at http://localhost:8978. Follow the on-screen instructions to set up your account and password
Follow the instructions to set up the login info.
After setting up your account and password, let’s go back to our Rego code. Remember we have two users: Alice and Bob. Alice has full permissions for all operations, whereas Bob has no permissions at all. Next, we’ll test whether the OPA service is working correctly by creating these users in CloudBeaver.
Create the users Alice and Bob in CloudBeaver.
Before we going to set up the connection we should describe how authentication normally works within these services:
CloudBeaver: by default manages its own users (in its internal DB or LDAP).
Trino: receives a username when a connection is opened. This is usually configured in the CloudBeaver connection settings (Authentication → Username/Password or JWT).
OPA: only sees what Trino sends in input.context.identity.user.
Since we're using the Community Edition of CloudBeaver, we can't automatically pass user information to Trino. As a result, each CloudBeaver user must configure their own connection using their Trino username and password (or token). 
💡
Bonus Tip: If you don't want to set up a connection for each user in Community Edition, you can enable the Trino's proxy user feature. However, this topic is not covered in this blog post. 
Use the following URL: jdbc:trino://trino:8080/tpch/tiny. Since we are using Docker Compose to start the services, CloudBeaver cannot connect to Trino using localhost—it must reference the service name defined in the Docker network (in this case, trino).
Select the Trino as our connection type.
Set up the connection for Alice
Set up the connection for Bob
Next, we can observe the differences between the two connections. For Alice, you can view all the settings we configured in the Trino environment, including the memory and tpch catalogs. However, for Bob, you won't see anything because all operations are rejected by the OPA service.
The difference between Alice and Bob
For Bob, can't do any operation through connection
Perfect — everything is working as expected (poor Bob 😅). If you run into any issues, feel free to check out my GitHub repository for the complete project and setup instructions.
Takeaway
In this post, we demonstrated how to build a modern distributed SQL query engine with a user-friendly UI and an authorization solution that meets business-level data requirements — not just for handling large datasets, but also for enforcing data access control. 


Recap of bypassing anti-bot mechanism
Paul Liang — Sat, 19 Apr 2025 16:49:45 GMT
Background
I have been with GoFreight for 5 years. During the first two or three years, I was assigned to develop the crawler system for the tracking service. In the beginning, everything went well, and we were able to crawl the information we needed from carrier websites. However, as time passed, more and more carriers implemented anti-bot solutions on their websites, and we started encountering issues. We had to deal with various challenges such as CAPTCHAs, Google reCAPTCHA, CDN protections, etc. This required increasing amounts of time to bypass these detections, otherwise, we couldn’t meet our service SLA. In this blog, I want to take some time to recap the challenges we faced during this period, until we started retrieving data directly from our vendors or carriers.
TLS Handshake
First, I will briefly introduce the TLS Handshake. Transport Layer Security (TLS) is a widely adopted security protocol designed to ensure privacy and data security for communications over the internet. One of its primary use cases is encrypting the communication between web applications and servers. In the following figure, we can see how the client and server establish a secure connection before sending or receiving data, and this process is called the TLS Handshake. 
src: https://www.cloudflare.com/learning/ssl/what-happens-in-a-tls-handshake/
Fingerprint
Next, I will briefly explain what a fingerprint is with a simple example. A fingerprint is a unique identifier generated from a combination of device, browser, and network characteristics. It’s used to recognize users or bots, even when:
Cookies are disabled
IP addresses change
Users switch to incognito or private mode
Think of it as a digital ID for your device/browser session. This technique is widely used by MarTech and CDN companies for various purposes, including improving ad transfer rates and detecting bots. In this example, we’ll use FingerprintJS to help demonstrate how it works.



    
    
    Bot Detection with FingerprintJS v4


    Bot Detection Demo
    Click the button below to check if you're a bot!
    
    

    
    

You can open your HTML file in the browser. After clicking the button, you’ll see the Fingerprint ID in the developer console. We also print the details of the components for reference. You can review the content of these components to understand which items might affect the Fingerprint. If you open another tab and visit the same file, you may see the same Fingerprint ID because the content of the components remains unchanged. At this point, you should have an initial understanding of how the Fingerprint works.
Bot Detection
Imagine you have a website, and many bots are visiting, crawling everything. Sometimes, this can even cause your website to crash. What can you do? In the past, we often added rate limits based on IP addresses. However, with the rise of proxy services, users can easily apply new IPs from around the world. This makes the IP-based solution less effective today. Luckily, as you might have guessed, here comes Fingerprinting. By calculating the Fingerprint ID based on the browser's components, we can easily identify whether website requests are coming from the same instance.



    
    
    Bot Detection with FingerprintJS v4


    Bot Detection Demo
    Click the button below to check if you're a bot!
    
    

    
    


from flask import Flask, request, jsonify
from flask_cors import CORS
import time

app = Flask(__name__)
CORS(app)

# Store fingerprints to detect rapid, repetitive requests
fingerprint_tracker = {}

# Known bot-like user-agents
BOT_USER_AGENTS = [
    "HeadlessChrome", "bot", "crawl", "spider", "Googlebot", "Bingbot", "Yahoo! Slurp", "DuckDuckBot"
]

# Fake screen resolutions (some bots report unusual screen sizes)
UNREALISTIC_SCREENS = ["0x0", "1x1", "1024x1024"]

# Check if the request is likely from a bot
def is_bot(fingerprint, user_agent, screen_size, languages):
    current_time = time.time()

    # 1. Detect rapid repeated requests (rate limiting)
    if fingerprint in fingerprint_tracker:
        last_request_time = fingerprint_tracker[fingerprint]
        if current_time - last_request_time < 2:  # Less than 2 seconds between requests
            return True, "Suspicious rapid requests detected"
    
    # Update last request time
    fingerprint_tracker[fingerprint] = current_time

    # 2. Check for known bot user-agents
    if any(bot in user_agent for bot in BOT_USER_AGENTS):
        return True, "Bot-like User-Agent detected"

    # 3. Check for unusual screen sizes (some bots use default headless sizes)
    if screen_size in UNREALISTIC_SCREENS:
        return True, "Unrealistic screen resolution detected"

    # 4. Check if the language list is empty (bots often don't send language data)
    if not languages or len(languages) == 0:
        return True, "No language data found"

    return False, "Looks like a human"

@app.route('/check_bot', methods=['POST'])
def check_bot():
    data = request.get_json()
    
    fingerprint = data.get("fingerprint")
    user_agent = data.get("userAgent", "")
    screen_size = data.get("screenSize", "")
    languages = data.get("languages", [])

    bot, message = is_bot(fingerprint, user_agent, screen_size, languages)
    
    return jsonify({"is_bot": bot, "message": message})

if __name__ == '__main__':
    app.run(debug=True)

By default, browsers block cross-origin requests for security reasons (Same-Origin Policy). By doing so, your Flask backend can receive a request from the client on localhost for testing.
Let's delve into the backend code. Now that you have the fingerprint, you can easily build the fingerprint profile. Although we use a dictionary format (fingerprint_tracker) as an example, you can imagine recording this fingerprint information in the database. Each time a request with a specific fingerprint comes in, we can query the database to check its history, helping us detect bots. This example just gives you a basic example of how to use the fingerprint to detect the bot.
TLS Fingerprint
Next, we will introduce the TLS Fingerprint. As we mentioned before, general fingerprinting uses the browser component to calculate its value. TLS Fingerprinting is a technique used to identify clients based on their unique Transport Layer Security (TLS) handshake characteristics. When a client (e.g., browser or bot) connects to a server, it starts with a TLS handshake that includes:
TLS Parameter What It Reveals How It's Used for Fingerprinting
TLS Version e.g., TLS 1.2 or TLS 1.3 Some bots use outdated versions
Cipher Suites List of supported encryption algorithms Bots often have limited options
Extensions Features like ALPN, SNI, GREASE Unique combinations per browser
Elliptic Curves Supported key exchange methods Unusual curves = suspicious
Signature Algorithms Authentication methods used Differ per OS/browser
Order of Fields The sequence of cipher suites/extensions Different per client type
Let's see an example. There are two clients connected to your server:
TLS 1.3
Cipher Suites: [TLS_AES_128_GCM_SHA256, TLS_AES_256_GCM_SHA384]
Extensions: [server_name, supported_versions, key_share, psk_key_exchange_modes]
Elliptic Curves: [X25519, secp256r1]
Signature Algorithms: [rsa_pss_rsae_sha256, ecdsa_secp256r1_sha256]
Legitimate User (Chrome on Windows)
TLS 1.2
Cipher Suites: [TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384]
Extensions: [server_name]
Elliptic Curves: [secp256r1]
Signature Algorithms: [rsa_pss_rsae_sha256]
Suspicious Bot (Python Requests Library)
Based on this information, we can easily detect that the second request was generated by Python Requests and may belong to a suspicious bot, so we can block it.
I will use mitmproxy to patch the TLS information of the Python request. First, create the debug_hello.py script to check if our request is being patched in mitmproxy, and then use the mitmdump command to start the mitmproxy service locally.
from mitmproxy import tls

EXTENSION_NAMES = {
    0: "server_name",
    10: "supported_groups",
    11: "ec_point_formats",
    13: "signature_algorithms",
    16: "application_layer_protocol_negotiation",
    21: "padding",
    22: "encrypt_then_mac",
    23: "extended_master_secret",
    43: "supported_versions",
    45: "psk_key_exchange_modes",
    49: "post_handshake_auth",
    51: "key_share",
    65281: "renegotiation_info",
}

def readable_extensions(extensions):
    return [
        EXTENSION_NAMES.get(ext_id, f"unknown({ext_id})")
        for ext_id, _ in extensions
    ]

def tls_clienthello(data: tls.ClientHelloData):
    hello = data.client_hello
    print("JA3 Debug:")
    print(f"  - Client: {data.context.client.peername}")
    print(f"  - Cipher Suites: {hello.cipher_suites}")
    print(f"  - Extensions: {readable_extensions(hello.extensions)}")
debug_hello.py
$ mitmdump --mode regular@8082 -s debug_hello.py --set tls_client_hello=chrome_120
Next, we’ll prepare a simple Python request code to send a request with a proxy and examine the result from tls.peet.ws. Take some time to review the differences between using and not using a proxy, especially the significant variations in the ciphers section.
import requests
from pprint import pprint

proxies = {
    "http": "http://127.0.0.1:8082",
    "https": "http://127.0.0.1:8082",
}

response = requests.get("https://tls.peet.ws/api/all", proxies=proxies, verify=False)
# response = requests.get("https://tls.peet.ws/api/all")

pprint(response.json())
{'donate': 'Please consider donating to keep this API running. Visit '
           'https://tls.peet.ws',
 'http1': {'headers': ['Host: tls.peet.ws',
                       'User-Agent: python-requests/2.32.3',
                       'Accept-Encoding: gzip, deflate, br, zstd',
                       'Accept: */*',
                       'Connection: keep-alive']},
 'http_version': 'HTTP/1.1',
 'ip': '103.234.230.84:61132',
 'method': 'GET',
 'tcpip': {'ip': {}, 'tcp': {}},
 'tls': {'ciphers': ['TLS_AES_256_GCM_SHA384',
                     'TLS_CHACHA20_POLY1305_SHA256',
                     'TLS_AES_128_GCM_SHA256',
                     'TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256',
                     'TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256',
                     'TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384',
                     'TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384',
                     'TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256',
                     'TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256',
                     'TLS_DHE_RSA_WITH_AES_128_GCM_SHA256',
                     'TLS_DHE_RSA_WITH_AES_256_GCM_SHA384',
                     'TLS_DHE_RSA_WITH_CHACHA20_POLY1305_SHA256',
                     'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256',
                     'TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256',
                     'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA',
                     'TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA',
                     'TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384',
                     'TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384',
                     'TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA',
                     'TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA',
                     'TLS_DHE_RSA_WITH_AES_128_CBC_SHA256',
                     'TLS_DHE_RSA_WITH_AES_256_CBC_SHA256',
                     'TLS_RSA_WITH_AES_128_GCM_SHA256',
                     'TLS_RSA_WITH_AES_256_GCM_SHA384',
                     'TLS_RSA_WITH_AES_128_CBC_SHA256',
                     'TLS_RSA_WITH_AES_256_CBC_SHA256',
                     'TLS_RSA_WITH_AES_128_CBC_SHA',
                     'TLS_RSA_WITH_AES_256_CBC_SHA'],
         'client_random': '2059469c2c091fad5cdc5d3923783e2297f8a96de4c9e06efb22572f8294bc2d',
         'extensions': [{'data': '00',
                         'name': 'extensionRenegotiationInfo (boringssl) '
                                 '(65281)'},
                        {'name': 'server_name (0)',
                         'server_name': 'tls.peet.ws'},
                        {'elliptic_curves_point_formats': ['0x00',
                                                           '0x01',
                                                           '0x02'],
                         'name': 'ec_point_formats (11)'},
                        {'name': 'supported_groups (10)',
                         'supported_groups': ['X25519 (29)',
                                              'P-256 (23)',
                                              'X448 (30)',
                                              'P-521 (25)',
                                              'P-384 (24)',
                                              'ffdhe2048 (256)',
                                              'ffdhe3072 (257)',
                                              'ffdhe4096 (258)',
                                              'ffdhe6144 (259)',
                                              'ffdhe8192 (260)']},
                        {'data': '', 'name': 'session_ticket (35)'},
                        {'name': 'application_layer_protocol_negotiation (16)',
                         'protocols': ['http/1.1']},
                        {'data': '', 'name': 'encrypt_then_mac (22)'},
                        {'extended_master_secret_data': '',
                         'master_secret_data': '',
                         'name': 'extended_master_secret (23)'},
                        {'name': 'signature_algorithms (13)',
                         'signature_algorithms': ['ecdsa_secp256r1_sha256',
                                                  'ecdsa_secp384r1_sha384',
                                                  'ecdsa_secp521r1_sha512',
                                                  'ed25519',
                                                  'ed25519',
                                                  'ecdsa_brainpoolP256r1tls13_sha256',
                                                  'ecdsa_brainpoolP384r1tls13_sha384',
                                                  'ecdsa_brainpoolP512r1tls13_sha512',
                                                  'rsa_pss_pss_sha256',
                                                  'rsa_pss_pss_sha384',
                                                  'rsa_pss_pss_sha512',
                                                  'rsa_pss_rsae_sha256',
                                                  'rsa_pss_rsae_sha384',
                                                  'rsa_pss_rsae_sha512',
                                                  'rsa_pkcs1_sha256',
                                                  'rsa_pkcs1_sha384',
                                                  'rsa_pkcs1_sha512',
                                                  '0x303',
                                                  '0x301',
                                                  '0x302',
                                                  '0x402',
                                                  '0x502',
                                                  '0x602']},
                        {'name': 'supported_versions (43)',
                         'versions': ['TLS 1.3', 'TLS 1.2']},
                        {'PSK_Key_Exchange_Mode': 'PSK with (EC)DHE key '
                                                  'establishment (psk_dhe_ke) '
                                                  '(1)',
                         'name': 'psk_key_exchange_modes (45)'},
                        {'name': 'key_share (51)',
                         'shared_keys': [{'X25519 (29)': 'a3ba691321dfea99979785396e5c370ee6ee6a7403cb736d51388c9d65206800'}]}],
         'ja3': '771,4866-4867-4865-49195-49199-49196-49200-52393-52392-158-159-52394-49187-49191-49161-49171-49188-49192-49162-49172-103-107-156-157-60-61-47-53,65281-0-11-10-35-16-22-23-13-43-45-51,29-23-30-25-24-256-257-258-259-260,0-1-2',
         'ja3_hash': '135b770c875c319c3564deacfe0bcc39',
         'ja4': 't13d2812h1_a01be8c064b6_0b298858d6c1',
         'ja4_r': 't13d2812h1_002f,0035,003c,003d,0067,006b,009c,009d,009e,009f,1301,1302,1303,c009,c00a,c013,c014,c023,c024,c027,c028,c02b,c02c,c02f,c030,cca8,cca9,ccaa_000a,000b,000d,0015,0016,0017,0023,002b,002d,0033,ff01_0403,0503,0603,0807,0808,081a,081b,081c,0809,080a,080b,0804,0805,0806,0401,0501,0601,0303,0301,0302,0402,0502,0602',
         'peetprint': '772-771|1.1|29-23-30-25-24-256-257-258-259-260|1027-1283-1539-2055-2056-2074-2075-2076-2057-2058-2059-2052-2053-2054-1025-1281-1537-771-769-770-1026-1282-1538|1||4866-4867-4865-49195-49199-49196-49200-52393-52392-158-159-52394-49187-49191-49161-49171-49188-49192-49162-49172-103-107-156-157-60-61-47-53|0-10-11-13-16-22-23-35-43-45-51-65281',
         'peetprint_hash': 'a81429f9a27d4b2da1c4126a7921174a',
         'session_id': '4f5ff2f21118a79e9af3be3367428189e2c7050a629e24264149dea476e84e7e',
         'tls_version_negotiated': '772',
         'tls_version_record': '771'}}
Takeaways
In this post, I’ve introduced Fingerprint and TLS Fingerprint and provided some simple examples to demonstrate how modern websites and servers use them for bot detection. At the end of this post, I’ll share a few key takeaways:
Python Requests Can Be Upgraded to Look Like Chrome: If you find that your default TLS version is still 1.2 for Python requests, remember to upgrade both Python and OpenSSL to enable TLS 1.3. This upgrade helps you bypass TLS-layer bot detection mechanisms like JA3/JA4 checks.
mitmproxy Presets Emulate Real Browsers and Devices: Including mitmproxy with a fingerprint solution in your crawler engine can help bypass bot detection systems by emulating real browser and device behavior.
OpenSSL is the Default TLS Engine - and it’s Easy to detect: Python, curl, and most CLI tools use OpenSSL for TLS, making their ClientHello predictable and easy to fingerprint unless spoofed. mitmproxy mitigates this issue by terminating TLS itself and sending a new ClientHello to the target using your chosen preset.
Reference 
https://www.cloudflare.com/learning/ssl/what-happens-in-a-tls-handshake/


Understanding CSRF, SOP, and CORS Security Risks
Paul Liang — Fri, 28 Feb 2025 17:07:53 GMT
Background
In the past, I used to share my newly acquired knowledge through blogs. However, with the emergence of AI (Specifically LLM models), things have changed. Nowadays, everyone is asking AI questions, making traditional internet searches seem less valuable. Despite this shift, I still want to continue sharing through blogs because it allows me to practice my English writing and organize my thoughts in a structured way. However, given the rise of AI, I also plan to adapt my blog writing.
These days, it's fortunate to learn alongside AI—you can ask any question if you’re struggling with a specific topic. The key skill, though, is validating the answers that AI provides to ensure they’re correct and make sense. I believe this skill will become even more important as we live in the AI era, which is why I want to write this blog post with the help of AI. 
I hope you enjoy this post, even though much of the content may be generated by AI. I still aim to share my perspective, and I hope this mindset is helpful to you.
I struggled with web security topics because I had many unanswered questions before. However, thanks to AI, I now have the opportunity to clarify everything; that's why I chose this topic as my first blog, which cooperates with AI, and  I will show you how to engage with AI back and forth to learn a specific topic.
Definitions
The following definitions are generated via AI; if you are familiar with them, you can skip this section.
SOP - Same-Origin Policy
The Same-Origin Policy (SOP) is a browser security mechanism that restricts web pages from accessing data from a different origin (protocol, domain, or port).
Purpose: Prevents cross-site data theft and protects user sessions.
Example: A script on https://example.comcannot read cookies from https://attacker.com.
CSRF - Cross-Site Request Forgery
Cross-Site Request Forgery (CSRF) is an attack that tricks a logged-in user into performing an unwanted action on a trusted website.
How it works: The attack exploits the browser’s automatic cookie handling to send unauthorized requests.
 Example: A victim logs into bank.com, then visits attacker.com, which automatically submits a money transfer request to bank.com using the victim's session.
 Prevention: Use CSRF tokens, SameSite=Strict cookies, and re-authentication for sensitive actions.
CORS - Cross-Origin Resource Sharing
Cross-Origin Resource Sharing (CORS) is a security feature that allows controlled access to resources from different origins by defining which external sites can make requests.
Why it's needed: SOP blocks all cross-origin requests by default, but CORS allows specific trusted domains to access an API.
Example: If api.example.com wants to allow requests from frontend.com, the server must set: Access-Control-Allow-Origin: https://frontend.com
Security risk: Misconfiguring CORS (Access-Control-Allow-Origin: *) can bypass CSRF protection, allowing attackers to send requests using a victim’s session.
Discussion with AI
In the past, I was familiar with the terms listed above individually, but I had questions about why we need to use all these concepts to design website security. One day, while logging into a Django website I was developing, I noticed I received both a sessionID and a csrftoken. The sessionID was marked as HttpOnly, while the csrftoken was not. This sparked my curiosity about why they had different attributes in the browser, so I decided to investigate further.
I opened the ChatGPT and asked the following question:
Q: I used Django to implement the login functionality. After logging in, I received two cookies: csrftoken and sessionid, and I’m curious about the difference between these cookies.
Ans: Django uses sessionid to keep track of logged-in users, and this ID is stored in a cookie on the user's browser. The session ID maps to session data stored on the server (e.g., in a database, cache, or file system). When the user makes subsequent requests, their browser sends the sessionid cookie, and Django reads this cookie, retrieves the session data, and associates it with the request.
The csrftoken is a Cross-Site Request Forgery (CSRF) token used by Django to prevent CSRF attacks. This token is stored in a cookie (csrftoken ) and embedded as a hidden field in the form. When the form is submitted, Django checks if the CSRF token included in the form matches the token stored in the user's cookie. Since browsers automatically include cookies in requests, an attacker could trick a logged-in user into making an unwanted request (e.g., submitting a form). The CSRF token ensures that only forms generated by Django are accepted.
Q: I noticed that the sessionid is HttpOnly, but the csrftoken is not. Why is that?
Ans: In Django, the sessionid cookie is marked as HttpOnly to enhance security. This attribute prevents client-side scripts from accessing the cookie, helping mitigate risks associated with cross-site scripting (XSS) attacks. By making the sessionid cookie HttpOnly, it cannot be accessed via JavaScript, which is why this tag helps prevent XSS risks.
Q: Can I use JavaScript to retrieve the csrftoken and inject it into a hidden form input? If so, that means a hacker could potentially do the same thing, right? Does this mean the CSRF token is ineffective?
Ans: Yes! You can use JavaScript to retrieve the CSRF token from the cookie and inject it into the hidden form input field dynamically, and due to the  CSRF token is stored in a cookie, but only for the same site, that's important, only for the same site, that's why we must set the SameSite=LAX at least within the Django. By doing that, JavaScript running on another origin cannot access the CSRF token due to browser security policies (Same-Origin Policy).
Q: If I configure the wrong settings, whether through Nginx or Django, to allow requests from different origins using Access-Control-Allow-Origin: *, and I don’t set the SameSite attribute on the CSRF token, could this configuration render the CSRF protection ineffective?
Ans: Yes, misconfiguring CORS (Access-Control-Allow-Origin: *) and not setting the SameSite attribute properly can make CSRF protection ineffective. Let’s break it down. Under this scenario, any origin (including attacker.com) can make requests to your API without restrictions. When a victim visits attacker.com, their sessionid and csrftokenare automatically included in cross-origin requests. Now, How an Attacker Exploits This
Victim logs into bank.com (legitimate site)
Victim visits attacker.com
The attacker executes a fetch() request to bank.com
Because of Access-Control-Allow-Origin: *, the request is allowed.
Because cookies are included (no SameSite restriction), the bank processes the request.
Attack Succeeds!
Finally, after asking a few questions, I now have a clear understanding of basic web security concepts like CSRF, SOP, and CORS. I now deeply appreciate the great advantage of today’s era, where asking AI questions helps resolve confusion.
Takeaways
In the final section, I’ll provide a quick summary of today’s topic, with the assistance of AI, of course. Hope you enjoy it. 
Summary
Term Definition Purpose Example of Risk
SOP Restricts cross-origin access Prevents cross-site data theft A script from evil.com tries to read cookies from bank.com (Blocked by SOP)
CSRF Tricks users into making unintended requests Prevents unauthorized actions A victim clicks a malicious link that transfers money from their bank
CORS Allows controlled cross-origin access Enables APIs to be shared securely Access-Control-Allow-Origin: * allows any site to make API requests (Security Risk!)
✅ SOP blocks cross-origin requests by default.
✅ CORS allows specific cross-origin access (if configured correctly).
✅ CSRF protections prevent attackers from hijacking authenticated actions. 


Simple Anomaly Detection for the Text Data
Paul Liang — Mon, 30 Dec 2024 05:38:18 GMT
Background
Recently, our team faced a challenge, how to detect the suspicious incorrect data from our vendor, in other words, our vendor sometimes will provides incorrect raw data to us. Unfortunately, we don’t have a solution to prevent this issue. As a result, we initially pass the incorrect data through the data pipeline across multiple services, leading to a series of data update errors. Most seriously, we end up providing entirely wrong information to our customers, which could increase product churn - something we definitely want to avoid. 
Goal
First, let me introduce what the data looks like. It's a pure text data with the JSON format and includes multiple parts, and each file has slightly different to each other except one. If you’re interested, you can refer to the following link: GitHub Link, and you can also use the online JSON differ tools to help you recognized the difference quickly.
The file data/MAEU244638355-230015009-45b04974f7594ad299fca7fb5138ac36.json contains the anomaly because it has huge difference compare to others, and our goal is to come up with a solution that helps us identify it.
Solutions
How to find the way to identify the anomaly data? The first idea came to my mind is can I use the text similarity? why not use the hash function to do that? as the mention before where the files are all slightly different and only one file is significantly anomalous so a pure hash-based solution will not work effectively. After deciding the direction of solution, next I started to prepare the dataset and build the POC to verify it.
Compute Pairwise Similarity

Normalize JSON files (sort keys, remove timestamps).
Use a similarity metric to compare files.

Cosine Similarity: Compare vectorized features of JSONs.
Structural Similarity: Compare the structure and key counts.



Identify Anomaly

Calculate the average similarity score for each file compared to the rest.
The file with the lowest average similarity is the anomaly.

Based on these ideas, I’ve built the POC as outlined below.  After running the program, as you can see, we successfully detected the incorrect data. You can review the code first; it’s not complicated, but there are some theoretical concepts I’ll explain in the next section.
import os
import json

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def normalize_json(json_data):
    """Normalize JSON for comparison by sorting keys."""
    return json.dumps(json_data, sort_keys=True)

def compute_similarity(json_files):
    """Compute similarity scores between JSON files."""
    normalized_files = []
    
    for file_path in json_files:
        with open(file_path, 'r') as f:
            data = json.load(f)
        normalized_files.append(normalize_json(data))
    
    # Vectorize the normalized JSON files
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(normalized_files)
    
    # Compute pairwise cosine similarity
    similarity_matrix = cosine_similarity(tfidf_matrix)
    return similarity_matrix

def detect_anomaly(json_files):
    """Detect the anomalous file based on similarity scores."""
    similarity_matrix = compute_similarity(json_files)
    avg_similarity = similarity_matrix.mean(axis=1)
    
    # Find the file with the lowest average similarity
    anomaly_index = np.argmin(avg_similarity)
    anomaly_file = json_files[anomaly_index]
    return anomaly_file, avg_similarity



# Load all JSON files from the "data" folder
data_folder = "data"  # Replace with your folder name
json_files = [os.path.join(data_folder, file) for file in os.listdir(data_folder) if file.endswith('.json')]

print(json_files)

# Detect the anomaly
if json_files:
    anomaly_file, avg_similarity = detect_anomaly(json_files)
    print(f"Anomalous file detected: {anomaly_file}")
    print("Average Similarity Scores:", avg_similarity)
else:
    print("No JSON files found in the specified folder.")
➜ python anomaly_detection.py
['data/MAEU244638355-230015009-45b04974f7594ad299fca7fb5138ac36.json', 'data/MAEU244638355-229834907-111a125fff524e568539fa9ee1259cc9.json', 'data/MAEU244638355-226316743-999b976a3e3147cdbefcb89497bcc8b0.json', 'data/MAEU244638355-228799200-1d919c554e0643c7869dcafe06206f12.json', 'data/MAEU244638355-227412648-4ab0fcecc65441b1800033f0ddcddd67.json']
Anomalous file detected: data/MAEU244638355-230015009-45b04974f7594ad299fca7fb5138ac36.json
Average Similarity Scores: [0.91224814 0.96954086 0.96806412 0.97184808 0.96856012]
Implementation Details
Now, let's dive deeper into the code. The POC consists of serveral key steps, each with its specific purpose:
Normalization: JSON files are normalized by sorting keys to ensure a consistent stucture.
TF-IDF Vectorization: Converts the normalized JOSN strings into a vectorized form.
Cosine Similarity: Measures the pairwise similarity between the files.
Anomaly Detection: Identifies the file with lowest average similarity score. 
Next I will explain the TF-IDF and Cosine Similarity, both of them are simple and easy concept for text analysis.
TF-IDF
The TF-IDF stands for Term Frequency-Inverse Document Frequency, a common technique in natural language processing (NLP) to represent text data in a way that reflects the importance of word in a collection of documents.
Term Frequency (TF)
Measure how often a term (word) appears in a document
Formular:
$$ TF(t) = {\text{Number of times term t appears in a document} \over \text{Total number of terms in the document}} $$
Inverse Document Frequency (IDF)
Measure how important a term is. Terms that occur in many documents are less significant.
Formular:
$$ IDF(t) = {log{\text{Total number of documents} \over \text{Number of documents containing t}}} $$
TF-IDF Weight
Combine TF and IDF to assign a weight to each term in each document.
Formular:
$$ \text{TF-IDF(t)} = {\text{TF(t)} \over \text{IDF(t)}} $$
What the tfidf_matrix Contains
The tfidf_matrix is a 2D sparse matrix where:
Rows represent the documents (in this case, normalized JSON files).
Columns represent unique terms (words) across all documents.
Each entry (i, j) in the matrix is the TF-IDF weight of the term j in document i.
Why use the TF-IDF
Handles Variability in JSON Files: Converts textual differences (e.g., slight variations in keys or values) into comparable numerical data.
Ignores Common Words: Words that appear in all files (e.g., "is", "a") get lower weights, focusing comparisons on more distinctive terms.
Captures Structure: By flattening and normalizing JSON into strings, TF-IDF indirectly captures the structure and content differences between files.
Cosine Similarity
When translating data from text to vectors, there are many mathematical formulas and concepts that can guide us. Here, we choose cosine similarity, which is commonly used to calculate the similarity between two vectors. You can refer to the figure below to visualize this.
Ref: https://pub.aimind.so/understanding-cosine-similarity-and-cosine-distance-in-depth-cc91eac3ef2
The cosine similarity formular comes from the definitin of dot product, in the 2D space the definition of dot product is 
$$ {{\mathbf{A} \cdot \mathbf{B}} = {|\mathbf{A}| |\mathbf{B}|}}{\cos(\theta)} $$
then we can get the cosine similarity is
$$ \text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} $$
Takeaways
In this blog, I demonstrated how to use TF-IDF and Cosine Similarity with JSON normalization to build a simple anomaly detection system for a text dataset. We also explained what TF-IDF and Cosine Similarity are, and I hope these concepts will be useful to you in the future.
Reference
https://github.com/TaikerLiang/anomaly-detection-poc
https://pub.aimind.so/understanding-cosine-similarity-and-cosine-distance-in-depth-cc91eac3ef2


How do I Take Notes for My Daily Work
Paul Liang — Sun, 20 Oct 2024 09:33:53 GMT
Background
Today, I will show you how I take notes for my daily work. To begin, I'd like to introduce my role and job to give you a better understanding of my responsibility. I am an engineer at a SaaS company, where I spend half of my time focused on system and product operations.
To expain more detailed, when customers use our product and encounter any unexpected outcomes, they may reach out to our support team for assistance. If the support team is unable to resolve the issue, it gets escalated to me. My job is to investigate the problem, whether it's a bug or an error related to a new feature we've implemented. 
We are developing a software service to address the complexities of the supply chain, which is a challenging domain. As a result, out product team may sometimes misunderstand certain aspects, leading to features being delivered with issues which may cause our customer confusion. 
Defense/Offense engineering
For my personally challenges would be how can I quick understand the code behind the certain feature across multiple Github Repos, and I found I am not good at memorize something, so although some of code I have read before, but I still forget it next time, it led me need to read it and understand it agian which really take me a lot of time on it, so that's why I need some note tools to help me overcome this challenges.
VS-Code Extensions
I use VS-Code as my primary IDE for development, and to overcome my challenges, I'm always on the lookout for extensions that can help me be more productive when reading code. Below are two of the extensions I find most useful:
Bookmarks: The Bookmarks extension is invaluable when working across multiple files or large projects. It allows me to place bookmarks within my code, making it easier to quickly jump to important sections. I often use bookmarks to tag the entry points of certain features or the main business logic functions. This way, I can easily navigate to the exact part of the code I need without wasting time searching through files. It significantly improves my workflow, especially when dealing with multiple repositories or complex codebases.
Line Note: Line Note is an incredibly useful extension that allows me to add inline comments or annotations directly to the code. These notes are private, meaning only I can view and edit them, which is particularly helpful for adding context or reminders that are relevant to me personally. For example, when investigating a bug or trying to understand a piece of business logic, I use Line Note to jot down my thoughts and observations. These annotations help me retain important information and allow me to quickly recall why certain decisions were made or what I need to focus on next.
By using these two extensions together, I can quickly locate the code I need and access my personal notes, which saves me a lot of time and effort. This combination helps me retain knowledge from previous investigations, making it easier to solve my daily tasks across multiple product services.
The screenshots of Bookmarks and Line Note
Notion for Daily To-Do List
Notion has been my primary tool for notes and documentation for a long time. I primarily use it for two purpose: 1. quick notes, and 2. my daily to-do list. 
Quick notes refer to the need to record information temporarily, such as ideas I want to jot down immediately to avoid forgetting them, or when my supervisor or teammates require my assistance and I need  a place to note it for future reference.
For my daily to-do list, I created a custom template in Notion. You might wonder why I chose to build my own to-do list when there're so many available apps. I have tried various to-do list apps, but I always questioned why I needed to use an additional app for my daily tasks. My requirements are quite simple, focusing on two main points: 1. displaying today's to-do list, allowing me to archive completed task with a click, and 2. showing all to-do items so I can keep track of upcoming tasks and any incomplete ones. 
I’ll share the Notion template via the following link; if you’re interested, feel free to try it out.
Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.
A new tool that blends your everyday work apps into one. It’s the all-in-one workspace for you and your team
Notion
To-Do List Template
NotebookLM for Quick Learning
NotebookLM is a new AI notebook tool created by Google. It's still in the beta version, but I see huge potential in it for helping users quickly learn about various topics. NotebookLM allows you to upload documents and then ask questions about them, making it much easier to understand complex information without needing to manually search through notes or documentation. 
The screenshots of NotebookLM
Although NotebookLM is still in its early stages, I believe it represents the future of personalized learning tools, and I am trying to integrates NotebookLM with my existing workflow. I am experimenting with combining it with Notion, as I think this could be a powerful way to build a personal knowledge database. By storing notes in Notion and using NotebookLM to explore and understand them in depth, I can create a seamless loop of capturing information and then efficiently revisiting and internalizing it. This combination has the potential to make my learning process more systematic and help me retain information for the long term.
Key Takeaways
In this blog, I shared how I use various notebook tools to manage my daily tasks, and I hope these insights will be helpful to you.
VS-Code extensions:  extensions like Bookmarks and Line Note help enhance productivity by enabling efficient navigation and personalized notes within the codebase.
Notion for task management: Notion is a versatile tool for managing daily tasks and quick notes, offering custom templates that meet specific needs without added complexity.
NotebookLM: NotebookLM's interactive features make learning more engaging by providing personalized insights and breaking down complex topics.


How to Debug Your GitHub Action in Your Local Environment
Paul Liang — Sun, 18 Aug 2024 17:35:33 GMT
Background
Recently, I needed to use GitHub Actions to complete the CI/CD flow for a project. Based on my previous experiences, I've encountered numerous challenges in validating GitHub Actions, which required me to submit multiple commits to check if they were functioning properly, as shown in the following picture.
This time, I started looking for a tool or solution that could allow me to validate the script in my local environment and save a lot of time. In this blog, I will show you how I use Act to complete and validate GitHub Action scripts in my local environment. 
Act is a tool that can run your GitHub Actions locally! 
ACT Installation
I used the homebrew to install the ACT on my Mac. 
$ brew install act
Usage guide
I will use the repo of github-action-demo to demonstrate some basic usage and complete the following case study section.
GitHub - cplee/github-actions-demo: Demonstration of GitHub Actions
Demonstration of GitHub Actions. Contribute to cplee/github-actions-demo development by creating an account on GitHub.
GitHubcplee
$ git clone git@github.com:cplee/github-actions-demo.git
$ cd github-actions-demo
$ act -l --container-architecture linux/amd64
INFO[0000] Using docker host 'unix:///var/run/docker.sock', and daemon socket 'unix:///var/run/docker.sock'
Stage  Job ID  Job name  Workflow name  Workflow file  Events
0      test    test      CI             main.yml       push
💡
By default act will run all jobs in every workflow in .github/workflows.
First, we run the act -l command, which will show all the workflows within the repo; as we can see, there is only one workflow called main.yml and the content of main.yml would be:
name: CI
on: push

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - uses: actions/setup-node@v1
    - run: npm install
    - run: npm test
It's pretty simple: just run an ubuntu-latest environment, install the necessary packages, and then try to pass all the sts within the tests folder. Next, let's focus on the second line of this file: on: push. What does this mean? It means this CI workflow will only be triggered when a push event occurs. 
Let us do some simple experiments:
$ act pull_request
INFO[0000] Using docker host 'unix:///var/run/docker.sock', and daemon socket 'unix:///var/run/docker.sock'
Error: Could not find any stages to run. View the valid jobs with `act --list`. Use `act --help` to find how to filter by Job ID/Workflow/Event Name
We used Act to mock the pull_request event, but encountered an error message stating that no stages were found to run. This is expected because the workflow is only triggered by the push event. Next, we will use Act to mock the push event and see what happens.
➜ act --container-architecture linux/amd64 push
INFO[0000] Using docker host 'unix:///var/run/docker.sock', and daemon socket 'unix:///var/run/docker.sock'
[CI/test] 🚀  Start image=catthehacker/ubuntu:act-latest
[CI/test]   🐳  docker pull image=catthehacker/ubuntu:act-latest platform=linux/amd64 username= forcePull=true
[CI/test] using DockerAuthConfig authentication for docker pull
...
...
...
As we can see, it trigger the workflow successfully, pretty simple right? Ok, now we know how to use the act to trigger the workflow, next we will delve into a more complicated case in the next section.
Case Study
In this case, we have several goals to complete within the CI flow:
Run the tests 
Upload a file to the AWS S3
Build the docker image and push it to AWS ECR
The original main.yml already covers the goal of running the tests, so we'll skip that part. For the second goal, uploading a file to AWS S3, we want to use Act to test the CI flow in our local environment. Therefore, we also need a local AWS service to help complete the testing environment. LocalStack is a tool that allows you to develop and test your AWS applications locally, reducing development time.
For the second and third goals, we will integrate LocalStack with Act to complete the entire CI flow in our local environment.
Please follow the instructions on the LocalStack official website to install and set up the necessary credentials; I won't cover that part here.
Upload a file to S3
Next, we use the AWS command to create a S3 bucket demo . Later, we will upload a file to this bucket when we run the CI flow.
$ aws --endpoint-url=http://localhost:4566 --profile localstack s3 mb s3://demo
make_bucket: demo
$ aws --endpoint-url=http://localhost:4566 --profile localstack s3 ls
2024-07-21 13:13:24 demo
After successfully creating the bucket, we will add some AWS-related scripts to complete the CI process.
name: CI
on: push

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup the node
        uses: actions/setup-node@v1

      - name: Install packages
        run: npm install

      - name: Run the tests
        run: npm test

      - name: Install AWS CLI
        run: |
          sudo apt-get update
          sudo apt-get install -y awscli

      - name: Configure AWS CLI
        run: |
          mkdir -p ~/.aws
          echo "[profile localstack]" > ~/.aws/config
          echo "region = us-east-1" >> ~/.aws/config
          echo "output=json" >> ~/.aws/config
          echo "[localstack]" > ~/.aws/credentials
          echo "aws_access_key_id = test" >> ~/.aws/credentials
          echo "aws_secret_access_key = test" >> ~/.aws/credentials
          echo "Complete Configure AWS CLI"

      - name: Upload file to S3
        run: |
          aws --endpoint-url=http://localhost:4566 --profile localstack s3 cp src/upload.txt s3://demo

Then, re-running the act push command to see the result. 
$ aws --endpoint-url=http://localhost:4566 --profile localstack s3 ls s3://demo
2024-08-18 10:50:11          0 upload.txt
Great! Now, we can upload the file to the S3 successfully, and let's move on to the next section.
Build the Docker image and push it to AWS ECR.
At the beginning of this section, we need to use the LocalStack to create an ECR repository first.
$ aws --endpoint-url=http://localhost:4566 --profile localstack ecr create-repository --repository-name demo
Please note that ecr create-repository is a pro feature of LocalStack. If you want to use it, please consider applying for their hobby subscription.
Then we need to get the repositoryUri for pushing the docker image to the ECR later, here is another command to retrieve the repositoryUri info.
$ aws --endpoint-url=http://localhost:4566 --profile localstack ecr describe-repositories
{
    "repositories": [
        {
            "repositoryArn": "arn:aws:ecr:us-east-1:000000000000:repository/demo",
            "registryId": "000000000000",
            "repositoryName": "demo",
            "repositoryUri": "000000000000.dkr.ecr.us-east-1.localhost.localstack.cloud:4566/demo",
            "createdAt": "2024-08-18T14:36:16+08:00",
            "imageTagMutability": "MUTABLE",
            "imageScanningConfiguration": {
                "scanOnPush": false
            },
            "encryptionConfiguration": {
                "encryptionType": "AES256"
            }
        }
    ]
After completing the setup, we need to add some steps to finish this section.
name: CI
on: push

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup the node
        uses: actions/setup-node@v1

      - name: Install packages
        run: npm install

      - name: Run the tests
        run: npm test

      - name: Install AWS CLI
        run: |
          sudo apt-get update
          sudo apt-get install -y awscli

      - name: Configure AWS CLI
        run: |
          mkdir -p ~/.aws
          echo "[profile localstack]" > ~/.aws/config
          echo "region = us-east-1" >> ~/.aws/config
          echo "output=json" >> ~/.aws/config
          echo "[localstack]" > ~/.aws/credentials
          echo "aws_access_key_id = test" >> ~/.aws/credentials
          echo "aws_secret_access_key = test" >> ~/.aws/credentials
          echo "Complete Configure AWS CLI"

      - name: Log in to LocalStack ECR
        run: |
          aws ecr get-login-password --region us-east-1 --endpoint-url=http://localhost:4566 --profile localstack

      - name: Build Docker image
        run: |
          docker build -t github-action-demo:latest .
      
      - name: Tag Docker image
        run: |
          docker tag github-action-demo:latest localhost:4566/demo/github-action-demo:latest

      - name: Push Docker image to LocalStack ECR
        run: |
          docker push 000000000000.dkr.ecr.us-east-1.localhost.localstack.cloud:4566/demo/github-action-demo:latest

      - name: Upload file to S3
        run: |
          aws --endpoint-url=http://localhost:4566 --profile localstack s3 cp src/upload.txt s3://demo

Next, we will trigger the act push again to see the result.
$ act push
...
...
...
[CI/build]   ✅  Success - Main Build Docker image
[CI/build] ⭐ Run Main Tag Docker image
[CI/build]   🐳  docker exec cmd=[bash --noprofile --norc -e -o pipefail /var/run/act/workflow/8] user= workdir=
[CI/build]   ✅  Success - Main Tag Docker image
[CI/build] ⭐ Run Main Push Docker image to LocalStack ECR
[CI/build]   🐳  docker exec cmd=[bash --noprofile --norc -e -o pipefail /var/run/act/workflow/9] user= workdir=
| The push refers to repository [000000000000.dkr.ecr.us-east-1.localhost.localstack.cloud:4566/demo/github-action-demo]
0fe414dcbcfc: Layer already exists
905029b2faaa: Layer already exists
85dfe2a5d583: Layer already exists
41d31c4461e4: Layer already exists
e8ec0d6af932: Layer already exists
eb8031fe07bc: Layer already exists
67580865a2da: Layer already exists
66fe03bf3f29: Layer already exists
f752cb05a39e: Layer already exists
20f026ae0a91: Layer already exists
f21c087a3964: Layer already exists
cedb364ef937: Layer already exists
latest: digest: sha256:36dbf0d1894f7af04c8fefba107cb2cd0c609112ad17c71af6d91529cba5a61c size: 2839
[CI/build]   ✅  Success - Main Push Docker image to LocalStack ECR
[CI/build] ⭐ Run Main Upload file to S3
[CI/build]   🐳  docker exec cmd=[bash --noprofile --norc -e -o pipefail /var/run/act/workflow/10] user= workdir=
upload: src/upload.txt to s3://demo/upload.txt                  ng
[CI/build]   ✅  Success - Main Upload file to S3
[CI/build] Cleaning up container for job build
[CI/build] 🏁  Job succeeded

INFO    ️📣 A newer version of 'act' is available - consider ugrading to 0.2.65.
Congratulations! We can now successfully run the full CI flow to complete our job.
Key Takeaways
In this post, we demonstrate how to use Act to develop GitHub Action scripts in your local environment with LocalStack. Doing so can save a lot of time when validating the scripts, enjoy!
Reference
https://www.youtube.com/watch?v=OW121yjV1IM


Compare multiple serializer solutions within the Django application
Paul Liang — Mon, 10 Jun 2024 06:06:56 GMT
Background
This blog is inspired by the article "Improve Serialization Performance in Django Rest Framework". The author compared various serializer solutions in the Django Rest Framework, such as simple functions and regular serializers. Unfortunately, after reading this article, I noted that the development packages used are somewhat outdated (versions: they use Python 3.7, Django 2.1.1, and Django Rest Framework 3.9.4.). Consequently, I have decided to recreate these environments using the latest versions of the packages to enhance their credibility.
Introduction 
Before delving into the detailed experiments, I will list the methods I plan to compare, along with the versions of packages in my development environment:   
Serializer solutions
Data Class
Regular Serializer
Model Serializer
Simple function
Pydantic 
In recent years, Pydantic has emerged as the most widely used data validation library for Python, which is why I've included it in my comparison list. 
Versions in My Local Environment
Python 3.10
Django 5.0.6
Django Rest Framework 3.15.1
Pydantic 2.7.3
Environment Setup
I created the two models called Product and Order within a new Django project for these experiments, and there is a relation between the Product and Order which can make the experiments more relevant to daily operations.
models.py
from django.db import models


class Product(models.Model):
    id = models.AutoField(primary_key=True)
    name = models.CharField(max_length=100)
    price = models.DecimalField(max_digits=10, decimal_places=2)
    stock = models.IntegerField()
    remark = models.TextField()
    created_at = models.DateTimeField(auto_now_add=True)

    def __str__(self):
        return self.name

class Order(models.Model):
    id = models.AutoField(primary_key=True)
    product = models.ForeignKey(Product, on_delete=models.CASCADE)
    quantity = models.IntegerField()
    order_date = models.DateTimeField(auto_now_add=True)

    def __str__(self):
        return f"Order {self.id} for {self.product.name}"
And I also create a Django command for populating data.
populate_data.py
import random
from django.utils import timezone
from django.core.management.base import BaseCommand
from myapp.models import Product, Order

class Command(BaseCommand):
    help = 'Populate the database with test data'

    def handle(self, *args, **kwargs):
        Product.objects.all().delete()
        Order.objects.all().delete()
        
        # Create products
        products = []
        for i in range(1000):
            product = Product(
                name=f'Product {i}',
                price=random.uniform(10.0, 100.0),
                stock=random.randint(1, 100),
                remark='Remark for product',
                created_at=timezone.now()
            )
            products.append(product)
        
        # Bulk create products with a batch size of 500
        Product.objects.bulk_create(products, batch_size=500)

        # Fetch all products to get their IDs
        all_products = list(Product.objects.all())
        
        # Create orders
        orders = []
        for i in range(10000):
            order = Order(
                
                product=random.choice(all_products),
                quantity=random.randint(1, 10),
                order_date=timezone.now()
            )
            orders.append(order)
        
        # Bulk create orders with a batch size of 500
        Order.objects.bulk_create(orders, batch_size=500)
        
        self.stdout.write(self.style.SUCCESS('Successfully populated the database with test data'))
Next, we need to run the following commands to complete data preparation.
$ python manage.py makemigrations
$ python manage.py migrate
$ python manage.py populate_data
Experiment 
As previously mentioned, we have various methods for comparison. In this section, we will implement the necessary serializer solutions.
Data Class
from dataclasses import dataclass
from datetime import datetime
from typing import Optional

@dataclass
class ProductData:
    id: int
    name: str
    price: float
    stock: int
    remark: str
    created_at: datetime

@dataclass
class OrderData:
    id: int
    product: ProductData
    quantity: int
    order_date: datetime
Regular Serializer
from rest_framework import serializers

class ProductSerializer(serializers.Serializer):
    id = serializers.IntegerField()
    name = serializers.CharField(max_length=100)
    price = serializers.DecimalField(max_digits=10, decimal_places=2)
    stock = serializers.IntegerField()
    remark = serializers.CharField()
    created_at = serializers.DateTimeField()
    
class OrderSerializer(serializers.Serializer):
    id = serializers.IntegerField()
    product = ProductSerializer()
    quantity = serializers.IntegerField()
    order_date = serializers.DateTimeField()
Model Serializer
from rest_framework import serializers
from .models import Product, Order

class ProductModelSerializer(serializers.ModelSerializer):
    class Meta:
        model = Product
        fields = ['id', 'name', 'price', 'stock', 'remark', 'created_at']

class OrderModelSerializer(serializers.ModelSerializer):
    product = ProductModelSerializer()

    class Meta:
        model = Order
        fields = ['id', 'product', 'quantity', 'order_date']
Simple Function
from typing import Dict, Any

def serialize_product(product: Product) -> Dict[str, Any]:
    return {
        'id': product.id,
        'name': product.name,
        'price': float(product.price),
        'stock': product.stock,
        'remark': product.remark,
        'created_at': product.created_at.isoformat() if product.created_at else None,
    }

def serialize_order(order: Order) -> Dict[str, Any]:
    return {
        'id': order.id,
        'product': serialize_product(order.product),
        'quantity': order.quantity,
        'order_date': order.order_date.isoformat() if order.order_date else None,
    }
Pydantic
from pydantic import BaseModel
from typing import List
from datetime import datetime

class ProductDataPydantic(BaseModel):
    id: int
    name: str
    price: float
    stock: int
    remark: str
    created_at: datetime

class OrderDataPydantic(BaseModel):
    id: int
    product: ProductDataPydantic
    quantity: int
    order_date: datetime
Experiment Results
For this experiment I prepare a Django command and use the  line_profiler to help me measure the performance for different methods. As you can see below I have the five functions corresponding to different methods for performance evaluation. 
profile_serialization.py
from django.core.management.base import BaseCommand
from django.db import connection
from myapp.models import Product, Order
from myapp.dataclasses import ProductData, OrderData
from myapp.serializers import ProductSerializer, ProductModelSerializer, OrderSerializer, OrderModelSerializer, serialize_product, serialize_order
from myapp.pydantic_models import ProductDataPydantic, OrderDataPydantic

class Command(BaseCommand):
    help = 'Profile serialization performance'

    def handle(self, *args, **kwargs):
        # Read data from the database with select_related
        orders = list(Order.objects.select_related('product').all())
        print("length of orders:", len(orders))

        # Dataclass serialization
        self.profile_dataclass_serialization(orders)
        
        # Django Serializer
        self.profile_django_serializer(orders)
        
        # Django ModelSerializer
        self.profile_django_model_serializer(orders)
        
        # Simple Function-based Serialization
        self.profile_simple_function(orders)

        # Pydantic Serialization
        self.profile_pydantic_serialization(orders)

    @profile
    def profile_dataclass_serialization(self, orders):
        order_data = [
            OrderData(
                id=order.id,
                product=ProductData(
                    id=order.product.id,
                    name=order.product.name,
                    price=float(order.product.price),
                    stock=order.product.stock,
                    remark=order.product.remark,
                    created_at=order.product.created_at
                ),
                quantity=order.quantity,
                order_date=order.order_date
            ) for order in orders
        ]

    @profile
    def profile_django_serializer(self, orders):
        order_serializer = OrderSerializer(orders, many=True)
        order_data = [order for order in order_serializer.data]

    @profile
    def profile_django_model_serializer(self, orders):
        order_model_serializer = OrderModelSerializer(orders, many=True)
        order_data = [order for order in order_model_serializer.data]

    @profile
    def profile_simple_function(self, orders):
        simple_serialized_orders = [serialize_order(order) for order in orders]

    @profile
    def profile_pydantic_serialization(self, orders):
        order_data = [
            OrderDataPydantic(
                id=order.id,
                product=ProductDataPydantic(
                    id=order.product.id,
                    name=order.product.name,
                    price=float(order.product.price),
                    stock=order.product.stock,
                    remark=order.product.remark,
                    created_at=order.product.created_at
                ),
                quantity=order.quantity,
                order_date=order.order_date
            ) for order in orders
        ]
You can run the following command to get the final results, and the table shows the results of different methods.
$ kernprof -l -v manage.py profile_serialization



Method
Result (seconds)
Data Validation




Dataclass serialization
0.05273  s
N


Django Serializer
0.443061 s
Y


Django ModelSerializer
0.423694 s
Y


Simple Function-based Serialization
0.032002 s
N


Pydantic Serialization
0.086857 s
Y



It's not surprising that simple function-based serialization performed the best in this comparison due to its simplicity and lack of data validation capabilities. However, I was somewhat surprised to find that the performance of Django's impression that ModelSerializer would perform worse, but the test showed nearly the same performance for both (Serializer and ModelSerializer). This is intriguing, and we will explore this scenario in the next section. 
Upon reviewing the results table again, the most appealing solution to me is Pydantic Serialization; it performs the best among all serialization methods with data validation functionality. Remarkably, there is not much difference in performance between Pydantic and serializations without data validation. This is why Pydantic has recently become the most popular serialization/data validation tool in Python.
Further Discussion
In this section, we will discuss the performance results of the Serializer and ModelSerializer. Upon investigation, both have similar implementation approaches. There are two parts to the implementation:
Field Handling: Both Serializer and ModelSerializer handle fields in a similar manner once they are defined. For read operations, both serializers iterate over the fields and generate the output dictionary.
Field Definitions: While Serializer requires explicit field definitions, ModelSerializer introspects the model and automatically creates the fields
I believe Django and DRF have addressed performance issues with ModelSerializer, which is reflected in today's experiment results. ModelSerializer has a slight overhead during the initialization phase due to model introspection. However, this overhead is negligible during read operations because it happens only once, resulting in performance that is almost identical to that of Serializer.
Conclusion
I took some time to complete this experiment, but I acknowledge there may be some omissions or deficiencies. I welcome your corrections. At the end of this post, I believe there is a key takeaway for you. 
💡
Stop using Django Serializer solutions including both regular serializer and Model Serializer. Pydantic would be a better choice for implementing serialization functionality.
Reference
https://hakibenita.com/django-rest-framework-slow#modelserializer



How to Modularize Your Airflow Project
Paul Liang — Sun, 03 Mar 2024 18:38:30 GMT
Background
As AI & ML technologies mature, more and more companies are beginning to explore the integration of these technologies into their internal products. Bringing AI & ML solutions to companies poses a significant challenge in recent years.
This post will introduce how to use Airflow to build a small MLOps project aimed at providing a machine learning (ML) pipeline to reduce the cost of deploying ML models. Additionally, we will take some time to discuss modularization in Airflow.
For this demonstration, we have selected the Titanic competition, a renowned challenge on Kaggle, as our example dataset. For detailed installation instructions and settings, please refer to my Github repo.
ML Pipeline Lifecycle
As you can see, I have defined the basic pipeline as shown in the figure above. However, due to differences in business models and architectures among different companies, you may need to make some adjustments to fit your current architecture.
I referred to several articles and videos to define the basic pipeline flow for beginners:
Step 1: We will load the raw data from S3, and here we will use LocalStack to help us implement this function in our local environment. 
Step 2: Prepare the training data for the model, including data preprocessing, feature selection, etc. 
Step 3: Export the training data for model training. 
Step 4: Perform model training. 
Step 5: Export the evaluation results to users via notification solutions such as Slack or Telegram messages. 
Step 6: Export the model to S3.
Then I divided these steps within the pipeline to the four groups:
Data Preparation Group
Model Training Group
Model Deployment Group
Notification Group
All implementations revolve around this pipeline flow and these four groups, and these definitions serve as the fundamental concepts of modularization."
Modularization
Next, let's delve into modularization within the Airflow project. Most articles and online resources tend to place all related code in a single file. However, from my perspective, this approach becomes challenging to maintain as the project grows. Therefore, I have explored how to modularize within Airflow myself and developed a final solution.
For each task (i.e., the problem you want to solve using ML), we will create a main DAG file and related functions, which will be organized into different files to achieve modularization.
dags: This directory contains the main DAG (Directed Acyclic Graph) functions.
dags/func: Within this directory, sub-folders are organized for modularization. Each folder name corresponds to a main DAG function. Inside each sub-folder, the entire ML pipeline is divided into different steps, with each step having its own file to complete the related function.
The tree structure looks like:
./dags
├── __init__.py
├── func
│   ├── __init__.py
│   └── titanic
│       ├── __init__.py
│       ├── data_preparation.py
│       ├── model_deployment.py
│       ├── model_training.py
│       └── notification.py
└── titanic.py
We have a main DAG file named titanic and a sub-folder under func also named titanic. This structure signifies that all functions within this folder are related to the main DAG titanic. This approach allows us to easily organize the folder architecture for multiple ML tasks.
titanic.py
from datetime import datetime, timedelta
import json
import pickle
import base64

import pandas as pd
from airflow.decorators import task
from airflow.decorators import dag
from airflow.utils.dates import days_ago
from airflow.providers.amazon.aws.transfers.local_to_s3 import LocalFilesystemToS3Operator
from airflow.utils.task_group import TaskGroup

from func.titanic.data_preparation import load_data_from_s3, data_preprocessing, export_the_training_data_to_s3, delete_temp_file
from func.titanic.model_training import get_dataset, train_model, evaluate_model
from func.titanic.model_deployment import save_model


@dag(schedule_interval='@daily', start_date=days_ago(1), catchup=False, tags=['example'])
def titanic_flow():


    with TaskGroup("data_preparation_group") as data_preparation:
        data = load_data_from_s3()
        train_data = data_preprocessing(data) # type: ignore
        tmp_filename = export_the_training_data_to_s3(train_data) # type: ignore

        upload_to_s3_task = LocalFilesystemToS3Operator(
            task_id='upload_to_s3',
            filename=tmp_filename, 
            dest_key='titanic/train.csv',
            dest_bucket='airflow',
            aws_conn_id='aws_localstack',  # Airflow AWS connection ID which can be created through the UI
            replace=True,
        )

        tmp_filename >> upload_to_s3_task
        upload_to_s3_task >> delete_temp_file(tmp_filename)


    with TaskGroup("model_training") as model_training:
        dataset = get_dataset(data=train_data)
        model = train_model(dataset['X'], dataset['y'])
        results = evaluate_model(model, dataset['X'], dataset['y'])


    with TaskGroup("model_deployment") as model_deployment:
        # Save the trained model to a temporary file
        model_file_path = save_model(model)

        # Task to upload the model to S3
        upload_model_to_s3 = LocalFilesystemToS3Operator(
            task_id='upload_model_to_s3',
            filename=model_file_path,
            dest_key='titanic/titanic_model.pkl',
            dest_bucket='airflow',
            aws_conn_id='aws_localstack', 
            replace=True,
        )

        results >> model_file_path >> upload_model_to_s3


dag = titanic_flow()
As you can see, the main DAG function's result is now very simple and clear. We have successfully abstracted away the details into other functions, allowing the main DAG function to focus solely on the data flow and data passing. Not bad, right?
Some Tips
Passing data between each step was the most challenging part during the implementation of this demo project. Perhaps because I was not very familiar with Airflow at the time, I often encountered XComArg-related error messages like the following. This occurred because Airflow automatically wraps your return value in an XComArg. If you want to return multiple values in a single function, please remember to return them in dictionary format instead of tuple format.
Argument of type "XComArg" cannot be assigned to parameter "data" of type "DataFrame"
  "XComArg" is incompatible with "DataFrame"PylancereportGeneralTypeIssues
(variable) train_data: XComArg
Key Takeaways
In this post, I have demonstrated how to build a simple MLOps pipeline using Python and Airflow. We also discussed modularization in Airflow, although it may not be the optimal solution (as I am still exploring better approaches), I believe I have shared some valuable insights that may prompt you to rethink your approach.
Next, there are several areas where this demo project can be further improved, including:
DAG and unit testing
Container development environment setup
Multi-model comparison flow
Deployment with tags
Notification implementation
Integration of CI/CD into the MLOps workflow
Debugging production code
In conclusion, I hope you found my ideas on modularizing your Airflow project helpful, and I look forward to continuing to explore the world of Airflow and MLOps in the future. Cheers!
Reference
https://www.kaggle.com/competitions/titanic
https://www.kaggle.com/code/alexisbcook/titanic-tutorial
https://docs.localstack.cloud/overview/
https://proclusacademy.com/blog/practical/k-fold-cross-validation-sklearn/


[Python] How to Improve The Performance of Heavy SQL Joins
Paul Liang — Sun, 14 Jan 2024 14:18:35 GMT
Background
Software engineering is a multidisciplinary field that encompasses the design, development, testing, and maintenance of software systems. In the realm of software development, one critical aspect that demands continuous attention is database performance tuning. This process involves optimizing the efficiency of database operations, ensuring that the software interacts with the database in the most effective and responsive manner. Today I want to share my experience in improving the performance of heavy SQL join.
Case Study
Alright! First, let me introduce the table schema and provide a sample code. Afterward, we can discuss the issues present in the sample code.
class OceanShipmentExportInfo(models.Model):
    id = models.AutoField(primary_key=True)
    booking_no = models.CharField(max_length=32, null=True, db_index=True)
    
    
class OceanShipment(models.Model):
    class Meta:
        unique_together = ("lookup_id", "subscriber")
   
    subscriber = models.ForeignKey(Subscriber, related_name='shipment_set', null=True, on_delete=models.CASCADE)
    lookup_id = models.CharField(max_length=32, null=True)
    type = models.CharField(max_length=1, default='I', choices=SHIPMENT_TYPE_CHOICES)
    ETA = models.DateTimeField(null=True, db_index=True)
    oe_info = models.OneToOneField(
        'OceanShipmentExportInfo', related_name='shipment', null=True, on_delete=models.SET_NULL
    )
    MBL_NO = models.CharField(max_length=32, db_index=True, null=True)
    HBL_NO = models.CharField(max_length=32, db_index=True, null=True)
    lookup_id = models.CharField(max_length=32, null=True)
  
def search(search_option, subscriber):
    query = OceanShipment.objects.filter_by_subscriber(None, subscriber).filter(type=search_option.type)

    if search_option.keyword:
        query = query.filter(
            Q(HBL_NO__icontains=search_option.keyword)
            | Q(MBL_NO__icontains=search_option.keyword)
            | Q(oe_info__booking_no__icontains=search_option.keyword)
            | Q(lookup_id=search_option.keyword)
        )

    return query.order_by('-ETA')
Next, let's delve into this piece of code. We have two models named OceanShipmentExportInfo and OceanShipment. In the search function, if there's a keyword to be searched, we look at the following fields: 1. HBL_NO (OceanShipment), 2. MBL_NO (OceanShipment), 3. booking_no (OceanShipmentExportInfo), 4. lookup_id (OceanShipment).
Now, the issue arises because booking_no is in the other model (OceanShipmentExportInfo), and there's a one-to-one relation between OceanShipmentExportInfo and the OceanShipment model. This implies that this query will perform a join operation to complete the search function. The problem is that both OceanShipmentExportInfo and OceanShipment contain millions of records, leading to this query taking almost 7 seconds to complete, which is a very bad performance. 
Solutions
When aiming to optimize SQL queries for speed, the immediate consideration is usually indexing. However, upon reviewing the model, it became apparent that all fields used in filtering and joining were already indexed. (Remember, for OneToOneField, Django automatically creates an index on the foreign key field by default.)
Upon closer examination, I identified that the join operation (Q(oe_info__booking_no__icontains=search_option.keyword)) was the bottleneck causing the query to slow down. Remarkably, removing this line resulted in a significant improvement, bringing the query performance to 100 ms.
This led me to explore ways to circumvent SQL JOIN. The solution was a refactoring of the search function, and the results were impressive.
def search(search_option: SearchOption, subscriber=None):
    query = OceanShipment.objects.filter_by_subscriber(None, subscriber).filter(type=search_option.type)
    booking_ids = None
    if search_option.keyword:
        booking_ids = OceanShipmentExportInfo.objects.filter(booking_no=search_option.keyword).values('id')

        if booking_ids:
	        query = query.filter(
	            Q(HBL_NO__icontains=search_option.keyword)
	            | Q(MBL_NO__icontains=search_option.keyword)
	            | Q(oe_info__in=booking_ids)
	            | Q(lookup_id=search_option.keyword)
	        )
        else:
	        query = query.filter(
	            Q(HBL_NO__icontains=search_option.keyword)
	            | Q(MBL_NO__icontains=search_option.keyword)
	            | Q(lookup_id=search_option.keyword)
	        )
As observed, I retrieved the booking_ids through a separate query operation first. Then, if there are booking_ids, we perform the search using only the oe_info foreign key id without needing a JOIN operation. In summary, we utilized two straightforward search queries without any JOIN operations to achieve the same search function. The outcome is remarkably positive, with the overall function performance improving from 7 seconds to 100 ms. That's quite impressive.
Key Takeaways
After implementing this enhancement, I've gained some insights to share with all of you. SQL JOIN is a costly operation; while it's commonly used to address the N+1 problem, we should strive to minimize its usage whenever possible. As demonstrated in this case, employing multiple SQL operations without JOIN can yield better performance, especially when both tables contain a significant number of records.



[Python] Case Study for Handling Python Singleton Errors
Paul Liang — Sat, 30 Sep 2023 14:18:32 GMT
Background
Recently, I encountered an unusual bug, and upon investigation, we discovered that the root cause was related to the singleton object we had implemented earlier. This bug was deeply concealed, making it quite challenging to uncover. In this blog post, I will narrate the entire journey. Let's dive in.
Singleton
First, we need to know what the singleton is and how to implement it in Python. The Singleton pattern restricts the instantiation of a class to a single instance and provides a global point of access to that instance, and the following shows a simple example of creating a singleton object in Python:
class ConfigGetter:
    _instance = None
    _config_cache = {}

    def __init__(self) -> None:
        print("__init__")

    def __new__(cls, *args, **kwargs):
        print("cls._instance", cls._instance)
        if cls._instance is None:
            cls._instance = super().__new__(cls)
        return cls._instance
We use the magic method __new__ to complete the singleton in Python. __new__ is a class method (i.e., it's defined on the class itself) and is responsible for creating a new instance of the class. It's called before __init__, and its primary purpose is to return a new instance of the class.
class TestView(APIView):
    def get(self, request):
        config_getter = ConfigGetter()        
        return Response(status=status.HTTP_200_OK)
Next, we use a simple API to test whether the singleton object works fine or not. We start a web service (Django) and call the API twice.  You can see the output like the following:
Django version 3.2.15, using settings 'config.settings.dev'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.
cls._instance None
init
cls._instance 
init
the first time, the cls._instance was None, but the second time, it showed it was a ConfigGetter object.  The singleton object appears to work correctly, but please be cautious of the __init__ will still execute twice, which means the attributes of the singleton object may vary depending on your __init__. 
Case Study
Alright, after demonstrating how to implement the singleton in Python, let's return to the story we initially intended to share. We have a configuration table that controls which crawlers are enabled. When we execute the code, it checks this configuration table using the ConfigGetter object. 
Here's an example of the code:
class ConfigGetter:
    _instance = None
    _config_cache = {}

    def __new__(cls, *args, **kwargs):
        print("cls._instance", cls._instance)
        if cls._instance is None:
            cls._instance = super().__new__(cls)
        return cls._instance

    def get_config(self, scac_code: str) -> CarrierConfig:
        if scac_code not in self._config_cache:
            config = self._prepare_config(scac_code=scac_code)
            self._config_cache[scac_code] = config

        return self._config_cache[scac_code]
 
The _prepare_config the function will load the data from the database, and as you can see, the smart us at that time was to utilize the process cache and singleton object to reduce the query times for better performance.  So, what problems does this approach introduce?"
The answer is that we can't immediately apply database changes to the code. To elaborate, if I update the config table and want the program to apply these changes immediately, can we achieve this with the code we designed above? Absolutely not. So, can we implement it correctly while also reducing database queries?
Problem definition
Okay, let's recap our problem. We have a web service running on Django, and we want to use a config table to control which crawler is currently enabled. We have two key criteria to fulfill:
After updating the config table, the program should immediately apply these changes.
We need to ensure good performance. We don't want the program to query the config table every time because these changes are infrequent.
In the beginning, we proposed a simple solution for handling this case; we reset the _config_cache after we updated the config table. It will look like:
class ConfigGetter:
    _instance = None
    _config_cache = {}

    def __new__(cls, *args, **kwargs):
        print("cls._instance", cls._instance)
        if cls._instance is None:
            cls._instance = super().__new__(cls)
        return cls._instance

    def get_config(self, scac_code: str) -> CarrierConfig:
        if scac_code not in self._config_cache:
            config = self._prepare_config(scac_code=scac_code)
            self._config_cache[scac_code] = config

        return self._config_cache[scac_code]
        
    def reset(self):
        self._config_cache = {}
 
But after giving it some thought, this solution may not work as expected. In a Django application that runs with multiple processes, each process will indeed have its own Singleton object. This behavior is because each process operates independently and maintains its own separate memory space. 
What does that mean? If we have 5 processes in a Django application, it's hard to reset _config_cache for all processes, so here's the next solution: 
Could we create a singleton object across all processes?
import multiprocessing

class Singleton:
    _instance = None
    _config_cache = {}
    _lock = multiprocessing.Lock()

    def __new__(cls):
        with cls._lock:
            if cls._instance is None:
                cls._instance = super().__new__(cls)
        return cls._instance
        
    def get_config(self, scac_code: str) -> CarrierConfig:
        if scac_code not in self._config_cache:
        config = self._prepare_config(scac_code=scac_code)
        self._config_cache[scac_code] = config

        return self._config_cache[scac_code]

    def reset(self):
        self._config_cache = {}
 
Alright, this seems to be shaping up nicely, doesn't it? We proceeded with testing and reflecting on whether there were any potential issues with this solution. Another concern surfaced: if we implement this Singleton pattern and run both the Django and Celery processes within the same Python interpreter instance, they will indeed share the same Singleton object, all thanks to multiprocessing.Lock(). 
However, if you are running the Django and Celery processes in separate interpreter instances (for example, running them on separate servers), they will not share the same Singleton object. In this case, you would need to use a different method to share the Singleton instance across processes, such as using a shared memory object or a separate server process to manage the Singleton instance.
Summary
In the end, we opted for the Redis cache solution to resolve this issue. After updating the config table, we clear the Redis cache, forcing the program to query the table and rebuild the Redis cache. I hope you found this journey insightful, and if you have any great ideas or better solutions, please leave a comment; I would greatly appreciate it. Thanks for reading!


Use Nginx To Enable CORS for Multiple Origins
Paul Liang — Sat, 09 Sep 2023 07:46:18 GMT
Background
Recently, I needed to hand over my previous side project to others, and I started to think about how could I complete it in an easy, simple, and low-communication cost way. That's when the term 'docker-compose' popped into my head. So, I decided to complete the task with 'docker-compoes'.  
Soon enough, problems started cropping up. This side project consists of three services: the client, console, and API service, each with its own corresponding URLs, like these:
client: domain-a.com
console: console.domain-a.com
server: api.domain-a.com 
In appearance, this setting will hit the CORS issue, we will briefly introduce the CORS later. To address it, we initially applied a quick fix by adding the 'Access-Control-Allow-Origin *' setting to our Nginx.
However, this solution isn't perfect because we actually want to permit access to the resource from multiple origins only. With 'Access-Control-Allow-Origin *,' essentially, anyone on the internet can attempt to access our resources, which poses more risk.
In this post, I'll share how to allow access from multiple origins to your resources specifically. Let's dive in.
CORS 
First, let's get a handle on what CORS is. CORS stands for Cross-Origin Resource Sharing. Why did we need it? Well, it all goes back to the early days when people hadn't quite caught on to this problem. See, hackers would attempt to create phishing websites, like those horoscope or farm news sites you might have seen. Sneakily tucked behind these innocent-looking sites, they'd slip in some script code to access resources from other websites using your browser's cookies. This is what's known as CSRF.
CORS, along with the same-site attribute for cookies, can effectively prevent CSRF on the browser side. I won't delve deeper into this aspect; if you're interested, there are plenty of resources available on the internet.
Now, returning to CORS, what does the error actually look like?
Why? How does the browser figure out that a request has hit the CORS limitation? Well, the browser breaks down the URL into three parts: the scheme, domain, and port. When all three components match, it's considered the 'same origin.' Here are some examples:
https://domain-a.com is my website.
schema: https
domain: domain-a
port: 443
Next, let's examine whether the following request complies with CORS.
http://domain-a.com → not the same origin (schema is not the same)
https://domain-a.com/mike → same origin.
https://news.domain-a.com → not the same origin (domain is not the same)
https://domain-a.com:81 → not the same origin (port is different)
https://domain-b.com → not the same origin (domain is not the same)
By this point, you should have a basic understanding of what CORS is. Next section, I will show you how to solve it via Nginx.
Nginx Setting
Most of the time, we opt for Nginx as our web server due to its various advantages, like being lightweight, simple, and stable, among others. Now, considering our prior understanding of CORS rules, how does the server manage which requests can access its resources?
This is where the Access-Control-Request-*headers come into play. The server determines what to send back as Access-Control-Allow- headers. Based on these headers, the browser makes the call on which requests can bypass the CORS restriction.
Here is the simplest solution you can find on the internet.
location / {
    
    add_header Access-Control-Allow-Origin *;
    add_header Access-Control-Allow-Methods 'GET, POST, PUT, DELETE, OPTIONS';
    add_header Access-Control-Allow-Headers 'DNT,X-Mx-ReqToken,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Authorization';

    # the remaning setthings
    ...
    ...
    ...
}
This is clearly not what we want. So, here's another solution that I've come up with recently. We employ a preflighted requests handler to communicate to the browser under which conditions we can permit CORS. Typically, the Access-Control-Allow-Origin header only accepts a single origin value. Unfortunately, for our side project, we need to allow multiple origins. To meet this requirement, we utilize the 'map' function in Nginx.
According to the Nginx website, the map function is used to create a new variable whose value depends on the values of one or more of the source variables specified in the first parameter. For more details, you can check this link.
map $http_origin $allowed_origin {
    ~*^http://localhost:3030$      $http_origin;
    ~*^http://localhost:4000$      $http_origin;
}

location / {

    # Preflighted requests
    if ($request_method = OPTIONS) {
        add_header "Access-Control-Allow-Origin" $allowed_origin;
        add_header "Access-Control-Allow-Methods" "GET, PUT, POST, DELETE, OPTIONS";
        add_header "Access-Control-Allow-Headers" "Authorization, Content-Type";
        add_header "Access-Control-Max-Age" 86400;
        return 204;
    }

    if ($allowed_origin != "") {
        add_header "Access-Control-Allow-Origin" $allowed_origin;
    }

    # the remaning setthings
    ...
    ...
    ...
}
By doing this, we can greenlight several origins to navigate through the CORS barrier when working locally. What's even more crucial is that we can seamlessly run multiple services and have them communicate with the API service via localhost with different ports using docker-compose.
Summary
In this post, we've presented a solution that enables multiple origins to navigate the CORS restriction in a more secure way. We'd like to reiterate the importance of avoiding the wildcard solution Access-Control-Allow-Origin * in your Nginx configuration, and hope this post is helpful to you, thanks.
Reference
https://medium.com/starbugs/弄懂同源政策-same-origin-policy-與跨網域-cors-e2e5c1a53a19

Annotation	What it does
@Mock	Creates a fake object (a mock)
@InjectMocks	Creates a real object and injects mocks into it

Term	Definition	Purpose	Example of Risk
SOP	Restricts cross-origin access	Prevents cross-site data theft	A script from `evil.com` tries to read cookies from `bank.com` (Blocked by SOP)
CSRF	Tricks users into making unintended requests	Prevents unauthorized actions	A victim clicks a malicious link that transfers money from their bank
CORS	Allows controlled cross-origin access	Enables APIs to be shared securely	`Access-Control-Allow-Origin: *` allows any site to make API requests (Security Risk!)

Python	Mockito
assert mock.called	verify(mock)
assert mock.call_count == 1	times(1)
assert not mock.called	never()

TLS Parameter	What It Reveals	How It's Used for Fingerprinting
TLS Version	e.g., `TLS 1.2` or `TLS 1.3`	Some bots use outdated versions
Cipher Suites	List of supported encryption algorithms	Bots often have limited options
Extensions	Features like ALPN, SNI, GREASE	Unique combinations per browser
Elliptic Curves	Supported key exchange methods	Unusual curves = suspicious
Signature Algorithms	Authentication methods used	Differ per OS/browser
Order of Fields	The sequence of cipher suites/extensions	Different per client type

Method	Result (seconds)	Data Validation
Dataclass serialization	0.05273 s	N
Django Serializer	0.443061 s	Y
Django ModelSerializer	0.423694 s	Y
Simple Function-based Serialization	0.032002 s	N
Pydantic Serialization	0.086857 s	Y