<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Taiker]]></title><description><![CDATA[If you're not making someone else's life better, then you're wasting your time.]]></description><link>https://blog.taiker.space/</link><image><url>https://blog.taiker.space/favicon.png</url><title>Taiker</title><link>https://blog.taiker.space/</link></image><generator>Ghost 5.57</generator><lastBuildDate>Fri, 17 Apr 2026 03:56:29 GMT</lastBuildDate><atom:link href="https://blog.taiker.space/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[First Spring Boot Application - Todo List (3)]]></title><description><![CDATA[<h1 id="background">Background</h1><p>In the &#xA0;third post of this series, I want to talk about unit testing, one of the most important part of the software development I thought.</p><p>Creaating a developer-friendly unit testing environment and writing high-quality test cases are both essential for building a robust and maintainable SaaS product.</p>]]></description><link>https://blog.taiker.space/first-spring-boot-application-todo-list-3/</link><guid isPermaLink="false">692e4bc4e18ac2500b9df274</guid><category><![CDATA[Web]]></category><category><![CDATA[Spring Boot]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Sun, 14 Dec 2025 08:14:31 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1763393434899-bd2837aab221?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8YWxsfDI5M3x8fHx8fHx8MTc2NDY0MTgxOHw&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1763393434899-bd2837aab221?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8YWxsfDI5M3x8fHx8fHx8MTc2NDY0MTgxOHw&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" alt="First Spring Boot Application - Todo List (3)"><p>In the &#xA0;third post of this series, I want to talk about unit testing, one of the most important part of the software development I thought.</p><p>Creaating a developer-friendly unit testing environment and writing high-quality test cases are both essential for building a robust and maintainable SaaS product. Over time, these practices help ensure your service remains reliable as it grows in complexity.</p><p>In this post, I&apos;ll not only cover how to write unit tests in Java Spring Boot, but also how to set up a clean and efficient testing environment that makes writing and running tests easier for the whole team.</p><h1 id="installation">Installation</h1><p>Make sure you have this (Spring Boot usually adds it already):</p><pre><code class="language-XML">&lt;dependency&gt;
    &lt;groupId&gt;org.springframework.boot&lt;/groupId&gt;
    &lt;artifactId&gt;spring-boot-starter-test&lt;/artifactId&gt;
    &lt;scope&gt;test&lt;/scope&gt;
&lt;/dependency&gt;</code></pre><p>This includes:</p><ul><li>JUnit 5</li><li>Mockito</li><li>AssertJ</li><li>Spring Test</li></ul><p>Next I want to introduce the <code>H2</code>, H2 is a lightweight relational database written entirely in Java. It belongs to a disposable and in-memory database that exists only for your tests. In Spring Boot, H2 lets you test <strong>real database behavior</strong> without the cost of running MySQL/Postgres.</p><pre><code class="language-XML">&lt;dependency&gt;
    &lt;groupId&gt;com.h2database&lt;/groupId&gt;
    &lt;artifactId&gt;h2&lt;/artifactId&gt;
    &lt;scope&gt;test&lt;/scope&gt;
&lt;/dependency&gt;</code></pre><p>We also need to add the following application setting for our testing environment.</p><figure class="kg-card kg-code-card"><pre><code class="language-YML">spring:
  datasource:
    url: jdbc:h2:mem:todolist-test;MODE=PostgreSQL
    driver-class-name: org.h2.Driver
    username: sa
    password:

  liquibase:
    enabled: false

  jpa:
    hibernate:
      ddl-auto: create-drop
    show-sql: false
    properties:
      hibernate:
        format_sql: false

logging:
  level:
    root: WARN

    # Spring framework noise
    org.springframework: WARN

    # Hibernate SQL + binding noise
    org.hibernate.SQL: WARN
    org.hibernate.type.descriptor.sql: WARN

    # JPA bootstrap logs
    org.springframework.orm.jpa: WARN
</code></pre><figcaption><b>src/test/resources/application-test.yml</b></figcaption></figure><p>You can change the <code>mode</code> in the database url to your target database, I used the PostgreSQL here. Remenber to add the <code>@ActiveProfiles(&quot;test&quot;)</code> to the <code>TodolistApplicationTests.java</code> to tell the spring to use the <code>application-test.yml</code> to overwrite the <code>application.yml</code>, which means can load the testing config for unit test.</p><figure class="kg-card kg-code-card"><pre><code class="language-JAVA">package com.example.todolist;

import org.junit.jupiter.api.Test;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.ActiveProfiles;

@SpringBootTest
@ActiveProfiles(&quot;test&quot;)
class TodolistApplicationTests {

	@Test
	void contextLoads() {
	}

}
</code></pre><figcaption>TodolistApplicationTests.java</figcaption></figure><h1 id="implementation">Implementation</h1><p>In this section, we will walk through how to write the unit test in Java Spring Boot. In previous posts we have implemented the controller, service and repository part in Spring. Next we will write the unit tests for each of them, here we go:</p><h3 id="repository">Repository</h3><figure class="kg-card kg-code-card"><pre><code class="language-JAVA">package com.example.todolist.repository;

import com.example.todolist.model.Todo;
import org.springframework.data.jpa.repository.JpaRepository;

public interface TodoRepository extends JpaRepository&lt;Todo, Long&gt; {}
</code></pre><figcaption>TodoRepository.java</figcaption></figure><figure class="kg-card kg-code-card"><pre><code class="language-JAVA">package com.example.todolist.model;

import jakarta.persistence.*;
import lombok.Getter;
import lombok.NoArgsConstructor;
import lombok.Setter;

import java.time.Instant;

@Entity
@Table(name = &quot;todolist&quot;)
@Getter
@Setter
@NoArgsConstructor
public class Todo {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    @Column(nullable = false)
    private String title;

    private String description;

    @Column(nullable = false)
    private boolean completed = false;

    // UTC-safe creation timestamp
    @Column(nullable = false, updatable = false,
            columnDefinition = &quot;TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP&quot;)
    private Instant createdAt = Instant.now();

    // Custom constructor for creating new todos
    public Todo(String title, String description) {
        this.title = title;
        this.description = description;
    }

}</code></pre><figcaption>Todo.java</figcaption></figure><p>We need to define our test plans based the todo model and repository. We are testing:</p><ul><li>JPA entity mapping</li><li>Auto-generated ID</li><li>@Column(nullable = false) constraints</li><li>Default values (completed, createdAt)</li><li>CRUD behavior (save, find, update, delete)</li></ul><p>The whole test code for the repository is</p><pre><code class="language-JAVA">package com.example.todolist.repository;

import com.example.todolist.model.Todo;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.autoconfigure.orm.jpa.DataJpaTest;
import org.springframework.test.context.ActiveProfiles;

import java.util.Optional;

import static org.assertj.core.api.Assertions.assertThat;

@DataJpaTest
@ActiveProfiles(&quot;test&quot;)
class TodoRepositoryTest {

    @Autowired
    private TodoRepository todoRepository;

    @Test
    void shouldSaveAndFindTodo() {
        // Arrange
        Todo todo = new Todo(&quot;Write tests&quot;, &quot;Learn H2 with Spring Boot&quot;);

        // Act
        Todo saved = todoRepository.save(todo);

        // Assert
        assertThat(saved.getId()).isNotNull();
        assertThat(saved.isCompleted()).isFalse();
        assertThat(saved.getCreatedAt()).isNotNull();

        Optional&lt;Todo&gt; found = todoRepository.findById(saved.getId());

        assertThat(found).isPresent();
        assertThat(found.get().getTitle()).isEqualTo(&quot;Write tests&quot;);
    }

    @Test
    void shouldUpdateTodo() {
        // Arrange
        Todo todo = todoRepository.save(new Todo(&quot;Old title&quot;, &quot;Old desc&quot;));

        // Act
        todo.setTitle(&quot;New title&quot;);
        todo.setCompleted(true);
        Todo updated = todoRepository.save(todo);

        // Assert
        assertThat(updated.getTitle()).isEqualTo(&quot;New title&quot;);
        assertThat(updated.isCompleted()).isTrue();
    }

    @Test
    void shouldDeleteTodo() {
        // Arrange
        Todo todo = todoRepository.save(new Todo(&quot;Delete me&quot;, &quot;Temp&quot;));

        // Act
        todoRepository.deleteById(todo.getId());

        // Assert
        assertThat(todoRepository.findById(todo.getId())).isEmpty();
    }

    @Test
    void shouldEnforceNotNullConstraint() {
        // Arrange
        Todo todo = new Todo();
        todo.setDescription(&quot;Missing title&quot;);

        // Act &amp; Assert
        assertThat(org.junit.jupiter.api.Assertions.assertThrows(
            org.springframework.dao.DataIntegrityViolationException.class,
            () -&gt; {
                todoRepository.save(todo);
                todoRepository.flush();
            }
        )).isNotNull();
    }
}</code></pre><p>The first thing I want to talk about is <code>package com.example.todolist.repository;</code>, &quot;why in both TodoRepositoryTest.java and TodoRepository.java have package com.example.todolist.repository;? &quot;, and it brings very important concept in the JAVA:</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x1F4A1;</div><div class="kg-callout-text">Java does <b>not</b> care about folders &#x2014; it cares about <b>packages</b>. <b>Test classes should live in the same package as the code they test.</b></div></div><p>Usually I use the AAA (Triple-A) concept to write the unit test no matter which programming language I used. What Is the AAA (Triple-A) Concept? It&#x2019;s a test structure pattern, not a framework feature.</p><blockquote><strong>&#x201C;Set up &#x2192; Do the thing &#x2192; Check the result&#x201D;</strong></blockquote><p>There are three parts you need to complete in your unit test: </p><ol><li>Arrange</li><li>Act</li><li>Assert</li></ol><p>That&apos;s why this concept called AAA (Triple-A). Remember, if Arrange, Act, and Assert cannot be clearly separated, reconsider the responsibility of the code being tested.</p><p>Now we can run the <code>mvn clean test</code> command to check if the testing works correctly.</p><pre><code class="language-TEXT">&#x279C; mvn clean test
...
...
...
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.352 s -- in com.example.todolist.TodolistApplicationTests
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  3.229 s
[INFO] Finished at: 2025-12-13T23:09:29+08:00
[INFO] ------------------------------------------------------------------------
</code></pre><p>Ok, everything works well now. Next we will complete the unit tests for controller and service.</p><h3 id="service">Service</h3><figure class="kg-card kg-code-card"><pre><code class="language-JAVA">package com.example.todolist.service;

import com.example.todolist.model.Todo;
import com.example.todolist.repository.TodoRepository;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import org.mockito.InjectMocks;
import org.mockito.Mock;
import org.mockito.junit.jupiter.MockitoExtension;

import java.util.List;
import java.util.Optional;

import static org.assertj.core.api.Assertions.assertThat;
import static org.assertj.core.api.Assertions.assertThatThrownBy;
import static org.mockito.ArgumentMatchers.any;
import static org.mockito.Mockito.*;

@ExtendWith(MockitoExtension.class)
class TodoServiceTest {

    @Mock
    private TodoRepository todoRepository;

    @InjectMocks
    private TodoService todoService;

    @Test
    void getAllTodos_shouldReturnAllTodos() {
        // Arrange
        Todo todo1 = new Todo(&quot;Task 1&quot;, &quot;Desc 1&quot;);
        Todo todo2 = new Todo(&quot;Task 2&quot;, &quot;Desc 2&quot;);

        when(todoRepository.findAll()).thenReturn(List.of(todo1, todo2));

        // Act
        List&lt;Todo&gt; result = todoService.getAllTodos();

        // Assert
        assertThat(result).hasSize(2);
        assertThat(result).extracting(Todo::getTitle)
                .containsExactly(&quot;Task 1&quot;, &quot;Task 2&quot;);

        verify(todoRepository).findAll();
    }

    @Test
    void createTodo_shouldSaveAndReturnTodo() {
        // Arrange
        Todo todo = new Todo(&quot;New Task&quot;, &quot;New Desc&quot;);

        Todo savedTodo = new Todo(&quot;New Task&quot;, &quot;New Desc&quot;);
        savedTodo.setId(1L);

        when(todoRepository.save(any(Todo.class))).thenReturn(savedTodo);

        // Act
        Todo result = todoService.createTodo(todo);

        // Assert
        assertThat(result.getId()).isEqualTo(1L);
        assertThat(result.getTitle()).isEqualTo(&quot;New Task&quot;);
        assertThat(result.isCompleted()).isFalse();

        verify(todoRepository).save(todo);
    }

    @Test
    void updateTodo_shouldUpdateExistingTodo() {
        // Arrange
        Long todoId = 1L;

        Todo existing = new Todo(&quot;Old Title&quot;, &quot;Old Desc&quot;);
        existing.setId(todoId);

        Todo updated = new Todo(&quot;New Title&quot;, &quot;New Desc&quot;);
        updated.setCompleted(true);

        when(todoRepository.findById(todoId)).thenReturn(Optional.of(existing));
        when(todoRepository.save(any(Todo.class))).thenAnswer(invocation -&gt; invocation.getArgument(0));

        // Act
        Todo result = todoService.updateTodo(todoId, updated);

        // Assert
        assertThat(result.getTitle()).isEqualTo(&quot;New Title&quot;);
        assertThat(result.getDescription()).isEqualTo(&quot;New Desc&quot;);
        assertThat(result.isCompleted()).isTrue();

        verify(todoRepository).findById(todoId);
        verify(todoRepository).save(existing);
    }

    @Test
    void updateTodo_shouldThrowException_whenTodoNotFound() {
        // Arrange
        Long todoId = 99L;
        Todo updated = new Todo(&quot;Doesn&apos;t matter&quot;, &quot;Nope&quot;);

        when(todoRepository.findById(todoId)).thenReturn(Optional.empty());

        // Act + Assert
        assertThatThrownBy(() -&gt; todoService.updateTodo(todoId, updated))
                .isInstanceOf(RuntimeException.class)
                .hasMessage(&quot;Todo not found&quot;);

        verify(todoRepository).findById(todoId);
        verify(todoRepository, never()).save(any());
    }

    @Test
    void deleteTodo_shouldDeleteById() {
        // Arrange
        Long todoId = 1L;

        doNothing().when(todoRepository).deleteById(todoId);

        // Act
        todoService.deleteTodo(todoId);

        // Assert
        verify(todoRepository).deleteById(todoId);
    }
}</code></pre><figcaption>TodolistApplicationTests.java</figcaption></figure><p>First of all, <code>@ExtendWith(MockitoExtension.class)</code> this line tells JUnit 5 to enable Mockito (Required for @Mock / @InjectMocks) support for this test class. </p><p>What&apos;s the difference between the <code>@Mock</code> and <code>@injectMocks</code>?</p><!--kg-card-begin: html--><table><thead><tr><th>
<p class="p1"><strong>Annotation</strong></p>
</th><th>
<p class="p1"><strong>What it does</strong></p>
</th></tr></thead><tbody><tr><td>
<p class="p1">@Mock</p>
</td><td>
<p class="p1"><strong>Creates a fake object</strong><span class="s1"> (a mock)</span></p>
</td></tr><tr><td>
<p class="p1">@InjectMocks</p>
</td><td>
<p class="p1"><span class="s1"><strong>Creates a real object</strong></span> and injects mocks into it</p>
</td></tr></tbody></table><!--kg-card-end: html--><p>@<strong>Mock, </strong>Mockito creates a fake implementation of TodoRepository, no real logic runs unless you explicitly stub it. Example:</p><pre><code>when(todoRepository.findAll())
    .thenReturn(List.of(todo1, todo2));</code></pre><p>@<strong>InjectMocks, </strong>Mockito creates a real instance of TodoService, looks for fields annotated with @Mock then injects them into the service. For example</p><pre><code>@InjectMocks
TodoService todoService;</code></pre><p>is effectively to</p><pre><code>TodoService todoService = new TodoService(mockTodoRepository);</code></pre><p>For the <code>verify()</code>means<strong> &quot;Assert that this method was called on the mock.&quot;. </strong>Essential for void methods and side effects.</p><p>Compared to Django/Python:</p><!--kg-card-begin: html--><table><thead><tr><th>
<p class="p1"><strong>Python</strong></p>
</th><th>
<p class="p1"><strong>Mockito</strong></p>
</th></tr></thead><tbody><tr><td>
<p class="p1">assert mock.called</p>
</td><td>
<p class="p1">verify(mock)</p>
</td></tr><tr><td>
<p class="p1">assert mock.call_count == 1</p>
</td><td>
<p class="p1">times(1)</p>
</td></tr><tr><td>
<p class="p1">assert not mock.called</p>
</td><td>
<p class="p1">never()</p>
</td></tr></tbody></table><!--kg-card-end: html--><h3 id="controller">Controller</h3><p>Here is the unit test code of controller:</p><pre><code class="language-Java">package com.example.todolist.controller;

import com.example.todolist.model.Todo;
import com.example.todolist.service.TodoService;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.autoconfigure.web.servlet.WebMvcTest;
import org.springframework.boot.test.mock.mockito.MockBean;
import org.springframework.http.MediaType;
import org.springframework.test.web.servlet.MockMvc;

import java.util.List;

import static org.hamcrest.Matchers.hasSize;
import static org.mockito.ArgumentMatchers.any;
import static org.mockito.ArgumentMatchers.eq;
import static org.mockito.Mockito.doNothing;
import static org.mockito.Mockito.when;
import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.*;
import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.*;

@WebMvcTest(TodoController.class)
class TodoControllerTest {

    @Autowired
    private MockMvc mockMvc;

    @MockBean
    private TodoService todoService;

    @Autowired
    private ObjectMapper objectMapper;

    @Test
    void getAllTodos_shouldReturnTodoList() throws Exception {
        // Arrange
        Todo todo1 = new Todo(&quot;Task 1&quot;, &quot;Desc 1&quot;);
        Todo todo2 = new Todo(&quot;Task 2&quot;, &quot;Desc 2&quot;);

        when(todoService.getAllTodos()).thenReturn(List.of(todo1, todo2));

        // Act &amp; Assert
        mockMvc.perform(get(&quot;/api/todos&quot;))
                .andExpect(status().isOk())
                .andExpect(jsonPath(&quot;$&quot;, hasSize(2)))
                .andExpect(jsonPath(&quot;$[0].title&quot;).value(&quot;Task 1&quot;))
                .andExpect(jsonPath(&quot;$[1].title&quot;).value(&quot;Task 2&quot;));
    }

    @Test
    void createTodo_shouldReturnCreatedTodo() throws Exception {
        // Arrange
        Todo request = new Todo(&quot;New Task&quot;, &quot;New Desc&quot;);

        Todo saved = new Todo(&quot;New Task&quot;, &quot;New Desc&quot;);
        saved.setId(1L);

        when(todoService.createTodo(any(Todo.class))).thenReturn(saved);

        // Act &amp; Assert
        mockMvc.perform(post(&quot;/api/todos&quot;)
                        .contentType(MediaType.APPLICATION_JSON)
                        .content(objectMapper.writeValueAsString(request)))
                .andExpect(status().isOk())
                .andExpect(jsonPath(&quot;$.id&quot;).value(1L))
                .andExpect(jsonPath(&quot;$.title&quot;).value(&quot;New Task&quot;))
                .andExpect(jsonPath(&quot;$.completed&quot;).value(false));
    }

    @Test
    void updateTodo_shouldReturnUpdatedTodo() throws Exception {
        // Arrange
        Long todoId = 1L;

        Todo updated = new Todo(&quot;Updated Task&quot;, &quot;Updated Desc&quot;);
        updated.setCompleted(true);
        updated.setId(todoId);

        when(todoService.updateTodo(eq(todoId), any(Todo.class)))
                .thenReturn(updated);

        // Act &amp; Assert
        mockMvc.perform(put(&quot;/api/todos/{id}&quot;, todoId)
                        .contentType(MediaType.APPLICATION_JSON)
                        .content(objectMapper.writeValueAsString(updated)))
                .andExpect(status().isOk())
                .andExpect(jsonPath(&quot;$.title&quot;).value(&quot;Updated Task&quot;))
                .andExpect(jsonPath(&quot;$.completed&quot;).value(true));
    }

    @Test
    void deleteTodo_shouldReturnOk() throws Exception {
        // Arrange
        Long todoId = 1L;
        doNothing().when(todoService).deleteTodo(todoId);

        // Act &amp; Assert
        mockMvc.perform(delete(&quot;/api/todos/{id}&quot;, todoId))
                .andExpect(status().isOk());
    }
}</code></pre><p>I want to explain a few common scenarios that are especially useful when writing unit tests.</p><p>The first one is:</p><pre><code class="language-Java">// Act &amp; Assert
mockMvc.perform(post(&quot;/api/todos&quot;)
.contentType(MediaType.APPLICATION_JSON)
.content(objectMapper.writeValueAsString(request)))
.andExpect(status().isOk())
.andExpect(jsonPath(&quot;$.id&quot;).value(1L))
.andExpect(jsonPath(&quot;$.title&quot;).value(&quot;New Task&quot;))
.andExpect(jsonPath(&quot;$.completed&quot;).value(false));</code></pre><p> The logic behind it is &quot;<strong>Simulating an HTTP POST request to your controller and asserting the HTTP response.</strong>&quot;</p><ul><li>mockMvc = a fake HTTP client</li><li>.contentType(MediaType.APPLICATION_JSON) = sets the HTTP header</li><li>.content(objectMapper.writeValueAsString(request)) = set the request body.</li><li>The Assert Chain for couple of <code>andExpect</code> </li><li>.andExpect(jsonPath(&quot;$.id&quot;).value(1L)) = inspects the JSON response body.</li><li>$ = root object</li><li>$.id = id field</li></ul><p>Next one is:</p><pre><code>doNothing().when(todoService).deleteTodo(todoId);
</code></pre><p>It means that &quot;<strong>When deleteTodo(todoId) is called on this mock, do nothing.&quot;, you can think we mock and method with do nothing.</strong></p><p>In this section, I show you how to use the JUnit + Mockito with H2 in memory database to complete the unit tests for the todo list application, and also share the key concept of how to write the good unit test to you. In the next section, I will show you couple of useful command in developing the unit tests.</p><h1 id="command">Command</h1><p>In this section, I want to show you couple of useful command in developing the unit tests.</p><ul><li><strong>Run the whole unit tests</strong></li></ul><pre><code class="language-Bash">$ mvn clean test</code></pre><ul><li><strong>Run one test class</strong></li></ul><pre><code class="language-Bash">$ mvn test -Dtest=TodoServiceTest,TodoControllerTest</code></pre><ul><li><strong>Run multiple test classes</strong></li></ul><pre><code class="language-Bash">$ mvn test -Dtest=TodoServiceTest,TodoControllerTest</code></pre><ul><li><strong>Run a single test method</strong></li></ul><pre><code class="language-Bash">$ mvn test -Dtest=TodoServiceTest#updateTodo_shouldUpdateExistingTodo</code></pre><p>If you use the vscode to develop the Java Spring Boot project, you can install the <code>Extension Pack for Java</code> which including the following packages:</p><ul><li>Language Support for Java</li><li>Debugger for Java</li><li>Maven for Java</li><li>Test Runner for Java </li></ul><p>And you can navigate to you testing panel in the left side bar to trigger running the unit test via UI. </p><figure class="kg-card kg-image-card"><img src="https://blog.taiker.space/content/images/2025/12/image.png" class="kg-image" alt="First Spring Boot Application - Todo List (3)" loading="lazy" width="1747" height="966" srcset="https://blog.taiker.space/content/images/size/w600/2025/12/image.png 600w, https://blog.taiker.space/content/images/size/w1000/2025/12/image.png 1000w, https://blog.taiker.space/content/images/size/w1600/2025/12/image.png 1600w, https://blog.taiker.space/content/images/2025/12/image.png 1747w" sizes="(min-width: 720px) 720px"></figure><h1 id="takeaway">Takeaway</h1><p>In this post, I walk you though how to write high-quality unit tests using the <strong>Arrange&#x2013;Act&#x2013;Assert (AAA)</strong> pattern, step by step. Unit testing is a critical part of the modern <strong>software development life cycle (SDLC)</strong>: it helps you validate behavior early, refactor with confidence, and keep regression from slipping into production. By adopting AAA as a consistent testing guideline, you can make your test easier to read, maintain and ultimately make your software projects more reliable over time.</p>]]></content:encoded></item><item><title><![CDATA[First Spring Boot Application - Todo List (2)]]></title><description><![CDATA[<h1 id="background">Background</h1><p>Continuing from the previous post, where we built the database layer and Liquibase migrations, this article shows how to define REST APIs in our Spring Boot Todo application. We&apos;ll walk through creating controllers, connecting them to services and repositories, and exposing CRUD operations that interact cleanly with</p>]]></description><link>https://blog.taiker.space/lombok-and-boilerplate/</link><guid isPermaLink="false">6908b9c0e18ac2500b9ded96</guid><category><![CDATA[Java]]></category><category><![CDATA[Web]]></category><category><![CDATA[Spring Boot]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Sun, 30 Nov 2025 14:38:21 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1587300003388-59208cc962cb?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDEzfHxkb2d8ZW58MHx8fHwxNzY0NTEzNjgzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1587300003388-59208cc962cb?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDEzfHxkb2d8ZW58MHx8fHwxNzY0NTEzNjgzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" alt="First Spring Boot Application - Todo List (2)"><p>Continuing from the previous post, where we built the database layer and Liquibase migrations, this article shows how to define REST APIs in our Spring Boot Todo application. We&apos;ll walk through creating controllers, connecting them to services and repositories, and exposing CRUD operations that interact cleanly with the database. </p><h1 id="implementation">Implementation</h1><p>Today, we&apos;ll structure our API using the classic Spring Boot layered architecture: Controller, Service, and Repository. Each layer focues on a specific responsibility so the code stays clean, modular, and easy to maintain. This separation of concerns prevents business logic and persistence logic from mixing together, which greatly improves the scalability of the application.</p><!--kg-card-begin: html--><div style="display: flex; justify-content: center;">
  <div class="mermaid">
      flowchart TD
      A[Controller\nREST Endpoints] --&gt; B[Service\nBusiness Logic]
      B --&gt; C[Repository\nData Access Layer]
      C --&gt; D[Database\nPostgreSQL / Any RDBMS]
  </div>
</div><!--kg-card-end: html--><pre><code class="language-text">src/
 &#x2514;&#x2500;&#x2500; main/
     &#x251C;&#x2500;&#x2500; java/
     &#x2502;   &#x2514;&#x2500;&#x2500; com/example/todolist/
     &#x2502;       &#x251C;&#x2500;&#x2500; TodolistApplication.java
     &#x2502;       &#x251C;&#x2500;&#x2500; model/
     &#x2502;       &#x2502;   &#x2514;&#x2500;&#x2500; Todo.java
     &#x2502;       &#x251C;&#x2500;&#x2500; repository/
     &#x2502;       &#x2502;   &#x2514;&#x2500;&#x2500; TodoRepository.java
     &#x2502;       &#x251C;&#x2500;&#x2500; service/
     &#x2502;       &#x2502;   &#x2514;&#x2500;&#x2500; TodoService.java
     &#x2502;       &#x2514;&#x2500;&#x2500; controller/
     &#x2502;           &#x2514;&#x2500;&#x2500; TodoController.java
     &#x2514;&#x2500;&#x2500; resources/
         &#x251C;&#x2500;&#x2500; application.properties
         &#x2514;&#x2500;&#x2500; static/   (optional)</code></pre><h2 id="controller">Controller</h2><p>The controller layer is responsible for defining the API endpoints of the applicaion. Its main purpose is to handle incoming HTTP requests and return the appropriate HTTP responses. In this layer, we design what APIs our service will expose and determine which HTTP methods (GET,POST,PUT,DELTE, etc.) should be used to follow RESTful conventions.</p><p>A controller also performs input validation and converts requst payloads into Java objects when necessary. After validating the request, the controller delegates the business logic to the service layer. Once the operation completes, &#xA0;the controller prepares and returns a well-structured response, typically in JSON, back to client.</p><pre><code class="language-Java">package com.example.todolist.controller;

import com.example.todolist.model.Todo;
import com.example.todolist.service.TodoService;
import org.springframework.web.bind.annotation.*;

import java.util.List;

@RestController
@RequestMapping(&quot;/api/todos&quot;)
public class TodoController {
    private final TodoService todoService;

    public TodoController(TodoService todoService) {
        this.todoService = todoService;
    }

    @GetMapping
    public List&lt;Todo&gt; getAllTodos() {
        return todoService.getAllTodos();
    }

    @PostMapping
    public Todo createTodo(@RequestBody Todo todo) {
        return todoService.createTodo(todo);
    }

    @PutMapping(&quot;/{id}&quot;)
    public Todo updateTodo(@PathVariable Long id, @RequestBody Todo todo) {
        return todoService.updateTodo(id, todo);
    }

    @DeleteMapping(&quot;/{id}&quot;)
    public void deleteTodo(@PathVariable Long id) {
        todoService.deleteTodo(id);
    }
}</code></pre><p>Here is our controller code, we will go through it.</p><pre><code class="language-text">private final TodoService todoService;

public TodoController(TodoService todoService) {
    this.todoService = todoService;
}</code></pre><p>Spring Boot sees that the <code>@RestController</code> on the class and The constructor that requires TodoService, and Spring will automatically creates (or &#x201C;injects&#x201D;) a TodoService object and passes it into the constructor.</p><p>The following methods are just simple GET,POST,PUT, and DELETE methods, very intuitive.</p><h2 id="service">Service</h2><p>Next, we move on to the service layer. As mentioned earlier, this is where we place the core business logic of the application. In the context of our Todo application, the service acts as the bridge between the controller and the repository.</p><p>Inside the service class, we inject the TodoRepository using Spring Boot&apos;s dependency injection mechanism. With the repository available, we implement the methods required by the controller - such as creating, updating, querying and deleting todo items. The service ensures that the controller stays clean and focused on handling requests, while all business rules and orchestration happen here.</p><p>This separation not only keeps the codebase modular and easily to maintain, but also simplifies testing because logic can be unit-tested independently from web or database layer.</p><pre><code class="language-Java">package com.example.todolist.service;

import com.example.todolist.model.Todo;
import com.example.todolist.repository.TodoRepository;
import org.springframework.stereotype.Service;
import java.util.List;

@Service
public class TodoService {
    private final TodoRepository todoRepository;

    public TodoService(TodoRepository todoRepository) {
        this.todoRepository = todoRepository;
    }

    public List&lt;Todo&gt; getAllTodos() {
        return todoRepository.findAll();
    }

    public Todo createTodo(Todo todo) {
        return todoRepository.save(todo);
    }

    public Todo updateTodo(Long id, Todo updatedTodo) {
        return todoRepository.findById(id)
            .map(todo -&gt; {
                todo.setTitle(updatedTodo.getTitle());
                todo.setDescription(updatedTodo.getDescription());
                todo.setCompleted(updatedTodo.isCompleted());
                return todoRepository.save(todo);
            })
            .orElseThrow(() -&gt; new RuntimeException(&quot;Todo not found&quot;));
    }

    public void deleteTodo(Long id) {
        todoRepository.deleteById(id);
    }
}</code></pre><p>After reviewing the code, some of you may wonder where methods like <code>setTitle</code>, <code>setDescription</code>, and <code>setCompleted</code> are actually implemented. If we go back to the model layer, we can see that the class is annotated with Lombok&#x2018;s <code>@Getter</code> and <code>@Setter</code>. These annotations automatically generate the getter and setter methods for every field at compile time, which is why you don&apos;t see them explicitly defined in the source code.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x1F4A1;</div><div class="kg-callout-text">In Java, the @ symbol is also used for <b>annotations</b>, annotations are just labels. but frameworks like Spring <b>scan annotations at startup</b> and add behavior.</div></div><pre><code class="language-Java">@Entity
@Table(name = &quot;todolist&quot;)
@Getter
@Setter
@NoArgsConstructor
public class Todo {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    @Column(nullable = false)
    private String title;

    private String description;

    @Column(nullable = false)
    private boolean completed = false;

    // UTC-safe creation timestamp
    @Column(nullable = false, updatable = false,
            columnDefinition = &quot;TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP&quot;)
    private Instant createdAt = Instant.now();

    // Custom constructor for creating new todos
    public Todo(String title, String description) {
        this.title = title;
        this.description = description;
    }

}</code></pre><h2 id="repository">Repository</h2><p>Now, let&apos;s look at the repository layer. Unlike Python and Django ORM, where you typically define model-specific query functions manually, Spring Boot makes this part extremely lightweight. All yu need to do is extend the <code>JpaRepository</code> interface, and Spring Data automatically generate the underlying implementation for you at runtime. </p><pre><code class="language-JAVA">package com.example.todolist.repository;

import com.example.todolist.model.Todo;
import org.springframework.data.jpa.repository.JpaRepository;

public interface TodoRepository extends JpaRepository&lt;Todo, Long&gt; {}</code></pre><p><code>JpaRepository</code> provides a full set of CRUD operaions out of the box, including:</p><pre><code class="language-text">* save()
* findById()
* findAll()
* deleteById()
* count()</code></pre><p>This means you can focus on business logic without writing boilerplate database access code.</p><p>P.S. <code>&lt;Todo, Long&gt;</code> explains two things:</p><ol><li>Todo = emtity type: &quot;This repository works with the Todo table&quot;</li><li>Long = ID type: &quot;The primary key type is Long&quot;</li></ol><h2 id="put-everything-together">Put everything together</h2><p>Finally, we need to set up the entry point for the Spring Boot application. Once that is in place, I will show you how to use Maven to build and start the Spring Boot service from the command line.</p><pre><code class="language-text">package com.example.todolist;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class TodolistApplication {

	public static void main(String[] args) {
		SpringApplication.run(TodolistApplication.class, args);
	}

}
</code></pre><p>You can start the service by running the following command from the project&apos;s root directory. Just make sure you export your <code>.env</code> variables beforehand. To make things even easier, I&apos;ve also prepared a Makefile so you can run these commands with simple shortcuts.</p><pre><code class="language-shell">mvn spring-boot:run</code></pre><pre><code class="language-Makefile"># Makefile for Liquibase Migration Management
# Provides Django-like commands for database migrations

# Load environment variables from .env file
-include .env
export

.PHONY: help run makemigration migrate migrate-one migrate-to showmigrations rollback rollback-preview fake-migrate fake-migrate-to fake-migrate-preview

# Default target
help:
	@echo &quot;=== Spring Boot Application Commands ===&quot;
	@echo &quot;&quot;
	@echo &quot;Application:&quot;
	@echo &quot;  make run                        - Start Spring Boot application (default port 8080)&quot;
	@echo &quot;  make run PORT=9090              - Start Spring Boot application on custom port&quot;
	@echo &quot;&quot;
	@echo &quot;=== Liquibase Migration Commands ===&quot;
	@echo &quot;&quot;
	@echo &quot;Migration Generation:&quot;
	@echo &quot;  make makemigration              - Generate new migration (auto-numbered)&quot;
	@echo &quot;  make makemigration NAME=example - Generate migration with custom name&quot;
	@echo &quot;&quot;
	@echo &quot;Migration Execution:&quot;
	@echo &quot;  make migrate                    - Apply all pending migrations&quot;
	@echo &quot;  make migrate-one                - Apply only the next pending migration&quot;
	@echo &quot;  make migrate-to NUM=0008        - Migrate to specific version&quot;
	@echo &quot;  make showmigrations             - Show migration status&quot;
	@echo &quot;&quot;
	@echo &quot;Rollback:&quot;
	@echo &quot;  make rollback COUNT=1           - Rollback N changesets (default: 1)&quot;
	@echo &quot;  make rollback-preview COUNT=1   - Preview rollback SQL&quot;
	@echo &quot;&quot;
	@echo &quot;Fake Migrations:&quot;
	@echo &quot;  make fake-migrate               - Mark all pending as executed&quot;
	@echo &quot;  make fake-migrate-to NUM=0008   - Mark up to version as executed&quot;
	@echo &quot;  make fake-migrate-preview       - Preview what would be marked&quot;
	@echo &quot;&quot;

# Variables
CHANGES_DIR := src/main/resources/db/changelog/changes
MVN := mvn
NAME ?= auto_generated
COUNT ?= 1
PORT ?=

# Application Commands
run:
ifdef PORT
	@echo &quot;Starting Spring Boot application on port $(PORT)...&quot;
	@$(MVN) spring-boot:run -Dspring-boot.run.arguments=&quot;--server.port=$(PORT)&quot;
else
	@echo &quot;Starting Spring Boot application on default port (8080)...&quot;
	@$(MVN) spring-boot:run
endif

# Auto-detect next migration number
LATEST_NUM := $(shell ls $(CHANGES_DIR) 2&gt;/dev/null | grep -E &apos;^[0-9]+&apos; | sed &apos;s/^0*//&apos; | sed &apos;s/[^0-9].*//&apos; | sort -n | tail -1)
ifeq ($(LATEST_NUM),)
	NEXT_NUM := 1
else
	NEXT_NUM := $(shell echo $$(($(LATEST_NUM) + 1)))
endif
NEXT_FORMATTED := $(shell printf &quot;%04d&quot; $(NEXT_NUM))

# Migration Generation
makemigration:
	@echo &quot;Generating migration $(NEXT_FORMATTED)_$(NAME).yaml...&quot;
	@if [ ! -d &quot;$(CHANGES_DIR)&quot; ]; then \
		echo &quot;Error: Directory $(CHANGES_DIR) does not exist&quot;; \
		exit 1; \
	fi
	@FILEPATH=&quot;$(CHANGES_DIR)/$(NEXT_FORMATTED)_$(NAME).yaml&quot;; \
	$(MVN) liquibase:diff -Dliquibase.diffChangeLogFile=$$FILEPATH; \
	if [ -f $$FILEPATH ]; then \
		echo &quot;&quot; &gt;&gt; $$FILEPATH; \
		echo &quot;- changeSet:&quot; &gt;&gt; $$FILEPATH; \
		echo &quot;    id: tag-$(NEXT_FORMATTED)&quot; &gt;&gt; $$FILEPATH; \
		echo &quot;    author: taiker&quot; &gt;&gt; $$FILEPATH; \
		echo &quot;    changes:&quot; &gt;&gt; $$FILEPATH; \
		echo &quot;      - tagDatabase:&quot; &gt;&gt; $$FILEPATH; \
		echo &quot;          tag: \&quot;$(NEXT_FORMATTED)\&quot;&quot; &gt;&gt; $$FILEPATH; \
		echo &quot;&#x2713; Migration created: $$FILEPATH&quot;; \
		echo &quot;&#x2713; Tag $(NEXT_FORMATTED) added&quot;; \
	else \
		echo &quot;Error: Failed to generate migration file&quot;; \
		exit 1; \
	fi

# Migration Execution
migrate:
	@echo &quot;Applying all pending migrations...&quot;
	@$(MVN) liquibase:update

migrate-one:
	@echo &quot;Applying next pending migration...&quot;
	@$(MVN) liquibase:updateCount -Dliquibase.count=1

migrate-to:
	@if [ -z &quot;$(NUM)&quot; ]; then \
		echo &quot;Error: NUM parameter required. Usage: make migrate-to NUM=0008&quot;; \
		exit 1; \
	fi
	@echo &quot;Migrating to version $(NUM)...&quot;
	@$(MVN) liquibase:updateToTag -Dliquibase.toTag=$(NUM)

showmigrations:
	@echo &quot;Checking migration status...&quot;
	@$(MVN) liquibase:status

# Rollback
rollback:
	@echo &quot;Rolling back $(COUNT) changeset(s)...&quot;
	@$(MVN) liquibase:rollback -Dliquibase.rollbackCount=$(COUNT)

rollback-preview:
	@echo &quot;Previewing rollback of $(COUNT) changeset(s)...&quot;
	@$(MVN) liquibase:rollbackSQL -Dliquibase.rollbackCount=$(COUNT)
	@echo &quot;&quot;
	@echo &quot;Preview saved to: target/liquibase/migrate.sql&quot;

# Fake Migrations
fake-migrate:
	@echo &quot;Marking all pending migrations as executed (without running them)...&quot;
	@$(MVN) liquibase:changeLogSync

fake-migrate-to:
	@if [ -z &quot;$(NUM)&quot; ]; then \
		echo &quot;Error: NUM parameter required. Usage: make fake-migrate-to NUM=0008&quot;; \
		exit 1; \
	fi
	@echo &quot;Marking migrations up to $(NUM) as executed (without running them)...&quot;
	@$(MVN) liquibase:changeLogSyncToTag -Dliquibase.toTag=$(NUM)

fake-migrate-preview:
	@echo &quot;Previewing what would be marked as executed...&quot;
	@$(MVN) liquibase:changeLogSyncSQL
	@echo &quot;&quot;
	@echo &quot;Preview saved to: target/liquibase/migrate.sql&quot;
</code></pre><p>You can use the <code>make run</code> to start the service, and you will find the service has been started and listened to the 8080 port.</p><pre><code class="language-text">$ make run
Starting Spring Boot application...
...
...
2025-11-30T17:39:23.018+08:00  INFO 62999 --- [todolist] [           main] o.h.e.t.j.p.i.JtaPlatformInitiator       : HHH000489: No JTA platform available (set &apos;hibernate.transaction.jta.platform&apos; to enable JTA platform integration)
2025-11-30T17:39:23.019+08:00  INFO 62999 --- [todolist] [           main] j.LocalContainerEntityManagerFactoryBean : Initialized JPA EntityManagerFactory for persistence unit &apos;default&apos;
2025-11-30T17:39:23.078+08:00  WARN 62999 --- [todolist] [           main] JpaBaseConfiguration$JpaWebConfiguration : spring.jpa.open-in-view is enabled by default. Therefore, database queries may be performed during view rendering. Explicitly configure spring.jpa.open-in-view to disable this warning
2025-11-30T17:39:23.183+08:00  INFO 62999 --- [todolist] [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port 8080 (http) with context path &apos;/&apos;
2025-11-30T17:39:23.186+08:00  INFO 62999 --- [todolist] [           main] c.example.todolist.TodolistApplication   : Started TodolistApplication in 6.397 seconds (process running for 6.501)</code></pre><p>If you want the application to run on a different port, you can specify it when using the make run command, for example:</p><pre><code class="language-shell">make run PORT=9090</code></pre><p>Once the service is running, open your browser and navigate to:</p><pre><code class="language-text">http://localhost:8080/api/todos
or
http://localhost:9090/api/todos if you chose a custom port.</code></pre><p>Since the database is still empty, the API will return an empty list, which means the setup is working correctly.</p><figure class="kg-card kg-image-card"><img src="https://blog.taiker.space/content/images/2025/11/image-1.png" class="kg-image" alt="First Spring Boot Application - Todo List (2)" loading="lazy" width="691" height="215" srcset="https://blog.taiker.space/content/images/size/w600/2025/11/image-1.png 600w, https://blog.taiker.space/content/images/2025/11/image-1.png 691w"></figure><h1 id="deployment">Deployment</h1><p>If you want to deploy this application to another environment - such as a server, VM or even a cloud instance - and run it &quot;like a real production service,&#x201D; you first need to package the project into a JAR file. To do that, run the following Maven command:</p><pre><code class="language-shell">export $(cat .env | xargs)
mvn clean package</code></pre><p>Spring Boot will create:</p><pre><code class="language-text">$ target/todolist-0.0.1-SNAPSHOT.jar
[INFO] --- jar:3.4.2:jar (default-jar) @ todolist ---
[INFO] Building jar: /Users/taiker/dev/todolist/target/todolist-0.0.1-SNAPSHOT.jar</code></pre><p>The version in your JAR name comes from your pom.xml</p><pre><code class="language-text">&lt;version&gt;0.0.1-SNAPSHOT&lt;/version&gt;</code></pre><p>So the output JAR will always follow:</p><pre><code class="language-text">&lt;artifactId&gt;-&lt;version&gt;.jar</code></pre><p>Then you can start the Sprint Boot service using Java:</p><pre><code class="language-text">$ java -jar target/todolist-0.0.1-SNAPSHOT.jar

# or run in different port

$ java -jar target/todolist-0.0.1-SNAPSHOT.jar &#x2013;server.port=9090</code></pre><h1 id="takeaway">Takeaway</h1><p>In this post, we walked through how to build RESTful APIs in Spring Boot using the Controller, Service and Repository layers to complete a basic CRUD workflow for our Todo application. We also explored multiple ways to start the service - with and without the Makefile - and covered how to bild a JAR file and run the application like a real deployed service.</p><p>In the next post, we&apos;ll take the project one step further and focus on writing effective unit tests in Spring Boot, including how to test controllers, services, and repository behavior with clean, maintainable test patterns. </p>]]></content:encoded></item><item><title><![CDATA[First Spring Boot Application - Todo List (1)]]></title><description><![CDATA[<h1 id="background">Background</h1><p>After joining a new company, there are always many things to learn from scratch. Today, I want to walk your through how to build your first Java web application using Sprint Boot. In this post, most of the code will be generated with the help of AI, and I</p>]]></description><link>https://blog.taiker.space/first-spring-boot-application-todo-list/</link><guid isPermaLink="false">6904c21de18ac2500b9debb4</guid><category><![CDATA[Java]]></category><category><![CDATA[Spring Boot]]></category><category><![CDATA[Web]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Mon, 24 Nov 2025 16:29:46 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1521017432531-fbd92d768814?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDR8fGNhZmV8ZW58MHx8fHwxNzY0NTEzODEwfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1521017432531-fbd92d768814?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDR8fGNhZmV8ZW58MHx8fHwxNzY0NTEzODEwfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" alt="First Spring Boot Application - Todo List (1)"><p>After joining a new company, there are always many things to learn from scratch. Today, I want to walk your through how to build your first Java web application using Sprint Boot. In this post, most of the code will be generated with the help of AI, and I will focus on assembling a complete and practical foundation for web application development - including environment configuration with .env, unit testing, ORM integration and more.</p><p>This article will be the first in a series documenting how I collaborate with AI to build a fully functional to-do web application. In this initial post, we will focus specifically on managing database migrations in a Sprint Boot application using Liquibase.</p><h1 id="initiation">Initiation </h1><p>At the beginning, we will use the <strong>Spring Initializr </strong>to help us initiate the basic content of the project. Go to &#x1F449; <a href="https://start.spring.io/?ref=blog.taiker.space">https://start.spring.io</a>. </p><p>Project Metadata Explained:</p><!--kg-card-begin: markdown--><ul>
<li>Group: A unique identifier for your organization or project
<ul>
<li>Common pattern: com.companyname.projectname</li>
</ul>
</li>
<li>Artifact: The name of your project&#x2019;s build artifact (the JAR or WAR file name).
<ul>
<li>If artifact = demo, your build will produce a file like: <code>demo-0.0.1-SNAPSHOT.jar</code></li>
</ul>
</li>
<li>Name: The display name of the project (for humans).</li>
<li>Description: A short explanation of your project&#x2019;s purpose. It appears in your generated pom.xml file for documentation.</li>
</ul>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card"><img src="https://blog.taiker.space/content/images/2025/10/image.png" class="kg-image" alt="First Spring Boot Application - Todo List (1)" loading="lazy" width="717" height="742" srcset="https://blog.taiker.space/content/images/size/w600/2025/10/image.png 600w, https://blog.taiker.space/content/images/2025/10/image.png 717w"></figure><p>Click <strong>Generate</strong>, unzip it, and open it in your IDE. Your project structure will look like:</p><pre><code class="language-plain text">&#x279C; tree
.
&#x251C;&#x2500;&#x2500; HELP.md
&#x251C;&#x2500;&#x2500; mvnw
&#x251C;&#x2500;&#x2500; mvnw.cmd
&#x251C;&#x2500;&#x2500; pom.xml
&#x251C;&#x2500;&#x2500; src
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; main
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; java
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; com
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0;     &#x2514;&#x2500;&#x2500; example
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0;         &#x2514;&#x2500;&#x2500; todolist
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0;             &#x2514;&#x2500;&#x2500; TodolistApplication.java
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; resources
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0;     &#x2514;&#x2500;&#x2500; application.properties
&#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; test
&#x2502;&#xA0;&#xA0;     &#x2514;&#x2500;&#x2500; java
&#x2502;&#xA0;&#xA0;         &#x2514;&#x2500;&#x2500; com
&#x2502;&#xA0;&#xA0;             &#x2514;&#x2500;&#x2500; example
&#x2502;&#xA0;&#xA0;                 &#x2514;&#x2500;&#x2500; todolist
&#x2502;&#xA0;&#xA0;                     &#x2514;&#x2500;&#x2500; TodolistApplicationTests.java
&#x2514;&#x2500;&#x2500; target
    &#x251C;&#x2500;&#x2500; classes
    &#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; application.properties
    &#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; com
    &#x2502;&#xA0;&#xA0;     &#x2514;&#x2500;&#x2500; example
    &#x2502;&#xA0;&#xA0;         &#x2514;&#x2500;&#x2500; todolist
    &#x2502;&#xA0;&#xA0;             &#x2514;&#x2500;&#x2500; TodolistApplication.class
    &#x2514;&#x2500;&#x2500; test-classes
        &#x2514;&#x2500;&#x2500; com
            &#x2514;&#x2500;&#x2500; example
                &#x2514;&#x2500;&#x2500; todolist
                    &#x2514;&#x2500;&#x2500; TodolistApplicationTests.class</code></pre><h1 id="file-folder-explanation">File / Folder Explanation </h1><!--kg-card-begin: markdown--><ul>
<li>
<p>.mvn &#x2014; Maven Wrapper Folder</p>
<ul>
<li>This folder supports the Maven Wrapper, which lets you build the project without installing Maven globally.</li>
</ul>
</li>
<li>
<p>src &#x2014; Source Code Directory</p>
<ul>
<li>This is where all your code and resources live. <code>src/main/java</code> &#x2014; your actual application source code. That&#x2019;s where controllers, services, repositories, etc.</li>
</ul>
</li>
<li>
<p>target &#x2014; Build Output Directory</p>
<ul>
<li>This folder is automatically created by Maven when you build or run your project. It contains all the compiled code, packaged JAR/WAR files, and temporary build files.</li>
</ul>
</li>
<li>
<p>pom.xml</p>
<ul>
<li>pom.xml stands for Project Object Model file. It&#x2019;s the central configuration file that defines:
<ol>
<li>&#x1F4E6; Project structure and metadata</li>
<li>&#x1F517; Dependencies (libraries your app uses)</li>
<li>&#x2699;&#xFE0F; Build configuration (how your app compiles, packages, and runs)</li>
<li>&#x1F501; Plugins (extra tools like testing, packaging, or deployment automation)</li>
</ol>
</li>
</ul>
</li>
</ul>
<!--kg-card-end: markdown--><h1 id="implementation">Implementation</h1><p>Let&apos;s jump into the implementation. Before we write any code, we need to prepare two things: the project structure and the database dependencies. Below is what a clean Sprint Boot folder layout looks like. Most of this is generated automatically by Sprint Initializr, but we&apos;ll add our entity and config files manually...</p><pre><code class="language-plain text">src/
 &#x251C;&#x2500;&#x2500; main/
 &#x2502;   &#x251C;&#x2500;&#x2500; java/
 &#x2502;   &#x2502;   &#x2514;&#x2500;&#x2500; com/example/todolist/
 &#x2502;   &#x2502;        &#x251C;&#x2500;&#x2500; TodolistApplication.java       &#x2190; main entry point
 &#x2502;   &#x2502;        &#x251C;&#x2500;&#x2500; model/                         &#x2190; your JPA entity
 &#x2502;   &#x2502;        &#x2502;    &#x2514;&#x2500;&#x2500; Todo.java
 &#x2502;   &#x2514;&#x2500;&#x2500; resources/
 &#x2502;        &#x251C;&#x2500;&#x2500; application.properties             &#x2190; DB + JPA config</code></pre><h3 id="add-the-some-dependencies"><strong>Add the some dependencies</strong></h3><p>Before implementing the to-do list application, we need to set up the required dependencies and database connection. In my location environment, I use a <code>.env</code> file to manage config values such as database connection info. Below is an example configuration:</p><blockquote>.env</blockquote><pre><code class="language-plain text">DB_HOST=localhost
DB_PORT=5432
DB_NAME=todolist
DB_USER=paul
DB_PASSWORD=dev</code></pre><blockquote>application.properties</blockquote><pre><code class="language-plain text">spring.application.name=todolist

spring.datasource.url=jdbc:postgresql://${DB_HOST}:${DB_PORT:5432}/${DB_NAME:demo}
spring.datasource.username=${DB_USER:default_user}
spring.datasource.password=${DB_PASSWORD:default_pass}

# Hibernate Settings
spring.jpa.hibernate.ddl-auto=None
spring.jpa.show-sql=true
spring.jpa.properties.hibernate.format_sql=true</code></pre><p>I want to take a little of time to explain what&apos;s the Hibernate Setting, Hibernate is an ORM (Object-Relational Mapping) Framework, you can think it as the bridge between your java objects to relation database. </p><ol><li><strong>spring.jpa.hibernate.ddl-auto=None</strong><br>This controls how Hibernate handles DDL (Data Definition Language) &#x2192; meaning tables, columns, schema generation. The following are different values for this setting.<br>&#x2003;- none: Do nothing. Hibernate will not create, update, validate, or drop your <em>d</em>atabase schema.<br>&#x2003;- update: Automatically updates schema &#x2192; adds columns, changes types (not recommended in production).<br>&#x2003;- create: Drops all tables and recreates them every time the app starts.<br>&#x2003;- create-drop: Like create, but drops the schema when the app stops.<br>&#x2003;- validate: Checks entity vs database schema&#x2014;fails if mismatch but does not modify schema.</li><li><strong>spring.jpa.show-sql=true</strong><br>This tells Hibernate to print SQL statements in the console/logs. Not recommended in production, because it can print sensitive data + cause log flooding.</li><li><strong>spring.jpa.properties.hibernate.format_sql=true</strong><br>Formats SQL logs to be pretty and readable. Works only when show-sql=true</li></ol><div class="kg-card kg-callout-card kg-callout-card-yellow"><div class="kg-callout-emoji">&#x1F4A1;</div><div class="kg-callout-text">The Jakarta Persistence API (JPA), formerly known as the Java Persistence API, is a Jakarta EE application programming interface specification that describes the management of relational data in enterprise Java applications. It provides a standard approach for object-relational mapping (ORM), simplifying database interactions by enabling developers to map Java objects to database tables.&#xA0;</div></div><p>Next, edit your pom.xml &#x2192; inside &lt;dependencies&gt; section, add:</p><pre><code class="language-xml">&lt;!-- Spring Boot Web --&gt;
&lt;dependency&gt;
    &lt;groupId&gt;org.springframework.boot&lt;/groupId&gt;
    &lt;artifactId&gt;spring-boot-starter-web&lt;/artifactId&gt;
&lt;/dependency&gt;

&lt;!-- Spring Data JPA --&gt;
&lt;dependency&gt;
    &lt;groupId&gt;org.springframework.boot&lt;/groupId&gt;
    &lt;artifactId&gt;spring-boot-starter-data-jpa&lt;/artifactId&gt;
&lt;/dependency&gt;

&lt;!-- PostgreSQL Driver --&gt;
&lt;dependency&gt;
    &lt;groupId&gt;org.postgresql&lt;/groupId&gt;
    &lt;artifactId&gt;postgresql&lt;/artifactId&gt;
    &lt;scope&gt;runtime&lt;/scope&gt;
&lt;/dependency&gt;

&lt;!-- Testing (already included but keep it) --&gt;
&lt;dependency&gt;
    &lt;groupId&gt;org.springframework.boot&lt;/groupId&gt;
    &lt;artifactId&gt;spring-boot-starter-test&lt;/artifactId&gt;
    &lt;scope&gt;test&lt;/scope&gt;
&lt;/dependency&gt;</code></pre><p>Then reload Maven: (Remember to export your .env to system environmental variable, because the Spring Boot doesn&#x2019;t read .env automatically)</p><h3 id="create-the-model">Create the Model</h3><p>For the creating the model in the Sprint Boot, I will introduce the Lombok which can help us quickly build the model object without implement the Getter/Setter methods. </p><p>In pom.xml, inside &lt;dependencies&gt;, add:</p><pre><code class="language-plain text">&lt;dependency&gt;
    &lt;groupId&gt;org.projectlombok&lt;/groupId&gt;
    &lt;artifactId&gt;lombok&lt;/artifactId&gt;
    &lt;version&gt;1.18.32&lt;/version&gt;
    &lt;scope&gt;provided&lt;/scope&gt;
&lt;/dependency&gt;</code></pre><blockquote> src/main/java/com/example/todolist/model/Todo.java</blockquote><pre><code class="language-Java">package com.example.todolist.model;

import jakarta.persistence.*;
import lombok.Getter;
import lombok.NoArgsConstructor;
import lombok.Setter;

import java.time.Instant;

@Entity
@Table(name = &quot;todolist&quot;)
@Getter
@Setter
@NoArgsConstructor
public class Todo {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    @Column(nullable = false)
    private String title;

    private String description;

    @Column(nullable = false)
    private boolean completed = false;

    // UTC-safe creation timestamp
    @Column(nullable = false, updatable = false,
            columnDefinition = &quot;TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP&quot;)
    private Instant createdAt = Instant.now();

    // Custom constructor for creating new todos
    public Todo(String title, String description) {
        this.title = title;
        this.description = description;
    }

}</code></pre><p>We have defined the model schema for the <code>todolist</code> table. For the <code>createdAt</code> field, we use the Instant type rather than LocalDateTime to store the timestamp in UTC. This ensures that the creation time is recorded consistently and is automatically set to the current instant.</p><h3 id="migration-manager">Migration Manager</h3><p>In modern SaaS application, database migration management is a critical part of the development workflow. Each release typically involves schema changes, and keeping track of these changes in version control is essential for maintainability and team collaboration.</p><p>Liquibase is an open-source database schema change management tool that helps development and DevOps teams track, version, and automate database migrations. In this post, we will walk through how to install Liquibase and use it to manage database changes in your Spring Boot application. </p><p>To get started, add the following sections to your <code>pom.xml</code> to include the Liquibase dependencies and Maven plugin:</p><pre><code class="language-plain text">&lt;dependency&gt;
	&lt;groupId&gt;org.liquibase&lt;/groupId&gt;
	&lt;artifactId&gt;liquibase-core&lt;/artifactId&gt;
	&lt;version&gt;4.27.0&lt;/version&gt;
&lt;/dependency&gt;</code></pre><pre><code class="language-plain text">&lt;plugin&gt;
    &lt;groupId&gt;org.liquibase&lt;/groupId&gt;
    &lt;artifactId&gt;liquibase-maven-plugin&lt;/artifactId&gt;
    &lt;version&gt;4.27.0&lt;/version&gt;

    &lt;configuration&gt;
        &lt;changeLogFile&gt;
            src/main/resources/db/changelog/db.changelog-master.yaml
        &lt;/changeLogFile&gt;

        &lt;url&gt;${env.DB_URL}&lt;/url&gt;
        &lt;username&gt;${env.DB_USER}&lt;/username&gt;
        &lt;password&gt;${env.DB_PASSWORD}&lt;/password&gt;
        &lt;driver&gt;org.postgresql.Driver&lt;/driver&gt;

        &lt;referenceUrl&gt;
            hibernate:spring:com.example.todolist.model?dialect=org.hibernate.dialect.PostgreSQLDialect
        &lt;/referenceUrl&gt;
        &lt;referenceDriver&gt;
            liquibase.ext.hibernate.database.connection.HibernateDriver
        &lt;/referenceDriver&gt;
    &lt;/configuration&gt;

    &lt;dependencies&gt;

        &lt;!-- Hibernate integration for Liquibase --&gt;
        &lt;dependency&gt;
            &lt;groupId&gt;org.liquibase.ext&lt;/groupId&gt;
            &lt;artifactId&gt;liquibase-hibernate6&lt;/artifactId&gt;
            &lt;version&gt;4.27.0&lt;/version&gt;
        &lt;/dependency&gt;

        &lt;!-- Spring ORM support --&gt;
        &lt;dependency&gt;
            &lt;groupId&gt;org.springframework&lt;/groupId&gt;
            &lt;artifactId&gt;spring-orm&lt;/artifactId&gt;
            &lt;version&gt;6.1.3&lt;/version&gt;
        &lt;/dependency&gt;

        &lt;!-- Spring context (required for Hibernate integration) --&gt;
        &lt;dependency&gt;
            &lt;groupId&gt;org.springframework&lt;/groupId&gt;
            &lt;artifactId&gt;spring-context&lt;/artifactId&gt;
            &lt;version&gt;6.1.3&lt;/version&gt;
        &lt;/dependency&gt;

        &lt;!-- PostgreSQL JDBC driver --&gt;
        &lt;dependency&gt;
            &lt;groupId&gt;org.postgresql&lt;/groupId&gt;
            &lt;artifactId&gt;postgresql&lt;/artifactId&gt;
            &lt;version&gt;42.7.3&lt;/version&gt;
        &lt;/dependency&gt;

    &lt;/dependencies&gt;
&lt;/plugin&gt;</code></pre><p>you can use the following command to verify the installation.</p><pre><code class="language-shell">./mvnw liquibase:help</code></pre><p>Next we will setup how to generate the ChangeLog for our new model <code>todolist</code>. Create the <code>db.changelog-master.yaml</code> to <code>/src/main/resources/db/changelog</code> and also create the <code>changes</code> folder in the <code>changelog</code>.</p><figure class="kg-card kg-code-card"><pre><code class="language-plain text">todolist/src/main/resources on &#xE0A0; main [&#x21E1;] is &#x1F4E6; 1 via &#x2615; v21.0.9
&#x279C; tree
.
&#x251C;&#x2500;&#x2500; application.properties
&#x2514;&#x2500;&#x2500; db
    &#x2514;&#x2500;&#x2500; changelog
        &#x251C;&#x2500;&#x2500; changes
        &#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; 0001-changelog-init.yaml
        &#x2514;&#x2500;&#x2500; db.changelog-master.yaml</code></pre><figcaption><strong>Folder Structure of the resources</strong></figcaption></figure><blockquote>db.changelog-master.yaml</blockquote><pre><code>databaseChangeLog:
  - includeAll:
      path: db/changelog/changes/
      relativeToChangelogFile: false
</code></pre><p>Based on the settings in the <code>db.changelog-master.yaml</code>, we will store all the changelog files in the <strong><code>db/changelog/changes/</code></strong>. New we will use the following commands to generate our first init migration file and apply the update.</p><pre><code class="language-shell">export $(cat .env | xargs)
./mvnw liquibase:diff -Dliquibase.diffChangeLogFile=src/main/resources/db/changelog/changes/0001-init.yaml</code></pre><p>You will get the following change log:</p><pre><code class="language-yaml">databaseChangeLog:
- changeSet:
    id: 1763910259559-1
    author: taiker (generated)
    changes:
    - createTable:
        columns:
        - column:
            autoIncrement: true
            constraints:
              nullable: false
              primaryKey: true
              primaryKeyName: todolistPK
            name: id
            type: BIGINT
        - column:
            constraints:
              nullable: false
            name: completed
            type: BOOLEAN
        - column:
            constraints:
              nullable: false
            defaultValueComputed: CURRENT_TIMESTAMP
            name: createdAt
            type: TIMESTAMP WITH TIME ZONE
        - column:
            name: description
            type: VARCHAR(255)
        - column:
            constraints:
              nullable: false
            name: title
            type: VARCHAR(255)
        tableName: todolist
- changeSet:
    id: 1763910259559-2
    author: taiker (generated)
    changes:
    - dropTable:
        tableName: todos

</code></pre><p>Next we will use the update command to apply our database change log.</p><pre><code class="language-shell">export $(cat .env | xargs)
./mvnw liquibase:update</code></pre><div class="kg-card kg-callout-card kg-callout-card-yellow"><div class="kg-callout-emoji">&#x1F4A1;</div><div class="kg-callout-text">Additional, the liquibase also provide the rollback mechanism, you can refer to the table below.</div></div><figure class="kg-card kg-image-card"><img src="https://blog.taiker.space/content/images/2025/11/image.png" class="kg-image" alt="First Spring Boot Application - Todo List (1)" loading="lazy" width="792" height="363" srcset="https://blog.taiker.space/content/images/size/w600/2025/11/image.png 600w, https://blog.taiker.space/content/images/2025/11/image.png 792w" sizes="(min-width: 720px) 720px"></figure><p>You can log in to the Postgres to see the result</p><pre><code>todolist=# \dt public.*
                List of relations
 Schema |         Name          | Type  | Owner
--------+-----------------------+-------+--------
 public | databasechangelog     | table | taiker
 public | databasechangeloglock | table | taiker
 public | todolist              | table | taiker
(3 rows)</code></pre><pre><code class="language-text">todolist=# \d public.todolist
                                     Table &quot;public.todolist&quot;
   Column    |           Type           | Collation | Nullable |             Default
-------------+--------------------------+-----------+----------+----------------------------------
 id          | bigint                   |           | not null | generated by default as identity
 completed   | boolean                  |           | not null |
 createdAt   | timestamp with time zone |           | not null | now()
 description | character varying(255)   |           |          |
 title       | character varying(255)   |           | not null |
Indexes:
    &quot;todolistPK&quot; PRIMARY KEY, btree (id)</code></pre><p>Now you can see that we have successfully created a new table called <code>todolist</code> in PostgreSQL by applying the changelog generated by Liquibase. This is the simplest example of using Liquibase during development. Next, we will add additional fields to the table through multiple changelogs to explore what else Liquibase can do.</p><p>The first change we want to make is renaming the <code>createdAt</code> column in the todolist table. According to common database conventions, column names should follow snake_case, so the column should be named <code>created_at</code> instead of &#xA0;<code>createdAt</code>. This is a perfect opportunity to create another Liquibase changelog to apply this update.</p><p>To avoid running into this naming mismatch again in the future, we will also update our Hibernate configuration. By enabling the <code>CamelCaseToUnderscoresNamingStrategy</code>, we can continue using maps them to snake_case columns in the database. This keeps our Java model clean and idiomatic while ensuring consistent database naming conventions.</p><blockquote>application.properties</blockquote><pre><code># Use snake_case for database column names
spring.jpa.hibernate.naming.physical-strategy=org.hibernate.boot.model.naming.CamelCaseToUnderscoresNamingStrategy</code></pre><p>Generate the following change log for renaming <code>createdAt</code> field, and apply it.</p><pre><code>databaseChangeLog:
  - changeSet:
      id: 0002-rename-createdAt-to-created_at
      author: taiker
      changes:
        - renameColumn:
            tableName: todolist
            oldColumnName: createdAt
            newColumnName: created_at
            columnDataType: TIMESTAMP WITH TIME ZONE
</code></pre><p>Now that we have updated the table schema, we can try using the Liquibase rollback command. After running the rollback, you will see that the database returns to the previous schema exactly as expected.</p><pre><code class="language-shell">export $(cat .env | xargs)

# option 1
mvn liquibase:rollback -Dliquibase.rollbackCount=1</code></pre><pre><code class="language-text">todolist=# \d public.todolist
                                     Table &quot;public.todolist&quot;
   Column    |           Type           | Collation | Nullable |             Default
-------------+--------------------------+-----------+----------+----------------------------------
 id          | bigint                   |           | not null | generated by default as identity
 completed   | boolean                  |           | not null |
 createdAt   | timestamp with time zone |           | not null | now()
 description | character varying(255)   |           |          |
 title       | character varying(255)   |           | not null |
Indexes:
    &quot;todolistPK&quot; PRIMARY KEY, btree (id)</code></pre><h1 id="bonus">Bonus</h1><p>I use a Makefile to wrap the Maven Liquibase commands, making them much easier to run during development. Below is the Makefile for reference. You can run <code>make help</code> to view all supported commands.</p><pre><code class="language-text">&#x279C; make help
=== Liquibase Migration Commands ===

Migration Generation:
  make makemigration              - Generate new migration (auto-numbered)
  make makemigration NAME=example - Generate migration with custom name

Migration Execution:
  make migrate                    - Apply all pending migrations
  make migrate-one                - Apply only the next pending migration
  make migrate-to NUM=0008        - Migrate to specific version
  make showmigrations             - Show migration status

Rollback:
  make rollback COUNT=1           - Rollback N changesets (default: 1)
  make rollback-preview COUNT=1   - Preview rollback SQL

Fake Migrations:
  make fake-migrate               - Mark all pending as executed
  make fake-migrate-to NUM=0008   - Mark up to version as executed
  make fake-migrate-preview       - Preview what would be marked</code></pre><pre><code class="language-makefile"># Makefile for Liquibase Migration Management
# Provides Django-like commands for database migrations

# Load environment variables from .env file
-include .env
export

.PHONY: help makemigration migrate migrate-one migrate-to showmigrations rollback rollback-preview fake-migrate fake-migrate-to fake-migrate-preview

# Default target
help:
	@echo &quot;=== Liquibase Migration Commands ===&quot;
	@echo &quot;&quot;
	@echo &quot;Migration Generation:&quot;
	@echo &quot;  make makemigration              - Generate new migration (auto-numbered)&quot;
	@echo &quot;  make makemigration NAME=example - Generate migration with custom name&quot;
	@echo &quot;&quot;
	@echo &quot;Migration Execution:&quot;
	@echo &quot;  make migrate                    - Apply all pending migrations&quot;
	@echo &quot;  make migrate-one                - Apply only the next pending migration&quot;
	@echo &quot;  make migrate-to NUM=0008        - Migrate to specific version&quot;
	@echo &quot;  make showmigrations             - Show migration status&quot;
	@echo &quot;&quot;
	@echo &quot;Rollback:&quot;
	@echo &quot;  make rollback COUNT=1           - Rollback N changesets (default: 1)&quot;
	@echo &quot;  make rollback-preview COUNT=1   - Preview rollback SQL&quot;
	@echo &quot;&quot;
	@echo &quot;Fake Migrations:&quot;
	@echo &quot;  make fake-migrate               - Mark all pending as executed&quot;
	@echo &quot;  make fake-migrate-to NUM=0008   - Mark up to version as executed&quot;
	@echo &quot;  make fake-migrate-preview       - Preview what would be marked&quot;
	@echo &quot;&quot;

# Variables
CHANGES_DIR := src/main/resources/db/changelog/changes
MVN := mvn
NAME ?= auto_generated
COUNT ?= 1

# Auto-detect next migration number
LATEST_NUM := $(shell ls $(CHANGES_DIR) 2&gt;/dev/null | grep -E &apos;^[0-9]+&apos; | sed &apos;s/^0*//&apos; | sed &apos;s/[^0-9].*//&apos; | sort -n | tail -1)
ifeq ($(LATEST_NUM),)
	NEXT_NUM := 1
else
	NEXT_NUM := $(shell echo $$(($(LATEST_NUM) + 1)))
endif
NEXT_FORMATTED := $(shell printf &quot;%04d&quot; $(NEXT_NUM))

# Migration Generation
makemigration:
	@echo &quot;Generating migration $(NEXT_FORMATTED)_$(NAME).yaml...&quot;
	@if [ ! -d &quot;$(CHANGES_DIR)&quot; ]; then \
		echo &quot;Error: Directory $(CHANGES_DIR) does not exist&quot;; \
		exit 1; \
	fi
	@FILEPATH=&quot;$(CHANGES_DIR)/$(NEXT_FORMATTED)_$(NAME).yaml&quot;; \
	$(MVN) liquibase:diff -Dliquibase.diffChangeLogFile=$$FILEPATH; \
	if [ -f $$FILEPATH ]; then \
		echo &quot;&quot; &gt;&gt; $$FILEPATH; \
		echo &quot;- changeSet:&quot; &gt;&gt; $$FILEPATH; \
		echo &quot;    id: tag-$(NEXT_FORMATTED)&quot; &gt;&gt; $$FILEPATH; \
		echo &quot;    author: taiker&quot; &gt;&gt; $$FILEPATH; \
		echo &quot;    changes:&quot; &gt;&gt; $$FILEPATH; \
		echo &quot;      - tagDatabase:&quot; &gt;&gt; $$FILEPATH; \
		echo &quot;          tag: \&quot;$(NEXT_FORMATTED)\&quot;&quot; &gt;&gt; $$FILEPATH; \
		echo &quot;&#x2713; Migration created: $$FILEPATH&quot;; \
		echo &quot;&#x2713; Tag $(NEXT_FORMATTED) added&quot;; \
	else \
		echo &quot;Error: Failed to generate migration file&quot;; \
		exit 1; \
	fi

# Migration Execution
migrate:
	@echo &quot;Applying all pending migrations...&quot;
	@$(MVN) liquibase:update

migrate-one:
	@echo &quot;Applying next pending migration...&quot;
	@$(MVN) liquibase:updateCount -Dliquibase.count=1

migrate-to:
	@if [ -z &quot;$(NUM)&quot; ]; then \
		echo &quot;Error: NUM parameter required. Usage: make migrate-to NUM=0008&quot;; \
		exit 1; \
	fi
	@echo &quot;Migrating to version $(NUM)...&quot;
	@$(MVN) liquibase:updateToTag -Dliquibase.toTag=$(NUM)

showmigrations:
	@echo &quot;Checking migration status...&quot;
	@$(MVN) liquibase:status

# Rollback
rollback:
	@echo &quot;Rolling back $(COUNT) changeset(s)...&quot;
	@$(MVN) liquibase:rollback -Dliquibase.rollbackCount=$(COUNT)

rollback-preview:
	@echo &quot;Previewing rollback of $(COUNT) changeset(s)...&quot;
	@$(MVN) liquibase:rollbackSQL -Dliquibase.rollbackCount=$(COUNT)
	@echo &quot;&quot;
	@echo &quot;Preview saved to: target/liquibase/migrate.sql&quot;

# Fake Migrations
fake-migrate:
	@echo &quot;Marking all pending migrations as executed (without running them)...&quot;
	@$(MVN) liquibase:changeLogSync

fake-migrate-to:
	@if [ -z &quot;$(NUM)&quot; ]; then \
		echo &quot;Error: NUM parameter required. Usage: make fake-migrate-to NUM=0008&quot;; \
		exit 1; \
	fi
	@echo &quot;Marking migrations up to $(NUM) as executed (without running them)...&quot;
	@$(MVN) liquibase:changeLogSyncToTag -Dliquibase.toTag=$(NUM)

fake-migrate-preview:
	@echo &quot;Previewing what would be marked as executed...&quot;
	@$(MVN) liquibase:changeLogSyncSQL
	@echo &quot;&quot;
	@echo &quot;Preview saved to: target/liquibase/migrate.sql&quot;
</code></pre><h1 id="takeaway">Takeaway</h1><p>In this post, I walked you through how to start a new Java project using the online Spring Initializr. Unlike man tutorials that build an entire to-do list application in one go, we spent most of our time focusing on how to use Liquibase to manage database schema changes through practical, real-world examples. Understanding database migrations early helps ensure your application remains maintainable as it grows.</p><p>In the next post, we will continue building the application and complete the remaining parts of the basic to-do list system using Spring Boot. If you&#x2019;re interested in building production-ready SaaS applications with Java and Spring Boot, stay tuned for the next article.</p>]]></content:encoded></item><item><title><![CDATA[[Data] How to Build the Modern Distributed SQL Query Engine for Business-Level Requirement]]></title><description><![CDATA[<h1 id="background">Background</h1><p>In August, I embarked on a new journey by joining a new company, marking the start of another exciting venture. At the same time, I began exploring the world of data &#x2014; an area full of concepts and tools I had never worked with before. This blog is part</p>]]></description><link>https://blog.taiker.space/how-to-build-the-modern-sql-query-engine/</link><guid isPermaLink="false">68d793e3e18ac2500b9de9b5</guid><category><![CDATA[Data]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Mon, 29 Sep 2025 15:38:23 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1530541930197-ff16ac917b0e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDE0fHxjYW1wfGVufDB8fHx8MTc1OTA0NzY2Nnww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1530541930197-ff16ac917b0e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDE0fHxjYW1wfGVufDB8fHx8MTc1OTA0NzY2Nnww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" alt="[Data] How to Build the Modern Distributed SQL Query Engine for Business-Level Requirement"><p>In August, I embarked on a new journey by joining a new company, marking the start of another exciting venture. At the same time, I began exploring the world of data &#x2014; an area full of concepts and tools I had never worked with before. This blog is part of my effort to document that learning process. In particular, I want to share how to build a small, local data solution &#x2014; complete with authentication &#x2014; that can run on your own local instance.</p><h1 id="prerequisites">Prerequisites</h1><p>Before jumping into the details, it&#x2019;s important to have a basic understanding of the three core services we&#x2019;ll be working with. Below, I&#x2019;ll provide a brief introduction to each of them.</p><h3 id="cloudbeaver">CloudBeaver</h3><p><a>CloudBeaver</a> is a lightweight, web-based database management tool. It allows you to connect to multiple databases through a browser interface, making it easy to explore data, run queries, and manage database objects without installing heavy desktop software. In our setup, CloudBeaver will serve as the main interface to interact with Trino.</p><h3 id="trino">Trino</h3><p><a>Trino</a> is a high&#x2011;performance, distributed SQL query engine designed to run interactive queries at scale. It can connect to a wide variety of data sources &#x2014; from relational databases like Postgres and MySQL to big data systems and even object storage. In our local setup, Trino serves as the <strong>query engine</strong> sitting between CloudBeaver (the UI) and the underlying data sources. This means CloudBeaver sends SQL queries to Trino, and Trino takes care of planning, executing, and returning the results.</p><h3 id="opa-open-policy-agent">OPA (Open Policy Agent)</h3><p><a>OPA</a> is a general-purpose policy engine that helps enforce fine-grained access control. It uses a declarative policy language called Rego to define what actions are allowed or denied. In our local data solution, OPA will integrate with Trino to ensure that only authorized users can run certain queries or access specific rows of data.</p><h1 id="arch">Arch</h1><p>Based on the image below, you can easily understand the roles and responsibilities of these three services. In the next section, we will walk through how to use these services step by step to build a modern distributed SQL query engine for processing large volumes of data.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.taiker.space/content/images/2025/09/Generated-Image-September-29--2025---10_37PM.png" class="kg-image" alt="[Data] How to Build the Modern Distributed SQL Query Engine for Business-Level Requirement" loading="lazy" width="1024" height="1024" srcset="https://blog.taiker.space/content/images/size/w600/2025/09/Generated-Image-September-29--2025---10_37PM.png 600w, https://blog.taiker.space/content/images/size/w1000/2025/09/Generated-Image-September-29--2025---10_37PM.png 1000w, https://blog.taiker.space/content/images/2025/09/Generated-Image-September-29--2025---10_37PM.png 1024w" sizes="(min-width: 720px) 720px"><figcaption>Overview of the Arch</figcaption></figure><h1 id="setup">Setup</h1><p>We use <strong>Docker Compose</strong> to orchestrate the three core services (Trino, CloudBeaver, OPA) for this demo project</p><figure class="kg-card kg-code-card"><pre><code class="language-dokcer-compose">services:
  trino:
    image: trinodb/trino:latest
    container_name: trino
    ports:
      - &quot;8080:8080&quot;
    volumes:
      - ./trino/etc:/etc/trino
      - ./trino/data:/data/trino

  cloudbeaver:
    image: dbeaver/cloudbeaver:latest
    container_name: cloudbeaver
    ports:
      - &quot;8978:8978&quot;
    volumes:
      - ./cloudbeaver/workspace:/opt/cloudbeaver/workspace
    depends_on:
      - trino

  opa:
    image: openpolicyagent/opa:0.69.0
    container_name: opa
    ports:
      - &quot;8181:8181&quot;
    command: [&quot;run&quot;, &quot;--server&quot;, &quot;--addr&quot;, &quot;0.0.0.0:8181&quot;, &quot;--config-file&quot;, &quot;/config.yaml&quot;, &quot;--log-level&quot;, &quot;debug&quot;, &quot;/policy&quot;, &quot;/data&quot;]
    volumes:
      - ./opa/policies:/policy
      - ./opa/data:/data
      - ./opa/config.yaml:/config.yaml</code></pre><figcaption>docker-compose.yaml</figcaption></figure><h2 id="cloudbeaver-1">CloudBeaver</h2><p>CloudBeaver setup is the simplest part. All we need to do is create a folder that Docker Compose can mount, which will be used to store logs and authentication information.</p><pre><code class="language-plain text">$ mkdir -p cloudbeaver/workspace</code></pre><h2 id="trino-1">Trino</h2><p>Trino is the <strong>heart of the stack</strong>, so its setup is a bit more involved than the others. We keep everything under a <code>trino/</code>directory with two subfolders: <code>etc/</code> for configuration and <code>data/</code> for runtime state.</p><p><code><strong>trino/etc/</strong></code><strong> &#x2014; Configuration</strong><br>Mounted at <code>/etc/trino/</code> inside the container. Key files include:</p><ul><li><code>config.properties</code> &#x2013; Main server settings (coordinator role, memory limits, discovery URI)</li><li><code>node.properties</code> &#x2013; Node-specific configuration</li><li><code>jvm.config</code> &#x2013; JVM flags and memory options</li><li><code>log.properties</code> &#x2013; Logging levels</li><li><code>access-control.properties</code> &#x2013; Points Trino to OPA for policy decisions</li><li><code>password-authenticator.properties</code> &#x2013; Enables file-based password authentication</li><li><code>password.db</code> &#x2013; Local credential store</li><li><code>catalog/tpch.properties</code> &#x2013; TPCH sample dataset connector</li><li><code>catalog/memory.properties</code> &#x2013; In-memory connector for quick tests</li></ul><p><code><strong>trino/data/</strong></code><strong> &#x2014; Runtime Data</strong><br>Mounted at <code>/data/trino/</code> inside the container. Used for:</p><ul><li><code>var/log/</code> &#x2013; HTTP and query logs</li><li><code>var/run/launcher.pid</code> &#x2013; Process ID of the Trino server</li><li>Cache and temp files for query execution</li></ul><p>This directory ensures Trino can persist logs, state, and temp files across restarts.</p><h3 id="example-config-files">Example Config Files</h3><blockquote><strong>etc/access-control.properties</strong></blockquote><figure class="kg-card kg-code-card"><pre><code class="language-plain text">access-control.name=opa
opa.policy.uri=http://opa:8181/v1/data/authz/trino/allow
opa.policy.batched-uri=http://opa:8181/v1/data/authz/trino/batch_allow
opa.log-requests=true
opa.log-responses=true</code></pre><figcaption><strong>etc/access-control.properties</strong></figcaption></figure><blockquote><strong>etc/config.properties</strong></blockquote><figure class="kg-card kg-code-card"><pre><code class="language-plain text">coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=512MB
query.max-memory-per-node=256MB
discovery-server.enabled=true
discovery.uri=http://localhost:8080</code></pre><figcaption><strong>etc/config.properties</strong></figcaption></figure><blockquote><strong>etc/jvm.config</strong></blockquote><figure class="kg-card kg-code-card"><pre><code class="language-plain text">-server
-Xmx1G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError</code></pre><figcaption><strong>etc/jvm.config</strong></figcaption></figure><blockquote><strong>etc/log.properties</strong></blockquote><figure class="kg-card kg-code-card"><pre><code class="language-plain text">io.trino.plugin.opa.OpaHttpClient=DEBUG</code></pre><figcaption><strong>etc/log.properties</strong></figcaption></figure><blockquote><strong>etc/node.properties</strong></blockquote><figure class="kg-card kg-code-card"><pre><code class="language-plain text">node.environment=dev
node.id=1
node.data-dir=/data/trino</code></pre><figcaption><strong>etc/node.properties</strong></figcaption></figure><blockquote><strong>etc/password-authenticator.properties</strong></blockquote><figure class="kg-card kg-code-card"><pre><code class="language-plain text">password-authenticator.name=file
file.password-file=/etc/trino/password.db</code></pre><figcaption><strong>etc/password-authenticator.properties</strong></figcaption></figure><blockquote><strong>etc/password.db</strong></blockquote><figure class="kg-card kg-code-card"><pre><code class="language-plain text">alice:password123
bob:password123</code></pre><figcaption><strong>etc/password.db</strong></figcaption></figure><p><strong>Next, let&#x2019;s introduce the concept of catalogs in Trino.</strong></p><p>In Trino, a <em>catalog</em> is a named configuration that connects to a specific data source. Each catalog:</p><ul><li>Points to a different data source (e.g., a database, file system, etc.)</li><li>Uses a specific connector to communicate with that data source</li><li>Has its own configuration settings</li><li>Appears as a separate &#x201C;database&#x201D; when queried via SQL</li></ul><p>In this demo, we&#x2019;ll use two catalog configurations. Here&#x2019;s the first:</p><hr><h3 id="1-tpchproperties-%E2%80%94-tpch-connector"><strong>1. tpch.properties &#x2014; TPCH Connector</strong></h3><ul><li><strong>Purpose</strong>: Provides sample datasets for testing and demonstration purposes</li><li><strong>Connector</strong>: TPCH (Transaction Processing Performance Council)</li><li><strong>Data</strong>: Dynamically generates standard benchmark tables such as customer, orders, lineitem, and more</li><li><strong>Configuration</strong>: <code>splits-per-node=4</code> &#x2014; Controls the level of parallelism used during data generation</li></ul><blockquote>etc/catalog/tpch.properties</blockquote><figure class="kg-card kg-code-card"><pre><code class="language-plain text">connector.name=tpch
tpch.splits-per-node=4</code></pre><figcaption>etc/catalog/tpch.properties</figcaption></figure><h3 id="2-memorypropertiesmemory-connector">2. memory.properties - Memory Connector</h3><ul><li><strong>Purpose</strong>: Creates in-memory tables for temporary data storage</li><li><strong>Connector</strong>: Memory connector for ephemeral data</li><li><strong>Use case</strong>: Temporary tables, testing, or data processing pipelines</li></ul><blockquote>etc/catalog/memory.properties</blockquote><figure class="kg-card kg-code-card"><pre><code class="language-plain text">connector.name=memory</code></pre><figcaption>etc/catalog/memory.properties</figcaption></figure><h2 id="opa">OPA</h2><p>Next we configure OPA. Create an <code>opa/</code> folder and add a <code>trino.rego</code> file to define authorization rules.</p><figure class="kg-card kg-code-card"><pre><code class="language-rego">package authz.trino

import rego.v1

default allow := false

debug_info := {
    &quot;user&quot;: input.context.identity.user,
    &quot;operation&quot;: input.action.operation,
    &quot;full_input&quot;: input,
}

# Allow Alice to perform any operation
allow if {
    input.context.identity.user == &quot;alice&quot;
}

# Explicitly deny Bob
allow if {
    input.context.identity.user == &quot;bob&quot;
    false
}

# Handle batch operations with filterResources
batch_allow := result if {
    input.action.filterResources
    result := [i |
        input.action.filterResources[i]
        allow
    ]
}

# Handle other types of batch requests
batch_allow := result if {
    input.batch
    result := [allow | input.batch[_]]
}</code></pre><figcaption>opa/trino.rego</figcaption></figure><p><strong>Explanation of Rules</strong></p><ul><li><code>default allow := false</code> &#x2013; Deny by default</li><li><code>debug_info</code> &#x2013; Collects context for debugging</li><li>Alice &#x2013; Full access</li><li>Bob &#x2013; Always denied</li><li><code>batch_allow</code> &#x2013; Applies the rules to batch requests</li></ul><p><strong>User Story</strong></p><ol><li><strong>Alice</strong> &#x2013; Can perform any operation, always allowed.</li><li><strong>Bob</strong> &#x2013; Always denied.</li><li><strong>Other users</strong> &#x2013; Denied by default.</li></ol><p>We also need to set the <code>config.yaml</code> for OPA. </p><figure class="kg-card kg-code-card"><pre><code class="language-yaml">services:
  authz:
    resource: data.authz

decision_logs:
  console: true
  reporting:
    min_delay_seconds: 1
    max_delay_seconds: 1</code></pre><figcaption>opa/config.yaml</figcaption></figure><p>when OPA evaluates an input, it will look at the policy path data.authz (from your .rego files) to determine allow/deny or any decision output.</p><p>This section controls how OPA records and outputs <strong>decision logs</strong> (audit trail of all policy decisions).</p><h1 id="validation">Validation</h1><p>In this section, we will run all services using Docker Compose and validate that everything works correctly. First, please make sure the folder structure matches the following:</p><pre><code class="language-plain text">.
&#x251C;&#x2500;&#x2500; cloudbeaver
&#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; workspace
&#x251C;&#x2500;&#x2500; docker-compose.yaml
&#x251C;&#x2500;&#x2500; opa
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; config.yaml
&#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; policies
&#x2502;&#xA0;&#xA0;     &#x2514;&#x2500;&#x2500; trino.rego
&#x2514;&#x2500;&#x2500; trino
    &#x251C;&#x2500;&#x2500; data
    &#x2514;&#x2500;&#x2500; etc
        &#x251C;&#x2500;&#x2500; access-control.properties
        &#x251C;&#x2500;&#x2500; catalog
        &#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; memory.properties
        &#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; tpch.properties
        &#x251C;&#x2500;&#x2500; config.properties
        &#x251C;&#x2500;&#x2500; jvm.config
        &#x251C;&#x2500;&#x2500; log.properties
        &#x251C;&#x2500;&#x2500; node.properties
        &#x251C;&#x2500;&#x2500; password-authenticator.properties
        &#x2514;&#x2500;&#x2500; password.db</code></pre><p>Next, use <code>docker-compose up --build</code> to start the services. If everything goes well, you can open your browser and navigate to CloudBeaver at <code>http://localhost:8978</code>. Follow the on-screen instructions to set up your account and password</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.taiker.space/content/images/2025/09/image.png" class="kg-image" alt="[Data] How to Build the Modern Distributed SQL Query Engine for Business-Level Requirement" loading="lazy" width="2000" height="802" srcset="https://blog.taiker.space/content/images/size/w600/2025/09/image.png 600w, https://blog.taiker.space/content/images/size/w1000/2025/09/image.png 1000w, https://blog.taiker.space/content/images/size/w1600/2025/09/image.png 1600w, https://blog.taiker.space/content/images/size/w2400/2025/09/image.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Follow the instructions to set up the login info.</figcaption></figure><p>After setting up your account and password, let&#x2019;s go back to our Rego code. Remember we have two users: Alice and Bob. Alice has full permissions for all operations, whereas Bob has no permissions at all. Next, we&#x2019;ll test whether the OPA service is working correctly by creating these users in CloudBeaver.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.taiker.space/content/images/2025/09/image-3.png" class="kg-image" alt="[Data] How to Build the Modern Distributed SQL Query Engine for Business-Level Requirement" loading="lazy" width="2000" height="996" srcset="https://blog.taiker.space/content/images/size/w600/2025/09/image-3.png 600w, https://blog.taiker.space/content/images/size/w1000/2025/09/image-3.png 1000w, https://blog.taiker.space/content/images/size/w1600/2025/09/image-3.png 1600w, https://blog.taiker.space/content/images/2025/09/image-3.png 2226w" sizes="(min-width: 720px) 720px"><figcaption>Create the users Alice and Bob in CloudBeaver.</figcaption></figure><p>Before we going to set up the connection we should describe how authentication normally works within these services:</p><ul><li><strong>CloudBeaver</strong>: by default manages its own users (in its internal DB or LDAP).</li><li><strong>Trino</strong>: receives a username when a connection is opened. This is usually configured in the CloudBeaver connection settings (Authentication &#x2192; Username/Password or JWT).</li><li><strong>OPA</strong>: only sees what Trino sends in input.context.identity.user.</li></ul><p>Since we&apos;re using the Community Edition of CloudBeaver, we can&apos;t automatically pass user information to Trino. As a result, each CloudBeaver user must configure their own connection using their Trino username and password (or token). </p><div class="kg-card kg-callout-card kg-callout-card-yellow"><div class="kg-callout-emoji">&#x1F4A1;</div><div class="kg-callout-text">Bonus Tip: If you don&apos;t want to set up a connection for each user in Community Edition, you can enable the Trino&apos;s proxy user feature. However, this topic is not covered in this blog post.&#xA0;</div></div><p>Use the following URL: <code>jdbc:trino://trino:8080/tpch/tiny</code>. Since we are using Docker Compose to start the services, CloudBeaver cannot connect to Trino using localhost&#x2014;it must reference the service name defined in the Docker network (in this case, trino).</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.taiker.space/content/images/2025/09/image-1.png" class="kg-image" alt="[Data] How to Build the Modern Distributed SQL Query Engine for Business-Level Requirement" loading="lazy" width="2000" height="1303" srcset="https://blog.taiker.space/content/images/size/w600/2025/09/image-1.png 600w, https://blog.taiker.space/content/images/size/w1000/2025/09/image-1.png 1000w, https://blog.taiker.space/content/images/size/w1600/2025/09/image-1.png 1600w, https://blog.taiker.space/content/images/2025/09/image-1.png 2364w" sizes="(min-width: 720px) 720px"><figcaption>Select the Trino as our connection type.</figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.taiker.space/content/images/2025/09/image-6.png" class="kg-image" alt="[Data] How to Build the Modern Distributed SQL Query Engine for Business-Level Requirement" loading="lazy" width="2000" height="987" srcset="https://blog.taiker.space/content/images/size/w600/2025/09/image-6.png 600w, https://blog.taiker.space/content/images/size/w1000/2025/09/image-6.png 1000w, https://blog.taiker.space/content/images/size/w1600/2025/09/image-6.png 1600w, https://blog.taiker.space/content/images/size/w2400/2025/09/image-6.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Set up the connection for Alice</figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.taiker.space/content/images/2025/09/image-7.png" class="kg-image" alt="[Data] How to Build the Modern Distributed SQL Query Engine for Business-Level Requirement" loading="lazy" width="2000" height="943" srcset="https://blog.taiker.space/content/images/size/w600/2025/09/image-7.png 600w, https://blog.taiker.space/content/images/size/w1000/2025/09/image-7.png 1000w, https://blog.taiker.space/content/images/size/w1600/2025/09/image-7.png 1600w, https://blog.taiker.space/content/images/size/w2400/2025/09/image-7.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Set up the connection for Bob</figcaption></figure><p>Next, we can observe the differences between the two connections. For Alice, you can view all the settings we configured in the Trino environment, including the <code>memory</code> and <code>tpch</code> catalogs. However, for Bob, you won&apos;t see anything because all operations are rejected by the OPA service.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.taiker.space/content/images/2025/09/image-8.png" class="kg-image" alt="[Data] How to Build the Modern Distributed SQL Query Engine for Business-Level Requirement" loading="lazy" width="1316" height="888" srcset="https://blog.taiker.space/content/images/size/w600/2025/09/image-8.png 600w, https://blog.taiker.space/content/images/size/w1000/2025/09/image-8.png 1000w, https://blog.taiker.space/content/images/2025/09/image-8.png 1316w" sizes="(min-width: 720px) 720px"><figcaption>The difference between Alice and Bob</figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.taiker.space/content/images/2025/09/image-9.png" class="kg-image" alt="[Data] How to Build the Modern Distributed SQL Query Engine for Business-Level Requirement" loading="lazy" width="2000" height="852" srcset="https://blog.taiker.space/content/images/size/w600/2025/09/image-9.png 600w, https://blog.taiker.space/content/images/size/w1000/2025/09/image-9.png 1000w, https://blog.taiker.space/content/images/size/w1600/2025/09/image-9.png 1600w, https://blog.taiker.space/content/images/size/w2400/2025/09/image-9.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>For Bob, can&apos;t do any operation through connection</figcaption></figure><p>Perfect &#x2014; everything is working as expected (poor Bob &#x1F605;). If you run into any issues, feel free to check out my <a href="https://github.com/TaikerLiang/blog-opa-demo/tree/bk_20251002?ref=blog.taiker.space">GitHub repository</a> for the complete project and setup instructions.</p><h1 id="takeaway">Takeaway</h1><p>In this post, we demonstrated how to build a modern distributed SQL query engine with a user-friendly UI and an authorization solution that meets business-level data requirements &#x2014; not just for handling large datasets, but also for enforcing data access control. </p>]]></content:encoded></item><item><title><![CDATA[Recap of bypassing anti-bot mechanism]]></title><description><![CDATA[<h1 id="background">Background</h1><p>I have been with GoFreight for 5 years. During the first two or three years, I was assigned to develop the crawler system for the tracking service. In the beginning, everything went well, and we were able to crawl the information we needed from carrier websites. However, as time</p>]]></description><link>https://blog.taiker.space/recap-of-bypassing-anti-bot-mechanism/</link><guid isPermaLink="false">67e80011e18ac2500b9de769</guid><category><![CDATA[Web]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Sat, 19 Apr 2025 16:49:45 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1552381810-0e4a585b44ca?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDV8fGJ5cGFzc3xlbnwwfHx8fDE3NDMyNTc2NDB8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1552381810-0e4a585b44ca?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDV8fGJ5cGFzc3xlbnwwfHx8fDE3NDMyNTc2NDB8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="Recap of bypassing anti-bot mechanism"><p>I have been with GoFreight for 5 years. During the first two or three years, I was assigned to develop the crawler system for the tracking service. In the beginning, everything went well, and we were able to crawl the information we needed from carrier websites. However, as time passed, more and more carriers implemented anti-bot solutions on their websites, and we started encountering issues. We had to deal with various challenges such as CAPTCHAs, Google reCAPTCHA, CDN protections, etc. This required increasing amounts of time to bypass these detections, otherwise, we couldn&#x2019;t meet our service SLA. In this blog, I want to take some time to recap the challenges we faced during this period, until we started retrieving data directly from our vendors or carriers.</p><h1 id="tls-handshake">TLS Handshake</h1><p>First, I will briefly introduce the TLS Handshake. Transport Layer Security (TLS) is a widely adopted security protocol designed to ensure privacy and data security for communications over the internet. One of its primary use cases is encrypting the communication between web applications and servers. In the following figure, we can see how the client and server establish a secure connection before sending or receiving data, and this process is called the TLS Handshake. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.taiker.space/content/images/2025/04/tls-ssl-handshake.png" class="kg-image" alt="Recap of bypassing anti-bot mechanism" loading="lazy" width="2000" height="1143" srcset="https://blog.taiker.space/content/images/size/w600/2025/04/tls-ssl-handshake.png 600w, https://blog.taiker.space/content/images/size/w1000/2025/04/tls-ssl-handshake.png 1000w, https://blog.taiker.space/content/images/size/w1600/2025/04/tls-ssl-handshake.png 1600w, https://blog.taiker.space/content/images/size/w2400/2025/04/tls-ssl-handshake.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>src: https://www.cloudflare.com/learning/ssl/what-happens-in-a-tls-handshake/</figcaption></figure><h1 id="fingerprint">Fingerprint</h1><p>Next, I will briefly explain what a fingerprint is with a simple example. A fingerprint is a unique identifier generated from a combination of device, browser, and network characteristics. It&#x2019;s used to recognize users or bots, even when:</p><ul><li>Cookies are disabled</li><li>IP addresses change</li><li>Users switch to incognito or private mode</li></ul><p>Think of it as a digital ID for your device/browser session. This technique is widely used by MarTech and CDN companies for various purposes, including improving ad transfer rates and detecting bots. In this example, we&#x2019;ll use FingerprintJS to help demonstrate how it works.</p><pre><code class="language-html">&lt;!DOCTYPE html&gt;
&lt;html lang=&quot;en&quot;&gt;
&lt;head&gt;
    &lt;meta charset=&quot;UTF-8&quot;&gt;
    &lt;meta name=&quot;viewport&quot; content=&quot;width=device-width, initial-scale=1.0&quot;&gt;
    &lt;title&gt;Bot Detection with FingerprintJS v4&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;h1&gt;Bot Detection Demo&lt;/h1&gt;
    &lt;p&gt;Click the button below to check if you&apos;re a bot!&lt;/p&gt;
    &lt;button id=&quot;checkBtn&quot;&gt;Check Me&lt;/button&gt;
    &lt;p id=&quot;result&quot;&gt;&lt;/p&gt;

    &lt;!-- Load FingerprintJS as a module --&gt;
    &lt;script type=&quot;module&quot;&gt;
        import FingerprintJS from &apos;https://openfpcdn.io/fingerprintjs/v4&apos;;

        async function checkFingerprint() {
            const fp = await FingerprintJS.load();
            const result = await fp.get({ extendedResult: true });

            console.log(&quot;Fingerprint ID:&quot;, result.visitorId);
            console.log(&quot;Detailed Components:&quot;, result.components);
            console.log(&quot;Confidence Score:&quot;, result.confidence.score);
        }


        // Attach event listener after DOM loads
        document.getElementById(&quot;checkBtn&quot;).addEventListener(&quot;click&quot;, checkFingerprint);
    &lt;/script&gt;
&lt;/body&gt;
&lt;/html&gt;</code></pre><figure class="kg-card kg-image-card"><img src="https://blog.taiker.space/content/images/2025/04/Screenshot-2025-04-14-at-10.56.48-PM.png" class="kg-image" alt="Recap of bypassing anti-bot mechanism" loading="lazy" width="840" height="783" srcset="https://blog.taiker.space/content/images/size/w600/2025/04/Screenshot-2025-04-14-at-10.56.48-PM.png 600w, https://blog.taiker.space/content/images/2025/04/Screenshot-2025-04-14-at-10.56.48-PM.png 840w" sizes="(min-width: 720px) 720px"></figure><p>You can open your HTML file in the browser. After clicking the button, you&#x2019;ll see the Fingerprint ID in the developer console. We also print the details of the components for reference. You can review the content of these components to understand which items might affect the Fingerprint. If you open another tab and visit the same file, you may see the same Fingerprint ID because the content of the components remains unchanged. At this point, you should have an initial understanding of how the Fingerprint works.</p><h1 id="bot-detection">Bot Detection</h1><p>Imagine you have a website, and many bots are visiting, crawling everything. Sometimes, this can even cause your website to crash. What can you do? In the past, we often added rate limits based on IP addresses. However, with the rise of proxy services, users can easily apply new IPs from around the world. This makes the IP-based solution less effective today. Luckily, as you might have guessed, here comes Fingerprinting. By calculating the Fingerprint ID based on the browser&apos;s components, we can easily identify whether website requests are coming from the same instance.</p><pre><code class="language-html">&lt;!DOCTYPE html&gt;
&lt;html lang=&quot;en&quot;&gt;
&lt;head&gt;
    &lt;meta charset=&quot;UTF-8&quot;&gt;
    &lt;meta name=&quot;viewport&quot; content=&quot;width=device-width, initial-scale=1.0&quot;&gt;
    &lt;title&gt;Bot Detection with FingerprintJS v4&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;h1&gt;Bot Detection Demo&lt;/h1&gt;
    &lt;p&gt;Click the button below to check if you&apos;re a bot!&lt;/p&gt;
    &lt;button id=&quot;checkBtn&quot;&gt;Check Me&lt;/button&gt;
    &lt;p id=&quot;result&quot;&gt;&lt;/p&gt;

    &lt;!-- Load FingerprintJS as a module --&gt;
    &lt;script type=&quot;module&quot;&gt;
        import FingerprintJS from &apos;https://openfpcdn.io/fingerprintjs/v4&apos;;

        async function checkFingerprint() {
            const fp = await FingerprintJS.load();
            const result = await fp.get({ extendedResult: true });

            console.log(&quot;Fingerprint ID:&quot;, result.visitorId);
            console.log(&quot;Detailed Components:&quot;, result.components);
            console.log(&quot;Confidence Score:&quot;, result.confidence.score);

            const fingerprint = result.visitorId;

            // Send to backend
            fetch(&apos;http://127.0.0.1:5000/check_bot&apos;, {
                method: &apos;POST&apos;,
                headers: { &apos;Content-Type&apos;: &apos;application/json&apos; },
                body: JSON.stringify({ 
                    fingerprint: fingerprint,
                    userAgent: navigator.userAgent,
                    screenSize: `${screen.width}x${screen.height}`,
                    languages: navigator.languages
                })
            })
            .then(response =&gt; response.json())
            .then(data =&gt; {
                document.getElementById(&quot;result&quot;).innerText = data.message;
            });
        }


        // Attach event listener after DOM loads
        document.getElementById(&quot;checkBtn&quot;).addEventListener(&quot;click&quot;, checkFingerprint);
    &lt;/script&gt;
&lt;/body&gt;
&lt;/html&gt;
</code></pre><pre><code class="language-python">from flask import Flask, request, jsonify
from flask_cors import CORS
import time

app = Flask(__name__)
CORS(app)

# Store fingerprints to detect rapid, repetitive requests
fingerprint_tracker = {}

# Known bot-like user-agents
BOT_USER_AGENTS = [
    &quot;HeadlessChrome&quot;, &quot;bot&quot;, &quot;crawl&quot;, &quot;spider&quot;, &quot;Googlebot&quot;, &quot;Bingbot&quot;, &quot;Yahoo! Slurp&quot;, &quot;DuckDuckBot&quot;
]

# Fake screen resolutions (some bots report unusual screen sizes)
UNREALISTIC_SCREENS = [&quot;0x0&quot;, &quot;1x1&quot;, &quot;1024x1024&quot;]

# Check if the request is likely from a bot
def is_bot(fingerprint, user_agent, screen_size, languages):
    current_time = time.time()

    # 1. Detect rapid repeated requests (rate limiting)
    if fingerprint in fingerprint_tracker:
        last_request_time = fingerprint_tracker[fingerprint]
        if current_time - last_request_time &lt; 2:  # Less than 2 seconds between requests
            return True, &quot;Suspicious rapid requests detected&quot;
    
    # Update last request time
    fingerprint_tracker[fingerprint] = current_time

    # 2. Check for known bot user-agents
    if any(bot in user_agent for bot in BOT_USER_AGENTS):
        return True, &quot;Bot-like User-Agent detected&quot;

    # 3. Check for unusual screen sizes (some bots use default headless sizes)
    if screen_size in UNREALISTIC_SCREENS:
        return True, &quot;Unrealistic screen resolution detected&quot;

    # 4. Check if the language list is empty (bots often don&apos;t send language data)
    if not languages or len(languages) == 0:
        return True, &quot;No language data found&quot;

    return False, &quot;Looks like a human&quot;

@app.route(&apos;/check_bot&apos;, methods=[&apos;POST&apos;])
def check_bot():
    data = request.get_json()
    
    fingerprint = data.get(&quot;fingerprint&quot;)
    user_agent = data.get(&quot;userAgent&quot;, &quot;&quot;)
    screen_size = data.get(&quot;screenSize&quot;, &quot;&quot;)
    languages = data.get(&quot;languages&quot;, [])

    bot, message = is_bot(fingerprint, user_agent, screen_size, languages)
    
    return jsonify({&quot;is_bot&quot;: bot, &quot;message&quot;: message})

if __name__ == &apos;__main__&apos;:
    app.run(debug=True)

</code></pre><p>By default, <strong>browsers block cross-origin requests</strong> for security reasons (Same-Origin Policy). By doing so, your Flask backend can receive a request from the client on localhost for testing.</p><p>Let&apos;s delve into the backend code. Now that you have the fingerprint, you can easily build the fingerprint profile. Although we use a dictionary format (fingerprint_tracker) as an example, you can imagine recording this fingerprint information in the database. Each time a request with a specific fingerprint comes in, we can query the database to check its history, helping us detect bots. This example just gives you a basic example of how to use the fingerprint to detect the bot.</p><h1 id="tls-fingerprint">TLS Fingerprint</h1><p>Next, we will introduce the TLS Fingerprint. As we mentioned before, general fingerprinting uses the browser component to calculate its value. TLS Fingerprinting is a technique used to identify clients based on their unique Transport Layer Security (TLS) handshake characteristics. When a client (e.g., browser or bot) connects to a server, it starts with a <strong>TLS handshake</strong> that includes:</p><!--kg-card-begin: html--><table class="min-w-full" data-start="638" data-end="1389"><thead data-start="638" data-end="739"><tr data-start="638" data-end="739"><th data-start="638" data-end="663"><strong data-start="640" data-end="657">TLS Parameter</strong></th><th data-start="663" data-end="699"><strong data-start="665" data-end="684">What It Reveals</strong></th><th data-start="699" data-end="739"><strong data-start="701" data-end="737">How It&apos;s Used for Fingerprinting</strong></th></tr></thead><tbody data-start="834" data-end="1389"><tr data-start="834" data-end="924"><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="834" data-end="858"><strong data-start="836" data-end="851">TLS Version</strong></td><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="858" data-end="889">e.g., <code data-start="866" data-end="875">TLS 1.2</code> or <code data-start="879" data-end="888">TLS 1.3</code></td><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="889" data-end="924">Some bots use outdated versions</td></tr><tr data-start="925" data-end="1026"><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="925" data-end="949"><strong data-start="927" data-end="944">Cipher Suites</strong></td><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="949" data-end="991">List of supported encryption algorithms</td><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="991" data-end="1026">Bots often have limited options</td></tr><tr data-start="1027" data-end="1120"><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="1027" data-end="1051"><strong data-start="1029" data-end="1043">Extensions</strong></td><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="1051" data-end="1085">Features like ALPN, SNI, GREASE</td><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="1085" data-end="1120">Unique combinations per browser</td></tr><tr data-start="1121" data-end="1209"><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="1121" data-end="1145"><strong data-start="1123" data-end="1142">Elliptic Curves</strong></td><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="1145" data-end="1178">Supported key exchange methods</td><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="1178" data-end="1209">Unusual curves = suspicious</td></tr><tr data-start="1210" data-end="1292"><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="1210" data-end="1237"><strong data-start="1212" data-end="1236">Signature Algorithms</strong></td><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="1237" data-end="1267">Authentication methods used</td><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="1267" data-end="1292">Differ per OS/browser</td></tr><tr data-start="1293" data-end="1389"><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="1293" data-end="1317"><strong data-start="1295" data-end="1314">Order of Fields</strong></td><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="1317" data-end="1360">The sequence of cipher suites/extensions</td><td class="max-w-[calc(var(--thread-content-max-width)*2/3)]" data-start="1360" data-end="1389">Different per client type</td></tr></tbody></table><!--kg-card-end: html--><p>Let&apos;s see an example. There are two clients connected to your server:</p><figure class="kg-card kg-code-card"><pre><code class="language-text">TLS 1.3
Cipher Suites: [TLS_AES_128_GCM_SHA256, TLS_AES_256_GCM_SHA384]
Extensions: [server_name, supported_versions, key_share, psk_key_exchange_modes]
Elliptic Curves: [X25519, secp256r1]
Signature Algorithms: [rsa_pss_rsae_sha256, ecdsa_secp256r1_sha256]</code></pre><figcaption>Legitimate User (Chrome on Windows)</figcaption></figure><figure class="kg-card kg-code-card"><pre><code class="language-text">TLS 1.2
Cipher Suites: [TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384]
Extensions: [server_name]
Elliptic Curves: [secp256r1]
Signature Algorithms: [rsa_pss_rsae_sha256]</code></pre><figcaption>Suspicious Bot (Python Requests Library)</figcaption></figure><p>Based on this information, we can easily detect that the second request was generated by Python Requests and may belong to a suspicious bot, so we can block it.</p><p>I will use mitmproxy to patch the TLS information of the Python request. First, create the <code>debug_hello.py</code> script to check if our request is being patched in <code>mitmproxy</code>, and then use the <code>mitmdump</code> command to start the mitmproxy service locally.</p><figure class="kg-card kg-code-card"><pre><code class="language-Python">from mitmproxy import tls

EXTENSION_NAMES = {
    0: &quot;server_name&quot;,
    10: &quot;supported_groups&quot;,
    11: &quot;ec_point_formats&quot;,
    13: &quot;signature_algorithms&quot;,
    16: &quot;application_layer_protocol_negotiation&quot;,
    21: &quot;padding&quot;,
    22: &quot;encrypt_then_mac&quot;,
    23: &quot;extended_master_secret&quot;,
    43: &quot;supported_versions&quot;,
    45: &quot;psk_key_exchange_modes&quot;,
    49: &quot;post_handshake_auth&quot;,
    51: &quot;key_share&quot;,
    65281: &quot;renegotiation_info&quot;,
}

def readable_extensions(extensions):
    return [
        EXTENSION_NAMES.get(ext_id, f&quot;unknown({ext_id})&quot;)
        for ext_id, _ in extensions
    ]

def tls_clienthello(data: tls.ClientHelloData):
    hello = data.client_hello
    print(&quot;JA3 Debug:&quot;)
    print(f&quot;  - Client: {data.context.client.peername}&quot;)
    print(f&quot;  - Cipher Suites: {hello.cipher_suites}&quot;)
    print(f&quot;  - Extensions: {readable_extensions(hello.extensions)}&quot;)
</code></pre><figcaption>debug_hello.py</figcaption></figure><pre><code class="language-Python">$ mitmdump --mode regular@8082 -s debug_hello.py --set tls_client_hello=chrome_120</code></pre><p>Next, we&#x2019;ll prepare a simple Python request code to send a request with a proxy and examine the result from tls.peet.ws. Take some time to review the differences between using and not using a proxy, especially the significant variations in the <code>ciphers</code> section.</p><pre><code class="language-Python">import requests
from pprint import pprint

proxies = {
    &quot;http&quot;: &quot;http://127.0.0.1:8082&quot;,
    &quot;https&quot;: &quot;http://127.0.0.1:8082&quot;,
}

response = requests.get(&quot;https://tls.peet.ws/api/all&quot;, proxies=proxies, verify=False)
# response = requests.get(&quot;https://tls.peet.ws/api/all&quot;)

pprint(response.json())</code></pre><pre><code class="language-Text">{&apos;donate&apos;: &apos;Please consider donating to keep this API running. Visit &apos;
           &apos;https://tls.peet.ws&apos;,
 &apos;http1&apos;: {&apos;headers&apos;: [&apos;Host: tls.peet.ws&apos;,
                       &apos;User-Agent: python-requests/2.32.3&apos;,
                       &apos;Accept-Encoding: gzip, deflate, br, zstd&apos;,
                       &apos;Accept: */*&apos;,
                       &apos;Connection: keep-alive&apos;]},
 &apos;http_version&apos;: &apos;HTTP/1.1&apos;,
 &apos;ip&apos;: &apos;103.234.230.84:61132&apos;,
 &apos;method&apos;: &apos;GET&apos;,
 &apos;tcpip&apos;: {&apos;ip&apos;: {}, &apos;tcp&apos;: {}},
 &apos;tls&apos;: {&apos;ciphers&apos;: [&apos;TLS_AES_256_GCM_SHA384&apos;,
                     &apos;TLS_CHACHA20_POLY1305_SHA256&apos;,
                     &apos;TLS_AES_128_GCM_SHA256&apos;,
                     &apos;TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256&apos;,
                     &apos;TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256&apos;,
                     &apos;TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384&apos;,
                     &apos;TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384&apos;,
                     &apos;TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256&apos;,
                     &apos;TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256&apos;,
                     &apos;TLS_DHE_RSA_WITH_AES_128_GCM_SHA256&apos;,
                     &apos;TLS_DHE_RSA_WITH_AES_256_GCM_SHA384&apos;,
                     &apos;TLS_DHE_RSA_WITH_CHACHA20_POLY1305_SHA256&apos;,
                     &apos;TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256&apos;,
                     &apos;TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256&apos;,
                     &apos;TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA&apos;,
                     &apos;TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA&apos;,
                     &apos;TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384&apos;,
                     &apos;TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384&apos;,
                     &apos;TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA&apos;,
                     &apos;TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA&apos;,
                     &apos;TLS_DHE_RSA_WITH_AES_128_CBC_SHA256&apos;,
                     &apos;TLS_DHE_RSA_WITH_AES_256_CBC_SHA256&apos;,
                     &apos;TLS_RSA_WITH_AES_128_GCM_SHA256&apos;,
                     &apos;TLS_RSA_WITH_AES_256_GCM_SHA384&apos;,
                     &apos;TLS_RSA_WITH_AES_128_CBC_SHA256&apos;,
                     &apos;TLS_RSA_WITH_AES_256_CBC_SHA256&apos;,
                     &apos;TLS_RSA_WITH_AES_128_CBC_SHA&apos;,
                     &apos;TLS_RSA_WITH_AES_256_CBC_SHA&apos;],
         &apos;client_random&apos;: &apos;2059469c2c091fad5cdc5d3923783e2297f8a96de4c9e06efb22572f8294bc2d&apos;,
         &apos;extensions&apos;: [{&apos;data&apos;: &apos;00&apos;,
                         &apos;name&apos;: &apos;extensionRenegotiationInfo (boringssl) &apos;
                                 &apos;(65281)&apos;},
                        {&apos;name&apos;: &apos;server_name (0)&apos;,
                         &apos;server_name&apos;: &apos;tls.peet.ws&apos;},
                        {&apos;elliptic_curves_point_formats&apos;: [&apos;0x00&apos;,
                                                           &apos;0x01&apos;,
                                                           &apos;0x02&apos;],
                         &apos;name&apos;: &apos;ec_point_formats (11)&apos;},
                        {&apos;name&apos;: &apos;supported_groups (10)&apos;,
                         &apos;supported_groups&apos;: [&apos;X25519 (29)&apos;,
                                              &apos;P-256 (23)&apos;,
                                              &apos;X448 (30)&apos;,
                                              &apos;P-521 (25)&apos;,
                                              &apos;P-384 (24)&apos;,
                                              &apos;ffdhe2048 (256)&apos;,
                                              &apos;ffdhe3072 (257)&apos;,
                                              &apos;ffdhe4096 (258)&apos;,
                                              &apos;ffdhe6144 (259)&apos;,
                                              &apos;ffdhe8192 (260)&apos;]},
                        {&apos;data&apos;: &apos;&apos;, &apos;name&apos;: &apos;session_ticket (35)&apos;},
                        {&apos;name&apos;: &apos;application_layer_protocol_negotiation (16)&apos;,
                         &apos;protocols&apos;: [&apos;http/1.1&apos;]},
                        {&apos;data&apos;: &apos;&apos;, &apos;name&apos;: &apos;encrypt_then_mac (22)&apos;},
                        {&apos;extended_master_secret_data&apos;: &apos;&apos;,
                         &apos;master_secret_data&apos;: &apos;&apos;,
                         &apos;name&apos;: &apos;extended_master_secret (23)&apos;},
                        {&apos;name&apos;: &apos;signature_algorithms (13)&apos;,
                         &apos;signature_algorithms&apos;: [&apos;ecdsa_secp256r1_sha256&apos;,
                                                  &apos;ecdsa_secp384r1_sha384&apos;,
                                                  &apos;ecdsa_secp521r1_sha512&apos;,
                                                  &apos;ed25519&apos;,
                                                  &apos;ed25519&apos;,
                                                  &apos;ecdsa_brainpoolP256r1tls13_sha256&apos;,
                                                  &apos;ecdsa_brainpoolP384r1tls13_sha384&apos;,
                                                  &apos;ecdsa_brainpoolP512r1tls13_sha512&apos;,
                                                  &apos;rsa_pss_pss_sha256&apos;,
                                                  &apos;rsa_pss_pss_sha384&apos;,
                                                  &apos;rsa_pss_pss_sha512&apos;,
                                                  &apos;rsa_pss_rsae_sha256&apos;,
                                                  &apos;rsa_pss_rsae_sha384&apos;,
                                                  &apos;rsa_pss_rsae_sha512&apos;,
                                                  &apos;rsa_pkcs1_sha256&apos;,
                                                  &apos;rsa_pkcs1_sha384&apos;,
                                                  &apos;rsa_pkcs1_sha512&apos;,
                                                  &apos;0x303&apos;,
                                                  &apos;0x301&apos;,
                                                  &apos;0x302&apos;,
                                                  &apos;0x402&apos;,
                                                  &apos;0x502&apos;,
                                                  &apos;0x602&apos;]},
                        {&apos;name&apos;: &apos;supported_versions (43)&apos;,
                         &apos;versions&apos;: [&apos;TLS 1.3&apos;, &apos;TLS 1.2&apos;]},
                        {&apos;PSK_Key_Exchange_Mode&apos;: &apos;PSK with (EC)DHE key &apos;
                                                  &apos;establishment (psk_dhe_ke) &apos;
                                                  &apos;(1)&apos;,
                         &apos;name&apos;: &apos;psk_key_exchange_modes (45)&apos;},
                        {&apos;name&apos;: &apos;key_share (51)&apos;,
                         &apos;shared_keys&apos;: [{&apos;X25519 (29)&apos;: &apos;a3ba691321dfea99979785396e5c370ee6ee6a7403cb736d51388c9d65206800&apos;}]}],
         &apos;ja3&apos;: &apos;771,4866-4867-4865-49195-49199-49196-49200-52393-52392-158-159-52394-49187-49191-49161-49171-49188-49192-49162-49172-103-107-156-157-60-61-47-53,65281-0-11-10-35-16-22-23-13-43-45-51,29-23-30-25-24-256-257-258-259-260,0-1-2&apos;,
         &apos;ja3_hash&apos;: &apos;135b770c875c319c3564deacfe0bcc39&apos;,
         &apos;ja4&apos;: &apos;t13d2812h1_a01be8c064b6_0b298858d6c1&apos;,
         &apos;ja4_r&apos;: &apos;t13d2812h1_002f,0035,003c,003d,0067,006b,009c,009d,009e,009f,1301,1302,1303,c009,c00a,c013,c014,c023,c024,c027,c028,c02b,c02c,c02f,c030,cca8,cca9,ccaa_000a,000b,000d,0015,0016,0017,0023,002b,002d,0033,ff01_0403,0503,0603,0807,0808,081a,081b,081c,0809,080a,080b,0804,0805,0806,0401,0501,0601,0303,0301,0302,0402,0502,0602&apos;,
         &apos;peetprint&apos;: &apos;772-771|1.1|29-23-30-25-24-256-257-258-259-260|1027-1283-1539-2055-2056-2074-2075-2076-2057-2058-2059-2052-2053-2054-1025-1281-1537-771-769-770-1026-1282-1538|1||4866-4867-4865-49195-49199-49196-49200-52393-52392-158-159-52394-49187-49191-49161-49171-49188-49192-49162-49172-103-107-156-157-60-61-47-53|0-10-11-13-16-22-23-35-43-45-51-65281&apos;,
         &apos;peetprint_hash&apos;: &apos;a81429f9a27d4b2da1c4126a7921174a&apos;,
         &apos;session_id&apos;: &apos;4f5ff2f21118a79e9af3be3367428189e2c7050a629e24264149dea476e84e7e&apos;,
         &apos;tls_version_negotiated&apos;: &apos;772&apos;,
         &apos;tls_version_record&apos;: &apos;771&apos;}}</code></pre><h1 id="takeaways">Takeaways</h1><p>In this post, I&#x2019;ve introduced Fingerprint and TLS Fingerprint and provided some simple examples to demonstrate how modern websites and servers use them for bot detection. At the end of this post, I&#x2019;ll share a few key takeaways:</p><ul><li>Python Requests Can Be Upgraded to Look Like Chrome: If you find that your default TLS version is still 1.2 for Python requests, remember to upgrade both Python and OpenSSL to enable TLS 1.3. This upgrade helps you bypass TLS-layer bot detection mechanisms like JA3/JA4 checks.</li><li>mitmproxy Presets Emulate Real Browsers and Devices: Including mitmproxy with a fingerprint solution in your crawler engine can help bypass bot detection systems by emulating real browser and device behavior.</li><li>OpenSSL is the Default TLS Engine - and it&#x2019;s Easy to detect: Python, curl, and most CLI tools use OpenSSL for TLS, making their ClientHello predictable and easy to fingerprint unless spoofed. mitmproxy mitigates this issue by terminating TLS itself and sending a new ClientHello to the target using your chosen preset.</li></ul><h1 id="reference">Reference </h1><ul><li><a href="https://www.cloudflare.com/learning/ssl/what-happens-in-a-tls-handshake/?ref=blog.taiker.space">https://www.cloudflare.com/learning/ssl/what-happens-in-a-tls-handshake/</a></li></ul>]]></content:encoded></item><item><title><![CDATA[Understanding CSRF, SOP, and CORS Security Risks]]></title><description><![CDATA[<h1 id="background">Background</h1><p>In the past, I used to share my newly acquired knowledge through blogs. However, with the emergence of AI (Specifically LLM models), things have changed. Nowadays, everyone is asking AI questions, making traditional internet searches seem less valuable. Despite this shift, I still want to continue sharing through blogs</p>]]></description><link>https://blog.taiker.space/understanding-csrf-sop-and-cors-security-risks/</link><guid isPermaLink="false">67a870dce18ac2500b9de5b7</guid><category><![CDATA[web security]]></category><category><![CDATA[software engineering]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Fri, 28 Feb 2025 17:07:53 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1495714096525-285e85481946?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDIzfHxzZWN1cml0eXxlbnwwfHx8fDE3MzkwOTEwOTJ8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1495714096525-285e85481946?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDIzfHxzZWN1cml0eXxlbnwwfHx8fDE3MzkwOTEwOTJ8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="Understanding CSRF, SOP, and CORS Security Risks"><p>In the past, I used to share my newly acquired knowledge through blogs. However, with the emergence of AI (Specifically LLM models), things have changed. Nowadays, everyone is asking AI questions, making traditional internet searches seem less valuable. Despite this shift, I still want to continue sharing through blogs because it allows me to practice my English writing and organize my thoughts in a structured way. However, given the rise of AI, I also plan to adapt my blog writing.</p><p>These days, it&apos;s fortunate to learn alongside AI&#x2014;you can ask any question if you&#x2019;re struggling with a specific topic. The key skill, though, is validating the answers that AI provides to ensure they&#x2019;re correct and make sense. I believe this skill will become even more important as we live in the AI era, which is why I want to write this blog post with the help of AI. </p><p>I hope you enjoy this post, even though much of the content may be generated by AI. I still aim to share my perspective, and I hope this mindset is helpful to you.</p><p>I struggled with web security topics because I had many unanswered questions before. However, thanks to AI, I now have the opportunity to clarify everything; that&apos;s why I chose this topic as my first blog, which cooperates with AI, and &#xA0;I will show you how to engage with AI back and forth to learn a specific topic.</p><h1 id="definitions">Definitions</h1><p>The following definitions are generated via AI; if you are familiar with them, you can skip this section.</p><h2 id="sopsame-origin-policy">SOP - Same-Origin Policy</h2><p>The <strong>Same-Origin Policy (SOP)</strong> is a <strong>browser security mechanism</strong> that <strong>restricts web pages from accessing data from a different origin</strong> (protocol, domain, or port).</p><ul><li><strong>Purpose</strong>: Prevents <strong>cross-site data theft</strong> and protects user sessions.</li><li><strong>Example</strong>: A script on <code>https://example.com</code><strong>cannot</strong> read cookies from <code>https://attacker.com</code>.</li></ul><h2 id="csrfcross-site-request-forgery">CSRF - <strong>Cross-Site Request Forgery</strong></h2><p><strong>Cross-Site Request Forgery (CSRF)</strong> is an <strong>attack that tricks a logged-in user into performing an unwanted action</strong> on a trusted website.</p><ul><li><strong>How it works</strong>: The attack exploits the <strong>browser&#x2019;s automatic cookie handling</strong> to send <strong>unauthorized</strong> requests.</li><li> <strong>Example</strong>: A victim logs into <code>bank.com</code>, then visits <code>attacker.com</code>, which <strong>automatically submits a money transfer request</strong> to <code>bank.com</code> using the victim&apos;s session.</li><li> <strong>Prevention</strong>: Use <strong>CSRF tokens</strong>, <code>SameSite=Strict</code> cookies, and <strong>re-authentication for sensitive actions</strong>.</li></ul><h2 id="corscross-origin-resource-sharing">CORS - Cross-Origin Resource Sharing</h2><p><strong>Cross-Origin Resource Sharing (CORS)</strong> is a security feature that <strong>allows controlled access to resources from different origins</strong> by defining which external sites can make requests.</p><ul><li><strong>Why it&apos;s needed</strong>: SOP blocks <strong>all</strong> cross-origin requests by default, but CORS allows <strong>specific</strong> trusted domains to access an API.</li><li><strong>Example</strong>: If <code>api.example.com</code> wants to allow requests from <code>frontend.com</code>, the server must set: <code>Access-Control-Allow-Origin: https://frontend.com</code></li><li><strong>Security risk</strong>: <strong>Misconfiguring CORS (<code>Access-Control-Allow-Origin: *</code>) can bypass CSRF protection</strong>, allowing attackers to send requests using a victim&#x2019;s session.</li></ul><h1 id="discussion-with-ai">Discussion with AI</h1><p>In the past, I was familiar with the terms listed above individually, but I had questions about why we need to use all these concepts to design website security. One day, while logging into a Django website I was developing, I noticed I received both a <code>sessionID</code> and a <code>csrftoken</code>. The <code>sessionID</code> was marked as <code>HttpOnly</code>, while the <code>csrftoken</code> was not. This sparked my curiosity about why they had different attributes in the browser, so I decided to investigate further.</p><p>I opened the ChatGPT and asked the following question:</p><p><strong>Q</strong>: I used Django to implement the login functionality. After logging in, I received two cookies: <code>csrftoken</code> and <code>sessionid</code>, and I&#x2019;m curious about the difference between these cookies.</p><p><strong>Ans</strong>: Django uses <code>sessionid</code> to keep track of logged-in users, and this ID is stored in a cookie on the user&apos;s browser. The session ID maps to session data stored on the server (e.g., in a database, cache, or file system). When the user makes subsequent requests, their browser sends the <code>sessionid</code> cookie, and Django reads this cookie, retrieves the session data, and associates it with the request.</p><p>The <code>csrftoken</code> is a Cross-Site Request Forgery (CSRF) token used by Django to prevent CSRF attacks. This token is stored in a cookie (<code>csrftoken</code> ) and embedded as a hidden field in the form. When the form is submitted, Django checks if the CSRF token included in the form matches the token stored in the user&apos;s cookie. Since browsers automatically include cookies in requests, an attacker could trick a logged-in user into making an unwanted request (e.g., submitting a form). The CSRF token ensures that only forms generated by Django are accepted.</p><p><strong>Q</strong>: I noticed that the <code>sessionid</code> is HttpOnly, but the <code>csrftoken</code> is not. Why is that?</p><p><strong>Ans</strong>: In Django, the <code>sessionid</code> cookie is marked as <code>HttpOnly</code> to enhance security. This attribute prevents client-side scripts from accessing the cookie, helping mitigate risks associated with cross-site scripting (XSS) attacks. By making the <code>sessionid</code> cookie <code>HttpOnly</code>, it cannot be accessed via JavaScript, which is why this tag helps prevent XSS risks.</p><p><strong>Q</strong>: Can I use JavaScript to retrieve the <code>csrftoken</code> and inject it into a hidden form input? If so, that means a hacker could potentially do the same thing, right? Does this mean the CSRF token is ineffective?</p><p><strong>Ans</strong>: Yes! You can use JavaScript to retrieve the CSRF token from the cookie and inject it into the hidden form input field dynamically, and due to the &#xA0;CSRF token is stored in a cookie, but only for the same site, that&apos;s important, only for the same site, that&apos;s why we must set the <code>SameSite=LAX</code> at least within the Django. By doing that, JavaScript running on another origin cannot access the CSRF token due to browser security policies (Same-Origin Policy).</p><p><strong>Q</strong>: If I configure the wrong settings, whether through Nginx or Django, to allow requests from different origins using <code>Access-Control-Allow-Origin: *</code>, and I don&#x2019;t set the <code>SameSite</code> attribute on the CSRF token, could this configuration render the CSRF protection ineffective?</p><p><strong>Ans</strong>: Yes, misconfiguring CORS (<code>Access-Control-Allow-Origin: *</code>) and not setting the <code>SameSite</code> attribute properly can make CSRF protection ineffective. Let&#x2019;s break it down. Under this scenario, any origin (including attacker.com) can make requests to your API without restrictions. When a victim visits <code>attacker.com</code>, their <code>sessionid</code> and <code>csrftoken</code>are automatically included in cross-origin requests. Now, How an Attacker Exploits This</p><ol><li>Victim logs into <code>bank.com</code> (legitimate site)</li><li>Victim visits <code>attacker.com</code></li><li>The attacker executes a <code>fetch()</code> request to <code>bank.com</code></li><li>Because of <code>Access-Control-Allow-Origin: *</code>, the request is allowed.</li><li>Because cookies are included (no <code>SameSite</code> restriction), the bank processes the request.</li><li>Attack Succeeds!</li></ol><p>Finally, after asking a few questions, I now have a clear understanding of basic web security concepts like CSRF, SOP, and CORS. I now deeply appreciate the great advantage of today&#x2019;s era, where asking AI questions helps resolve confusion.</p><h1 id="takeaways">Takeaways</h1><p>In the final section, I&#x2019;ll provide a quick summary of today&#x2019;s topic, with the assistance of AI, of course. Hope you enjoy it. </p><h3 id="summary"><strong>Summary</strong></h3><!--kg-card-begin: html--><table data-start="1804" data-end="2434"><thead data-start="1804" data-end="1870"><tr data-start="1804" data-end="1870"><th data-start="1804" data-end="1816"><strong data-start="1806" data-end="1814">Term</strong></th><th data-start="1816" data-end="1833"><strong data-start="1818" data-end="1832">Definition</strong></th><th data-start="1833" data-end="1847"><strong data-start="1835" data-end="1846">Purpose</strong></th><th data-start="1847" data-end="1870"><strong data-start="1849" data-end="1868">Example of Risk</strong></th></tr></thead><tbody data-start="1930" data-end="2434"><tr data-start="1930" data-end="2089"><td><strong data-start="1932" data-end="1939">SOP</strong></td><td>Restricts cross-origin access</td><td>Prevents cross-site data theft</td><td>A script from <code data-start="2022" data-end="2032">evil.com</code> tries to read cookies from <code data-start="2060" data-end="2070">bank.com</code> (Blocked by SOP)</td></tr><tr data-start="2090" data-end="2254"><td><strong data-start="2092" data-end="2100">CSRF</strong></td><td>Tricks users into making unintended requests</td><td>Prevents unauthorized actions</td><td>A victim clicks a malicious link that transfers money from their bank</td></tr><tr data-start="2255" data-end="2434"><td><strong data-start="2257" data-end="2265">CORS</strong></td><td>Allows controlled cross-origin access</td><td>Enables APIs to be shared securely</td><td><code data-start="2346" data-end="2378">Access-Control-Allow-Origin: *</code> allows any site to make API requests (Security Risk!)</td></tr></tbody></table><!--kg-card-end: html--><p>&#x2705; <strong>SOP blocks cross-origin requests</strong> by default.<br>&#x2705; <strong>CORS allows specific cross-origin access (if configured correctly).</strong><br>&#x2705; <strong>CSRF protections prevent attackers from hijacking authenticated actions.</strong> </p>]]></content:encoded></item><item><title><![CDATA[Simple Anomaly Detection for the Text Data]]></title><description><![CDATA[<h1 id="background">Background</h1><p>Recently, our team faced a challenge, how to detect the suspicious incorrect data from our vendor, in other words, our vendor sometimes will provides incorrect raw data to us. Unfortunately, we don&#x2019;t have a solution to prevent this issue. As a result, we initially pass the incorrect</p>]]></description><link>https://blog.taiker.space/simple-anomaly-detection-for-the-text-data/</link><guid isPermaLink="false">6762aa0de18ac2500b9de3b6</guid><category><![CDATA[Data]]></category><category><![CDATA[python]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Mon, 30 Dec 2024 05:38:18 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1630651394339-b6522d8e4cab?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDV8fGFub21hbHl8ZW58MHx8fHwxNzM0NTE5MzM0fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1630651394339-b6522d8e4cab?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDV8fGFub21hbHl8ZW58MHx8fHwxNzM0NTE5MzM0fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="Simple Anomaly Detection for the Text Data"><p>Recently, our team faced a challenge, how to detect the suspicious incorrect data from our vendor, in other words, our vendor sometimes will provides incorrect raw data to us. Unfortunately, we don&#x2019;t have a solution to prevent this issue. As a result, we initially pass the incorrect data through the data pipeline across multiple services, leading to a series of data update errors. Most seriously, we end up providing entirely wrong information to our customers, which could increase product churn - something we definitely want to avoid. </p><h1 id="goal">Goal</h1><p>First, let me introduce what the data looks like. It&apos;s a pure text data with the JSON format and includes multiple parts, and each file has slightly different to each other except one. If you&#x2019;re interested, you can refer to the following link: <a href="https://github.com/TaikerLiang/anomaly-detection-poc/tree/main/data?ref=blog.taiker.space" rel="noopener">GitHub Link</a>, and you can also use the online JSON differ tools to help you recognized the difference quickly.</p><p>The file <code>data/MAEU244638355-230015009-45b04974f7594ad299fca7fb5138ac36.json</code> contains the anomaly because it has huge difference compare to others, and our goal is to come up with a solution that helps us identify it.</p><h1 id="solutions">Solutions</h1><p>How to find the way to identify the anomaly data? The first idea came to my mind is <code>can I use the text similarity?</code> why not use the hash function to do that? as the mention before where the files are all slightly different and only one file is significantly anomalous so a pure hash-based solution will not work effectively. After deciding the direction of solution, next I started to prepare the dataset and build the POC to verify it.</p><!--kg-card-begin: markdown--><p><strong>Compute Pairwise Similarity</strong></p>
<ul>
<li>Normalize JSON files (sort keys, remove timestamps).</li>
<li>Use a similarity metric to compare files.
<ul>
<li>Cosine Similarity: Compare vectorized features of JSONs.</li>
<li>Structural Similarity: Compare the structure and key counts.</li>
</ul>
</li>
</ul>
<p><strong>Identify Anomaly</strong></p>
<ul>
<li>Calculate the average similarity score for each file compared to the rest.</li>
<li>The file with the lowest average similarity is the anomaly.</li>
</ul>
<!--kg-card-end: markdown--><p>Based on these ideas, I&#x2019;ve built the POC as outlined below. &#xA0;After running the program, as you can see, we successfully detected the incorrect data. You can review the code first; it&#x2019;s not complicated, but there are some theoretical concepts I&#x2019;ll explain in the next section.</p><pre><code class="language-Python">import os
import json

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def normalize_json(json_data):
    &quot;&quot;&quot;Normalize JSON for comparison by sorting keys.&quot;&quot;&quot;
    return json.dumps(json_data, sort_keys=True)

def compute_similarity(json_files):
    &quot;&quot;&quot;Compute similarity scores between JSON files.&quot;&quot;&quot;
    normalized_files = []
    
    for file_path in json_files:
        with open(file_path, &apos;r&apos;) as f:
            data = json.load(f)
        normalized_files.append(normalize_json(data))
    
    # Vectorize the normalized JSON files
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(normalized_files)
    
    # Compute pairwise cosine similarity
    similarity_matrix = cosine_similarity(tfidf_matrix)
    return similarity_matrix

def detect_anomaly(json_files):
    &quot;&quot;&quot;Detect the anomalous file based on similarity scores.&quot;&quot;&quot;
    similarity_matrix = compute_similarity(json_files)
    avg_similarity = similarity_matrix.mean(axis=1)
    
    # Find the file with the lowest average similarity
    anomaly_index = np.argmin(avg_similarity)
    anomaly_file = json_files[anomaly_index]
    return anomaly_file, avg_similarity



# Load all JSON files from the &quot;data&quot; folder
data_folder = &quot;data&quot;  # Replace with your folder name
json_files = [os.path.join(data_folder, file) for file in os.listdir(data_folder) if file.endswith(&apos;.json&apos;)]

print(json_files)

# Detect the anomaly
if json_files:
    anomaly_file, avg_similarity = detect_anomaly(json_files)
    print(f&quot;Anomalous file detected: {anomaly_file}&quot;)
    print(&quot;Average Similarity Scores:&quot;, avg_similarity)
else:
    print(&quot;No JSON files found in the specified folder.&quot;)
</code></pre><pre><code class="language-shell">&#x279C; python anomaly_detection.py
[&apos;data/MAEU244638355-230015009-45b04974f7594ad299fca7fb5138ac36.json&apos;, &apos;data/MAEU244638355-229834907-111a125fff524e568539fa9ee1259cc9.json&apos;, &apos;data/MAEU244638355-226316743-999b976a3e3147cdbefcb89497bcc8b0.json&apos;, &apos;data/MAEU244638355-228799200-1d919c554e0643c7869dcafe06206f12.json&apos;, &apos;data/MAEU244638355-227412648-4ab0fcecc65441b1800033f0ddcddd67.json&apos;]
Anomalous file detected: data/MAEU244638355-230015009-45b04974f7594ad299fca7fb5138ac36.json
Average Similarity Scores: [0.91224814 0.96954086 0.96806412 0.97184808 0.96856012]</code></pre><h1 id="implementation-details">Implementation Details</h1><p>Now, let&apos;s dive deeper into the code. The POC consists of serveral key steps, each with its specific purpose:</p><ol><li><strong>Normalization: </strong>JSON files are normalized by sorting keys to ensure a consistent stucture.</li><li><strong>TF-IDF Vectorization</strong>: Converts the normalized JOSN strings into a vectorized form.</li><li><strong>Cosine Similarity</strong>: Measures the pairwise similarity between the files.</li><li><strong>Anomaly Detection</strong>: Identifies the file with lowest average similarity score. </li></ol><p>Next I will explain the <strong>TF-IDF</strong> and <strong>Cosine Similarity</strong>,<strong> </strong>both of them are simple and easy concept for text analysis.</p><p><strong>TF-IDF</strong></p><p>The TF-IDF stands for <strong>Term Frequency-Inverse Document Frequency, </strong>a common technique in natural language processing (NLP) to represent text data in a way that reflects the importance of word in a collection of documents.</p><p><strong>Term Frequency (TF)</strong></p><ul><li>Measure how often a term (word) appears in a document</li><li>Formular:</li></ul><!--kg-card-begin: markdown--><p>$$ TF(t) = {\text{Number of times term t appears in a document} \over \text{Total number of terms in the document}} $$</p>
<!--kg-card-end: markdown--><p><strong>Inverse Document Frequency (IDF)</strong></p><ul><li>Measure how important a term is. Terms that occur in many documents are less significant.</li><li>Formular:</li></ul><!--kg-card-begin: markdown--><p>$$ IDF(t) = {log{\text{Total number of documents} \over \text{Number of documents containing t}}} $$</p>
<!--kg-card-end: markdown--><p><strong>TF-IDF Weight</strong></p><ul><li>Combine TF and IDF to assign a weight to each term in each document.</li><li>Formular:</li></ul><!--kg-card-begin: markdown--><p>$$ \text{TF-IDF(t)} = {\text{TF(t)} \over \text{IDF(t)}} $$</p>
<!--kg-card-end: markdown--><p><strong>What the tfidf_matrix Contains</strong></p><p>The <strong><code>tfidf_matrix</code></strong> is a 2D sparse matrix where:</p><ul><li>Rows represent the documents (in this case, normalized JSON files).</li><li>Columns represent unique terms (words) across all documents.</li><li>Each entry (i, j) in the matrix is the <strong>TF-IDF weight</strong> of the term <code>j</code> in document <code>i</code>.</li></ul><p><strong>Why use the TF-IDF</strong></p><ul><li><strong>Handles Variability in JSON Files</strong>: Converts textual differences (e.g., slight variations in keys or values) into comparable numerical data.</li><li><strong>Ignores Common Words</strong>: Words that appear in all files (e.g., &quot;is&quot;, &quot;a&quot;) get lower weights, focusing comparisons on more distinctive terms.</li><li><strong>Captures Structure</strong>: By flattening and normalizing JSON into strings, TF-IDF indirectly captures the structure and content differences between files.</li></ul><h3 id="cosine-similarity">Cosine Similarity</h3><p>When translating data from text to vectors, there are many mathematical formulas and concepts that can guide us. Here, we choose cosine similarity, which is commonly used to calculate the similarity between two vectors. You can refer to the figure below to visualize this.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.taiker.space/content/images/2024/12/2b4a7a82-ad4c-4b2a-b808-e423a334de6f.png" class="kg-image" alt="Simple Anomaly Detection for the Text Data" loading="lazy" width="488" height="376"><figcaption>Ref: https://pub.aimind.so/understanding-cosine-similarity-and-cosine-distance-in-depth-cc91eac3ef2</figcaption></figure><p>The cosine similarity formular comes from the definitin of dot product, in the 2D space the definition of dot product is </p><!--kg-card-begin: markdown--><p>$$ {{\mathbf{A} \cdot \mathbf{B}} = {|\mathbf{A}| |\mathbf{B}|}}{\cos(\theta)} $$</p>
<!--kg-card-end: markdown--><p>then we can get the cosine similarity is</p><!--kg-card-begin: markdown--><p>$$ \text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} $$</p>
<!--kg-card-end: markdown--><h1 id="takeaways">Takeaways</h1><p>In this blog, I demonstrated how to use TF-IDF and Cosine Similarity with JSON normalization to build a simple anomaly detection system for a text dataset. We also explained what TF-IDF and Cosine Similarity are, and I hope these concepts will be useful to you in the future.</p><h1 id="reference">Reference</h1><ul><li><a href="https://github.com/TaikerLiang/anomaly-detection-poc?ref=blog.taiker.space">https://github.com/TaikerLiang/anomaly-detection-poc</a></li><li><a href="https://pub.aimind.so/understanding-cosine-similarity-and-cosine-distance-in-depth-cc91eac3ef2?ref=blog.taiker.space">https://pub.aimind.so/understanding-cosine-similarity-and-cosine-distance-in-depth-cc91eac3ef2</a></li></ul>]]></content:encoded></item><item><title><![CDATA[How do I Take Notes for My Daily Work]]></title><description><![CDATA[<h1 id="background">Background</h1><p>Today, I will show you how I take notes for my daily work. To begin, I&apos;d like to introduce my role and job to give you a better understanding of my responsibility. I am an engineer at a SaaS company, where I spend half of my time</p>]]></description><link>https://blog.taiker.space/notebook_tools/</link><guid isPermaLink="false">6713d085e18ac2500b9de1f7</guid><category><![CDATA[Note]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Sun, 20 Oct 2024 09:33:53 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1523634921619-37ce98c1877f?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDI3fHxub3RlfGVufDB8fHx8MTcyOTM1Mzk4M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1523634921619-37ce98c1877f?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDI3fHxub3RlfGVufDB8fHx8MTcyOTM1Mzk4M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="How do I Take Notes for My Daily Work"><p>Today, I will show you how I take notes for my daily work. To begin, I&apos;d like to introduce my role and job to give you a better understanding of my responsibility. I am an engineer at a SaaS company, where I spend half of my time focused on system and product operations.</p><p>To expain more detailed, when customers use our product and encounter any unexpected outcomes, they may reach out to our support team for assistance. If the support team is unable to resolve the issue, it gets escalated to me. My job is to investigate the problem, whether it&apos;s a bug or an error related to a new feature we&apos;ve implemented. </p><p>We are developing a software service to address the complexities of the supply chain, which is a challenging domain. As a result, out product team may sometimes misunderstand certain aspects, leading to features being delivered with issues which may cause our customer confusion. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.taiker.space/content/images/2024/10/image.png" class="kg-image" alt="How do I Take Notes for My Daily Work" loading="lazy" width="978" height="259" srcset="https://blog.taiker.space/content/images/size/w600/2024/10/image.png 600w, https://blog.taiker.space/content/images/2024/10/image.png 978w" sizes="(min-width: 720px) 720px"><figcaption>Defense/Offense engineering</figcaption></figure><p>For my personally challenges would be how can I quick understand the code behind the certain feature across multiple Github Repos, and I found I am not good at memorize something, so although some of code I have read before, but I still forget it next time, it led me need to read it and understand it agian which really take me a lot of time on it, so that&apos;s why I need some note tools to help me overcome this challenges.</p><h1 id="vs-code-extensions">VS-Code Extensions</h1><p>I use VS-Code as my primary IDE for development, and to overcome my challenges, I&apos;m always on the lookout for extensions that can help me be more productive when reading code. Below are two of the extensions I find most useful:</p><p><strong><a href="https://marketplace.visualstudio.com/items?itemName=alefragnani.Bookmarks&amp;ref=blog.taiker.space">Bookmarks</a></strong>: The Bookmarks extension is invaluable when working across multiple files or large projects. It allows me to place bookmarks within my code, making it easier to quickly jump to important sections. I often use bookmarks to tag the entry points of certain features or the main business logic functions. This way, I can easily navigate to the exact part of the code I need without wasting time searching through files. It significantly improves my workflow, especially when dealing with multiple repositories or complex codebases.</p><p><strong><a href="https://marketplace.visualstudio.com/items?itemName=tkrkt.linenote&amp;ref=blog.taiker.space">Line Note</a></strong>: Line Note is an incredibly useful extension that allows me to add inline comments or annotations directly to the code. These notes are private, meaning only I can view and edit them, which is particularly helpful for adding context or reminders that are relevant to me personally. For example, when investigating a bug or trying to understand a piece of business logic, I use Line Note to jot down my thoughts and observations. These annotations help me retain important information and allow me to quickly recall why certain decisions were made or what I need to focus on next.</p><p>By using these two extensions together, I can quickly locate the code I need and access my personal notes, which saves me a lot of time and effort. This combination helps me retain knowledge from previous investigations, making it easier to solve my daily tasks across multiple product services.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.taiker.space/content/images/2024/10/image-1.png" class="kg-image" alt="How do I Take Notes for My Daily Work" loading="lazy" width="1456" height="718" srcset="https://blog.taiker.space/content/images/size/w600/2024/10/image-1.png 600w, https://blog.taiker.space/content/images/size/w1000/2024/10/image-1.png 1000w, https://blog.taiker.space/content/images/2024/10/image-1.png 1456w" sizes="(min-width: 720px) 720px"><figcaption>The screenshots of Bookmarks and Line Note</figcaption></figure><h1 id="notion-for-daily-to-do-list">Notion for Daily To-Do List</h1><p>Notion has been my primary tool for notes and documentation for a long time. I primarily use it for two purpose: 1. quick notes, and 2. my daily to-do list. </p><ul><li>Quick notes refer to the need to record information temporarily, such as ideas I want to jot down immediately to avoid forgetting them, or when my supervisor or teammates require my assistance and I need &#xA0;a place to note it for future reference.</li><li>For my daily to-do list, I created a custom template in Notion. You might wonder why I chose to build my own to-do list when there&apos;re so many available apps. I have tried various to-do list apps, but I always questioned why I needed to use an additional app for my daily tasks. My requirements are quite simple, focusing on two main points: 1. displaying today&apos;s to-do list, allowing me to archive completed task with a click, and 2. showing all to-do items so I can keep track of upcoming tasks and any incomplete ones. </li></ul><p>I&#x2019;ll share the Notion template via the following link; if you&#x2019;re interested, feel free to try it out.</p><figure class="kg-card kg-bookmark-card kg-card-hascaption"><a class="kg-bookmark-container" href="https://taiker.notion.site/To-Do-List-Template-f032b000273347abace2cf5e8c0dc6d1?ref=blog.taiker.space"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Notion &#x2013; The all-in-one workspace for your notes, tasks, wikis, and databases.</div><div class="kg-bookmark-description">A new tool that blends your everyday work apps into one. It&#x2019;s the all-in-one workspace for you and your team</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://taiker.notion.site/images/logo-ios.png" alt="How do I Take Notes for My Daily Work"><span class="kg-bookmark-author">Notion</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.notion.so/images/meta/default.png" alt="How do I Take Notes for My Daily Work"></div></a><figcaption>To-Do List Template</figcaption></figure><h1 id="notebooklm-for-quick-learning">NotebookLM for Quick Learning</h1><p>NotebookLM is a new AI notebook tool created by Google. It&apos;s still in the beta version, but I see huge potential in it for helping users quickly learn about various topics. NotebookLM allows you to upload documents and then ask questions about them, making it much easier to understand complex information without needing to manually search through notes or documentation. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.taiker.space/content/images/2024/10/image-2.png" class="kg-image" alt="How do I Take Notes for My Daily Work" loading="lazy" width="2000" height="1090" srcset="https://blog.taiker.space/content/images/size/w600/2024/10/image-2.png 600w, https://blog.taiker.space/content/images/size/w1000/2024/10/image-2.png 1000w, https://blog.taiker.space/content/images/size/w1600/2024/10/image-2.png 1600w, https://blog.taiker.space/content/images/size/w2400/2024/10/image-2.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>The screenshots of NotebookLM</figcaption></figure><p>Although NotebookLM is still in its early stages, I believe it represents the future of personalized learning tools, and I am trying to integrates NotebookLM with my existing workflow. I am experimenting with combining it with Notion, as I think this could be a powerful way to build a personal knowledge database. By storing notes in Notion and using NotebookLM to explore and understand them in depth, I can create a seamless loop of capturing information and then efficiently revisiting and internalizing it. This combination has the potential to make my learning process more systematic and help me retain information for the long term.</p><h1 id="key-takeaways">Key Takeaways</h1><p>In this blog, I shared how I use various notebook tools to manage my daily tasks, and I hope these insights will be helpful to you.</p><ul><li><strong>VS-Code extensions</strong>: &#xA0;extensions like Bookmarks and Line Note help enhance productivity by enabling efficient navigation and personalized notes within the codebase.</li><li><strong>Notion for task management</strong>: Notion is a versatile tool for managing daily tasks and quick notes, offering custom templates that meet specific needs without added complexity.</li><li><strong>NotebookLM</strong>: NotebookLM&apos;s interactive features make learning more engaging by providing personalized insights and breaking down complex topics.</li></ul>]]></content:encoded></item><item><title><![CDATA[How to Debug Your GitHub Action in Your Local Environment]]></title><description><![CDATA[<h1 id="background">Background</h1><p>Recently, I needed to use GitHub Actions to complete the CI/CD flow for a project. Based on my previous experiences, I&apos;ve encountered numerous challenges in validating GitHub Actions, which required me to submit multiple commits to check if they were functioning properly, as shown in the</p>]]></description><link>https://blog.taiker.space/how-to-debug-your-github-action-in-your-local-environment/</link><guid isPermaLink="false">6600f8c8e18ac2500b9ddc9e</guid><category><![CDATA[Devops]]></category><category><![CDATA[productivity]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Sun, 18 Aug 2024 17:35:33 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1518107616985-bd48230d3b20?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDI4fHxnaXRodWJ8ZW58MHx8fHwxNzExMzM5NzgxfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1518107616985-bd48230d3b20?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDI4fHxnaXRodWJ8ZW58MHx8fHwxNzExMzM5NzgxfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="How to Debug Your GitHub Action in Your Local Environment"><p>Recently, I needed to use GitHub Actions to complete the CI/CD flow for a project. Based on my previous experiences, I&apos;ve encountered numerous challenges in validating GitHub Actions, which required me to submit multiple commits to check if they were functioning properly, as shown in the following picture.</p><figure class="kg-card kg-image-card"><img src="https://blog.taiker.space/content/images/2024/07/Screenshot-2024-07-20-at-10.52.56-PM.png" class="kg-image" alt="How to Debug Your GitHub Action in Your Local Environment" loading="lazy" width="960" height="794" srcset="https://blog.taiker.space/content/images/size/w600/2024/07/Screenshot-2024-07-20-at-10.52.56-PM.png 600w, https://blog.taiker.space/content/images/2024/07/Screenshot-2024-07-20-at-10.52.56-PM.png 960w" sizes="(min-width: 720px) 720px"></figure><p>This time, I started looking for a tool or solution that could allow me to validate the script in my local environment and save a lot of time. In this blog, I will show you how I use <a href="https://github.com/nektos/act?ref=blog.taiker.space">Act</a> to complete and validate GitHub Action scripts in my local environment. </p><blockquote>Act is a tool that can run your GitHub Actions locally! </blockquote><h1 id="act-installation">ACT Installation</h1><p>I used the homebrew to install the ACT on my Mac. </p><pre><code class="language-shell">$ brew install act</code></pre><h1 id="usage-guide">Usage guide</h1><p>I will use the repo of <code>github-action-demo</code> to demonstrate some basic usage and complete the following case study section.</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://github.com/cplee/github-actions-demo.git?ref=blog.taiker.space"><div class="kg-bookmark-content"><div class="kg-bookmark-title">GitHub - cplee/github-actions-demo: Demonstration of GitHub Actions</div><div class="kg-bookmark-description">Demonstration of GitHub Actions. Contribute to cplee/github-actions-demo development by creating an account on GitHub.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://github.com/fluidicon.png" alt="How to Debug Your GitHub Action in Your Local Environment"><span class="kg-bookmark-author">GitHub</span><span class="kg-bookmark-publisher">cplee</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://opengraph.githubassets.com/27a9e5add25adcad4c21c0df6d74d187f0fb372be2e61e08416d8b5c5e5c8c98/cplee/github-actions-demo" alt="How to Debug Your GitHub Action in Your Local Environment"></div></a></figure><pre><code class="language-shell">$ git clone git@github.com:cplee/github-actions-demo.git
$ cd github-actions-demo
$ act -l --container-architecture linux/amd64
INFO[0000] Using docker host &apos;unix:///var/run/docker.sock&apos;, and daemon socket &apos;unix:///var/run/docker.sock&apos;
Stage  Job ID  Job name  Workflow name  Workflow file  Events
0      test    test      CI             main.yml       push</code></pre><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x1F4A1;</div><div class="kg-callout-text">By default <code>act</code> will run all jobs in every workflow in <code>.github/workflows</code>.</div></div><p>First, we run the <code>act -l</code> command, which will show all the workflows within the repo; as we can see, there is only one workflow called <code>main.yml</code> and the content of <code>main.yml</code> would be:</p><pre><code>name: CI
on: push

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - uses: actions/setup-node@v1
    - run: npm install
    - run: npm test</code></pre><p>It&apos;s pretty simple: just run an <code>ubuntu-latest</code> environment, install the necessary packages, and then try to pass all the sts within the <code>tests</code> folder. Next, let&apos;s focus on the second line of this file: <code>on: push</code>. What does this mean? It means this CI workflow will only be triggered when a <code>push</code> event occurs. </p><p>Let us do some simple experiments:</p><pre><code>$ act pull_request
INFO[0000] Using docker host &apos;unix:///var/run/docker.sock&apos;, and daemon socket &apos;unix:///var/run/docker.sock&apos;
Error: Could not find any stages to run. View the valid jobs with `act --list`. Use `act --help` to find how to filter by Job ID/Workflow/Event Name</code></pre><p>We used <code>Act</code> to mock the <code>pull_request</code> event, but encountered an error message stating that no stages were found to run. This is expected because the workflow is only triggered by the <code>push</code> event. Next, we will use Act to mock the <code>push</code> event and see what happens.</p><pre><code>&#x279C; act --container-architecture linux/amd64 push
INFO[0000] Using docker host &apos;unix:///var/run/docker.sock&apos;, and daemon socket &apos;unix:///var/run/docker.sock&apos;
[CI/test] &#x1F680;  Start image=catthehacker/ubuntu:act-latest
[CI/test]   &#x1F433;  docker pull image=catthehacker/ubuntu:act-latest platform=linux/amd64 username= forcePull=true
[CI/test] using DockerAuthConfig authentication for docker pull
...
...
...</code></pre><p>As we can see, it trigger the workflow successfully, pretty simple right? Ok, now we know how to use the <code>act</code> to trigger the workflow, next we will delve into a more complicated case in the next section.</p><h1 id="case-study">Case Study</h1><p>In this case, we have several goals to complete within the CI flow:</p><ol><li>Run the tests </li><li>Upload a file to the AWS S3</li><li>Build the docker image and push it to AWS ECR</li></ol><p>The original <code>main.yml</code> already covers the goal of running the tests, so we&apos;ll skip that part. For the second goal, uploading a file to AWS S3, we want to use Act to test the CI flow in our local environment. Therefore, we also need a local AWS service to help complete the testing environment. LocalStack is a tool that allows you to develop and test your AWS applications locally, reducing development time.</p><p>For the second and third goals, we will integrate LocalStack with Act to complete the entire CI flow in our local environment.</p><p>Please follow the instructions on the LocalStack official website to install and set up the necessary credentials; I won&apos;t cover that part here.</p><h3 id="upload-a-file-to-s3">Upload a file to S3</h3><p>Next, we use the AWS command to create a S3 bucket <code>demo</code> . Later, we will upload a file to this bucket when we run the CI flow.</p><pre><code class="language-shell">$ aws --endpoint-url=http://localhost:4566 --profile localstack s3 mb s3://demo
make_bucket: demo
$ aws --endpoint-url=http://localhost:4566 --profile localstack s3 ls
2024-07-21 13:13:24 demo</code></pre><p>After successfully creating the bucket, we will add some AWS-related scripts to complete the CI process.</p><pre><code>name: CI
on: push

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup the node
        uses: actions/setup-node@v1

      - name: Install packages
        run: npm install

      - name: Run the tests
        run: npm test

      - name: Install AWS CLI
        run: |
          sudo apt-get update
          sudo apt-get install -y awscli

      - name: Configure AWS CLI
        run: |
          mkdir -p ~/.aws
          echo &quot;[profile localstack]&quot; &gt; ~/.aws/config
          echo &quot;region = us-east-1&quot; &gt;&gt; ~/.aws/config
          echo &quot;output=json&quot; &gt;&gt; ~/.aws/config
          echo &quot;[localstack]&quot; &gt; ~/.aws/credentials
          echo &quot;aws_access_key_id = test&quot; &gt;&gt; ~/.aws/credentials
          echo &quot;aws_secret_access_key = test&quot; &gt;&gt; ~/.aws/credentials
          echo &quot;Complete Configure AWS CLI&quot;

      - name: Upload file to S3
        run: |
          aws --endpoint-url=http://localhost:4566 --profile localstack s3 cp src/upload.txt s3://demo

</code></pre><p>Then, re-running the <code>act push</code> command to see the result. </p><pre><code class="language-shell">$ aws --endpoint-url=http://localhost:4566 --profile localstack s3 ls s3://demo
2024-08-18 10:50:11          0 upload.txt</code></pre><p>Great! Now, we can upload the file to the S3 successfully, and let&apos;s move on to the next section.</p><h3 id="build-the-docker-image-and-push-it-to-aws-ecr">Build the Docker image and push it to AWS ECR.</h3><p>At the beginning of this section, we need to use the LocalStack to create an ECR repository first.</p><pre><code class="language-shell">$ aws --endpoint-url=http://localhost:4566 --profile localstack ecr create-repository --repository-name demo</code></pre><blockquote>Please note that <code>ecr create-repository</code> is a pro feature of LocalStack. If you want to use it, please consider applying for their hobby subscription.</blockquote><p>Then we need to get the <code>repositoryUri</code> for pushing the docker image to the ECR later, here is another command to retrieve the <code>repositoryUri</code> info.</p><pre><code>$ aws --endpoint-url=http://localhost:4566 --profile localstack ecr describe-repositories
{
    &quot;repositories&quot;: [
        {
            &quot;repositoryArn&quot;: &quot;arn:aws:ecr:us-east-1:000000000000:repository/demo&quot;,
            &quot;registryId&quot;: &quot;000000000000&quot;,
            &quot;repositoryName&quot;: &quot;demo&quot;,
            &quot;repositoryUri&quot;: &quot;000000000000.dkr.ecr.us-east-1.localhost.localstack.cloud:4566/demo&quot;,
            &quot;createdAt&quot;: &quot;2024-08-18T14:36:16+08:00&quot;,
            &quot;imageTagMutability&quot;: &quot;MUTABLE&quot;,
            &quot;imageScanningConfiguration&quot;: {
                &quot;scanOnPush&quot;: false
            },
            &quot;encryptionConfiguration&quot;: {
                &quot;encryptionType&quot;: &quot;AES256&quot;
            }
        }
    ]</code></pre><p>After completing the setup, we need to add some steps to finish this section.</p><pre><code>name: CI
on: push

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup the node
        uses: actions/setup-node@v1

      - name: Install packages
        run: npm install

      - name: Run the tests
        run: npm test

      - name: Install AWS CLI
        run: |
          sudo apt-get update
          sudo apt-get install -y awscli

      - name: Configure AWS CLI
        run: |
          mkdir -p ~/.aws
          echo &quot;[profile localstack]&quot; &gt; ~/.aws/config
          echo &quot;region = us-east-1&quot; &gt;&gt; ~/.aws/config
          echo &quot;output=json&quot; &gt;&gt; ~/.aws/config
          echo &quot;[localstack]&quot; &gt; ~/.aws/credentials
          echo &quot;aws_access_key_id = test&quot; &gt;&gt; ~/.aws/credentials
          echo &quot;aws_secret_access_key = test&quot; &gt;&gt; ~/.aws/credentials
          echo &quot;Complete Configure AWS CLI&quot;

      - name: Log in to LocalStack ECR
        run: |
          aws ecr get-login-password --region us-east-1 --endpoint-url=http://localhost:4566 --profile localstack

      - name: Build Docker image
        run: |
          docker build -t github-action-demo:latest .
      
      - name: Tag Docker image
        run: |
          docker tag github-action-demo:latest localhost:4566/demo/github-action-demo:latest

      - name: Push Docker image to LocalStack ECR
        run: |
          docker push 000000000000.dkr.ecr.us-east-1.localhost.localstack.cloud:4566/demo/github-action-demo:latest

      - name: Upload file to S3
        run: |
          aws --endpoint-url=http://localhost:4566 --profile localstack s3 cp src/upload.txt s3://demo

</code></pre><p>Next, we will trigger the <code>act push</code> again to see the result.</p><pre><code>$ act push
...
...
...
[CI/build]   &#x2705;  Success - Main Build Docker image
[CI/build] &#x2B50; Run Main Tag Docker image
[CI/build]   &#x1F433;  docker exec cmd=[bash --noprofile --norc -e -o pipefail /var/run/act/workflow/8] user= workdir=
[CI/build]   &#x2705;  Success - Main Tag Docker image
[CI/build] &#x2B50; Run Main Push Docker image to LocalStack ECR
[CI/build]   &#x1F433;  docker exec cmd=[bash --noprofile --norc -e -o pipefail /var/run/act/workflow/9] user= workdir=
| The push refers to repository [000000000000.dkr.ecr.us-east-1.localhost.localstack.cloud:4566/demo/github-action-demo]
0fe414dcbcfc: Layer already exists
905029b2faaa: Layer already exists
85dfe2a5d583: Layer already exists
41d31c4461e4: Layer already exists
e8ec0d6af932: Layer already exists
eb8031fe07bc: Layer already exists
67580865a2da: Layer already exists
66fe03bf3f29: Layer already exists
f752cb05a39e: Layer already exists
20f026ae0a91: Layer already exists
f21c087a3964: Layer already exists
cedb364ef937: Layer already exists
latest: digest: sha256:36dbf0d1894f7af04c8fefba107cb2cd0c609112ad17c71af6d91529cba5a61c size: 2839
[CI/build]   &#x2705;  Success - Main Push Docker image to LocalStack ECR
[CI/build] &#x2B50; Run Main Upload file to S3
[CI/build]   &#x1F433;  docker exec cmd=[bash --noprofile --norc -e -o pipefail /var/run/act/workflow/10] user= workdir=
upload: src/upload.txt to s3://demo/upload.txt                  ng
[CI/build]   &#x2705;  Success - Main Upload file to S3
[CI/build] Cleaning up container for job build
[CI/build] &#x1F3C1;  Job succeeded

INFO    &#xFE0F;&#x1F4E3; A newer version of &apos;act&apos; is available - consider ugrading to 0.2.65.</code></pre><p>Congratulations! We can now successfully run the full CI flow to complete our job.</p><h1 id="key-takeaways">Key Takeaways</h1><p>In this post, we demonstrate how to use <a href="https://github.com/nektos/act?ref=blog.taiker.space">Act</a> to develop GitHub Action scripts in your local environment with LocalStack. Doing so can save a lot of time when validating the scripts, enjoy!</p><h1 id="reference">Reference</h1><ul><li><a href="https://www.youtube.com/watch?v=OW121yjV1IM&amp;ref=blog.taiker.space">https://www.youtube.com/watch?v=OW121yjV1IM</a></li></ul>]]></content:encoded></item><item><title><![CDATA[Compare multiple serializer solutions within the Django application]]></title><description><![CDATA[<h1 id="background">Background</h1><p>This blog is inspired by the article &quot;<a href="https://hakibenita.com/django-rest-framework-slow?ref=blog.taiker.space#simple-function">Improve Serialization Performance in Django Rest Framework</a>&quot;. The author compared various serializer solutions in the Django Rest Framework, such as simple functions and regular serializers. Unfortunately, after reading this article, I noted that the development packages used are somewhat outdated</p>]]></description><link>https://blog.taiker.space/compare-multiple-serializer-solutions-within-the-django-application/</link><guid isPermaLink="false">6665aa69e18ac2500b9ddcb1</guid><category><![CDATA[python]]></category><category><![CDATA[Django]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Mon, 10 Jun 2024 06:06:56 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1717823369944-fb6b290a0485?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8YWxsfDk4fHx8fHx8Mnx8MTcxNzkzODg0MHw&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1717823369944-fb6b290a0485?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8YWxsfDk4fHx8fHx8Mnx8MTcxNzkzODg0MHw&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="Compare multiple serializer solutions within the Django application"><p>This blog is inspired by the article &quot;<a href="https://hakibenita.com/django-rest-framework-slow?ref=blog.taiker.space#simple-function">Improve Serialization Performance in Django Rest Framework</a>&quot;. The author compared various serializer solutions in the Django Rest Framework, such as simple functions and regular serializers. Unfortunately, after reading this article, I noted that the development packages used are somewhat outdated (versions: they use Python 3.7, Django 2.1.1, and Django Rest Framework 3.9.4.). Consequently, I have decided to recreate these environments using the latest versions of the packages to enhance their credibility.</p><h1 id="introduction">Introduction </h1><p>Before delving into the detailed experiments, I will list the methods I plan to compare, along with the versions of packages in my development environment: &#xA0; </p><h3 id="serializer-solutions">Serializer solutions</h3><ul><li>Data Class</li><li>Regular Serializer</li><li>Model Serializer</li><li>Simple function</li><li>Pydantic </li></ul><p>In recent years, Pydantic has emerged as the most widely used data validation library for Python, which is why I&apos;ve included it in my comparison list. </p><h3 id="versions-in-my-local-environment">Versions in My Local Environment</h3><ul><li>Python 3.10</li><li>Django 5.0.6</li><li>Django Rest Framework 3.15.1</li><li>Pydantic 2.7.3</li></ul><h1 id="environment-setup">Environment Setup</h1><p>I created the two models called <code>Product</code> and <code>Order</code> within a new Django project for these experiments, and there is a relation between the <code>Product</code> and <code>Order</code> which can make the experiments more relevant to daily operations.</p><blockquote>models.py</blockquote><pre><code class="language-python">from django.db import models


class Product(models.Model):
    id = models.AutoField(primary_key=True)
    name = models.CharField(max_length=100)
    price = models.DecimalField(max_digits=10, decimal_places=2)
    stock = models.IntegerField()
    remark = models.TextField()
    created_at = models.DateTimeField(auto_now_add=True)

    def __str__(self):
        return self.name

class Order(models.Model):
    id = models.AutoField(primary_key=True)
    product = models.ForeignKey(Product, on_delete=models.CASCADE)
    quantity = models.IntegerField()
    order_date = models.DateTimeField(auto_now_add=True)

    def __str__(self):
        return f&quot;Order {self.id} for {self.product.name}&quot;
</code></pre><p>And I also create a Django command for populating data.</p><blockquote>populate_data.py</blockquote><pre><code class="language-python">import random
from django.utils import timezone
from django.core.management.base import BaseCommand
from myapp.models import Product, Order

class Command(BaseCommand):
    help = &apos;Populate the database with test data&apos;

    def handle(self, *args, **kwargs):
        Product.objects.all().delete()
        Order.objects.all().delete()
        
        # Create products
        products = []
        for i in range(1000):
            product = Product(
                name=f&apos;Product {i}&apos;,
                price=random.uniform(10.0, 100.0),
                stock=random.randint(1, 100),
                remark=&apos;Remark for product&apos;,
                created_at=timezone.now()
            )
            products.append(product)
        
        # Bulk create products with a batch size of 500
        Product.objects.bulk_create(products, batch_size=500)

        # Fetch all products to get their IDs
        all_products = list(Product.objects.all())
        
        # Create orders
        orders = []
        for i in range(10000):
            order = Order(
                
                product=random.choice(all_products),
                quantity=random.randint(1, 10),
                order_date=timezone.now()
            )
            orders.append(order)
        
        # Bulk create orders with a batch size of 500
        Order.objects.bulk_create(orders, batch_size=500)
        
        self.stdout.write(self.style.SUCCESS(&apos;Successfully populated the database with test data&apos;))</code></pre><p>Next, we need to run the following commands to complete data preparation.</p><pre><code class="language-shell_script">$ python manage.py makemigrations
$ python manage.py migrate
$ python manage.py populate_data</code></pre><h1 id="experiment">Experiment </h1><p>As previously mentioned, we have various methods for comparison. In this section, we will implement the necessary serializer solutions.</p><h4 id="data-class">Data Class</h4><pre><code class="language-python">from dataclasses import dataclass
from datetime import datetime
from typing import Optional

@dataclass
class ProductData:
    id: int
    name: str
    price: float
    stock: int
    remark: str
    created_at: datetime

@dataclass
class OrderData:
    id: int
    product: ProductData
    quantity: int
    order_date: datetime
</code></pre><h4 id="regular-serializer">Regular Serializer</h4><pre><code class="language-python">from rest_framework import serializers

class ProductSerializer(serializers.Serializer):
    id = serializers.IntegerField()
    name = serializers.CharField(max_length=100)
    price = serializers.DecimalField(max_digits=10, decimal_places=2)
    stock = serializers.IntegerField()
    remark = serializers.CharField()
    created_at = serializers.DateTimeField()
    
class OrderSerializer(serializers.Serializer):
    id = serializers.IntegerField()
    product = ProductSerializer()
    quantity = serializers.IntegerField()
    order_date = serializers.DateTimeField()</code></pre><h4 id="model-serializer">Model Serializer</h4><pre><code class="language-python">from rest_framework import serializers
from .models import Product, Order

class ProductModelSerializer(serializers.ModelSerializer):
    class Meta:
        model = Product
        fields = [&apos;id&apos;, &apos;name&apos;, &apos;price&apos;, &apos;stock&apos;, &apos;remark&apos;, &apos;created_at&apos;]

class OrderModelSerializer(serializers.ModelSerializer):
    product = ProductModelSerializer()

    class Meta:
        model = Order
        fields = [&apos;id&apos;, &apos;product&apos;, &apos;quantity&apos;, &apos;order_date&apos;]</code></pre><h4 id="simple-function">Simple Function</h4><pre><code class="language-python">from typing import Dict, Any

def serialize_product(product: Product) -&gt; Dict[str, Any]:
    return {
        &apos;id&apos;: product.id,
        &apos;name&apos;: product.name,
        &apos;price&apos;: float(product.price),
        &apos;stock&apos;: product.stock,
        &apos;remark&apos;: product.remark,
        &apos;created_at&apos;: product.created_at.isoformat() if product.created_at else None,
    }

def serialize_order(order: Order) -&gt; Dict[str, Any]:
    return {
        &apos;id&apos;: order.id,
        &apos;product&apos;: serialize_product(order.product),
        &apos;quantity&apos;: order.quantity,
        &apos;order_date&apos;: order.order_date.isoformat() if order.order_date else None,
    }
</code></pre><h4 id="pydantic">Pydantic</h4><pre><code class="language-python">from pydantic import BaseModel
from typing import List
from datetime import datetime

class ProductDataPydantic(BaseModel):
    id: int
    name: str
    price: float
    stock: int
    remark: str
    created_at: datetime

class OrderDataPydantic(BaseModel):
    id: int
    product: ProductDataPydantic
    quantity: int
    order_date: datetime</code></pre><h1 id="experiment-results">Experiment Results</h1><p>For this experiment I prepare a Django command and use the &#xA0;<a href="https://pypi.org/project/line-profiler/?ref=blog.taiker.space">line_profiler</a> to help me measure the performance for different methods. As you can see below I have the five functions corresponding to different methods for performance evaluation. </p><blockquote>profile_serialization.py</blockquote><pre><code class="language-python">from django.core.management.base import BaseCommand
from django.db import connection
from myapp.models import Product, Order
from myapp.dataclasses import ProductData, OrderData
from myapp.serializers import ProductSerializer, ProductModelSerializer, OrderSerializer, OrderModelSerializer, serialize_product, serialize_order
from myapp.pydantic_models import ProductDataPydantic, OrderDataPydantic

class Command(BaseCommand):
    help = &apos;Profile serialization performance&apos;

    def handle(self, *args, **kwargs):
        # Read data from the database with select_related
        orders = list(Order.objects.select_related(&apos;product&apos;).all())
        print(&quot;length of orders:&quot;, len(orders))

        # Dataclass serialization
        self.profile_dataclass_serialization(orders)
        
        # Django Serializer
        self.profile_django_serializer(orders)
        
        # Django ModelSerializer
        self.profile_django_model_serializer(orders)
        
        # Simple Function-based Serialization
        self.profile_simple_function(orders)

        # Pydantic Serialization
        self.profile_pydantic_serialization(orders)

    @profile
    def profile_dataclass_serialization(self, orders):
        order_data = [
            OrderData(
                id=order.id,
                product=ProductData(
                    id=order.product.id,
                    name=order.product.name,
                    price=float(order.product.price),
                    stock=order.product.stock,
                    remark=order.product.remark,
                    created_at=order.product.created_at
                ),
                quantity=order.quantity,
                order_date=order.order_date
            ) for order in orders
        ]

    @profile
    def profile_django_serializer(self, orders):
        order_serializer = OrderSerializer(orders, many=True)
        order_data = [order for order in order_serializer.data]

    @profile
    def profile_django_model_serializer(self, orders):
        order_model_serializer = OrderModelSerializer(orders, many=True)
        order_data = [order for order in order_model_serializer.data]

    @profile
    def profile_simple_function(self, orders):
        simple_serialized_orders = [serialize_order(order) for order in orders]

    @profile
    def profile_pydantic_serialization(self, orders):
        order_data = [
            OrderDataPydantic(
                id=order.id,
                product=ProductDataPydantic(
                    id=order.product.id,
                    name=order.product.name,
                    price=float(order.product.price),
                    stock=order.product.stock,
                    remark=order.product.remark,
                    created_at=order.product.created_at
                ),
                quantity=order.quantity,
                order_date=order.order_date
            ) for order in orders
        ]</code></pre><p>You can run the following command to get the final results, and the table shows the results of different methods.</p><pre><code class="language-shell_script">$ kernprof -l -v manage.py profile_serialization</code></pre><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th><strong>Method</strong></th>
<th><strong>Result (seconds)</strong></th>
<th style="text-align:center"><strong>Data Validation</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>Dataclass serialization</td>
<td>0.05273  s</td>
<td style="text-align:center">N</td>
</tr>
<tr>
<td>Django Serializer</td>
<td>0.443061 s</td>
<td style="text-align:center">Y</td>
</tr>
<tr>
<td>Django ModelSerializer</td>
<td>0.423694 s</td>
<td style="text-align:center">Y</td>
</tr>
<tr>
<td>Simple Function-based Serialization</td>
<td>0.032002 s</td>
<td style="text-align:center">N</td>
</tr>
<tr>
<td>Pydantic Serialization</td>
<td>0.086857 s</td>
<td style="text-align:center">Y</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>It&apos;s not surprising that simple function-based serialization performed the best in this comparison due to its simplicity and lack of data validation capabilities. However, I was somewhat surprised to find that the performance of Django&apos;s impression that ModelSerializer would perform worse, but the test showed nearly the same performance for both (Serializer and ModelSerializer). This is intriguing, and we will explore this scenario in the next section. </p><p>Upon reviewing the results table again, the most appealing solution to me is Pydantic Serialization; it performs the best among all serialization methods with data validation functionality. Remarkably, there is not much difference in performance between Pydantic and serializations without data validation. This is why Pydantic has recently become the most popular serialization/data validation tool in Python.</p><h1 id="further-discussion">Further Discussion</h1><p>In this section, we will discuss the performance results of the Serializer and ModelSerializer. Upon investigation, both have similar implementation approaches. There are two parts to the implementation:</p><ul><li>Field Handling: Both <code>Serializer</code> and <code>ModelSerializer</code> handle fields in a similar manner once they are defined. <strong>For read operations</strong>, both serializers iterate over the fields and generate the output dictionary.</li><li>Field Definitions: While <code>Serializer</code> requires explicit field definitions, <code>ModelSerializer</code> introspects the model and automatically creates the fields</li></ul><p>I believe Django and DRF have addressed performance issues with <code>ModelSerializer</code>, which is reflected in today&apos;s experiment results. <code>ModelSerializer</code> has a slight overhead during the initialization phase due to model introspection. However, this overhead is negligible during read operations because it happens only once, resulting in performance that is almost identical to that of <code>Serializer</code>.</p><h1 id="conclusion">Conclusion</h1><p>I took some time to complete this experiment, but I acknowledge there may be some omissions or deficiencies. I welcome your corrections. At the end of this post, I believe there is a key takeaway for you. </p><div class="kg-card kg-callout-card kg-callout-card-yellow"><div class="kg-callout-emoji">&#x1F4A1;</div><div class="kg-callout-text">Stop using Django Serializer solutions including both regular serializer and Model Serializer. Pydantic would be a better choice for implementing serialization functionality.</div></div><h1 id="reference">Reference</h1><ul><li><a href="https://hakibenita.com/django-rest-framework-slow?ref=blog.taiker.space#modelserializer">https://hakibenita.com/django-rest-framework-slow#modelserializer</a></li></ul><p></p>]]></content:encoded></item><item><title><![CDATA[How to Modularize Your Airflow Project]]></title><description><![CDATA[<h1 id="background">Background</h1><p>As AI &amp; ML technologies mature, more and more companies are beginning to explore the integration of these technologies into their internal products. Bringing AI &amp; ML solutions to companies poses a significant challenge in recent years.</p><p>This post will introduce how to use Airflow to build a small</p>]]></description><link>https://blog.taiker.space/how-to-modularize-your-airflow-project/</link><guid isPermaLink="false">65e2c535e18ac2500b9dda78</guid><category><![CDATA[python]]></category><category><![CDATA[ML]]></category><category><![CDATA[Data]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Sun, 03 Mar 2024 18:38:30 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1501820488136-72669149e0d4?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDI1fHxjYXR8ZW58MHx8fHwxNzA5Mjc3MTg5fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1501820488136-72669149e0d4?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDI1fHxjYXR8ZW58MHx8fHwxNzA5Mjc3MTg5fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="How to Modularize Your Airflow Project"><p>As AI &amp; ML technologies mature, more and more companies are beginning to explore the integration of these technologies into their internal products. Bringing AI &amp; ML solutions to companies poses a significant challenge in recent years.</p><p>This post will introduce how to use Airflow to build a small MLOps project aimed at providing a machine learning (ML) pipeline to reduce the cost of deploying ML models. Additionally, we will take some time to discuss modularization in Airflow.</p><p>For this demonstration, we have selected the <a href="https://www.kaggle.com/competitions/titanic?ref=blog.taiker.space">Titanic competition</a>, a renowned challenge on Kaggle, as our example dataset. For detailed installation instructions and settings, please refer to my <a href="https://github.com/TaikerLiang/MLOps-Demo?tab=readme-ov-file&amp;ref=blog.taiker.space">Github repo</a>.</p><h1 id="ml-pipeline-lifecycle">ML Pipeline Lifecycle</h1><figure class="kg-card kg-image-card"><img src="https://blog.taiker.space/content/images/2024/03/Screenshot-2024-03-02-at-3.03.46-PM.png" class="kg-image" alt="How to Modularize Your Airflow Project" loading="lazy" width="1656" height="896" srcset="https://blog.taiker.space/content/images/size/w600/2024/03/Screenshot-2024-03-02-at-3.03.46-PM.png 600w, https://blog.taiker.space/content/images/size/w1000/2024/03/Screenshot-2024-03-02-at-3.03.46-PM.png 1000w, https://blog.taiker.space/content/images/size/w1600/2024/03/Screenshot-2024-03-02-at-3.03.46-PM.png 1600w, https://blog.taiker.space/content/images/2024/03/Screenshot-2024-03-02-at-3.03.46-PM.png 1656w" sizes="(min-width: 720px) 720px"></figure><p>As you can see, I have defined the basic pipeline as shown in the figure above. However, due to differences in business models and architectures among different companies, you may need to make some adjustments to fit your current architecture.</p><p>I referred to several articles and videos to define the basic pipeline flow for beginners:</p><ul><li>Step 1: We will load the raw data from S3, and here we will use LocalStack to help us implement this function in our local environment. </li><li>Step 2: Prepare the training data for the model, including data preprocessing, feature selection, etc. </li><li>Step 3: Export the training data for model training. </li><li>Step 4: Perform model training. </li><li>Step 5: Export the evaluation results to users via notification solutions such as Slack or Telegram messages. </li><li>Step 6: Export the model to S3.</li></ul><p>Then I divided these steps within the pipeline to the four groups:</p><ol><li>Data Preparation Group</li><li>Model Training Group</li><li>Model Deployment Group</li><li>Notification Group</li></ol><p>All implementations revolve around this pipeline flow and these four groups, and these definitions serve as the fundamental concepts of modularization.&quot;</p><h1 id="modularization">Modularization</h1><p>Next, let&apos;s delve into modularization within the Airflow project. Most articles and online resources tend to place all related code in a single file. However, from my perspective, this approach becomes challenging to maintain as the project grows. Therefore, I have explored how to modularize within Airflow myself and developed a final solution.</p><p>For each task (i.e., the problem you want to solve using ML), we will create a main DAG file and related functions, which will be organized into different files to achieve modularization.</p><ul><li>dags: This directory contains the main DAG (Directed Acyclic Graph) functions.</li><li>dags/func: Within this directory, sub-folders are organized for modularization. Each folder name corresponds to a main DAG function. Inside each sub-folder, the entire ML pipeline is divided into different steps, with each step having its own file to complete the related function.</li></ul><p>The tree structure looks like:</p><pre><code>./dags
&#x251C;&#x2500;&#x2500; __init__.py
&#x251C;&#x2500;&#x2500; func
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; __init__.py
&#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; titanic
&#x2502;&#xA0;&#xA0;     &#x251C;&#x2500;&#x2500; __init__.py
&#x2502;&#xA0;&#xA0;     &#x251C;&#x2500;&#x2500; data_preparation.py
&#x2502;&#xA0;&#xA0;     &#x251C;&#x2500;&#x2500; model_deployment.py
&#x2502;&#xA0;&#xA0;     &#x251C;&#x2500;&#x2500; model_training.py
&#x2502;&#xA0;&#xA0;     &#x2514;&#x2500;&#x2500; notification.py
&#x2514;&#x2500;&#x2500; titanic.py</code></pre><p>We have a main DAG file named <code>titanic</code> and a sub-folder under <code>func</code> also named <code>titanic</code>. This structure signifies that all functions within this folder are related to the main DAG <code>titanic</code>. This approach allows us to easily organize the folder architecture for multiple ML tasks.</p><blockquote>titanic.py</blockquote><pre><code class="language-Python">from datetime import datetime, timedelta
import json
import pickle
import base64

import pandas as pd
from airflow.decorators import task
from airflow.decorators import dag
from airflow.utils.dates import days_ago
from airflow.providers.amazon.aws.transfers.local_to_s3 import LocalFilesystemToS3Operator
from airflow.utils.task_group import TaskGroup

from func.titanic.data_preparation import load_data_from_s3, data_preprocessing, export_the_training_data_to_s3, delete_temp_file
from func.titanic.model_training import get_dataset, train_model, evaluate_model
from func.titanic.model_deployment import save_model


@dag(schedule_interval=&apos;@daily&apos;, start_date=days_ago(1), catchup=False, tags=[&apos;example&apos;])
def titanic_flow():


    with TaskGroup(&quot;data_preparation_group&quot;) as data_preparation:
        data = load_data_from_s3()
        train_data = data_preprocessing(data) # type: ignore
        tmp_filename = export_the_training_data_to_s3(train_data) # type: ignore

        upload_to_s3_task = LocalFilesystemToS3Operator(
            task_id=&apos;upload_to_s3&apos;,
            filename=tmp_filename, 
            dest_key=&apos;titanic/train.csv&apos;,
            dest_bucket=&apos;airflow&apos;,
            aws_conn_id=&apos;aws_localstack&apos;,  # Airflow AWS connection ID which can be created through the UI
            replace=True,
        )

        tmp_filename &gt;&gt; upload_to_s3_task
        upload_to_s3_task &gt;&gt; delete_temp_file(tmp_filename)


    with TaskGroup(&quot;model_training&quot;) as model_training:
        dataset = get_dataset(data=train_data)
        model = train_model(dataset[&apos;X&apos;], dataset[&apos;y&apos;])
        results = evaluate_model(model, dataset[&apos;X&apos;], dataset[&apos;y&apos;])


    with TaskGroup(&quot;model_deployment&quot;) as model_deployment:
        # Save the trained model to a temporary file
        model_file_path = save_model(model)

        # Task to upload the model to S3
        upload_model_to_s3 = LocalFilesystemToS3Operator(
            task_id=&apos;upload_model_to_s3&apos;,
            filename=model_file_path,
            dest_key=&apos;titanic/titanic_model.pkl&apos;,
            dest_bucket=&apos;airflow&apos;,
            aws_conn_id=&apos;aws_localstack&apos;, 
            replace=True,
        )

        results &gt;&gt; model_file_path &gt;&gt; upload_model_to_s3


dag = titanic_flow()</code></pre><p>As you can see, the main DAG function&apos;s result is now very simple and clear. We have successfully abstracted away the details into other functions, allowing the main DAG function to focus solely on the data flow and data passing. Not bad, right?</p><h1 id="some-tips">Some Tips</h1><p>Passing data between each step was the most challenging part during the implementation of this demo project. Perhaps because I was not very familiar with Airflow at the time, I often encountered XComArg-related error messages like the following. This occurred because Airflow automatically wraps your return value in an XComArg. If you want to return multiple values in a single function, please remember to return them in dictionary format instead of tuple format.</p><pre><code>Argument of type &quot;XComArg&quot; cannot be assigned to parameter &quot;data&quot; of type &quot;DataFrame&quot;
  &quot;XComArg&quot; is incompatible with &quot;DataFrame&quot;PylancereportGeneralTypeIssues
(variable) train_data: XComArg</code></pre><h1 id="key-takeaways">Key Takeaways</h1><p>In this post, I have demonstrated how to build a simple MLOps pipeline using Python and Airflow. We also discussed modularization in Airflow, although it may not be the optimal solution (as I am still exploring better approaches), I believe I have shared some valuable insights that may prompt you to rethink your approach.</p><p>Next, there are several areas where this demo project can be further improved, including:</p><ol><li>DAG and unit testing</li><li>Container development environment setup</li><li>Multi-model comparison flow</li><li>Deployment with tags</li><li>Notification implementation</li><li>Integration of CI/CD into the MLOps workflow</li><li>Debugging production code</li></ol><p>In conclusion, I hope you found my ideas on modularizing your Airflow project helpful, and I look forward to continuing to explore the world of Airflow and MLOps in the future. Cheers!</p><h1 id="reference">Reference</h1><ul><li><a href="https://www.kaggle.com/competitions/titanic?ref=blog.taiker.space" rel="nofollow">https://www.kaggle.com/competitions/titanic</a></li><li><a href="https://www.kaggle.com/code/alexisbcook/titanic-tutorial?ref=blog.taiker.space" rel="nofollow">https://www.kaggle.com/code/alexisbcook/titanic-tutorial</a></li><li><a href="https://docs.localstack.cloud/overview/?ref=blog.taiker.space" rel="nofollow">https://docs.localstack.cloud/overview/</a></li><li><a href="https://proclusacademy.com/blog/practical/k-fold-cross-validation-sklearn/?ref=blog.taiker.space" rel="nofollow">https://proclusacademy.com/blog/practical/k-fold-cross-validation-sklearn/</a></li></ul>]]></content:encoded></item><item><title><![CDATA[[Python] How to Improve The Performance of Heavy SQL Joins]]></title><description><![CDATA[<h1 id="background">Background</h1><p>Software engineering is a multidisciplinary field that encompasses the design, development, testing, and maintenance of software systems. In the realm of software development, one critical aspect that demands continuous attention is database performance tuning. This process involves optimizing the efficiency of database operations, ensuring that the software interacts with</p>]]></description><link>https://blog.taiker.space/how-to-improve-the-performance-of-heavy-sql-join/</link><guid isPermaLink="false">65a3985ea3f91488698f7e60</guid><category><![CDATA[python]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Sun, 14 Jan 2024 14:18:35 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1619545493446-b378e885c6de?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDd8fGhlYXZ5fGVufDB8fHx8MTcwNTIyMDU0M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1619545493446-b378e885c6de?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDd8fGhlYXZ5fGVufDB8fHx8MTcwNTIyMDU0M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="[Python] How to Improve The Performance of Heavy SQL Joins"><p>Software engineering is a multidisciplinary field that encompasses the design, development, testing, and maintenance of software systems. In the realm of software development, one critical aspect that demands continuous attention is database performance tuning. This process involves optimizing the efficiency of database operations, ensuring that the software interacts with the database in the most effective and responsive manner. Today I want to share my experience in improving the performance of heavy SQL join.</p><h1 id="case-study">Case Study</h1><p>Alright! First, let me introduce the table schema and provide a sample code. Afterward, we can discuss the issues present in the sample code.</p><pre><code class="language-Python">class OceanShipmentExportInfo(models.Model):
    id = models.AutoField(primary_key=True)
    booking_no = models.CharField(max_length=32, null=True, db_index=True)
    
    
class OceanShipment(models.Model):
    class Meta:
        unique_together = (&quot;lookup_id&quot;, &quot;subscriber&quot;)
   
    subscriber = models.ForeignKey(Subscriber, related_name=&apos;shipment_set&apos;, null=True, on_delete=models.CASCADE)
    lookup_id = models.CharField(max_length=32, null=True)
    type = models.CharField(max_length=1, default=&apos;I&apos;, choices=SHIPMENT_TYPE_CHOICES)
    ETA = models.DateTimeField(null=True, db_index=True)
    oe_info = models.OneToOneField(
        &apos;OceanShipmentExportInfo&apos;, related_name=&apos;shipment&apos;, null=True, on_delete=models.SET_NULL
    )
    MBL_NO = models.CharField(max_length=32, db_index=True, null=True)
    HBL_NO = models.CharField(max_length=32, db_index=True, null=True)
    lookup_id = models.CharField(max_length=32, null=True)
  
</code></pre><pre><code class="language-Python">def search(search_option, subscriber):
    query = OceanShipment.objects.filter_by_subscriber(None, subscriber).filter(type=search_option.type)

    if search_option.keyword:
        query = query.filter(
            Q(HBL_NO__icontains=search_option.keyword)
            | Q(MBL_NO__icontains=search_option.keyword)
            | Q(oe_info__booking_no__icontains=search_option.keyword)
            | Q(lookup_id=search_option.keyword)
        )

    return query.order_by(&apos;-ETA&apos;)</code></pre><p>Next, let&apos;s delve into this piece of code. We have two models named OceanShipmentExportInfo and OceanShipment. In the search function, if there&apos;s a keyword to be searched, we look at the following fields: 1. HBL_NO (OceanShipment), 2. MBL_NO (OceanShipment), 3. booking_no (OceanShipmentExportInfo), 4. lookup_id (OceanShipment).</p><p>Now, the issue arises because booking_no is in the other model (OceanShipmentExportInfo), and there&apos;s a one-to-one relation between <code>OceanShipmentExportInfo</code> and the <code>OceanShipment</code> model. This implies that this query will perform a join operation to complete the search function. The problem is that both <code>OceanShipmentExportInfo</code> and <code>OceanShipment</code> contain millions of records, leading to this query taking almost 7 seconds to complete, which is a very bad performance. </p><h1 id="solutions">Solutions</h1><p>When aiming to optimize SQL queries for speed, the immediate consideration is usually indexing. However, upon reviewing the model, it became apparent that all fields used in filtering and joining were already indexed. (Remember, for <code>OneToOneField</code>, Django automatically creates an index on the foreign key field by default.)</p><p>Upon closer examination, I identified that the join operation (Q(oe_info__booking_no__icontains=search_option.keyword)) was the bottleneck causing the query to slow down. Remarkably, removing this line resulted in a significant improvement, bringing the query performance to 100 ms.</p><p>This led me to explore ways to circumvent SQL JOIN. The solution was a refactoring of the search function, and the results were impressive.</p><pre><code class="language-Python">def search(search_option: SearchOption, subscriber=None):
    query = OceanShipment.objects.filter_by_subscriber(None, subscriber).filter(type=search_option.type)
    booking_ids = None
    if search_option.keyword:
        booking_ids = OceanShipmentExportInfo.objects.filter(booking_no=search_option.keyword).values(&apos;id&apos;)

        if booking_ids:
	        query = query.filter(
	            Q(HBL_NO__icontains=search_option.keyword)
	            | Q(MBL_NO__icontains=search_option.keyword)
	            | Q(oe_info__in=booking_ids)
	            | Q(lookup_id=search_option.keyword)
	        )
        else:
	        query = query.filter(
	            Q(HBL_NO__icontains=search_option.keyword)
	            | Q(MBL_NO__icontains=search_option.keyword)
	            | Q(lookup_id=search_option.keyword)
	        )
</code></pre><p>As observed, I retrieved the booking_ids through a separate query operation first. Then, if there are booking_ids, we perform the search using only the oe_info foreign key id without needing a JOIN operation. In summary, we utilized two straightforward search queries without any JOIN operations to achieve the same search function. The outcome is remarkably positive, with the overall function performance improving from 7 seconds to 100 ms. That&apos;s quite impressive.</p><h1 id="key-takeaways">Key Takeaways</h1><p>After implementing this enhancement, I&apos;ve gained some insights to share with all of you. SQL JOIN is a costly operation; while it&apos;s commonly used to address the N+1 problem, we should strive to minimize its usage whenever possible. As demonstrated in this case, employing multiple SQL operations without JOIN can yield better performance, especially when both tables contain a significant number of records.</p><p></p>]]></content:encoded></item><item><title><![CDATA[[Python] Case Study for Handling Python Singleton Errors]]></title><description><![CDATA[<h1 id="background">Background</h1><p>Recently, I encountered an unusual bug, and upon investigation, we discovered that the root cause was related to the singleton object we had implemented earlier. This bug was deeply concealed, making it quite challenging to uncover. In this blog post, I will narrate the entire journey. Let&apos;s</p>]]></description><link>https://blog.taiker.space/python-singleton-error-use-case/</link><guid isPermaLink="false">64eda89ca3f91488698f7825</guid><category><![CDATA[python]]></category><category><![CDATA[Web]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Sat, 30 Sep 2023 14:18:32 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1695486739035-c6008878d493?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8YWxsfDYyfHx8fHx8Mnx8MTY5NjA2NjQ3OHw&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1695486739035-c6008878d493?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8YWxsfDYyfHx8fHx8Mnx8MTY5NjA2NjQ3OHw&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="[Python] Case Study for Handling Python Singleton Errors"><p>Recently, I encountered an unusual bug, and upon investigation, we discovered that the root cause was related to the singleton object we had implemented earlier. This bug was deeply concealed, making it quite challenging to uncover. In this blog post, I will narrate the entire journey. Let&apos;s dive in.</p><h1 id="singleton">Singleton</h1><p>First, we need to know what the singleton is and how to implement it in Python. The Singleton pattern restricts the instantiation of a class to a single instance and provides a global point of access to that instance, and the following shows a simple example of creating a singleton object in Python:</p><pre><code class="language-Python">class ConfigGetter:
    _instance = None
    _config_cache = {}

    def __init__(self) -&gt; None:
        print(&quot;__init__&quot;)

    def __new__(cls, *args, **kwargs):
        print(&quot;cls._instance&quot;, cls._instance)
        if cls._instance is None:
            cls._instance = super().__new__(cls)
        return cls._instance</code></pre><p>We use the magic method <code>__new__</code> to complete the singleton in Python. <code>__new__</code> is a class method (i.e., it&apos;s defined on the class itself) and is responsible for creating a new instance of the class. It&apos;s called before <code>__init__</code>, and its primary purpose is to return a new instance of the class.</p><pre><code class="language-Python">class TestView(APIView):
    def get(self, request):
        config_getter = ConfigGetter()        
        return Response(status=status.HTTP_200_OK)
</code></pre><p>Next, we use a simple API to test whether the singleton object works fine or not. We start a web service (Django) and call the API twice. &#xA0;You can see the output like the following:</p><pre><code>Django version 3.2.15, using settings &apos;config.settings.dev&apos;
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.
cls._instance None
init
cls._instance &lt;src.ConfigGetter object at 0x10aee0e50&gt;
init</code></pre><p>the first time, the cls._instance was None, but the second time, it showed it was a ConfigGetter object. &#xA0;The singleton object appears to work correctly, but please be cautious of the <code>__init__</code> will still execute twice, which means the attributes of the singleton object may vary depending on your <code>__init__</code>. </p><h1 id="case-study">Case Study</h1><p>Alright, after demonstrating how to implement the singleton in Python, let&apos;s return to the story we initially intended to share. We have a configuration table that controls which crawlers are enabled. When we execute the code, it checks this configuration table using the <code>ConfigGetter</code> object. </p><p>Here&apos;s an example of the code:</p><pre><code class="language-Python">class ConfigGetter:
    _instance = None
    _config_cache = {}

    def __new__(cls, *args, **kwargs):
        print(&quot;cls._instance&quot;, cls._instance)
        if cls._instance is None:
            cls._instance = super().__new__(cls)
        return cls._instance

    def get_config(self, scac_code: str) -&gt; CarrierConfig:
        if scac_code not in self._config_cache:
            config = self._prepare_config(scac_code=scac_code)
            self._config_cache[scac_code] = config

        return self._config_cache[scac_code]
 </code></pre><p>The <code>_prepare_config</code> the function will load the data from the database, and as you can see, the smart us at that time was to utilize the process cache and singleton object to reduce the query times for better performance. &#xA0;So, what problems does this approach introduce?&quot;</p><p>The answer is that we can&apos;t immediately apply database changes to the code. To elaborate, if I update the config table and want the program to apply these changes immediately, can we achieve this with the code we designed above? Absolutely not. So, can we implement it correctly while also reducing database queries?</p><h1 id="problem-definition">Problem definition</h1><p>Okay, let&apos;s recap our problem. We have a web service running on Django, and we want to use a config table to control which crawler is currently enabled. We have two key criteria to fulfill:</p><ol><li>After updating the config table, the program should immediately apply these changes.</li><li>We need to ensure good performance. We don&apos;t want the program to query the config table every time because these changes are infrequent.</li></ol><p>In the beginning, we proposed a simple solution for handling this case; we reset the <code>_config_cache</code> after we updated the config table. It will look like:</p><pre><code class="language-Python">class ConfigGetter:
    _instance = None
    _config_cache = {}

    def __new__(cls, *args, **kwargs):
        print(&quot;cls._instance&quot;, cls._instance)
        if cls._instance is None:
            cls._instance = super().__new__(cls)
        return cls._instance

    def get_config(self, scac_code: str) -&gt; CarrierConfig:
        if scac_code not in self._config_cache:
            config = self._prepare_config(scac_code=scac_code)
            self._config_cache[scac_code] = config

        return self._config_cache[scac_code]
        
    def reset(self):
        self._config_cache = {}
 </code></pre><p>But after giving it some thought, this solution may not work as expected. In a Django application that runs with multiple processes, each process will indeed have its own Singleton object. This behavior is because each process operates independently and maintains its own separate memory space. </p><p>What does that mean? If we have 5 processes in a Django application, it&apos;s hard to reset <code>_config_cache</code> for all processes, so here&apos;s the next solution: </p><p>Could we create a singleton object across all processes?</p><pre><code class="language-Python">import multiprocessing

class Singleton:
    _instance = None
    _config_cache = {}
    _lock = multiprocessing.Lock()

    def __new__(cls):
        with cls._lock:
            if cls._instance is None:
                cls._instance = super().__new__(cls)
        return cls._instance
        
    def get_config(self, scac_code: str) -&gt; CarrierConfig:
        if scac_code not in self._config_cache:
        config = self._prepare_config(scac_code=scac_code)
        self._config_cache[scac_code] = config

        return self._config_cache[scac_code]

    def reset(self):
        self._config_cache = {}
 </code></pre><p>Alright, this seems to be shaping up nicely, doesn&apos;t it? We proceeded with testing and reflecting on whether there were any potential issues with this solution. Another concern surfaced: if we implement this Singleton pattern and run both the Django and Celery processes within the same Python interpreter instance, they will indeed share the same Singleton object, all thanks to <code>multiprocessing.Lock()</code>. </p><p>However, if you are running the Django and Celery processes in separate interpreter instances (for example, running them on separate servers), they will not share the same Singleton object. In this case, you would need to use a different method to share the Singleton instance across processes, such as using a shared memory object or a separate server process to manage the Singleton instance.</p><h1 id="summary">Summary</h1><p>In the end, we opted for the Redis cache solution to resolve this issue. After updating the config table, we clear the Redis cache, forcing the program to query the table and rebuild the Redis cache. I hope you found this journey insightful, and if you have any great ideas or better solutions, please leave a comment; I would greatly appreciate it. Thanks for reading!</p>]]></content:encoded></item><item><title><![CDATA[Use Nginx To Enable CORS for Multiple Origins]]></title><description><![CDATA[<h1 id="background">Background</h1><p>Recently, I needed to hand over my previous side project to others, and I started to think about how could I complete it in an easy, simple, and low-communication cost way. That&apos;s when the term &apos;docker-compose&apos; popped into my head. So, I decided to complete</p>]]></description><link>https://blog.taiker.space/nginx-allow-multiple-origin/</link><guid isPermaLink="false">64edb8bba3f91488698f782c</guid><category><![CDATA[Web]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Sat, 09 Sep 2023 07:46:18 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1682686578023-dc680e7a3aeb?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wxfDF8YWxsfDF8fHx8fHwyfHwxNjk0MTg3MTAwfA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1682686578023-dc680e7a3aeb?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wxfDF8YWxsfDF8fHx8fHwyfHwxNjk0MTg3MTAwfA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="Use Nginx To Enable CORS for Multiple Origins"><p>Recently, I needed to hand over my previous side project to others, and I started to think about how could I complete it in an easy, simple, and low-communication cost way. That&apos;s when the term &apos;docker-compose&apos; popped into my head. So, I decided to complete the task with &apos;docker-compoes&apos;. &#xA0;</p><p>Soon enough, problems started cropping up. This side project consists of three services: the client, console, and API service, each with its own corresponding URLs, like these:</p><ul><li>client: domain-a.com</li><li>console: console.domain-a.com</li><li>server: api.domain-a.com </li></ul><p>In appearance, this setting will hit the CORS issue, we will briefly introduce the CORS later. To address it, we initially applied a quick fix by adding the &apos;Access-Control-Allow-Origin *&apos; setting to our Nginx.</p><p>However, this solution isn&apos;t perfect because we actually want to permit access to the resource from multiple origins only. With &apos;Access-Control-Allow-Origin *,&apos; essentially, anyone on the internet can attempt to access our resources, which poses more risk.</p><p>In this post, I&apos;ll share how to allow access from multiple origins to your resources specifically. Let&apos;s dive in.</p><h1 id="cors">CORS </h1><p>First, let&apos;s get a handle on what CORS is. CORS stands for Cross-Origin Resource Sharing. Why did we need it? Well, it all goes back to the early days when people hadn&apos;t quite caught on to this problem. See, hackers would attempt to create phishing websites, like those horoscope or farm news sites you might have seen. Sneakily tucked behind these innocent-looking sites, they&apos;d slip in some script code to access resources from other websites using your browser&apos;s cookies. This is what&apos;s known as CSRF.</p><p>CORS, along with the same-site attribute for cookies, can effectively prevent CSRF on the browser side. I won&apos;t delve deeper into this aspect; if you&apos;re interested, there are plenty of resources available on the internet.</p><p>Now, returning to CORS, what does the error actually look like?</p><figure class="kg-card kg-image-card"><img src="https://blog.taiker.space/content/images/2023/09/image.png" class="kg-image" alt="Use Nginx To Enable CORS for Multiple Origins" loading="lazy" width="1350" height="117" srcset="https://blog.taiker.space/content/images/size/w600/2023/09/image.png 600w, https://blog.taiker.space/content/images/size/w1000/2023/09/image.png 1000w, https://blog.taiker.space/content/images/2023/09/image.png 1350w" sizes="(min-width: 720px) 720px"></figure><p>Why? How does the browser figure out that a request has hit the CORS limitation? Well, the browser breaks down the URL into three parts: the scheme, domain, and port. When all three components match, it&apos;s considered the &apos;same origin.&apos; Here are some examples:</p><p>https://domain-a.com is my website.</p><ul><li>schema: https</li><li>domain: domain-a</li><li>port: 443</li></ul><p>Next, let&apos;s examine whether the following request complies with CORS.</p><ul><li>http://domain-a.com &#x2192; not the same origin (schema is not the same)</li><li>https://domain-a.com/mike &#x2192; same origin.</li><li>https://news.domain-a.com &#x2192; not the same origin (domain is not the same)</li><li>https://domain-a.com:81 &#x2192; not the same origin (port is different)</li><li>https://domain-b.com &#x2192; not the same origin (domain is not the same)</li></ul><p>By this point, you should have a basic understanding of what CORS is. Next section, I will show you how to solve it via Nginx.</p><h1 id="nginx-setting">Nginx Setting</h1><p>Most of the time, we opt for Nginx as our web server due to its various advantages, like being lightweight, simple, and stable, among others. Now, considering our prior understanding of CORS rules, how does the server manage which requests can access its resources?</p><p>This is where the <code>Access-Control-Request-*</code><em>headers come into play. The server determines what to send back as <code>Access-Control-Allow-</code></em> headers. Based on these headers, the browser makes the call on which requests can bypass the CORS restriction.</p><p>Here is the simplest solution you can find on the internet.</p><pre><code>location / {
    
    add_header Access-Control-Allow-Origin *;
    add_header Access-Control-Allow-Methods &apos;GET, POST, PUT, DELETE, OPTIONS&apos;;
    add_header Access-Control-Allow-Headers &apos;DNT,X-Mx-ReqToken,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Authorization&apos;;

    # the remaning setthings
    ...
    ...
    ...
}
</code></pre><p>This is clearly not what we want. So, here&apos;s another solution that I&apos;ve come up with recently. We employ a preflighted requests handler to communicate to the browser under which conditions we can permit CORS. Typically, the <code>Access-Control-Allow-Origin</code> header only accepts a single origin value. Unfortunately, for our side project, we need to allow multiple origins. To meet this requirement, we utilize the &apos;map&apos; function in Nginx.</p><p>According to the Nginx website, the map function is used to create a new variable whose value depends on the values of one or more of the source variables specified in the first parameter. For more details, you can check <a href="http://nginx.org/en/docs/http/ngx_http_map_module.html?ref=blog.taiker.space">this link</a>.</p><pre><code>map $http_origin $allowed_origin {
    ~*^http://localhost:3030$      $http_origin;
    ~*^http://localhost:4000$      $http_origin;
}

location / {

    # Preflighted requests
    if ($request_method = OPTIONS) {
        add_header &quot;Access-Control-Allow-Origin&quot; $allowed_origin;
        add_header &quot;Access-Control-Allow-Methods&quot; &quot;GET, PUT, POST, DELETE, OPTIONS&quot;;
        add_header &quot;Access-Control-Allow-Headers&quot; &quot;Authorization, Content-Type&quot;;
        add_header &quot;Access-Control-Max-Age&quot; 86400;
        return 204;
    }

    if ($allowed_origin != &quot;&quot;) {
        add_header &quot;Access-Control-Allow-Origin&quot; $allowed_origin;
    }

    # the remaning setthings
    ...
    ...
    ...
}
</code></pre><p>By doing this, we can greenlight several origins to navigate through the CORS barrier when working locally. What&apos;s even more crucial is that we can seamlessly run multiple services and have them communicate with the API service via localhost with different ports using docker-compose.</p><h1 id="summary">Summary</h1><p>In this post, we&apos;ve presented a solution that enables multiple origins to navigate the CORS restriction in a more secure way. We&apos;d like to reiterate the importance of avoiding the wildcard solution <code>Access-Control-Allow-Origin *</code> in your Nginx configuration, and hope this post is helpful to you, thanks.</p><h1 id="reference">Reference</h1><ul><li><a href="https://medium.com/starbugs/%E5%BC%84%E6%87%82%E5%90%8C%E6%BA%90%E6%94%BF%E7%AD%96-same-origin-policy-%E8%88%87%E8%B7%A8%E7%B6%B2%E5%9F%9F-cors-e2e5c1a53a19?ref=blog.taiker.space">https://medium.com/starbugs/&#x5F04;&#x61C2;&#x540C;&#x6E90;&#x653F;&#x7B56;-same-origin-policy-&#x8207;&#x8DE8;&#x7DB2;&#x57DF;-cors-e2e5c1a53a19</a></li></ul>]]></content:encoded></item><item><title><![CDATA[[Python] How To Implement a “Dry Run Mode” In Django]]></title><description><![CDATA[<h1 id="background">Background</h1><p>In the previous post. I mentioned how the &quot;Pagination&quot; concept changed my mind about implementing the data migration via command. In this post, I want to introduce the second concept, which is also very useful for me to implement the data migration commands with great quality, called</p>]]></description><link>https://blog.taiker.space/how-to-implement-a-dry-run-mode-in-django/</link><guid isPermaLink="false">64ee2b79a3f91488698f7839</guid><category><![CDATA[python]]></category><dc:creator><![CDATA[Paul Liang]]></dc:creator><pubDate>Thu, 31 Aug 2023 15:53:56 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1546387903-6d82d96ccca6?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDYwfHx3aGl0ZXxlbnwwfHx8fDE2OTMzMzA0ODZ8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h1 id="background">Background</h1><img src="https://images.unsplash.com/photo-1546387903-6d82d96ccca6?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDYwfHx3aGl0ZXxlbnwwfHx8fDE2OTMzMzA0ODZ8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="[Python] How To Implement a &#x201C;Dry Run Mode&#x201D; In Django"><p>In the previous post. I mentioned how the &quot;Pagination&quot; concept changed my mind about implementing the data migration via command. In this post, I want to introduce the second concept, which is also very useful for me to implement the data migration commands with great quality, called <code>Dry Run</code>.</p><p>A <code>Dry Run</code> refers to the process of simulating the execution of a program without actually executing it on the intended target or environment. Next, we will use some examples to illustrate how to implement the <code>Dry Run</code> with Django command.</p><h1 id="before">Before</h1><p>Okay, we use the same example as the previous post. We get all the subscriptions from the subscriptions model and want to update part of the subscription info to the task, so after selecting all objects from the model, append a new task model through the for loop, we finally use the bulk update to update all data at the same time.</p><pre><code class="language-Python">from django.core.management.base import BaseCommand

from model import TaskSubscriptionModel, CarrierTaskModel
from managers import TaskManager


class Command(BaseCommand):  # pragma: no cover
	subscriptions = TaskSubscriptionModel.objects.all()
    tasks_to_update = []
    for subscription in subscriptions:
        task_id = subscription.task.id
        task = CarrierTaskModel(
        	id=task_id,
            subscriber=subscription.subscriber,
            lookup_id=subscription.lookup_id,
            tags=subscription.tags,
            expire_time=subscription.expire_time,
		)
		tasks_to_update.append(task)

	TaskManager().bulk_update(
		tasks_to_update,
		fields=[&quot;subscriber&quot;, &quot;lookup_id&quot;, &quot;tags&quot;, &quot;expire_time&quot;],
	)
</code></pre><p>What additional issue is present in this example, apart from what we discussed in the previous post? We don&apos;t know how much volume of data to be updated until the command is executed, but sometimes it&apos;s too late. When performing potentially risky database operations, is there a way to ensure everything progresses smoothly initially? Can we obtain a preview of how much data will be updated and, if it appears satisfactory, then process the update.</p><h1 id="after">After</h1><p>That&apos;s the value of the <code>Dry Run</code>. Now, we can see how it works.</p><pre><code class="language-python">from contextlib import contextmanager

from django.db.transaction import atomic
from django.core.management.base import BaseCommand

from model import TaskSubscriptionModel, CarrierTaskModel
from managers import TaskManager

class DoRollback(Exception):
    pass


@contextmanager
def rollback_atomic():
    try:
        with atomic():
            yield
            raise DoRollback()
    except DoRollback:
        pass


class Command(BaseCommand):
     def add_arguments(self, parser: argparse.ArgumentParser) -&gt; None:
        parser.add_argument(
            &quot;--dry-run&quot;,
            dest=&quot;dry_run&quot;,
            action=&quot;store_true&quot;,
            default=False,
            help=&quot;Actually edit the database or not&quot;,
        )
    
    def handle(self, *args, **options):
        dry_run = options[&quot;dry_run&quot;]

        prefix = &quot;In the dry run mode&quot; if dry_run else &quot;&quot;
        atomic_context = rollback_atomic() if dry_run else atomic()

        with atomic_context:
            subscriptions = TaskSubscriptionModel.objects.all()
            tasks_to_update = []
            for subscription in subscriptions:
                task_id = subscription.task.id
                task = CarrierTaskModel(
                	id=task_id,
                    subscriber=subscription.subscriber,
                    lookup_id=subscription.lookup_id,
                    tags=subscription.tags,
                    expire_time=subscription.expire_time,
        		)
        		tasks_to_update.append(task)

			print(&quot;number of the tasks are updated:&quot;, len(tasks_to_update))
        	
            TaskManager().bulk_update(
        		tasks_to_update,
        		fields=[&quot;subscriber&quot;, &quot;lookup_id&quot;, &quot;tags&quot;, &quot;expire_time&quot;],
        	)
</code></pre><p>What&apos;s the key element of the Dry Run process? The answer is &quot;Rollback&quot;. We can leverage this database feature in conjunction with the context manager to achieve it. For the details, please refer to the linked article in the reference section; here, I won&apos;t delve into it extensively. </p><p>This approach allows us to employ the &quot;--dry-run&quot; option to control whether the command carries out the database operation or not.</p><h1 id="summary">Summary</h1><p>In this post, we will discuss what <code>Dry Run</code> is and how to leverage it to enhance the quality of data migration commands. By combining <code>Pagination</code> (as discussed in the previous post) with <code>Dry Run</code>, you can create commands that are even more efficient. Enjoy the process!</p><p></p><h1 id="reference">Reference</h1><ul><li><a href="https://adamj.eu/tech/2022/10/13/dry-run-mode-for-data-imports-in-django/?ref=blog.taiker.space">https://adamj.eu/tech/2022/10/13/dry-run-mode-for-data-imports-in-django/</a></li></ul>]]></content:encoded></item></channel></rss>