Almost every project I have inherited started its data life in a CSV file, and almost every one of them outgrew it painfully. CSV is a wonderful interchange format and a terrible system of record. The transition to a real database is not just a storage swap — it is the moment you decide how the rest of your application talks to data for the next several years. Here is how I make that transition cleanly.
Where CSV Actually Breaks
CSV works right up until two things happen at once: concurrency and relationships. The failure modes are predictable:
- No concurrent writes: two processes appending to the same file corrupt it or silently drop rows.
- No types: everything is a string, so
"0123"becomes123and dates parse differently on every machine. - No integrity: nothing stops a row from referencing a customer that does not exist.
- No real queries: any non-trivial lookup means loading the whole file into memory and filtering.
You can paper over these for a while. You should not. The cost of migrating grows with every feature you build on top of the flat file.
The right time to move off CSV is the first time you write code to join two CSVs in application memory. That is the database asking to be born.
Design a Normalized Schema First
Before touching application code, model the data. Normalization here means each fact lives in exactly one place, with foreign keys enforcing the relationships the CSV only implied. A typical starting point:
CREATE TABLE customers (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
email TEXT NOT NULL UNIQUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE orders (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
customer_id BIGINT NOT NULL REFERENCES customers(id),
status TEXT NOT NULL CHECK (status IN ('pending','paid','shipped','cancelled')),
total_cents INTEGER NOT NULL CHECK (total_cents >= 0),
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_orders_customer ON orders (customer_id);
CREATE INDEX idx_orders_status ON orders (status) WHERE status = 'pending';Notice the constraints doing work the CSV never could: UNIQUE on email, a CHECK on status, money stored as integer cents to avoid floating-point drift, and a partial index for the common "show me pending orders" query. The schema is now enforcing your business rules instead of hoping the application does.
Hide Storage Behind a Repository Interface
The mistake I see most is scattering raw SQL through the codebase. The moment you do that, your storage choice is welded to your business logic. The repository pattern puts a typed interface between them, so the domain depends on an abstraction, not on PostgreSQL.
export interface OrderRepository {
findById(id: number): Promise<Order | null>;
findPendingByCustomer(customerId: number): Promise<Order[]>;
create(input: NewOrder): Promise<Order>;
updateStatus(id: number, status: OrderStatus): Promise<void>;
}
// Postgres implementation lives behind the interface
class PgOrderRepository implements OrderRepository {
constructor(private readonly db: Pool) {}
async findPendingByCustomer(customerId: number): Promise<Order[]> {
const { rows } = await this.db.query(
`SELECT id, customer_id, status, total_cents, created_at
FROM orders
WHERE customer_id = $1 AND status = 'pending'`,
[customerId],
);
return rows.map(toOrder);
}
// ...
}Your services depend on OrderRepository, never on the Pool. This is not abstraction for its own sake — it is what lets you write fast unit tests against an in-memory implementation, and it is the seam through which you migrated from CSV in the first place. The CSV reader and the Postgres client are just two implementations of the same contract.
Migrations and Round Trips
Two operational habits keep this clean once it is live.
First, treat schema changes as versioned migrations checked into the repo, applied in order, never edited by hand against production. Every schema change is a reviewable file with an up step, and your database state is reproducible from zero.
Second, minimize round trips. The repository interface makes it tempting to call findById in a loop, recreating the N+1 problem you escaped CSV to avoid. Instead, expose batch methods — findByIds(ids: number[]) — and let the implementation issue a single WHERE id = ANY($1) query. The interface stays clean while the implementation stays efficient.
The repository pattern gets dismissed as ceremony, but its real value shows up the day requirements change: a new caching layer, a read replica, an analytics export. Each becomes a new implementation behind a stable interface rather than a refactor that touches every call site. That stability is the entire point of a clean data layer.