Load batches using a running sum

So you have designed your beautiful data mart and you want your load procedure to load as fast as it can but maybe not too much at once? I’ve tried my luck with running sums and I’m happy.

What exactly do we want to achieve? The simple answer is: no matter how much data we have to load, we want to have reasonable runtimes. Most of the time, the data you load will not differ much, it’s just the amount that might vary. That’s good news cause at least you will be able to determine the reasonable amount of rows to load by testing a bit. Let’s say one million is perfect in our scenario. I know it’s obvious, but I’ll say it: Usually new data flows into our data warehouse as time goes by. As such, we will usually load using a timestamp, be it the natural date when our data was created or the date we loaded it into our DWH.

What we want to do now is create date batches that contain as many rows as possible, but less than one million. In my example, I will use the load date and a very simple design with a stage table as base and a data mart table as target.
Disclaimer: I’m skipping indexes, log messages, transactions and complex checks here, but don’t forget that these exist in the real world 😉

--A regular delta load, using the load time stamps, nothing special here.
declare @max_table_load_date datetime2 = (select max(table_load_date) from data_marts.mytable);

select
    col1
  , col2
  , col3
  , ...
  , stage_load_date
into #delta
from stages.mystage
where stage_load_date > @max_table_load_date
;

But what if a single stage_load_date contains such a small amount of rows, that looping would generate more overhead than required.

--First we want to know how many rows we have per stage_load_date. We will use this information later

select
    stage_load_date
  , count(*) as cnt
into #load_dates
from #delta
group by stage_load_date
;

--Now we are going to select a date that fits our row requirement. The running sum is the key to the selection.

while exists (select 1 from #load_dates)
begin
  declare @max_batch_date datetime2;
  select
    @max_batch_date = max(sub.stage_load_date)
  from (
    select
        stage_load_date
      , sum(cnt) over (order by stage_load_date) as runsum
    from #load_dates
    group by stage_load_date
  ) sub
  where sub.runsum <= 1000000 --that's one million
  ;

  --There is still data to be loaded, but what if there is one date with more than a million rows?
  if (@max_batch_date is null)
    set @max_batch_date = (select min(stage_load_date) from #load_dates);

  select
    *
  from #delta
  where stage_load_date <= @max_batch_date
  ;
  --Or do whatever you want to do with your delta. I'm sure there are plenty of different things you can do before inserting it to the data mart.

  --The last part should be the following
  delete ld
  from #load_dates ld
  where stage_load_date <= @max_batch_date

end;

I really like this solution cause it’s easy to implement and the overhead is very resource friendly. I’m using this on an MSSQL 2019 with columnstore indexing and the loop overhead is barely noticeable.