How is the size of the repositories determined (by what factors) ?
The size of the repositories are dependent on several factors. It is possible to have a smaller size repository for example by configuring Spotlight with longer collection intervals for each alarm or lower the number of days data is stored with shorter instervals.
For example with the Playback data, the default is 7 days to keep data but it can be decreased to 3 days or less and collect more often.