To understand this scenario it is necessary to explain the deduplication process. Data is stored in the Chunk Store in logical units called pages. The stream is split into chunks by the deduplicator and a query against the Chunk Index determines if the chunk is already in a page in the Chunk Store. If the chunk is not represented, it is added to the currently open page which will be added to the Chunk Store and a new Chunk Index entry is created. The chunk is then listed in the Manifest for this stream and processing is complete for that chunk.
Running these in parallel is fine, but only one page is open at a time, so multiple deduplicators finding new chunks will insert chunks into the page in turn, fragmenting the data for the streams. Therefore more concurrent deduplicators increase the potential for fragmentation of the data.
The result of this fragmentation will be slower retrieval (restores) of the stream from the Chunk Store and longer run times for Garbage Collection as retirements will cause many small holes spread widely across the Chunk Store as opposed to being grouped into a smaller number of pages.
© ALL RIGHTS RESERVED. Terms of Use Privacy Cookie Preference Center