I have been in the vendor world now for many years and its great, I really enjoy my job and the companies that I have worked for.
What really gets me is some of the blatant lies and mis-truths that get reported especially when they come from so-called Analysts or the like. I recently read a Edison Article on HP Thin Deduplication and laughed out loud more than once at the claims in the incredulously biased article.
Lets start with “Post processing removes any direct performance impact”. It may remove it at the time of ingest, but it does not remove the impact it has on CPU and memory altogether. Normally Post Process deduplication is a scheduled event that till run when you schedule it – what happens if you run a busy workload during that process. It definitely impacts performance.
Flash helps enables deduplication to happen inline as long as you also have a decent amount of cache and an OE that can efficiently handle metadata. As capacities get bigger so does your metadata table needs. NetApp was the first to bring deduplication to market and it was, and still is a post process. I have seen the impact fist hand on this going wrong when it hasn’t been managed correctly.
Next lets cover the cost of flash. Data reduction is an integral part of moving to flash, and also an integral part of the TCO of flash so it should be mentioned. However whats key is that when you talk to customers about this that you validate your data. The company that I work for, Pure Storage will give you an average data reduction of 3:1 for Database, 6:1 for VSI and 10:1 for VDI workloads. This information is gathered from our cloud assist portal gathering information for all our arrays in the wild and reporting on the non thin-provisioned capacity. The article discussed expressly mentions the deduplication ratios easily being 10:1 over and over again, whereas HP online states that their average data reduction is 4:1. Why is that important – well if you do base your cost per GB on data reduction then 10:1 will mean your cost per GB is a lot better than 4:1, but which information is the accurate of the two? The Pure information, like the NetApp ticker information is consistent messaging across the board. It is key that you look at this information when making your decision and that it is consistent and verifiable.
Take a look at the competitive differences where they rate everyone and claim HP is the best. All 4 vendors rated have very different data reduction methodologies, some like Pure include deduplication and compression but don’t count Thin Provisioning, HP counts Thin Provisioning and deduplication. XtremeIO and Solidfire are different again. How can they say that theirs is the best?
The report said HP Looked at telemetry data from 10s of thousands of systems?? The report states than an analysis was done between 16KiB and 4KiB block sizes and that there was little difference between the two with modest savings of 15%. That information was gained from telemetry data from their phone home system. I have seen similar data sent from NetApp, Hitachi and Pure and I would be very surprised if this was true. The data sent is extremely dense and takes masses of processing power just to manage fault calls let alone do deep statistical analysis. I would like to see some more information on this. I think IMHO that its just simply a very expensive exercise to change the block size. Look what happened with EMC XtremeIO recently going from XIOS 2.4 to 3.0 being a destructive upgrade going from 4KB to 8KB block sizes.
Thats enough of a rant for now, but I encourage you to do your homework before investing in any new technology. When you look at analyst reports, they are all paid for, but some are more paid for biased than others.
My advice, ask your vendor to prove it. Put a controller on the floor, run some real world work loads and not synthetic on your own data.