| 
|  |  
|  |  
| Improving 3D NAND Flash
Memory Lifetime - new paper 
 Editor:- August 28, 2018  - A new
twist using
RAID ideas in
SSD controllers has
surfaced recently in a research paper - 
Improving
3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process
Variation (pdf)	 by Yixin Luo and Saugata Ghose (Carnegie Mellon
University),   Yu Cai (SK Hynix),  Erich F. Haratsch (Seagate Technology) and 
Onur Mutlu (ETH Zürich) - which was   presented at the SIGMETRICS
conference in June 2018.
 
 The authors say that  in tall 3D nand (30 layers and upwards) the raw
error rate in blocks in the middle layers are significantly worse (6x) compared
to the top layer. To enable  more
reliable and
faster SSDs using 3D nand   for enterprise applications they propose a new type
of RAID - LI-RAID.    ...
read the article (pdf)
 
 
 wrapping up 40 years of memories about  endurance
 
 Editor:-
July 20, 2018 -
wrapping up SSD
endurance (selective memories from 40 years of thinking about endurance) is
my  new blog on the home page of StorageSearch.com
 
 This
may be  my last article on endurance. No more. Ever. I promise.  (I may have
said that before but this time I really mean it.) ...read the article
 
 
 reliability aspects of 100TB SAS SSDs
 
 Editor:- March
19, 2018 - Nimbus Data
Systems  has made another significant advance in the development of
multipetabyte energy-efficient solid state storage racks with the  
announcement
today  that it's sampling 100TB 3.5
SAS SSDs  with
unlimited DWPD.
 
 The
ExaDrive
DC100 has balanced  performance 100K
IOPS R/W and
up to 500 MBps throughput and consumes 0.1 watts/TB - which Nimbus says is 85%
lower than competing drives used in similar array applications - such as  the  
Micron's 7.68TB 5100
SATA SSD.
 
 ExaDrive technology and reliability?
 
 I
asked Thomas
Isakovich, CEO and founder of Nimbus some questions about the new
ExaDrive technology.
 
 Editor - The
50TB models
announced by your flash partners last year used planar 2D flash.     Does
the 100TB family use 3D flash?     Knowing the answer one way or another will
enable some people to make their own judgements about incremental upsides in the
next year or so's roadmap. And also form a view about specification stability
and reliability.
 
 Tom Isakovich - Yes 3D flash for the ExaDrive DC.
 
 Editor
- The issue of cost per drive is an interesting one too. But the companies you
were working with last year have experience in processes which can produce a
high confidence reliable SSD for high value, mission critical markets (like
military) in which the reliability of every single SSD is critical. So my guess
would be that for integrators who have a serious interest in the ExaDrive DC100
 they will be looking at the cost of drive failures on a system population
basis  and the value of less drives and less heat per TB is more important
than the headline cost of a single failed drive.
 
 Tom Isakovich - I have
an interesting subject for you to consider on the topic of "reliability".
Namely, is an SSD any less reliable than an all-flash array? I contend that it
is not. In fact, an SSD is more reliable.
 
     Our ExaDrive DC has flash redundancy internally, with the ability to
lose about 8% of flash dies without any downtime, data loss or capacity
reduction. This is analogous to
RAID in a traditional
all-flash array that protects against media failure. So on the notion of media
redundancy, they are equally redundant. 
I'm thinking more on this. But empirically, an SSD is more
reliable than a System. The user can achieve desired redundancy in their overall
architecture, taking this into consideration.        The ExaDrive DC has a 2.5 million hour MTBF with no moving parts.
That is about 6 times longer than the typical all-flash array (which includes)
many active and moving parts. All-flash arrays have integrated power supplies,
active controllers, fans, and other components prone to failure.
 
 See also:-
rackmount SSDs
 
 
 a Survey of Techniques for Architecting  Hybrid Flash  based SSDs
 
 Editor:-
December 20, 2017 - This month I received a copy of a  new (to me)  paper -
a Survey
of Techniques for Architecting SLC/MLC/TLC Hybrid Flash Memory based SSDs (27
pages pdf) - from  Sparsh Mittal,
Assistant Professor at Indian Institute of
Technology  Hyderabad who is among the co-authors of this  significant 
reference document.
 
 Although the primary purpose of the paper is to
record the comparitive  design  tradeoffs between different memory types in the
same  SSD it also looks at   the tactical use of virtualized  pSLC too.
 
 There
are over 60 cited references to external papers - so it's a rich source of 
ideas for SSD designers.
 
 Here's just a single sentence:- "It is
noteworthy that the technique of Jimenez et al. [33] converts MLC blocks into
SLC when they exhaust their lifetime to benefit from the high endurance of SLC
blocks. By comparison, other soft partitioning techniques perform
mode-transition." ...read the
article (27 pages pdf)
 
 
 the failure to make enough working memory chips
 
 Editor:-
September 7, 2017 - The biggest failure in the SSD market in recent times  was
the collective  failure of all the leading  memory companies to manage the
introduction of their new 3D  technologies in a way which aligned with   past
roadmap predictions and expectations. I discussed the causes and consequences in
2 articles on StorageSearch.com.
 
 
 the reliability difference in solo  industrial SSDs
 
 Editor:-
July 14, 2017 -
Reliability is one
of the concerns which got me interested in SSDs in the late 1980s, and the other
factor was raw speed
- sometimes - but not always -  both  in the same project. And different ways of
looking at reliability is one of the recurring themes which I notice in stories
about the industrial
SSD market.
 
 Earlier this year I had noticed a statement in one of
the
customer case studies on
the web site of Cactus
Technologies which talked about having delivered 200,000 high
reliability flash storage cards to a customer "without a reported failure".
And from time to time I wondered what did that really mean?
 
 So this
week I asked Steve
Larrivee,  VP Sales & Marketing at Cactus what was the time period
behind the story?
 
 Steve said - "The 200,000 cards were delivered
over a 2 year period over 5 years ago without one reported failure."
 
 Editor's
comments:- I thought  this was an impressive retrospective story and for
customers with applications  where the reliability of each solo SSD is   
critical it's a more convincing  positioning statement about the design and
manufacturing capabilities of the SSD creator than  any  forward reaching
promises can be.
 
 After our exchange of emails Steve wrote a new blog
about this -
Would
Memory Failure Be Catastrophic to your business? - which   included
additional anecdotal failure rates for the same application which happened when
the customer switched to a lower cost memory SSD design from a  competing high
quality supplier.
 
 trust and
services marketing related to enterprise SSD systems
 why
was it so hard to compile a simple list of military SSD companies?
 
 
 hard delays from invisibly fixed soft flash array errors can break
enterprise apps - says Enmotus - arguing need for better storage analytics
 
 Editor:-
June 15, 2017 - Using SSDs as its prime  example - but with a warning shot
towards the future adoption of  NVDIMMs - a new blog -
storage
analytics impact performance and scaling - by Jim O'Reilly  - on
the Enmotus blog site -
describes how soft errors can contribute to application failure due to
unexpected sluggish response times even when the data is automatically repaired
by SSD controllers and
when the self-aware status of the SSDs is that they are all  working exactly as
designed.
 
 That's the needs analysis argument for storage analytics
such as the software from Enmotus
which supports the company's FuzeDrive
Virtual SSD.
 
 Jim says  - "Storage analytics gather data on the
fly from a wide list of "virtual sensors" and is able to not only
build a picture of physical storage devices and connections, but also of the
compute instance performances and VLANs in the cluster.  This data is
continually crunched looking for aberrant behavior." 
...read
the article
 
 Editor's comments:- in my 2012 article -
will SSDs end
bottlenecks? - I said "Bottlenecks in the pure SSD datacenter will be
much more serious than in the HDD world - because responding slowly will be
equivalent to transaction failure."
 
 And in a  2011 article  -
the new SSD uncertainty
principle - I shared the (new to me) wisdom collected by long term
reliability studies of enterprise flash done by
STEC - that  many
flavors of  flash controller management contained within them the seeds of
performance crashes which would only become apparent after years of use as the
data integrity algorithms escalated to progressively more retries and stronger
ECC to deliver reliable data from wearing out (but still usable)  flash.
 
 So
 I agree with Jim O'Reilly. You do need more sophisticated datasystems 
analytics then whether or not an SSD has failed.
 
 The variable  quality
of latency can be a source of
incredibly 
long delays  in server  DRAM too.
 
 
 Soft-Error Mitigation  for PCM  and STT-RAM
 
 Editor:-
February 21, 2017  - There's a vast body of knowledge about data integrity
issues in
nand flash memories. The
underlying problems
and fixes have been one of the underpinnings  of
SSD controller design.
   But what about newer emerging nvms such as PCM  and STT-RAM?
 
 You
know that memories are real when you can read hard data about what goes wrong -
because physics detests a perfect storage device.
 
 A new paper -
a Survey of Soft-Error
Mitigation Techniques for Non-Volatile Memories  (pdf) - by Sparsh Mittal,
Assistant Professor at Indian Institute of
Technology  Hyderabad  -  describes  the nature of soft error  problems in
these new memory types and shows why system level architectures will  be needed
to make them usable. Among other things:-
 
scrubbing   in MLC PCM would be required in  almost every cycle to keep the
error rate at an acceptable level 
...read the article (pdf)read disturbance errors are  expected to become the most severe bottleneck
in STT-RAM scaling and performance 
 
 Microsemi's rad tolerant FPGAs orbit  Jupiter
 
 Editor:-
 September 20, 2016 - Microsemi
today 
announced
that its radiation-tolerant  FPGAs   are in use  on NASA's 
Juno Spacecraft within the
space vehicle's command and control systems, and in various instruments which
have now been deployed and are returning scientific data. Juno recently  entered
Jupiter's orbit after a 5 year journey.
 
 See also:-
Juno
mission (pdf),  
data
chips in space
 
 
 relating NVMdurance's machine learning to  manual tuning
 
 Editor:-
July 29, 2016 - Nearly every  SSD  in the market today from the smallest SSD on
a chip to the bewildering array of rackmnount systems can be viewed as a choice
of how to select and mix the raw ingredients of SSD IP and integrate them into
products which (for better or worse) match up to and satisfy user needs. How
these decisions are made depends on the DNA of the product marketers, the
technology teams, familiarity and ease of access to some technologies rather
than others, business pressures and timing, the willingness to take risks, and
sometimes - just luck.
 
 But all products - no matter how complex they
appear - can be analyzed as a specific set of choices made from the architecture
and IP selections which are possible.
 
 In many articles in the past I've
shown you how - whether you're looking at the design of SSDs or systems - there
are rarely more than 2, 3 or 6 raw available  decisions which determine each
piece of the jigsaw.  And I know from the feedback I get from SSD specifiers and
architects that these simple classifications can be useful in helping to compare
different products and even in choosing which competitive approaches are similar
enough to make comparisons worthwhile.
 
 But when you get down into the
details of implementation at each layer in the product design -   every one  of
these dimensional options which go into the permutations blender to shape the
total product identity - can itself  be complex and multilayered.
 
 Take
the example of the raw magic tuning numbers which enable the raw R/W program,
erase, threshold voltages, shaping and timing parameters inside a flash memory.
The question of how much and when has been at the heart of what makes some SSDs
better than others ever since flash was first used in SSDs.
 
 Some SSD
designers   have spent  their whole careers measuring and modeling how these
choices interact with the flash cell and can be tweaked to improve speed, power
consumption and reliability. You can get a flavor of this in my article -
adaptive R/W
and DSP ECC IP.
 
 In a conversation  with
NVMdurance's CEO -
Pearse Coyle
earlier this year (April 2016)  almost the first thing I did was try to relate
and place the work they were doing within the simple  frameworks I'd written
about before.
 
 So I asked him how similar it was to something which I
wrote a long article  about in 
April 2012 - when
SMART announced a
range of SandForce
driven SSDs which had 5x higher  endurance - while using exactly the
same industry  controller - but using 	magic tuning numbers which they had
learned from analyzing the adaptive settings from their own adaptive controller
design.
 
 Pearse said - yes - he knew that work. And what NVMdurance was
doing was the same type of thing.
 
 He said that some leading companies
which had the flash talent had done similar things in their proprietary SSDs
before.
 
 Pearse told me that as the complexity of flash increased -
with more layers and TLC   - it was becoming harder for designers to manually
(or using human expertise) guarantee they were choosing the optimum magic
numbers - because there were now so many variables involved.
 
 Pearse
said that what was different about NVMdurance was that  they were delivering the
magic numbers based on characterising a sample of typically 100 devices and then
performing machine based simulations to see which numbers would work best -
while also using a multi-stage life cycle model - which was designed to use 
different tuning after a fractional amount of the expected endurance had been
used.
 
 As far as he knew from his conversations with memory companies -
no-one else had made the same kinds of investments in this machine intensive
modeling - and that was the key difference - because NVMdurance had a proven
process for delivering good tuning numbers over a variety of memory generations
and types.
 
 I hoped at the time that someone would write a paper saying
more about it. Tom
Coughlin  has done that.
 
 Machine
learning enables longer life high capacity SSDs (pdf) - published this week 
describes the background principles and operation of NVMdurance's pathfinder and
plotter software tools and shows you how NVMdurance have tackled this complex
tuning problem to deliver a software delivered IP which can give endurance
results which are similar to adaptive adaptive R/W controllers but which don't
need such expensive  processors or  such complex run-time firmware. ...read
the article (pdf)
 
 
 can memory do more?
 
 Editor:- June 17, 2016 - in a new
blog on StorageSearch.com -
 I ask - where
are we heading with memory intensive systems and software?
 
 All the
marketing noise coming from the DIMM wars market (flash as RAM and Optane etc)
obscures some important underlying strategic and philosophical  questions about
the future of SSD.
 
 When all storage is memory - are there still design
techniques which can push the boundaries of what we assume memory can do?
 
 Can we think of   software as a     heat pump   to  manage the entropy
of memory arrays? (Nature of the memory - not just the heat of  its  data.)
 
 Should
we be asking    more from memory systems?  ...read the blog
 
 
 It's not  worth paying more for SLC  reliability in PCIe SSDs says
Google field study
 
 editor:-  February 26, 2016, 2016 - A 6 year
study of 
PCIe SSDs used by
Google (spanning millions of drive days and chips from 4 different flash
vendors)  concluded that SLC drives were not more reliable than MLC.
 
 An important conclusion re RAS is the importance of being able to map
out bad chips within the SSD architecture. This is because    somewhere between
2% to 7% of enterprise PCIe SSDs (depending on where they were used)  developed
at least bad chip during the first 4 years in the field - which without such
remapping would necessitate replacing  the failed SSD.
 
 The source  is -
Flash
Reliability in Production: the Expected and the Unexpected (pdf) - by Bianca
Schroeder University of Toronto, Raghav Lagisetty and  
Arif Merchant, Google.
 
 This is just  one of a set of
papers
    which was  presented February 22 - 25 , 2016 at  the
14th USENIX Conference on
File and Storage Technologies.
 
 Editor's comments:- For more
like this see the  news
archive - June 2015 which had a story about  a  large scale study of PCIe
SSD failures within Facebook.
 |  
| 
 |  
| Mirabilis discusses role of
deployment  level simulation to optimize   reliability delivered by SSD
controller design tweaks 
 Editor:- August 16, 2015 - "A
diligent system designer can extend the life of an SSD by upto 60% by proper
control of over-provisioning, thus reducing  TCO" says Deepak Shankar,
 Mirabilis Design  in his
recent paper
Extending
the Lifetime of SSD Controllers (pdf) which discusses the role of
application and deployment  level simulations to   explore  the impact of
changing   brews  in  controller
 architectural coctails.
 
 See also:-
SSD
  overprovisioning articles 2003 to 2015
 
 
 bath tub curve    is not the most useful way of thinking about
PCIe SSD failures   - according to a  large scale study  within  Facebook
 
 Editor:-
June 15, 2015 - A recently published research study -
Large-Scale
Study of Flash Memory Failures in the Field (pdf) - which analyzed   
failure rates of  PCIe
SSDs used in Facebook's infrastructure over a 4 year period - yields some
very useful insights into  the user experience of large populations of
enterprise flash.
 
 Among the many findings:-
 
Read disturbance errors - seem to very well managed in the enterprise SSDs
studied.
 The authors said  they "did  not observe a statistically
significant  difference  in the failure rate between SSDs that have read the
most amount of data versus those that have read the least amount of data."
 
Higher operational  temperatures mostly led  to increased failure rates,
but the effect was more pronounced for SSDs which didn't use aggressive data
throttling techniques - which could  prevent runaway temperatures due to
throttling back their write performance. 
More data written by the hosts to the SSDs  over time  - mostly resulted in
more  failures - but the authors noted that in some of the platforms studied -
more data written  resulted in lower failure rates. 
 This was
attributed to the fact some SSD software  implementations  work better at
reducing write amplification when they are exposed to more workload patterns.
 
Unlike the classic bathtub curve failure model which applies to hard drives
-   SSDs can be characterized as having early  an warning phase - which comes
before  an early failure weed out phase of the worst drives in the population
and which precedes the onset of  predicted endurance based wearout.
 In
this aspect - a small percentage of  rogue SSDs account for a disproportionately
high  percentage of the total data errors in the population.
  The
report contains plenty of raw data and graphs which can be a valuable  resource 
 for SSD designers and software writers to help them understand how they can  
tailor their efforts towards achieving  more reliable operation.   ...read
the article (pdf)
 
 See also:-
SSD  Reliability
 
 
 HDD failure rates analyzed by models
 
 Editor:- May 27,
2015 - The    reliability of
hard drives in a
cloud related business
(online backup)  is  revealed in a new report -
Hard Drive
Reliability Stats by Backblaze
which includes results   for over 42,000 drives analyzed  across 21 drive
models.
 
 The failure  distribution in the recent quarter  is   model 
and age specific rather than manufacturer specific - which is to say that you
can't say that Seagate is always better or worse than Western Digital. The table
also gives you insights into drive improvements for this type of application.
Failure rates in the quarter were:-
 
The data seems to fit in with  the
bath tub curve model - with high infant mortality, high failures at the end and
best reliability in the in between periods.  ...read the article upto 1 year old- worst model - 13%2-3 years -  worst model - 27%3 - 4 years -worst model - 3%5 years -   -worst model - 32% 
 
 high availability enterprise SSD arrays
 
 Editor:- 
January 26, 2012 - due to the growing number of oems in the high availability
rackmount  SSD market
StorageSearch.com today 
published a new directory focusing on
HA enterprise SSD
arrays.
 
 The new directory will make it easier for users to locate  
specialist HA SSD vendors, related news and articles.
 
 
 Pushing data  reliability up hard drive hill
 
 Editor:-
 July 4,  2011 - Why didn't
hard drives get more
reliable?
Enterprise users are still replacing hard drives according to   cycles that have
haven't changed much since RAID
became common in the 1990s. So why didn't HDD makers do something to make their
drives better?
 
 Error correction code inventor Phil White -
founder of ECC
Technologies has recently published a 
rant
/ blog  in which he describes the 25 years of  rejections he's had from  
leading HDD makers - and the reaons they said they didn't want to use his
patented algorithm - which he says could    increase data integrity and the life
of hard drives (and maybe SSDs too.)  It makes interesting reading for any other
wannabe inventors out there too.  ...read
Phil White's  article
 
 But I think another reason for past
rejections might simply have been  market  economics.
 
 The capacity
versus the cost of HDDs has improved  so much throughout that period - and at
the same time data capacity needs have grown - maybe the user  value proposition
didn't make sense.
 
 If you (RAID user)  find that all your  5 year old
drives are still working  (instead of being replaced) - how much is that really
worth?  By now those 5 year old drives   might only represent  3% to
10%  of  the new storage capacity  you  need anyway.  (The reliability 
value proposition is different outside service engineer frequented zone - but I
don't want to get side-tracked into
SSD market
models here.)
 
 Looking ahead at the future  of the HDD market my own
view is that whatever the industry does with respect to reliability won't tip
the balance against
SSDs
in the enterprise.
 
 The best bet for the  future of   hard drive
makers  is in consumer products where fashion ranks  higher up the reason to
buy list than longevity. Most people I know replace their notebook pcs, tvs
and phones not because the old ones have stopped working - but because the new
ones have lifestyle  features which make them more desirable.
 
 
 optimizing SSD architecture to cope with  flash  plane errors
 
 Editor:-
 May 26, 2011 -  a new slant on
SSD reliability
architectures is revealed today by Texas Memory Systems
who explained  how their patented Variable Stripe RAID  technology is used in
their recently launched PCIe SSD card - the
RamSan-70.
 
 TMS 
does a 1 month  burn-in of flash memory   prior to shipment.  (One of the
reasons cited for its   use
of SLC rather than
MLC BTW.) 
Through its QA processes the  company has acquired   real-world failure  data
for several generations of flash
memory  and used this to  model and characterize the failure modes which
occur in high IOPs  SSDs.
 
 Most enterprise SSDs use a simple type of
classic  RAID which groups
flash media into "stripes" containing equal numbers of chips. RAID
technology can reconstruct data from a failed Flash chip. Typically, when a chip
or part of a chip fails, the RAID algorithm uses a spare chip as a virtual
replacement for the broken chip. But once the SSD is out of spare chips, it
needs to be replaced.
 
 VSR technology allows the number of chips to
vary among stripes, so bad chips can simply be bypassed using a smaller stripe
size. Additionally, VSR provides greater stripe size granularity, so a stripe
could exclude a small part of a chip rather than having to exclude an
entire chip if only part of it failed -  "plane error". With VSR
technology, TMS says its SSD products will continue operating  longer in the
installed base.
 
 Dan Scheel, President of Texas Memory Systems explained why their
technology   increases reliability.
 
 "...Consider a hypothetical
SSD made up of 25 individual flash chips. If a plane failure occurs that
disables 1/8  of one chip, a traditional RAID system would remove a full 4% of
the raw Flash capacity. TMS VSR technology bypasses the failure and only reduces
the raw flash capacity by 0.5%, an 8x improvement. TMS tests show that plane
failures are the 2nd  most common kind of flash device failures, so it is very
important to be able to handle them without wasting working flash."
 
 Editor's
comments:- by wasting less capacity than simpler RAID solutions - more
usable capacity remains available for traditional 
bad block
management. This extra capacity comes from the over provisioning budget
which figure  varies according to each  SSD design  (as discussed in my recent 
flash iceberg syndrome article) but
is 30% for TMS.
 
 
 what happens in SSDs when power goes down? - and why you should
care
 
 Editor:-  February 24, 2011  - StorageSearch.com today published 
a new article -
 SSD  power is
going  down! -  which surveys power down management design factors  in 
SSDs.
 
 Why should you care what happens in an SSD when the power goes
down?
 
 This important design feature - which barely rates a mention in
most SSD datasheets and press releases - is really important  in determining SSD
data  integrity and operational reliability. This article  will help you
understand why some  SSDs which work perfectly well in one type of  application
might fail in others...  even when the changes in the operational environment
appear to be negligible. If you thought
endurance
was the end of the SSD
reliability story - think again. ...read the
article
 
 
 Business opportunities from Intel's   imperfect bridge chips
 
 Editor:-
 February 9, 2011  - 
Intel
Knowingly Sells Faulty Chipsets. are they Crazy? is a new article on PCWorld.com which discusses   how Intel
is dealing with the issue of a bridge chip with known defects in some
SATA ports.
 
 I
rarely  read that publication because my interests are enterprise storage and
SSDs - but the   author Keir Thomas
had linked to StorageSearch.com from another  recent  article he wrote - 
Seagate:
SSDs  are Doomed (at Least for Now) - which showed up in my web stats.
 
 When
I started my storage
reliability directory in 2006 - I knew that large storage vendors would ship
flaky SSDs and hard
drives - but I assumed that would be due to the  unwitting and creeping  use of 
inappropriate
design
and testing methodologies
- rather than deliberate business decisions.
 
 Another
characteristic  of  this Intel chip  is that if oems populate all the
RAM slots which it "supports"
- the speed drops down to unattractive levels.
 
 But that's  not  bad
news for everyone. Adrian Proctor,
VP of of Marketing at Viking
told me last month it means there's a growing population of DIMM slots on
motherboards which can't be used for RAM - but could be used instead to save
space and power by installing their
SATADIMM
SSDs   to replace HDDs as   boot drives. Other companies make
1 inch and smaller SSDs
too.
 
 
 comparing  SSD and  HDD failure rates in retail
 
 Editor:-
December 10, 2010 -  the   failure rates for
SSDs and
hard drives in  the 
retail channel are compared in a recent article which is part of a regular
feature on the French website HARDWARE.FR.
Because
many consumer  SSD
designs have been flaky - the  apparent  similarities suggested in the
French  report should not be taken to be typical of SSDs as a whole.
 
 On
the contrary - a  much bigger difference in field 
reliability is
suggested by the business models of
industrial
SSD makers  and
enterprise
server SSD makers  for whom better 
reliability is
part of the value proposition - and by anecdotal reports which I've had from
many  data recovery
companies.
 
 
 10,000x more reliable than RAID?
 
 Editor:-  August 26,
2010 - Amplidata
claims that its
BitSpread
 technology is 10,000x    more reliable than current
RAID based technologies
and requires 3x less storage.
 
 Is another  new way of fixing
reliability 
problems in hard disk
arrays  worth the effort just as we approach the end of the
hard disk market's life?
- I doubt it. See why  in - 
this way to the
petabyte SSD.
 
 
 how to make "SSD reliability" believable - marketing
case study
 
 Editor:- July 29, 2010 -  StorageSearch.com today  published
  a new article - 
the cultivation and
nurturing  of  "reliability"  in a 2.5" SSD brand.
 
 Reliability is an
important factor in many applications which use
SSDs.  But can you trust an
SSD brand just because it claims to be reliable?
 
 As we've seen in
recent years - in the rush for the
SSD market bubble -
many design teams which previously had little or no experience of   SSDs were
tasked with designing such products - and the result has been successive waves
of flaky SSDs and
SSDs whose specifications
couldn't be relied on to remain stable and in many products quickly 
degraded in customer sites.
 
 As part of an education series for SSD
product marketers - this new case study describes how one company - which didn't
have the conventional background  to start off with - managed to equate their
brand of SSD with reliability in the minds of designers in the embedded systems
market. ...read
the article
 
 
 Anobit aims at  SandForce SSD SoCs slots
 
 Editor:-
June 15, 2010 - Anobit
announced it is sampling
SSDs based on its patented Memory
Signal Processing technology which provide 20x improvement in operational
life for MLC SSDs in high IOPS server environments.
 
 Based on
proprietary algorithms that compensate for the physical limitations of NAND
flash, Anobit's MSP technology extends standard MLC
endurance
from approximately 3K read/write cycles to over 50K cycles -   to make MLC
technology suitable for high-duty cycle applications. This guarantees drive
write endurance of 10 full disk writes per day, for 5 years, or 7,300TBs
for a 400GB drive, with fully random data (worst-case conditions).
 
 First-generation Anobit Genesis SSDs deliver 20,000 IOPS random
write and 30,000 IOPS random read, with 180MB/s sustained write and 220MB/s
sustained read.
 
 Anobit  says that some of the world's largest NAND
manufacturers, consumer electronics vendors and storage solution providers
currently utilize Anobit's MSP technology in their products.
 
 "For too long, the high prices of SLC SSDs and
concerns about
MLC SSD endurance have slowed the adoption of
flash memory storage in
the enterprise. Anobit Genesis SSDs effectively neutralize both of these
concerns," said Prof. Ehud Weinstein, Anobit CEO. "By delivering true
enterprise-class SSD
reliability at affordable MLC SSD prices, Anobit Genesis SSDs unlock the
full promise of solid-state enterprise storage."
 
 Editor's comments:- superficially the endurance delivered by
Anobit's SSD controller
is better than that obtainable  from
SandForce - whereas
the performance lead is the other way around.  For most oems  what will be more
important is that they do not need to be locked into a single technology
supplier to get adequate metrics for their MLC SSD product lines.
 
 
 flash SSD integrity architectures  for space-craft
 
 Editor:-
 April 13, 2010 -  for those interested in
flash SSD data
integrity issues -  Phil White,  President  of ECC Technologies  has
released a  white paper -
NAND Flash
Memories for Spacecraft (doc).
 
 Phil has been working with ECC for
almost 37 years and his company is developing future ECC designs to
allow systems architects   to develop
NAND flash memories that
are highly reliable
and fault-tolerant even if the NAND flash chips themselves are not so  reliable.
 
 NASA is using ECC Tek's designs in
multiple missions.  2 of the designs are in space at the present time and are
working perfectly.  Phil White  recently wrote a document for NASA and
JPL which outlines how to design NAND
Flash memories for spacecraft.  The  22 page "preview" document 
excludes confidential  data but gives a taste of  the technology available for
licensing.   ...read the
article
 
 
 XLC promises  "enterprise" hybrid x4 SSDs
 
 Editor:-
 April 1, 2010 - XLC
Disk announced details of a paper  it will discuss   later  this month
at the NV
Memories Worskhop (UC San Diego)  called - "Paramagnetic Effects on
Trapped Charge Diffusion with Applications for x4 Data Integrity."
 
 The
company says its findings could have applications in the enterprise storage
market by solving the data integrity problems in  x4 MLC SSDs within a new class
of hybrid storage drives. ...read
more
 
 
 New  Integrity Tool for Old  Tape Archives
 
 Editor:-
January 18, 2010 -  Crossroads
Systems today 
announced
details of ArchiveVerify -  a new monitoring option for its
ReadVerify Appliance
that safeguards the future readability of data
backed up on
tape.
 
 "In our experience, the Achilles' heel of a data recovery
strategy is often the uncertainty of the data's readability, and this single
point of failure can render then entire restore process useless," adds
Bernd Krieger, Managing Director, at Crossroads Europe.
 
 Editor's comments:- Crossroads was originally a specialist in
the SAN  router business.
In recent years it has done a lot of work in the area of
storage reliability.
I've read lots of their whitepapers   which describe their research and products
addressing data integrity.  Although there has been a historic trend for users
to migrate away  from
tape to disk backup - many  super users  of  huge 
tape libraries (with the
biggest archives) will be the last to migrate away - due to logistics and cost.
It's those kind of users who can benefit most from automated tools or services
which increase the data integrity they achieve  and cut down media waste and
unrecoverable events.
 
 
 New article - Data Integrity Challenges in flash  SSD Design
 
 Editor:-
October 12, 2009 - StorageSearch.com
today  published a new article called  -
Data Integrity
Challenges in flash  SSD Design - written by Kent Smith Senior
Director, Product Marketing, SandForce.
 
 Since
bursting onto the SSD scene
in April 2009, 
SandForce has achieved remarkably
high  reader popularity.
 How did  a company whose business is designing
SSD controllers 
achieve this? - especially when the direct market for its products  today
numbers less than 1,000  oems.
 
 The answer is - that if you want to know
 what the future of 2.5"
  enterprise SATA SSDs might look like -you have  to look   at    the
leading   technology cores that will affect this market.  Even if you're not
planning to use SandForce based products yourself - you can't afford to ignore
them - because they are setting the agenda    in this market.
 
 Reliability is the
next new thing 
for SSD designers and users to start worrying about.  A common theme you will
hear from all fast SSD
companies  is that the faster you make an SSD go - the more effort  you 
have to put into understanding and engineering data integrity to  eliminate the
risk of "silent errors."   ...read the article
 
 
 Real World Reliability in High Performance Storage
 
 Editor:-
August 20, 2009 - Density Dynamics
 published  a    whitepaper  called - 
Real
World Reliability in High Performance Storage (pdf).
 
 It compares
real world failure rates for
HDDs and
flash SSDs with
predicted MTBF and
endurance
data and suggests that the big discrepancies reported by users  are due to the
nature of their workloads.  In this respect it suggests
RAM SSDs are better in
heavy IOPS
apps  - even taking into account the MTBFs of batteries and UPS like components.
 
 It also cites my own article 
RAM Cache Ratios
in flash SSDs.
 
 
 Why Consumers Can Expect More Flaky Flash SSDs!
 
 Editor:-
August 10, 2009 - a
new article published
today on   StorageSearch.com
explains why the  consumer  flash SSD quality problem is  not going to get
better  any time soon.
 
 You know what I mean. Product recalls,     
firmware upgrades,   performance downgrades  and  bad behavior which users did
not anticipate from reading  glowing magazine product reviews.  And that's if
they can get hold of the new products in the first place.
 
 We predicted
this unreliability  scenario many  years ago. And you have to get used to it.
The new article explains why it's happening and gives  some suggested
workarounds for navigating in  a world of  imperfect flash  SSD  product
marketing. ...read
the article
 
 
 Ramtron's  F-RAM Casualty of Auto Market  Crash
 
 Editor:-
May 7,  2009  -   Ramtron
said its  revenue 
declined
26% in the 1st quarter of 2009 compared to the year ago period.
 
 A
sharp decline in orders from the automotive  market was cited as a principal
cause.
 
 Ramtron also announced an update on a legal suit related to  
in-field failures of one of its   F-RAM memory products in an unspecified
application. (In July 2008  Ramtron confirmed that specific batches of product
had  failed due to manufacturing 
process
defects in one of its partners fabs.)
 
 Ramtron also announced
today that, over the next 2 years, it will transition the manufacturing of
products that are currently being built at Fujitsu's chip foundry located in
Iwate, Japan to its foundry at Texas Instruments in Dallas, Texas and to its
newest foundry at IBM Corp  in Essex Junction, Vermont.
 
 
 Why You Need Better ECC Inside the SSD
 
 Editor:- April
16, 2009 - this week  SandForce
published an article on the subject of  effective 
error correction in flash
SSDs.
 
 I like it because it resonates well with the thinking that
led me to publish this reliability page 3 years ago.
 
 At that time - I
was concerned with the theoretical inadequacy of  error correction used inside
hard drives. (Something
which has since been confirmed in practice and reported    in some of the papers
cited at the top of this page.)
 
 SandForce's short article shows you the
consequences - in terms of uncorrectable errors - if you use "industry
standard" strength ECC.  And that's part of the sales pitch for their  10-to-the-minus-something-better
errors protection in their new  SSD controller.
 
 
 How Good  SSD Controllers Manage Flash Data Integrity
 
 Editor:-
April 3, 2009 - SNIA
has published a new white paper - 
"NAND
Flash Solid State Storage for the Enterprise -  an in-depth Look at Reliability."
(pdf)
 
 It's co-authored by:-  Jonathan Thatcher
Fusion-io, Tom
Coughlin Coughlin 
Associates, Jim Handy
Objective
Analysis  and Neal Ekker
Texas Memory Systems.
 
 The
article contains the best  integrated explanation I've seen of the design
trade-offs for    error correction schemes and how they affect bit error rates
compared to the  raw  uncorrected results.  It goes on to   explain the
importance of the SSD controller  and memory architecture  (dispersing data
among many chips) and how these can  improve  data integrity by managing read
disturb errors. It also discusses  wear-leveling and write amplification which
have been well covered elsewhere. ...read
the article
 
 See also:- 
SSD  Reliability -
Understanding Data Failure Modes in Large Solid State Storage Arrays
 
 
 SSD Bookmarks from Texas Memory Systems
 
 Editor:-
March 16, 2009 - Texas
Memory Systems' President, Woody Hutsell   - shares his
SSD Bookmarks    
with readers of 
StorageSearch.com.
 
 Those
who know the SSD industry well, mostly think of  TMS as a company which makes
very fast SSDs  for    accelerating
SAN resident  applications.
But in the many discussions I've had with Woody Hutsell during the past decade -
"reliability" has also been a frequent topic in our conversations.
 
 That's because when you manufacture products which pack more memory chips than
anyone else has ever put into a single box - all those "10 to the minus
something" numbers which relate   physics to semiconductor memory
effects - add up to design problems which are  far from theoretical.  TMS has
been engineering solid state storage  systems for
30 years.
So I was not surprised to see an in depth paper about reliability being one of
the articles in this  
list of bookmarks.
 
 
 New Tool Acts as Bouncer for   Up Market Tape Joints
 
 Boulder,
Colo. - February 3, 2009 - Spectra Logic has extended its Media
Lifecycle Management   technology outside the library with a new reader -  now
shipping.
 
 The MLM Reader (approx $2,500)  is a portable device
that allows customers to check tape health on any computer through
USB, without loading the
tape into a library, and
is designed to proactively identify faulty tape media before it is required for
a data restore.  It  tracks over 30 non-volatile statistics about data tapes,
such as export details; remaining capacity; encryption information; number of
reads and writes; date of last access; born-on date; and cleaning log.   ...Spectra Logic profile
 
 
 SiliconSystems  Proposes New Methodology for  Realistically
Predicting Flash SSD Reliability
 
 Editor:-
December 15, 2008 - Gary Drossel, VP Product Planning at SiliconSystems
has written a   new article    -  "NAND Evolution and its Effects on SSD
Useable Life."
 
 This is probably one of the  3 most
significant articles on the subject of
flash SSD
reliability which have been published in recent years. Starting with a tour of 
the state of the art in the  flash SSD market and technology the paper
introduces several  new concepts to help systems designers understand why
current wear usage models don't give a complete picture.
 
Write amplification -  is a measure of the efficiency of the SSD
controller. Write amplification defines the number of writes the controller
makes to the NAND for every write from the host system.  
The paper discusses
the  theoretical expected lifetimes and amplification factors  for several
applications and concludes that measurement of wear-out in real applications is
the best  way to understand what is happening. It suggests that systems
designers can use the company's SiliconDrive (which includes real-time on-chip 
endurance monitoring)  as an endurance analysis  design tool. By simply 
plugging in SiliconDrive(s) to a new application for a day, week or month - the
percentage of wear-out can be measured - and corrective steps taken (in software
design or overprovisioning) to correct reliability problems.Wear-leveling efficiency - reflects the maximum deviation of the
most-worn block to the least worn block over time.  
 What
isn't stated in the article - but is a logical  inference  - is that even if
your product design goal is to buy SSDs from other oems - the SiliconDrives can
be used in your  design process to capture information  in a non invasive manner
which is difficult or impossible to collect using other instrumentation.  ...read the
article (pdf), ...SiliconSystems
profile, storage
reliability
 
 
 iStor Unlocks High Availability Features in Installed iSCSI ASICs
 
 IRVINE, Calif. - October 7, 2008 -
 iStor Networks, Inc.    has begun shipping  a new version of its
software, v2.5, as a no-cost upgrade for all its iSCSI storage solutions.
 
 This software will provide dual-controller
iS512 systems with the
ability to automatically detect malfunctions in the operational controller and
to switch to the redundant controller without loss of data, function or
performance.
 
 "This new software capitalized on the patented
capabilities of iStor's ASIC technology enabling HA capability with no
impact upon system performance before, during or after a controller failure."
said Jim Wayda, iStor's VP of Software Development.  "iStor designed its
controllers from the very beginning to deliver advanced functionality such as HA
and we are very proud that we have been able to demonstrate the investment
protection inherent in iStor's approach of implementation..." 
...iStor  profile,
iSCSI, 
storage reliability
 
 
 Can You Trust Your Flash SSD's Specs?
 
 Editor:- July 9, 2008  -
STORAGEsearch.com today published a new article which asks - Can you
trust your flash SSD specs?
 
 The
flash SSD market
opens up  tremendous opportunities for   systems integrators  to leverage solid
state disk technology. But due to the diversity of products in the market and
lack of industry standards - it's got tremendous risks as well.
 
 The
product which you carefully qualified may not be identical to the one that's
going into your production line for a variety of reasons... ...read the article
 
 
 Preparing for the Next Phase in the SSD Market Revolution
 
 Editor:-  June 25, 2008 -STORAGEsearch.com
 today  called for new papers  on the theme -  "Understanding Data
Failure Modes in Large Solid State Storage Arrays".
 
 Multi-terabyte
solid state storage arrays are seeping into the server environment in the same
way that RAID systems did
back in the early 1990s.
 
 But just as those RAID pioneers learned that
there was a lot more to making a reliable disk array than stuffing a bunch of PC
 hard disks into a
box with a fan and a  
power supply - so too will multi-terabyte
SSD users discover that
problems which are undetectable or do no harm  in small SSDs  can lead to
serious data corruption risks when those same  SSDs are scaled up without the
right   architecture and sometimes with it in place too.
 
 I know from  
the emails I get that many  readers  think that once they've looked at the
single issue of flash
endurance - they've covered covered the bases for enterprise SSDs.
 
 That's
why storagesearch.com is planning to publish a collection of definitive
technology articles to help guide the industry through this risky transition
process.
 
 The  new articles  will provide users with the   theoretical
justifications they need when they are faced with the difficult economic choices
that come from deploying   different types of SSDs (with  different cost models)
in diverse applications within their organizations.  ...read the article
 
 
 Disk Error Correction Company Gets $22 million Funding
 
 Santa Clara, Calif. -  April 9,
2008 - Link_A_Media Devices Corp  secured $22 million in Series
B financing.
 
 The funding round, led by
AIG SunAmerica Ventures,
was secured from 4 additional financial and corporate investors -
KeyNote Ventures,
NEC Electronics,
Micron  and
Seagate.
 
 Link_A_Media Devices is developing a new class of
chip controller resident
data recovery solutions  for 
HDDs and
SSDs. These   are
designed to exceed the performance of conventional methods deployed in
peripheral storage devices, as well as provide adaptive features that can be
used during manufacturing to improve drive yields and product margins.  
...Link_A_Media
Devices profile
 
 Editor's comments:- MLC flash SSDs have 
high internal
error rates and are currently unrecoverable. It looks like Link_A_Media's
technology could improve the odds of
data recovery in
failed devices which incorporate its technology (as well as reducing data errors
while the SSD is still operational.)
 
 Another side effect of their 
technology  may be better
performance in
flash SSDs.
 
 Link_A_Media
 says their IOP
Buster architecture enables scalability within the controller to address
various segments of SSD applications seamlessly.  It enables faster Read and
Write transfers.
 
 
 Spectra Libraries will Log Tape Health Metrics
 
 SNW,
ORLANDO, FL  - April 8, 2008 - Spectra Logic announced details of its
soon to be released new Media Lifecycle Management  software for its  tape
library customers.
 
 MLM will reduce backup failures  by tracking
more than 30 pieces of information about individual LTO tapes and logging this
on  on the tape's built in flash chip. Information such as: born-on date, number
of reads and writes, error rate, media quality, date of last access, application
usage, encryption information, cleaning log and remaining capacity are tracked.
MLM and BlueScale are compatible with all major
backup applications.  
...Spectra Logic
profile
 
 Editor's comments:- already past the decline and now in
the fall years of the tape
library market it looks like customers will get all kinds of useful
information and services which they probably would have liked to have before.
This sounds similar in concept to the
SMART
logs in hard disks
and SiSMART
 in SiliconSystems'
flash SSDs.
 
 
 Pillar's Petabyte Arrays are 99.999% Available
 
 San
Jose, Calif. - April 7, 2008  - Pillar Data Systems  today announced
availability of the Pillar Axiom 500MC - a mission critical storage system .
 
 The Pillar Axiom 500MC delivers up to 192GB of cache, with the ability
to scale capacity to 1.6 petabytes. The system supports both
fibre channel and
SATA disk drives. 
Pillar   guarantees 99.999% availability.   ...Pillar profile
 
 
 Does Unhappy Notebook Maker Have  High Rate of SSD Flash Backs?
 
 Editor:-
March 19, 2008 - a report discussed  in an article on CNET saying  that
flash  SSDs in notebooks are incurring double digit  customer reject rates  has
been dismissed by Dell as "untrue."
 
 
 Study  Enumerates Key Factors in    Disk Array Failures
 
 Editor:-
March 6, 2008 - a recently published paper called - Are Disks the Dominant
Contributor for Storage Failures? - reports on a 3 year study of   nearly 2
million operating disks.
 
 Among the many findings:-  the
annualized failure rate in near-line systems which mostly use
SATA disks is
approximately twice as high as in systems  which mostly use
fibre-channel  disks. 
But   other factors such as datapath resilience, presence or absence of
RAID and
reliability of the
rack system components are just as significant contributors to storage
reliability as the hard
disks themselves. 
...read
the article
 
 
 Are MLC SSDs Ever  Safe  in Enterprise Apps?
 
 Editor:- February
27, 2008 -  STORAGEsearch.com published a new article today called - Are
MLC SSDs Ever  Safe  in Enterprise Apps?
 
 This is a follow up
article to  the popular  
SSD Myths and
Legends which,  in early 2007, demolished the myth that  flash memory 
wear-out (a  comfort blanket beloved  by many  
RAM SSD makers)
precluded the  use of flash  in heavy duty datacenters.
 
 This new
article  looks at the risks posed by   MLC Nand Flash SSDs which have recently 
hatched from their breeeding ground as chip modules  in  cellphones  and morphed
into  hard disk  form
factors.  It  starts down a familiar lane but an unexpected  technology  twist
(which arrived in my email this morning) takes you to a  startling new  world of
possibilities.  ...read the article
 
 
 WEDC Targets Medical CompactFlash Market
 
 Phoenix, AZ -  December 19,
2007 - White Electronic Designs Corp  is leveraging its defense industry
experience and expertise to develop high-reliability modules for the growing
portable medical device market.
 
 According to the U.S. Census
Bureau, there will be an expected 40 million persons in the U.S. over the age of
65 by 2010, driving the need for portable medical devices, especially for home
use. The portable medical device market is  driven by the same requirements
and expectations as the defense segment; such as high quality and reliability,
shorter development cycles, a well-defined and documented supply chain and
extended product lifecycles.  Among other products  WEDC designs and
manufactures   one of the industry's first medical series  CompactFlash cards.
...White Electronic
Designs profile
 
 Editor's comments:- WEDC has also recently
published a paper 
Is All
CompactFlash Really Created Equal? (pdf) which uses the medical
instrumentation market as the backdrop for a discussion about
flash SSDs similar
to those concerns analyzed in
SSD Myths and
Legends - "write endurance" - which looked at the enterprise
server market.
 
 
 Patent May Suit  High Reliability  SSD OEMs
 
 MINNETONKA, MN - November 23, 2007 - ECC
Technologies, Inc.   announces that its parallel Reed-Solomon   error
correction designs and US Patent are immediately available for licensing.
 
 PRS encoder and decoder designs allow parallel I/O storage devices to
be designed with automatic, built-in backup (fault-tolerance).  PRS applied to
flash SSDs (for
example) enables SSDs to be designed that can tolerate NAND Flash chip failures.
 PRS can also be applied to Hard Disk Arrays.   Potential licensees can read
about the PRS technology applied to
SSDs  and to
HDDs on these
preceding links. ...ECC
Technologies profile,
storage reliability
 
 Editor's
comments:- in the early days of a fast growing technology market most
vendors are too busy growing their  revenue by selling products to customers.
But when markets get big enough or growth rates slow down - another round kicks
in - of harvesting money from those who succeeded in the market - but didn't
protect themselves properly with patents.
 
 When I was a young engineer
several designs of mine did get patented. In one particular company  I remember
being asked to leaf through some 10 year old logbooks of my predecessors to find
some prior art to help nullify a competitor's potential attack. I always
preferred doing things my own way - so I grumbled   at being asked to delve into
these dusty old files.  But  I did find what my boss was looking for.
 
 
 Panasas Solution Targets  RAID Unreliability
 
 FREMONT, CA - October
9, 2007 - Panasas, Inc.  announced the Panasas Tiered Parity
Architecture which the company claims is  the most significant extension to disk
array data reliability since Panasas CTO Garth Gibson's pioneering RAID
research at UC-Berkeley in 1988.
 
 With the release of the
ActiveScale 3.2 operating environment, Panasas will offer an innovative
end-to-end Tiered-Parity architecture that addresses the primary causes of
storage reliability
problems and provides the industry's first end-to-end data integrity checking
capability.
 
 Traditional
RAID implementations
protect against disk failures by calculating and storing parity data along with
the original data.
 
 In the past 10 years, individual disk drives have
become approximately 10x more reliable and over 250x denser than those protected
by the first generation RAID designs in the late 1980s. Unfortunately, the
number of disk media failures expected during each read over the surface of a
disk grows proportionately with the massive increase in density and has now
become the most common failure mode for RAID. A RAID disk failure can cause loss
of all the data in a volume which may be tens of terabytes   or more. Recovery
of the lost data from tape
(assuming that is all backed up) can take days or even weeks.
 
 Other
storage system vendors recognize this same issue and apply RAID 6, often called
double parity RAID, to address this problem. Double parity schemes only treat
the symptom of the failure, not the cause, and they carry substantial cost
and performance penalties, which will only get worse as disk drive densities
continue to increase.
 
 Panasas Tiered Parity architecture directly
addresses the root cause of the problem, not the symptom. Solving the storage
reliability problem caused by these new 1TB and larger disks allows Panasas to
build larger and more reliable storage that allows users to get more value from
their data and are less expensive for IT to support.
 
 "The challenges with storage system reliability today have
little to do with overall disk reliability, which is what RAID was designed to
address in 1988. The issues that we see today are directly related to disk
density and require new approaches. Most secondary disk failures today are the
result of media errors, which have become 250x more likely to occur during a
RAID failed-disk rebuild over the last 10 years," said Garth Gibson, CTO of
Panasas. "Tiered Parity allows us to tackle media errors with an
architecture that can counter the effects of increasing disk density. It also
solves data path reliability challenges beyond those addressed by traditional
RAID and extends parity checking out to the client or server node. Tiered Parity
provides the only end-to-end data integrity checking capability in the industry."
  ...Panasas profile
 
 Editor's
comments:- the problem of data corruption in large data sets because of  
obsolete  technology assumptions built into   hard disks, interface and RAID 
products has been looming for several years. You can see articles and research
about this on the  storage
reliability page.
 
 Is the solution   more reliable hard drives?
better interfaces? or a  smarter storage OS? Users can't wait another 5 years
for ideal solutions because the symptoms are there today when you look. The
Panasas solution sounds like a pragmatic tactical approach for some customers -
but the industry is a long way from a better storage reliability  mousetrap.
 
 
 Why Sun will Shine with  a New Lustre
 
 SANTA
CLARA, Calif -  September 12, 2007 -  Sun Microsystems, Inc.  today said
it will acquire the majority of Cluster File Systems, Inc.'s
intellectual property and business assets, including the Lustre File System.
 
 Sun intends to add support for  Solaris OS on Lustre and plans to
continue enhancing Lustre on Linux and Solaris OS across multi vendor hardware
platforms.  
...Sun Microsystems profile,
Acquired storage companies
 
 Editor's
comments:- I hadn't heard of this company before. A sure sign  that they
were heading straight for the
gone away storage
companies list without any deviations on route. Here's what I picked up from
their web site present and
past.
 
 The  
Lustre
product description (pdf)  says -  "the Lustre architecture was first
developed  at Carnegie Mellon University as a research project in 1999."
The company's website started in about 2001 amd they  released   Lustre 1.0 in
2003. By
2004
had a product ready for a bigger market.
 
 Strangely enough Solaris
support isn't listed as a strong  feature in their   recent
roadmap. So why does Sun
want this technology? - Well - even if you're not in the supercomputer business
- some technologies which start there eventually trickle down to the rest of us.
"Zero single points of failure" - mentioned on their home page - is a
good enough reason. As I wrote in my
7 year storage market
predictions (2005)  storage
reliability is going to become a major headache in enterprise storage in the
next 5 years.
 
 See also:- Robin
Harris's blog which explains the business background to CFS - "why
aren't they rich?"
 
 
 Tapewise Enterprise Checks Tape Media Errors
 
 Farnborough
 UK -  September 18, 2007 - Data Product Services today announces the
release of Tapewise Enterprise.
 
 Tapewise is software that writes
data to a tape and then reads it again, tracking any errors, soft recoverable
ones or unrecoverable ones, that occur. It streams a whole tape through a drive
in this way and, with its Tape Error Map   technology, produces a 3D graph
showing errors encountered along the length of a tape when data was being read
and written.
 
 The user can decide what an acceptable error rate is
and that boundary will be shown on the graph with any error rates above the
user-defined norm instantly visible.   The software supports a large number of
tape formats: 3480; 3490; DLT; SDLT; 3590; 9840; 9940; T10000; LTOs 1, 2 and 3
and 3592.  Costs start at $16,000 approx.  A free 14-day evaluation copy is
available. ...Data Product
Services profile, 
Tape drives,
Storage Testers
 
 
 Noise Damping Techniques for PATA SSDs
 
 Editor:-
August 10, 2007 -  SiliconSystems today published a new white paper 
called -   "Noise Damping Techniques for PATA SSDs   in Military-Embedded
Systems."
 
 This article looks at electronic signal integrity
issues in integrating high speed PATA SSDs. It helps electronic designers
understand  how factors such as ground bounce, loading, power supply noise and
signal trace mismatches can lead to false data or even device damage. Examples
given in the tutorial style commentary  include scope shots and logic analyzer
traces. ...read
the article, ...SiliconSystems
profile, storage  chips,
storage analyzers
 
 Editor's
comments:-  the article gives a good grounding (couldn't resist that one) in
the signal quality factors  needed to  get high
reliability
operation and is equally relevant to
hard disks. To simplify
the 20 page document:- if you connect reliable electronic modules using
unreliable signal paths - that will compromise the integrity of the data. Logic
states are virtual - but digital signals are real and can have completely
different shapes to what you expect if you don't follow basic rules.
 
 
 Squeak! - Green Storage - What's Green. What's Not
 
 Editor:- June 24,
2007 - STORAGEsearch.com today published a new article - Green Storage -
Trends and Predictions.
 
 There's a lot of nonsense in the media
about so called "Green Storage". This article blows away the
puffery
and clears the air for a better view of forward looking  green data storage 
technologies. Reliability gets an honorable mention. Find out what's really
green - and what's not. ...read the article
 
 
 Hard Drive
Unreliability Costs are Reason to Switch to SSDs
 
 Aliso
Viejo, Calif., May 30, 2007 - SiliconSystems, Inc.  today announced the
publication of a white paper called - "Solid-State Storage is a
Cost-Effective Replacement for Hard Drives in Many Applications."
 
 The paper  cites data from  Google
 and Carnegie Mellon University that
indicates  hard drive
field failure rates are up to 15x greater than quoted in disk
manufacturer data sheets.  The white paper was developed by SiliconSystems to
educate OEMs about the numerous technical and business decisions they must
successfully navigate to select the best storage solution for their application.
 ...read
the article (pdf),
...SiliconSystems
profile
 
 Editor's note:-
storage reliability
is a type 4 application in our 
SSD Market
Adoption Model.
 
 
 Debunking Misconceptions in SSD Longevity
 
 Editor:- May 11,
2007 - BiTMICRO Networks today published a new article called - "Debunking
Misconceptions in SSD Longevity."
 
 It cites lifetime
predictions from my own popular article - 
SSD Myths and
Legends - "write endurance" and fires a warning shot  aimed at
some competitors by saying  "some
flash SSD makers
have even quoted higher write endurance ratings than those provided by
manufacturers of their flash
memory components."
 
 That's certainly true - but  I knew when
writing my article that endurance varies from batch to batch of flash chips
within the same  semiconductor fab process. Some SSD oems 
sample test and
reject chips which are at the lower end of the distribution curve. That
means their worst case numbers are better than would be the case by simply
accepting merchant quality flash chips. Although starting from a different base
of assumptions -
BiTMICRO's
article  "conclude(s) that fears about the endurance limitations of
SSDs are rightfully fading away."
 
 
 Seagate Drops  Notebook Drives
 
 SCOTTS
VALLEY, Calif - March 12, 2007 - Seagate Technology today announced the
worldwide   availability of a  7,200 RPM   hard drive   with free-fall
protection for beefed-up laptop durability.
 
 Momentus 7200.2 delivers up to 160GB of capacity and has a 
SATA interface. The
hard drive is also
offered with an optional free-fall sensor to help prevent drive damage and data
loss upon impact if a laptop PC is dropped. The sensor works by detecting any
changes in acceleration equal to the force of gravity, then parking the head off
the disc to prevent contact with the platter in a free fall of as little as 8
inches.
...Seagate profile
 
 Editor's comments:- Hitachi revealed details
about its similar 
ESP drop sensor
in  2005. The drop sensor  approach is better than nothing, but doesn't get
around the unavoidable fact that hard disks can break when dropped.
 
 Another
approach is that of  Olixir
Technologies who have marketed repackaged high performance hard drives which
can be dropped repeatedly onto
a concrete floor from 6 feet and still survive.
 
 But
solid state disks are
inherently even tougher than that because there are no internal moving parts to
crash together. That's
why they have been used in space ships, helicoptors  and missiles. In 2006 
In-Stat predicted that
half of all mobile computers would use SSDs (instead of
hard disks) by 2013.
It's not just the ruggedness and better power consumption. A
video
by Samsung demonstrates the advantages more graphically.
 
 
 Hard Disk MTBF Specs   Incredible - Say 
User Reports
 
 Editor:-
 February 28, 2007 - an article published today  in Channel Insider - "Hard
Disk MTBF: Flap or Farce? - casts serious doubt on the inflated MTBF claims made
by all hard disk manufacturers.
 
 Reviewing a number of recently
published reliability studies from end users - the author
David
Morgenstern  says "...there's a gap between the reliability
expectations of manufacturers and customers. The current MTBF model isn't
accounting accurately for how drives are handled in the field and how they
function inside systems."  ...read
the article, storage
reliability
 
 
 Google Reports on HDD Reliability
 
 Editor:-
February 20, 2007 - Researchers at Google recently published a paper at
the recent  Usenix conference about hard disk reliability and failure
prediction - based on their own experiences as a large user of hard disk drives.
 
 The fascinating paper describes how Google measured available metrics
and status reports generated by the drives themselves and how this correlated
with actual failure patterns. One of the key insights in the report is  Google's
view of how useful
SMART
parameters were for predicting failures.
 
 "Our results are
surprising, if not somewhat disappointing. Out of all failed drives, over 56% of
them have no count in any of the four strong SMART signals, namely scan errors,
reallocation count, offline reallocation, and probational count. In other words,
models based only on those signals can never predict more than half of the
failed drives... ...even when we add all remaining SMART parameters (except
temperature) we still find that over 36% of all failed drives had zero counts on
all variables." ...read
the article, Hard disk
drives,  storage
reliability
 
 PS - the measured data on the percentage of disks which
fail each year  over a 5 year cycle under various conditions is essential
reading for disk to disk backup
contingency planning.
 
 
 Agere Halves Power Consumption for Mobile HDD Interface
 
 ALLENTOWN, Pa - February 6,
2007 -  Agere Systems  has begun shipping a new fully functional
90-nanometer TrueStore read channel.
 
 The TrueStore RC1300 uses half the  current required by the previous
generation of read channel chip technology in this market segment and is  25%
faster.  It targets the 1.8-inch and smaller
HDD form factor that
provides critical data storage of 20 to 160 gigabytes in a wide variety of
consumer devices.  
...Agere Systems profile
 
 
 STORAGEsearch.com Launches a New Strategic Directory  -  Storage
Reliability
 
 Editor:- June 20, 2006 - STORAGEsearch.com
today launched a new directory dedicated to the subject of "Storage
Reliability".
 
 Reliability was named as one of the 3 most important
 future  trends in storage   in my 
state of the storage market
article published last year. In that article  I also predicted that
uncorrectable  failures in storage systems (due to embedded design assumptions
made in  earlier generations) could, if not dealt with by drive and interface
designers,  pose a  more serious threat to enterprise computer systems
than the Y2K bug in the late 1990s.
 
 In addition to covering
news about what the industry is doing to improve reliability in future drives,
media and interfaces, STORAGEsearch has invited CTOs and technical directors of
leading companies to write special articles about this subject - which will
appear in the months ahead.
 
 When most people think about storage
reliability - they think about MTBF and thermal factors.
 
 If an
individual drive isn't reliable enough - wrap it in a
RAID. If heat reduces the
life of the disks - then cool them with more fans. If a memory system or
interface  is critical to an application - cocoon it with error detection and
correction codes. Those are approaches which have worked adequately  for the
past few decades - but they are not good enough any more.
 
 The demands
for storage reliability are growing.  Non stop applications need data that can
be trusted to be available on demand.
Compliance
dictates that data should be readable not just years - but possibly decades
after it was created. Meanwhile storage components, interfaces and systems are
increasing in speed and capacity - while many of them are using error correction
thinking that comes from earlier generations when data sets were smaller. As
storage gets bigger - users face the risk of having uncorrectable errors in the
heartland of their decision making data. That's why - all over the industry -
manufacturers are starting to talk about new storage reliability initiatives.
 
 There's
also the risk that new storage technologies which get rushed to serve the needs
of the consumer market - have not in fact been tested long enough to guarantee
that they will not fail or start to corrupt data in the  timeframe that 
enterprise customers care about.
 
 Wrapping arrays of  consumer
disks based on new 2
year proven media  technology  in a big "enterprise" box - cannot
guarantee that the data  will still be readable in 5 years time. This is not a
worry for consumers. They'll throw a failed disk  away or buy a new one.  But if
your enterprise owns thousands of these disks (hidden by virtualization) it
could be a big headache when the crumbly nature of the storage defects start to
hit the news.  This is 
another of the many concerns we'll be covering in these pages.
Storage media have
failed in the past and been withdrawn because they didn't meet their original
extrapolated lifetimes. Lessons  are not always learned from  errors in the past
  - but can be forgotten and reoccur.
 
 Storage reliability is changing.
If you are interested  - I hope you'll stay tuned to the new storage reliability
 channel here on the mouse site - as we report on these exciting developments in
the months ahead.
 
 
 Why Solaris will Get 128 Bit Addresses
 
 Editor:-
May 1, 2006 -  an article today  in InformationWeek.com  discusses the
Zettabyte File System - a new 128 bit addressing scheme for Solaris.
 
 The
article says that apart from the obvious advantage of being able to access more
storage, Sun is apparently thinking about  building in error correction into the
new address scheme.
 
 In a
market forecast
published last year in STORAGEsearch.com
- Storage Reliability and failures were cited as one of the most important   
long term problems which oems and users will have to deal with.
 
 The
cause of the problem is that storage interfaces as well as  modules and
components (like disks,
tapes,
optical drives etc) use
error correcting schemes which were designed for the  much smaller and slower
architectures of the past. As storage systems expand - new algorithms and
correction schemes will be needed to guarantee that users don't get affected by 
data failures which are uncorrectable using today's products and
protection schemes.
 
 It's good to see that Sun is working proactively
on one aspect of the problem. I've talked to many storage manufacturers  about
the upcoming reliability problem - which could be more serious than the Y2K
threat - if not dealt with in advance. Sun is highly sensitive to data 
reliability concerns.  
Problems
with its own  SPARC server cache memory design back in 2001- were cited at
the time  by many large users as reasons for considering a switch  to Intel and
PowerPC  based systems.
 
 See also:-
SPARC Product Directory
 
 
 Hard Disk  Sector Size May Change
 
 SUNNYVALE,
Calif - March 23, 2006 - IDEMA  today announced the results of an
industry committee assembled to identify a new and longer sector standard for
future magnetic hard disk drives.
 
 This Committee recommended
replacing the 30 year-standard of 512 bytes with sectors having ability to store
4,096 bytes.    Dr. Ed Grochowski, executive director of IDEMA US, reported that
adopting a 4K byte sector length facilitates further increases in data density
for hard drives which will increase storage capacity for users while continuing
to reduce cost per gigabyte.
 
 "Increasing areal density of newer magnetic
hard disk drives
requires a more robust error correction code, and this can be more efficiently
applied to 4,096 byte sector lengths," explained Dr. Martin Hassner from
Hitachi GST and IDEMA Committee member. 
...IDEMA profile
 
 
 Whitepaper Measures  ROI  of Disk Defragmentation
 
 Burbank,
CA - January 24, 2006 - Diskeeper recently sponsored IDC to
write a whitepaper called - "Defragmentation's Hidden Value for the
Enterprise."
 
 This measured the ROI of defragmentation
software in real customer sites. During the reliability test, the servers that
were defragmenting files automatically had a higher uptime (5 to 10%) than the
servers that didn't have defragmentation software automatically running. 
...read
the article (pdf), ...Diskeeper
 profile
 
 
 ProStor Systems Unveils New Backup
Technology
 
 BOULDER,
CO  -  November 2,  2005  -  ProStor Systems  made its public debut
today by introducing the firm's RDX removable disk backup  technology.
 
 The RDX removable cartridge uses the same 2.5" hard disk media platters  
found in  notebook computers and provides initial capacity upto 400GB
(compressed).  That will  will increase in line with conventional  hard disk
technology.  But the difference is that RDX uses a new patent-pending
error correcting format, which makes the data  1,000 times more
recoverable than in a standard hard drive. ProStor says this means that  
RDX-stored data will be   readable even after the cartridge has been archived
and non-operating more than a decade.    ...ProStor Systems profile,
Removable  Storage,
Disk to disk backup,
Storage People
 
 Editor's
comments:- the reliability of embedded storage modules and components such
as
disk drives,
tape drives and
optical disks  will become
an important issue for users in the
next 7 years.
 
 These
products rely on  inbuilt error correction algorithms which were designed over a
decade ago - when storage capacities were much smaller. All those "ten to
the minus something" numbers  which you see quoted for error rates sound
good - except that when your enterprise is managing Petabytes of data, at every
higher connection speeds,  then you will start  seeing uncorrectable data
failures occurring every year - inside the storage, and beyond the scope
of your RAID or other
protection scheme to correct.  ProStor is one of a new generation of storage
manufacturers addressing this problem, and we'll soon publish a directory
section dedicated to storage reliability issues such as this.
 |  
| 
 |  |  |