The Big SSD Research Post¶
This article was written at the end of 2014. Most of the content will remain relevant in the future but there may be aspects which become outdated as hardware and usage practices evolve.
I have purchased an Intel 730 Series 480GB SSD with the intention of using it in my Fedora desktop machine. There are many articles online that discus how to configure an SSD, what to put on it, which partition structure to use, which mount options etc. I wanted to set the drive up to get the best out of it and my system as a whole. What follows are the results of my research and the decisions I have made based upon it. The writing style is based on my own note taking and primarily for my own reference. It is my hope that it posting it here that it may be of use to others.
What do I put on my SSD?¶
There are many posts/articles online asking this question and the answers to them vary. What became apparent immediately was the age of some of the discussions. Information from 3-4 years ago based the decisions on drive size (32-128GB).
An SSD with a limited amount of capacity is often set-up along side a traditional HDD.
A typical partition layout for a smaller SSD & HDD:
SSD: /boot / HDD: swap /var /tmp /home /data
/boot & / are the main operating system and program partitions/directory structure. SSD’s are sometimes referred to as ‘boot drives’ as by having the OS and applications on it would allow the system to boot and respond faster.
Swap wouldn’t be placed on the SSD as if you were frequently using the swap space you would be putting additional wear onto the SSD.
/var contains log files from the syslog daemon. Frequently wrting small amounts of log data increases SSD wear. /tmp in much the same way has small temporary data that increases wear. /home is occasionaly placed on the SSD depending on the space required by the user. The HDD would be used for music, photos, large files and other user generated data.
Why should I think about SSD writes?¶
SSD writes are not in anything to worry about, they are however something to understand.
The flash memory that is used for data storage only has a limited number of program(write)/erase cycles over its lifetime. This can be in the region of 50,000 to 100,00 cycles or lower in the 3,000 to 5,000 range. You would think that as time goes by the number of cycles the flash can sustain would be going up. You would be wrong.
The number of cycles the flash can sustain is partially based on the scale of manufacturing process used to make the flash. Over time the flash memory being manufactured has been getting physically smaller and smaller, with its write endurance decreasing. The advantage is that with smaller individual bits you can pack more of them onto a chip and fit more of them into a drive. This is why 1TB+ drives are starting to become affordable.
With only a finite number of writes how do we stop the SSD wearing out?¶
If the SSD is spreading the write cycles out, why consider them?¶
The SSD is only managing an even distribution of write cycles over its flash memory. What is isn’t doing is trying to reduce the number of write cycles. We previously mentioned storing log files and changing data to the drive increases the number of write cycles. But there is another effect to consider for these writes something called ‘Write Amplification’.
Write Amplification = data written to flash memory / data written by host
A write amplification factor is based upon how much data actually gets written to the flash memory when we try to write some data to it. A write amplification value of 1 would mean that if the host sent 1MB of data to the SSD then 1MB worth of data would be written. A write amplification of 2 would mean that if the host sent the same 1MB of data to the SSD then 2MB of data would be written. So what we really want to have is a write amplification factor that is as low as possible.
The Wikipedia article outlines a number of factors affecting write amplification. There are a few we will consider a little later.
Garbage CollectionOver-ProvisioningFree User SpaceTRIMPartition Alignment
The overall aim is to reduce the write amplification affect to as low a figure as possible so that when we do write data we don’t end up making significantly more writes to the flash memory than we absolutely need to store our data.
What is Garbage Collection?¶
Simply put it is a process by which units of flash memory (pages) that have had data written to them which is no longer needed (e.g. from a deleted file) are reclaimed (erased) and made ready for storing data again.
The pages of flash memory are organised in blocks. The act of reclaiming some of the pages in a block of flash memory causes the good pages to get copied(written) to a new block (with free pages), whilst the origin block is completely erased ready for new data. The Wikipedia Write Amplification article has a nice graphic of this.
Garbage collection can be done internally by the SSD and can be triggered at file system level. Garbage collection is important in that it prevents unwanted/deleted data from being repeatedly copied over time as blocks are rewritten increasing the affect of write amplification.
What is TRIM?¶
TRIM is a SATA command which allows the operating system to tell the SSD which blocks of stored data are no longer needed which it otherwise wouldn’t know about. By telling the SSD which blocks contain invalid/deleted data it can erase them through its garbage collection process and make then ready for writing again. By having the blocks ready for new writes it allows the drive to write faster as it doesn’t have to go through the erase cycle before writing.
TRIM hasn’t always been supported by SSD’s but most of the models on the market today will have support for it. It is still worth checking.
Over-Provisioning & Free User Space¶
On a drive with limited space the only space used for new writes is that which is not occupied by valid data. The TRIM command would be used to tell the drive to garbage collect and keep this space available. But a drive operates better when it has more space to work with.
Over time as data is stored on the drive certain blocks will reach their write limit and cease to function. SSD’s (like HDD’s) have extra designed in capacity to handle these losses of capacity. When a block fails a new block will be allocated to take it’s place. The new block is taken from a Over-Provisioned capacity.
Over-provision(%) = ((physical capacity – user capacity)/user capacity)*100
This extra capacity is used for wear levelling, garbage collection and to help reduce write amplification. The more over-provisioned capacity you have the better, as the wear can be more evenly distributed and the more free blocks can be available for writing at any given time.
Why should I care about Partition Alignment?¶
A Brief Pause¶
So after all that we know that SSD flash has a limited number of write cycles and there are things that can increase/decrease the number of writes performed. There are steps the manufacturer has taken to assist the drive and by setting up the drive carefully ourselves we can assist this further.
It took me the best part of 3 evenings research to get to this point and I still wasn’t sure what to store on my SSD. My concern was over how much data I would be writing to the drive. So it was time to look at how I use my existing system.
The use of journald (part of systemd) will cut down on log writes to the SSD. By defaults it synchronises to disk every 5 seconds (cuts down on smaller writes). If however a message of priority CRIT, ALERT, EMERGENCY is logged then it is synchronised immediately.
Journald settings may need to be optimised.
By default journald has the option “ForwardToSyslog” enabled by default. If a syslog daemon (rsyslog/syslog-ng) is running then the log data could be written to disk twice.
Only using Journald would appear to be the better option for a workstation/desktop machine whereby you don’t need the log forwarding capabilities of a syslog daemon.
Monitoring HDD/SDD I/O¶
Two commandline tools are available for monitoring i/o usage on your system.
Iotop – shows current i/o usage by running processes
iostat – reports i/o statistics
MB_wrtn/uptime_in_days = MB/day written.
Intel 730 Series 480GB SSD specifications indicate the drive is designed for a minimum of 5 years useful life based upon upto 70GB writes per day.
Current HDD I/O numbers¶
Based on my standard usage over 9 days, with the machine operating 24/7 I have written approximately 30GB of data to my primary hard drive. The drive contains the /, /boot, swap, /home, /var partitions. Other drives in the machine are for media storage only and contain no system files. This gives me a write usage of around 3.3GB per day.
Most of the 3.3GB per day usage is down to the Podcasts that are downloaded by Gpodder. These range in size from a few hundred MB to just over a GB. Putting such downloads on a HDD rather than an SSD would reduce the data written to the drive considerably.
Going by the numbers of the SSD write endurance and the current usage based upon a HDD I can expect the drive to stay well below the daily write threshold and shouldn’t have to worry about wear.
What do we know now that we didn’t before?¶
Well the choice of system logger can have an affect. It happens that Journald is the Fedora default and it will serve the purpose I need it to.
SSD wear through normal OS usage should not be a concern. However downloading large files to the SSD forms a large part of my writes. The I/O tools only monitor the writes from the OS it isn’t factoring in the write amplification that might occur internally on the SSD. However given the drive specs and my level of usage this isn’t a concern.
So I know that I can use an SSD pretty much like I can my HDD.
How should I partition my SSD?¶
Time for a revision of the partition layout that began this article.
SSD: /boot / /home /var /tmp`` HDD: swap /data /downloads
We have learnt that using Journald reduces the write frequency of log files so /var can go onto the SSD. It turns out that /tmp is often mounted as a ram file system and hence doesn’t write to the drive, onto the SSD it goes. /home contains user configurations and documents. Why not speed up access to these, SSD it is then. /, and /boot are the core system and mainly static files, putting these on the SSD means the system and programs will load faster.
Swap can go onto the HDD, with plenty of RAM I hardly use it. There have been occasions in the past where I have had a badly behaving program eat up the RAM and cause writes to swap. Placing swap onto the HDD protects the SSD from the effects of this should it occur again.
Data and downloads are limited more by the playback and download speeds. Given the larger sizes putting them on a HDD makes sense. In addition the podcasts can come in at a few GB per week and I delete after watching, using the HDD removes the SSD wear.
Which File system?¶
Ext4 is a solid and reliable file system. I have used Ext2,Ext3,Ext4 for years without any issues. There are alternative file systems out there which may, for certain workloads, be more efficient/faster. The newer file systems have not been around long enough for my liking to have been fully tested. Fedora 21, about to be release as I write this will still default to Ext4. Btrfs is being looked at as a Fedora default for the future, but if they are not willing to commit to it as a default yet.
/etc/fstab outline based on previous partitions
/dev/sda1 /boot ext4 defaults,noatime 1 2 /dev/sda2 / ext4 defaults,noatime 1 1 /dev/sda3 /home ext4 defaults,noatime 1 2 /dev/sda4 /var ext4 defaults,noatime 1 2 /dev/sdb1 none swap defaults 0 0 /dev/sdb2 /data ext4 defaults,noatime 1 2 /dev/sdb3 /downloads ext4 defaults,noatime 1 2
Note: /tmp is created as tmpfs by the systemd.tmpfiles service
Mounting with noatime is advised since nothing really relies on atime functionality. If you have a practical need for atime functionality, using the relatime option is advised. https://tytso.livejournal.com/61830.html
I am not using the discard mount option for the SSD in my case as I have a drive which has a good amount of manufacturer over-provisioning and I will be leaving unpartitioned space to add to this. A scheduled fstrim command perhaps once a week should suffice.
How to create/align the partitions?¶
The simplest way to both create and align the partitions is to create them with Gparted. When creating a partion you are given an alignment option with the choices of MiB (default), Cylinder, None. The MiB default is what we want to align to mebibyte (1,048,576 byte) boundaries.
To check an existing partion table on /dev/sda run:
$ fdisk -lu /dev/sda
The fdisk output will indicate non-aligned partitions. I have not investigated how to realign partitions as I don’t have any misaligned ones.
Without the research process I had not even considered I/O schedulers. It is the process which determines how we communicate with the physical drives. There is a difference between how HDD (rotational) and SDD (non-rotational) drives behave.
Three common schedulers keep showing up, CFQ, NOOP and DEADLINE.
CFQ (Completely Fair Queuing)- Tends to be the default scheduler and there is nothing wrong with it. It is however designed with the mechanical limitations of rotational media in mind. The distance between data on rotating media means that there can be a delay between wanting to read/write data and waiting for that area of the media to rotate within range of the drives read/write heads. It only remains within range for a brief period before rotating out of range for the rest of the rotational cycle. SSD’s don’t have this limitation and schedulers with a difference approach can be considered.
NOOP – Is simple, it passes along requests quickly without much thought. It is faster on SSD’s than CFQ.
DEADLINE - A latency-orientated scheduler and little more thoughtful than NOOP.
How do I set the I/O Scheduler?¶
Detail changes of Udev rules, debian article has some information on this.