I have purchased an Intel 730 Series 480GB SSD with the intention of using it in my Fedora desktop machine. There are many articles online that discus how to configure an SSD, what to put on it, which partition structure to use, which mount options etc. I wanted to set the drive up to get the best out of it and my system as a whole. What follows are the results of my research and the decisions I have made based upon it. The writing style is based on my own note taking and primarily for my own reference. It is my hope that it posting it here that it may be of use to others.
What do I put on my SSD?
There are many posts/articles online asking this question and the answers to them vary. What became apparent immediately was the age of some of the discussions. Information from 3-4 years ago based the decisions on drive size (32-128GB).
An SSD with a limited amount of capacity is often set-up along side a traditional HDD.
A typical partition layout for a smaller SSD & HDD:
/boot & / are the main operating system and program partitions/directory structure. SSD's are sometimes referred to as 'boot drives' as by having the OS and applications on it would allow the system to boot and respond faster.
Swap wouldn't be placed on the SSD as if you were frequently using the swap space you would be putting additional wear onto the SSD.
/var contains log files from the syslog daemon. Frequently wrting small amounts of log data increases SSD wear. /tmp in much the same way has small temporary data that increases wear. /home is occasionaly placed on the SSD depending on the space required by the user. The HDD would be used for music, photos, large files and other user generated data.
SSD's have been increasing in size over time and the space limitation isn't such an important factor. What does remain in the considerations of the amount of data being written to an SSD.
Why should I think about SSD writes?
SSD writes are not in anything to worry about, they are however something to understand.
The flash memory that is used for data storage only has a limited number of program(write)/erase cycles over its lifetime. This can be in the region of 50,000 to 100,00 cycles or lower in the 3,000 to 5,000 range. You would think that as time goes by the number of cycles the flash can sustain would be going up. You would be wrong.
The number of cycles the flash can sustain is partially based on the scale of manufacturing process used to make the flash. Over time the flash memory being manufactured has been getting physically smaller and smaller, with its write endurance decreasing. The advantage is that with smaller individual bits you can pack more of them onto a chip and fit more of them into a drive. This is why 1TB+ drives are starting to become affordable.
With only a finite number of writes how do we stop the SSD wearing out?
If the same bit of flash memory kept being used it would very quickly fail and data would be lost. The way to avoid this is by using different bits of flash memory over time, a technique called 'Wear Leveling'. The SSD looks after itself and spreads the write cycles across all of its flash memory as evenly as it can.
If the SSD is spreading the write cycles out, why consider them?
The SSD is only managing an even distribution of write cycles over its flash memory. What is isn't doing is trying to reduce the number of write cycles. We previously mentioned storing log files and changing data to the drive increases the number of write cycles. But there is another effect to consider for these writes something called 'Write Amplification'.
Write Amplification = data written to flash memory / data written by host
A write amplification factor is based upon how much data actually gets written to the flash memory when we try to write some data to it. A write amplification value of 1 would mean that if the host sent 1MB of data to the SSD then 1MB worth of data would be written. A write amplification of 2 would mean that if the host sent the same 1MB of data to the SSD then 2MB of data would be written. So what we really want to have is a write amplification factor that is as low as possible.
The Wikipedia article outlines a number of factors affecting write amplification. There are a few we will consider a little later.
Free User Space
The overall aim is to reduce the write amplification affect to as low a figure as possible so that when we do write data we don't end up making significantly more writes to the flash memory than we absolutely need to store our data.
What is Garbage Collection?
Simply put it is a process by which units of flash memory (pages) that have had data written to them which is no longer needed (e.g. from a deleted file) are reclaimed (erased) and made ready for storing data again.
The pages of flash memory are organised in blocks. The act of reclaiming some of the pages in a block of flash memory causes the good pages to get copied(written) to a new block (with free pages), whilst the origin block is completely erased ready for new data. The Wikipedia Write Amplification article has a nice graphic of this.
Garbage collection can be done internally by the SSD and can be triggered at file system level. Garbage collection is important in that it prevents unwanted/deleted data from being repeatedly copied over time as blocks are rewritten increasing the affect of write amplification.
What is TRIM?
TRIM is a SATA command which allows the operating system to tell the SSD which blocks of stored data are no longer needed which it otherwise wouldn't know about. By telling the SSD which blocks contain invalid/deleted data it can erase them through its garbage collection process and make then ready for writing again. By having the blocks ready for new writes it allows the drive to write faster as it doesn't have to go through the erase cycle before writing.
TRIM hasn't always been supported by SSD's but most of the models on the market today will have support for it. It is still worth checking.
Over-Provisioning & Free User Space
On a drive with limited space the only space used for new writes is that which is not occupied by valid data. The TRIM command would be used to tell the drive to garbage collect and keep this space available. But a drive operates better when it has more space to work with.
Over time as data is stored on the drive certain blocks will reach their write limit and cease to function. SSD's (like HDD's) have extra designed in capacity to handle these losses of capacity. When a block fails a new block will be allocated to take it's place. The new block is taken from a Over-Provisioned capacity.
Over-provision(%) = ((physical capacity – user capacity)/user capacity)*100
On the Intel 730 Series 480GB SSD, the physical capacity (based on the flash chips inside) is 528GB.
Over-provision(%) = ((528 – 480)/480)*100 = 10%
This extra capacity is used for wear levelling, garbage collection and to help reduce write amplification. The more over-provisioned capacity you have the better, as the wear can be more evenly distributed and the more free blocks can be available for writing at any given time.
If more over provisioning is desired it can created by reducing the amount of the user storage space used for data. The formatted capacity of a 480GB drive is around 447GB. But instead of having all that 447GB in user accessible partitions, we could leave some space and only use 400GB as formatted storage. That unused 47GB is free space that the drive will use in addition to its over-provisioning pool. Without any valid data in it they are just seen as empty blocks ready for writing. This means that even with 400GB of data filling the drive there is plenty left for the wear levelling, garbage collection and to reduce write amplification.
It is rare however to fill up all the space on the drive. It might take many months/years of storing files, photos etc to begin to use all of the space. Free blocks on the drive will all look the same, it doesn't matter if they are from manufacturer over-provisioning or formatted over-provisioning. Free space on the drive is a dynamic over-provisioning. It will reduce over time as the drive gets fuller. But until the drive is filled it can be used to spread the wear and improve performance, having some free space is a good thing.
Why should I care about Partition Alignment?
Aligning the partitions on the disk can improve performance. If a block of filesystem data is spread across 2 blocks of flash, then both need to be read and written. If the file system block is aligned so that it fits cleanly inside a single block of flash only one block needs to be read and written. The links below explain this with some nice diagrams.
A Brief Pause
So after all that we know that SSD flash has a limited number of write cycles and there are things that can increase/decrease the number of writes performed. There are steps the manufacturer has taken to assist the drive and by setting up the drive carefully ourselves we can assist this further.
It took me the best part of 3 evenings research to get to this point and I still wasn't sure what to store on my SSD. My concern was over how much data I would be writing to the drive. So it was time to look at how I use my existing system.
The use of journald (part of systemd) will cut down on log writes to the SSD. By defaults it synchronises to disk every 5 seconds (cuts down on smaller writes). If however a message of priority CRIT, ALERT, EMERGENCY is logged then it is synchronised immediately.
Journald settings may need to be optimised.
By default journald has the option “ForwardToSyslog” enabled by default. If a syslog daemon (rsyslog/syslog-ng) is running then the log data could be written to disk twice.
Only using Journald would appear to be the better option for a workstation/desktop machine whereby you don't need the log forwarding capabilities of a syslog daemon.
Monitoring HDD/SDD I/O
Two commandline tools are available for monitoring i/o usage on your system.
Iotop – shows current i/o usage by running processes
iostat – reports i/o statistics
Note: Fedora – iostat is part of the sysstat package.
Iostat -m outputs a report on i/o since the last reboot. MB_wrtn is the total MB written to a device since the last reboot. Run the command $ uptime, to see when the last reboot was.
MB_wrtn/uptime_in_days = MB/day written.
Intel 730 Series 480GB SSD specifications indicate the drive is designed for a minimum of 5 years useful life based upon upto 70GB writes per day.
Current HDD I/O numbers
Based on my standard usage over 9 days, with the machine operating 24/7 I have written approximately 30GB of data to my primary hard drive. The drive contains the /, /boot, swap, /home, /var partitions. Other drives in the machine are for media storage only and contain no system files. This gives me a write usage of around 3.3GB per day.
Most of the 3.3GB per day usage is down to the Podcasts that are downloaded by Gpodder. These range in size from a few hundred MB to just over a GB. Putting such downloads on a HDD rather than an SSD would reduce the data written to the drive considerably.
Going by the numbers of the SSD write endurance and the current usage based upon a HDD I can expect the drive to stay well below the daily write threshold and shouldn't have to worry about wear.
What do we know now that we didn't before?
Well the choice of system logger can have an affect. It happens that Journald is the Fedora default and it will serve the purpose I need it to.
SSD wear through normal OS usage should not be a concern. However downloading large files to the SSD forms a large part of my writes. The I/O tools only monitor the writes from the OS it isn't factoring in the write amplification that might occur internally on the SSD. However given the drive specs and my level of usage this isn't a concern.
So I know that I can use an SSD pretty much like I can my HDD.
How should I partition my SSD?
A Linux Questions discussion recommended:
- Put everything that benefits from fast access onto an SSD.
- On modern SSD's wearout is nolonger the main issue (it's size/price)
- Music/Video files don't benefit from fast access due to the nature of playback.
- Downloads are capped by LAN/Internet Connection speed, not of the storage medium.
- Use Ext4 partitions.
Time for a revision of the partition layout that began this article.
We have learnt that using Journald reduces the write frequency of log files so /var can go onto the SSD. It turns out that /tmp is often mounted as a ram file system and hence doesn't write to the drive, onto the SSD it goes. /home contains user configurations and documents. Why not speed up access to these, SSD it is then. /, and /boot are the core system and mainly static files, putting these on the SSD means the system and programs will load faster.
Swap can go onto the HDD, with plenty of RAM I hardly use it. There have been occasions in the past where I have had a badly behaving program eat up the RAM and cause writes to swap. Placing swap onto the HDD protects the SSD from the effects of this should it occur again.
Data and downloads are limited more by the playback and download speeds. Given the larger sizes putting them on a HDD makes sense. In addition the podcasts can come in at a few GB per week and I delete after watching, using the HDD removes the SSD wear.
Which File system?
Ext4 is a solid and reliable file system. I have used Ext2,Ext3,Ext4 for years without any issues. There are alternative file systems out there which may, for certain workloads, be more efficient/faster. The newer file systems have not been around long enough for my liking to have been fully tested. Fedora 21, about to be release as I write this will still default to Ext4. Btrfs is being looked at as a Fedora default for the future, but if they are not willing to commit to it as a default yet.
/dev/sda1 /boot ext4 defaults,noatime 1 2
/dev/sda2 / ext4 defaults,noatime 1 1
/dev/sda3 /home ext4 defaults,noatime 1 2
/dev/sda4 /var ext4 defaults,noatime 1 2
/dev/sdb1 none swap defaults 0 0
/dev/sdb2 /data ext4 defaults,noatime 1 2
/dev/sdb3 /downloads ext4 defaults,noatime 1 2
Note: /tmp is created as tmpfs by the systemd.tmpfiles service
Mounting with noatime is advised since nothing really relies on atime functionality. If you have a practical need for atime functionality, using the relatime option is advised. http://tytso.livejournal.com/61830.html
I am not using the discard mount option for the SSD in my case as I have a drive which has a good amount of manufacturer over-provisioning and I will be leaving unpartitioned space to add to this. A scheduled fstrim command perhaps once a week should suffice.
How to create/align the partitions?
The simplest way to both create and align the partitions is to create them with Gparted. When creating a partion you are given an alignment option with the choices of MiB (default), Cylinder, None. The MiB default is what we want to align to mebibyte (1,048,576 byte) boundaries.
To check an existing partion table on /dev/sda run:
$ fdisk -lu /dev/sda
The fdisk output will indicate non-aligned partitions. I have not investigated how to realign partitions as I don't have any misaligned ones.
Without the research process I had not even considered I/O schedulers. It is the process which determines how we communicate with the physical drives. There is a difference between how HDD (rotational) and SDD (non-rotational) drives behave.
Three common schedulers keep showing up, CFQ, NOOP and DEADLINE.
CFQ (Completely Fair Queuing)- Tends to be the default scheduler and there is nothing wrong with it. It is however designed with the mechanical limitations of rotational media in mind. The distance between data on rotating media means that there can be a delay between wanting to read/write data and waiting for that area of the media to rotate within range of the drives read/write heads. It only remains within range for a brief period before rotating out of range for the rest of the rotational cycle. SSD's don't have this limitation and schedulers with a difference approach can be considered.
NOOP – Is simple, it passes along requests quickly without much thought. It is faster on SSD's than CFQ.
DEADLINE - A latency-orientated scheduler and little more thoughtful than NOOP.
A Phoronix comparison on schedulers showed Deadline to be the fastest on an SSD. This confirms what is echoed elsewhere. DEADLINE it is for the SSD and CFQ for the HDD.
How do I set the I/O Scheduler?
Detail changes of Udev rules, debian article has some information on this.
How do I schedule the fstrim job?
Detail setting via cron and/or systemd.
Apply TRIM via a systemd service
The util-linux package provides fstrim.service and fstrim.timer systemd unit files. Enabling the timer will activate the service weekly, which will then trim all mounted filesystems on devices that support the discard operation.