summaryrefslogtreecommitdiff
path: root/fs/fs-writeback.c
AgeCommit message (Collapse)Author
2012-03-28Merge tag 'writeback-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull trivial writeback fixes from Wu Fengguang: "They've been tested in linux-next for 20 days actually." * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Remove outdated comment fs: Remove bogus wait in write_inode_now()
2012-03-24Merge tag 'module-for-3.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux Pull cleanup of fs/ and lib/ users of module.h from Paul Gortmaker: "Fix up files in fs/ and lib/ dirs to only use module.h if they really need it. These are trivial in scope vs the work done previously. We now have things where any few remaining cleanups can be farmed out to arch or subsystem maintainers, and I have done so when possible. What is remaining here represents the bits that don't clearly lie within a single arch/subsystem boundary, like the fs dir and the lib dir. Some duplicate includes arising from overlapping fixes from independent subsystem maintainer submissions are also quashed." Fix up trivial conflicts due to clashes with other include file cleanups (including some due to the previous bug.h cleanup pull). * tag 'module-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: lib: reduce the use of module.h wherever possible fs: reduce the use of module.h wherever possible includecheck: delete any duplicate instances of module.h
2012-03-21writeback: Remove outdated commentJan Kara
The comment is hopelessly outdated and misplaced. We no longer have 'bdi' part of writeback work, the comment about blockdev super is outdated, comment about throttling as well. Information about list handling is in more detail at queue_io(). So just move the bit about older_than_this to close to move_expired_inodes() and remove the rest. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-03-21fs: Remove bogus wait in write_inode_now()Jan Kara
inode_sync_wait() in write_inode_now() is just bogus. That function waits for I_SYNC bit to be cleared but writeback_single_inode() clears the bit on return so the wait is effectivelly a nop unless someone else submits the inode for writeback again. All the waiting write_inode_now() needs is achieved by using WB_SYNC_ALL writeback mode. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-03-20Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "It's indeed trivial -- mostly documentation updates and a bunch of typo fixes from Masanari. There are also several linux/version.h include removals from Jesper." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits) kcore: fix spelling in read_kcore() comment constify struct pci_dev * in obvious cases Revert "char: Fix typo in viotape.c" init: fix wording error in mm_init comment usb: gadget: Kconfig: fix typo for 'different' Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c" writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header writeback: fix typo in the writeback_control comment Documentation: Fix multiple typo in Documentation tpm_tis: fix tis_lock with respect to RCU Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c" Doc: Update numastat.txt qla4xxx: Add missing spaces to error messages compiler.h: Fix typo security: struct security_operations kerneldoc fix Documentation: broken URL in libata.tmpl Documentation: broken URL in filesystems.tmpl mtd: simplify return logic in do_map_probe() mm: fix comment typo of truncate_inode_pages_range power: bq27x00: Fix typos in comment ...
2012-03-07writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment headerFengguang Wu
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-02-28fs: reduce the use of module.h wherever possiblePaul Gortmaker
For files only using THIS_MODULE and/or EXPORT_SYMBOL, map them onto including export.h -- or if the file isn't even using those, then just delete the include. Fix up any implicit include dependencies that were being masked by module.h along the way. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-02-01writeback: fix NULL bdi->dev in trace writeback_single_inodeWu Fengguang
bdi_prune_sb() resets sb->s_bdi to default_backing_dev_info when the tearing down the original bdi. Fix trace_writeback_single_inode to use sb->s_bdi=default_backing_dev_info rather than bdi->dev=NULL for a teared down bdi. Cc: <stable@kernel.org> Reported-by: Rabin Vincent <rabin@rab.in> Tested-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2012-01-10Merge branch 'writeback-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c writeback: balanced_rate cannot exceed write bandwidth writeback: do strict bdi dirty_exceeded writeback: avoid tiny dirty poll intervals writeback: max, min and target dirty pause time writeback: dirty ratelimit - think time compensation btrfs: fix dirtied pages accounting on sub-page writes writeback: fix dirtied pages accounting on redirty writeback: fix dirtied pages accounting on sub-page writes writeback: charge leaked page dirties to active tasks writeback: Include all dirty inodes in background writeback
2012-01-08Merge branch 'pm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm * 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits) PM / Hibernate: Implement compat_ioctl for /dev/snapshot PM / Freezer: fix return value of freezable_schedule_timeout_killable() PM / shmobile: Allow the A4R domain to be turned off at run time PM / input / touchscreen: Make st1232 use device PM QoS constraints PM / QoS: Introduce dev_pm_qos_add_ancestor_request() PM / shmobile: Remove the stay_on flag from SH7372's PM domains PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode PM: Drop generic_subsys_pm_ops PM / Sleep: Remove forward-only callbacks from AMBA bus type PM / Sleep: Remove forward-only callbacks from platform bus type PM: Run the driver callback directly if the subsystem one is not there PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412. PM / Sleep: Merge internal functions in generic_ops.c PM / Sleep: Simplify generic system suspend callbacks PM / Hibernate: Remove deprecated hibernation snapshot ioctls PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled() ARM: S3C64XX: Implement basic power domain support PM / shmobile: Use common always on power domain governor ... Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused XBT_FORCE_SLEEP bit
2012-01-08writeback: move MIN_WRITEBACK_PAGES to fs-writeback.cWu Fengguang
Fix compile error fs/fs-writeback.c:515:33: error: ‘PAGE_CACHE_SHIFT’ undeclared (first use in this function) Reported-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2012-01-03fs: move code out of buffer.cAl Viro
Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export kill_bdev as well, so brd doesn't have to open code it. Reduce buffer_head.h requirement accordingly. Removed a rather large comment from invalidate_bdev, as it looked a bit obsolete to bother moving. The small comment replacing it says enough. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-12-25Merge branch 'pm-sleep' into pm-for-linusRafael J. Wysocki
* pm-sleep: (51 commits) PM: Drop generic_subsys_pm_ops PM / Sleep: Remove forward-only callbacks from AMBA bus type PM / Sleep: Remove forward-only callbacks from platform bus type PM: Run the driver callback directly if the subsystem one is not there PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers PM / Sleep: Merge internal functions in generic_ops.c PM / Sleep: Simplify generic system suspend callbacks PM / Hibernate: Remove deprecated hibernation snapshot ioctls PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled() PM / Sleep: Recommend [un]lock_system_sleep() over using pm_mutex directly PM / Sleep: Replace mutex_[un]lock(&pm_mutex) with [un]lock_system_sleep() PM / Sleep: Make [un]lock_system_sleep() generic PM / Sleep: Use the freezer_count() functions in [un]lock_system_sleep() APIs PM / Freezer: Remove the "userspace only" constraint from freezer[_do_not]_count() PM / Hibernate: Replace unintuitive 'if' condition in kernel/power/user.c with 'else' Freezer / sunrpc / NFS: don't allow TASK_KILLABLE sleeps to block the freezer PM / Sleep: Unify diagnostic messages from device suspend/resume ACPI / PM: Do not save/restore NVS on Asus K54C/K54HR PM / Hibernate: Remove deprecated hibernation test modes PM / Hibernate: Thaw processes in SNAPSHOT_CREATE_IMAGE ioctl test path ... Conflicts: kernel/kmod.c
2011-12-21Merge branch 'master' into pm-sleepRafael J. Wysocki
* master: (848 commits) SELinux: Fix RCU deref check warning in sel_netport_insert() binary_sysctl(): fix memory leak mm/vmalloc.c: remove static declaration of va from __get_vm_area_node ipmi_watchdog: restore settings when BMC reset oom: fix integer overflow of points in oom_badness memcg: keep root group unchanged if creation fails nilfs2: potential integer overflow in nilfs_ioctl_clean_segments() nilfs2: unbreak compat ioctl cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask evm: prevent racing during tfm allocation evm: key must be set once during initialization mmc: vub300: fix type of firmware_rom_wait_states module parameter Revert "mmc: enable runtime PM by default" mmc: sdhci: remove "state" argument from sdhci_suspend_host x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT IB/qib: Correct sense on freectxts increment and decrement RDMA/cma: Verify private data length cgroups: fix a css_set not found bug in cgroup_attach_proc oprofile: Fix uninitialized memory access when writing to writing to oprofilefs Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel" ... Conflicts: kernel/cgroup_freezer.c
2011-12-18writeback: Include all dirty inodes in background writebackJan Kara
Current livelock avoidance code makes background work to include only inodes that were dirtied before background writeback has started. However background writeback can be running for a long time and thus excluding newly dirtied inodes can eventually exclude significant portion of dirty inodes making background writeback inefficient. Since background writeback avoids livelocking the flusher thread by yielding to any other work, there is no real reason why background work should not include all dirty inodes so change the logic in wb_writeback(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-12-18writeback: show writeback reason with __print_symbolicWu Fengguang
This makes the binary trace understandable by trace-cmd. CC: Dave Chinner <david@fromorbit.com> CC: Curt Wohlgemuth <curtw@google.com> CC: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-11-29writeback: Fix issue on make htmldocsMarcos Paulo de Souza
Document the @reason parameter to make "make htmldocs" happy. Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Marcos Paulo de Souza <marcos.mage@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-11-21freezer: implement and use kthread_freezable_should_stop()Tejun Heo
Writeback and thinkpad_acpi have been using thaw_process() to prevent deadlock between the freezer and kthread_stop(); unfortunately, this is inherently racy - nothing prevents freezing from happening between thaw_process() and kthread_stop(). This patch implements kthread_freezable_should_stop() which enters refrigerator if necessary but is guaranteed to return if kthread_stop() is invoked. Both thaw_process() users are converted to use the new function. Note that this deadlock condition exists for many of freezable kthreads. They need to be converted to use the new should_stop or freezable workqueue. Tested with synthetic test case. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Henrique de Moraes Holschuh <ibm-acpi@hmh.eng.br> Cc: Jens Axboe <axboe@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com>
2011-10-31writeback: Add a 'reason' to wb_writeback_workCurt Wohlgemuth
This creates a new 'reason' field in a wb_writeback_work structure, which unambiguously identifies who initiates writeback activity. A 'wb_reason' enumeration has been added to writeback.h, to enumerate the possible reasons. The 'writeback_work_class' and tracepoint event class and 'writeback_queue_io' tracepoints are updated to include the symbolic 'reason' in all trace events. And the 'writeback_inodes_sbXXX' family of routines has had a wb_stats parameter added to them, so callers can specify why writeback is being started. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-31writeback: send work item to queue_io, move_expired_inodesCurt Wohlgemuth
Instead of sending ->older_than_this to queue_io() and move_expired_inodes(), send the entire wb_writeback_work structure. There are other fields of a work item that are useful in these routines and in tracepoints. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: per-bdi background thresholdWu Fengguang
One thing puzzled me is that in JBOD case, the per-disk writeout performance is smaller than the corresponding single-disk case even when they have comparable bdi_thresh. Tracing shows find that in single disk case, bdi_writeback is always kept high while in JBOD case, it could drop low from time to time and correspondingly bdi_reclaimable could sometimes rush high. The fix is to watch bdi_reclaimable and kick background writeback as soon as it goes high. This resembles the global background threshold but in per-bdi manner. The trick is, as long as bdi_reclaimable does not go high, bdi_writeback naturally won't go low because bdi_reclaimable+bdi_writeback ~= bdi_thresh. With less fluctuated writeback pages, JBOD performance is observed to increase noticeably in various cases. vmstat:nr_written values before/after patch: 3.1.0-rc4-wo-underrun+ 3.1.0-rc4-bgthresh3+ ------------------------ ------------------------ 125596480 +25.9% 158179363 JBOD-10HDD-16G/ext4-100dd-1M-24p-16384M-20:10-X 61790815 +110.4% 130032231 JBOD-10HDD-16G/ext4-10dd-1M-24p-16384M-20:10-X 58853546 -0.1% 58823828 JBOD-10HDD-16G/ext4-1dd-1M-24p-16384M-20:10-X 110159811 +24.7% 137355377 JBOD-10HDD-16G/xfs-100dd-1M-24p-16384M-20:10-X 69544762 +10.8% 77080047 JBOD-10HDD-16G/xfs-10dd-1M-24p-16384M-20:10-X 50644862 +0.5% 50890006 JBOD-10HDD-16G/xfs-1dd-1M-24p-16384M-20:10-X 42677090 +28.0% 54643527 JBOD-10HDD-thresh=100M/ext4-100dd-1M-24p-16384M-100M:10-X 47491324 +13.3% 53785605 JBOD-10HDD-thresh=100M/ext4-10dd-1M-24p-16384M-100M:10-X 52548986 +0.9% 53001031 JBOD-10HDD-thresh=100M/ext4-1dd-1M-24p-16384M-100M:10-X 26783091 +36.8% 36650248 JBOD-10HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X 35526347 +14.0% 40492312 JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X 44670723 -1.1% 44177606 JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X 127996037 +22.4% 156719990 JBOD-10HDD-thresh=2G/ext4-100dd-1M-24p-16384M-2048M:10-X 57518856 +3.8% 59677625 JBOD-10HDD-thresh=2G/ext4-10dd-1M-24p-16384M-2048M:10-X 51919909 +12.2% 58269894 JBOD-10HDD-thresh=2G/ext4-1dd-1M-24p-16384M-2048M:10-X 86410514 +79.0% 154660433 JBOD-10HDD-thresh=2G/xfs-100dd-1M-24p-16384M-2048M:10-X 40132519 +38.6% 55617893 JBOD-10HDD-thresh=2G/xfs-10dd-1M-24p-16384M-2048M:10-X 48423248 +7.5% 52042927 JBOD-10HDD-thresh=2G/xfs-1dd-1M-24p-16384M-2048M:10-X 206041046 +44.1% 296846536 JBOD-10HDD-thresh=4G/xfs-100dd-1M-24p-16384M-4096M:10-X 72312903 -19.4% 58272885 JBOD-10HDD-thresh=4G/xfs-10dd-1M-24p-16384M-4096M:10-X 50635672 -0.5% 50384787 JBOD-10HDD-thresh=4G/xfs-1dd-1M-24p-16384M-4096M:10-X 68308534 +115.7% 147324758 JBOD-10HDD-thresh=800M/ext4-100dd-1M-24p-16384M-800M:10-X 57882933 +14.5% 66269621 JBOD-10HDD-thresh=800M/ext4-10dd-1M-24p-16384M-800M:10-X 52183472 +12.8% 58855181 JBOD-10HDD-thresh=800M/ext4-1dd-1M-24p-16384M-800M:10-X 53788956 +94.2% 104460352 JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X 44493342 +35.5% 60298210 JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X 42641209 +18.9% 50681038 JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: add bg_threshold parameter to __bdi_update_bandwidth()Wu Fengguang
No behavior change. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-31don't busy retry the inode on failed grab_super_passive()Wu Fengguang
This fixes a soft lockup on conditions a) the flusher is working on a work by __bdi_start_writeback(), while b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which grab sb->s_umount and enqueue a new work for the flusher to execute The s_umount grabbed by (b) will fail the grab_super_passive() in (a). Then if the inode is requeued, wb_writeback() will busy retry on it. As a result, wb_writeback() loops for ever without releasing wb->list_lock, which further blocks other tasks. Fix the busy loop by redirtying the inode. This may undesirably delay the writeback of the inode, however most likely it will be picked up soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or even writeback_inodes_wb(). bug url: http://www.spinics.net/lists/linux-fsdevel/msg47292.html Reported-by: Christoph Hellwig <hch@infradead.org> Tested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits) mm: properly reflect task dirty limits in dirty_exceeded logic writeback: don't busy retry writeback on new/freeing inodes writeback: scale IO chunk size up to half device bandwidth writeback: trace global_dirty_state writeback: introduce max-pause and pass-good dirty limits writeback: introduce smoothed global dirty limit writeback: consolidate variable names in balance_dirty_pages() writeback: show bdi write bandwidth in debugfs writeback: bdi write bandwidth estimation writeback: account per-bdi accumulated written pages writeback: make writeback_control.nr_to_write straight writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr() writeback: trace event writeback_queue_io writeback: trace event writeback_single_inode writeback: remove .nonblocking and .encountered_congestion writeback: remove writeback_control.more_io writeback: skip balance_dirty_pages() for in-memory fs writeback: add bdi_dirty_limit() kernel-doc writeback: avoid extra sync work at enqueue time writeback: elevate queue_io() into wb_writeback() ... Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c
2011-07-24writeback: don't busy retry writeback on new/freeing inodesWu Fengguang
Fix a system hang bug introduced by commit b7a2441f9966 ("writeback: remove writeback_control.more_io") and e8dfc3058 ("writeback: elevate queue_io() into wb_writeback()") easily reproducible with high memory pressure and lots of file creation/deletions, for example, a kernel build in limited memory. It hangs when some inode is in the I_NEW, I_FREEING or I_WILL_FREE state, the flusher will get stuck busy retrying that inode, never releasing wb->list_lock. The lock in turn blocks all kinds of other tasks when they are trying to grab it. As put by Jan, it's a safe change regarding data integrity. I_FREEING or I_WILL_FREE inodes are written back by iput_final() and it is reclaim code that is responsible for eventually removing them. So writeback code can safely ignore them. I_NEW inodes should move out of this state when they are fully set up and in the writeback round following that, we will consider them for writeback. So the change makes sense. CC: Jan Kara <jack@suse.cz> Reported-by: Hugh Dickins <hughd@google.com> Tested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-20superblock: move pin_sb_for_writeback() to fs/super.cDave Chinner
The per-sb shrinker has the same requirement as the writeback threads of ensuring that the superblock is usable and pinned for the time it takes to run the work. Both need to take a passive reference to the sb, take a read lock on the s_umount lock and then only continue if an unmount is not in progress. pin_sb_for_writeback() does this exactly, so move it to fs/super.c and rename it to grab_super_passive() and exporting it via fs/internal.h for all the VFS code to be able to use. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-09writeback: scale IO chunk size up to half device bandwidthWu Fengguang
Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a concern of not holding I_SYNC for too long. (At least, that was the comment previously.) This doesn't make sense now because the only time we wait for I_SYNC is if we are calling sync or fsync, and in that case we need to write out all of the data anyway. Previously there may have been other code paths that waited on I_SYNC, but not any more. -- Theodore Ts'o So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages will adapt to as large as the storage device can write within 500ms. XFS is observed to do IO completions in a batch, and the batch size is equal to the write chunk size. To avoid dirty pages to suddenly drop out of balance_dirty_pages()'s dirty control scope and create large fluctuations, the chunk size is also limited to half the control scope. The balance_dirty_pages() control scrope is [(background_thresh + dirty_thresh) / 2, dirty_thresh] which is by default [15%, 20%] of global dirty pages, whose range size is dirty_thresh / DIRTY_FULL_SCOPE. The adpative write chunk size will be rounded to the nearest 4MB boundary. http://bugzilla.kernel.org/show_bug.cgi?id=13930 CC: Theodore Ts'o <tytso@mit.edu> CC: Dave Chinner <david@fromorbit.com> CC: Chris Mason <chris.mason@oracle.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: introduce smoothed global dirty limitWu Fengguang
The start of a heavy weight application (ie. KVM) may instantly knock down determine_dirtyable_memory() if the swap is not enabled or full. global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi dirty thresholds that are _much_ lower than the global/bdi dirty pages. balance_dirty_pages() will then heavily throttle all dirtiers including the light ones, until the dirty pages drop below the new dirty thresholds. During this _deep_ dirty-exceeded state, the system may appear rather unresponsive to the users. About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty threshold to heavy dirtiers than light ones, and the dirty pages will be throttled around the heavy dirtiers' dirty threshold and reasonably below the light dirtiers' dirty threshold. In this state, only the heavy dirtiers will be throttled and the dirty pages are carefully controlled to not exceed the light dirtiers' dirty threshold. However if the threshold itself suddenly drops below the number of dirty pages, the light dirtiers will get heavily throttled. So introduce global_dirty_limit for tracking the global dirty threshold with policies - follow downwards slowly - follow up in one shot global_dirty_limit can effectively mask out the impact of sudden drop of dirtyable memory. It will be used in the next patch for two new type of dirty limits. Note that the new dirty limits are not going to avoid throttling the light dirtiers, but could limit their sleep time to 200ms. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: bdi write bandwidth estimationWu Fengguang
The estimation value will start from 100MB/s and adapt to the real bandwidth in seconds. It tries to update the bandwidth only when disk is fully utilized. Any inactive period of more than one second will be skipped. The estimated bandwidth will be reflecting how fast the device can writeout when _fully utilized_, and won't drop to 0 when it goes idle. The value will remain constant at disk idle time. At busy write time, if not considering fluctuations, it will also remain high unless be knocked down by possible concurrent reads that compete for the disk time and bandwidth with async writes. The estimation is not done purely in the flusher because there is no guarantee for write_cache_pages() to return timely to update bandwidth. The bdi->avg_write_bandwidth smoothing is very effective for filtering out sudden spikes, however may be a little biased in long term. The overheads are low because the bdi bandwidth update only occurs at 200ms intervals. The 200ms update interval is suitable, because it's not possible to get the real bandwidth for the instance at all, due to large fluctuations. The NFS commits can be as large as seconds worth of data. One XFS completion may be as large as half second worth of data if we are going to increase the write chunk to half second worth of data. In ext4, fluctuations with time period of around 5 seconds is observed. And there is another pattern of irregular periods of up to 20 seconds on SSD tests. That's why we are not only doing the estimation at 200ms intervals, but also averaging them over a period of 3 seconds and then go further to do another level of smoothing in avg_write_bandwidth. CC: Li Shaohua <shaohua.li@intel.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: make writeback_control.nr_to_write straightWu Fengguang
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(), and initialize the struct writeback_control there. struct writeback_control is basically designed to control writeback of a single file, but we keep abuse it for writing multiple files in writeback_sb_inodes() and its callers. It immediately clean things up, e.g. suddenly wbc.nr_to_write vs work->nr_pages starts to make sense, and instead of saving and restoring pages_skipped in writeback_sb_inodes it can always start with a clean zero value. It also makes a neat IO pattern change: large dirty files are now written in the full 4MB writeback chunk size, rather than whatever remained quota in wbc->nr_to_write. Acked-by: Jan Kara <jack@suse.cz> Proposed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: trace event writeback_queue_ioWu Fengguang
Note that it adds a little overheads to account the moved/enqueued inodes from b_dirty to b_io. The "moved" accounting may be later used to limit the number of inodes that can be moved in one shot, in order to keep spinlock hold time under control. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: trace event writeback_single_inodeWu Fengguang
It is valuable to know how the dirty inodes are iterated and their IO size. "writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0" - "state" reflects inode->i_state at the end of writeback_single_inode() - "index" reflects mapping->writeback_index after the ->writepages() call - "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode() - "wrote" is the number of pages actually written v2: add trace event writeback_single_inode_requeue as proposed by Dave. CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: remove writeback_control.more_ioWu Fengguang
When wbc.more_io was first introduced, it indicates whether there are at least one superblock whose s_more_io contains more IO work. Now with the per-bdi writeback, it can be replaced with a simple b_more_io test. Acked-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: avoid extra sync work at enqueue timeWu Fengguang
This removes writeback_control.wb_start and does more straightforward sync livelock prevention by setting .older_than_this to prevent extra inodes from being enqueued in the first place. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: elevate queue_io() into wb_writeback()Wu Fengguang
Code refactor for more logical code layout. No behavior change. - remove the mis-named __writeback_inodes_sb() - wb_writeback()/writeback_inodes_wb() will decide when to queue_io() before calling __writeback_inodes_wb() Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: split inode_wb_list_lock into bdi_writeback.list_lockChristoph Hellwig
Split the global inode_wb_list_lock into a per-bdi_writeback list_lock, as it's currently the most contended lock in the system for metadata heavy workloads. It won't help for single-filesystem workloads for which we'll need the I/O-less balance_dirty_pages, but at least we can dedicate a cpu to spinning on each bdi now for larger systems. Based on earlier patches from Nick Piggin and Dave Chinner. It reduces lock contentions to 1/4 in this test case: 10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram lock_stat version 0.3 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- vanilla 2.6.39-rc3: inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23 ------------------ inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85 inode_wb_list_lock 34 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49 inode_wb_list_lock 12893 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0 inode_wb_list_lock 10702 [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a ------------------ inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85 inode_wb_list_lock 19 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49 inode_wb_list_lock 5550 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0 inode_wb_list_lock 8511 [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157 2.6.39-rc3 + patch: &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37 ------------------------ &(&wb->list_lock)->rlock 10 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86 &(&wb->list_lock)->rlock 1493 [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150 &(&wb->list_lock)->rlock 3652 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f &(&wb->list_lock)->rlock 1412 [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223 ------------------------ &(&wb->list_lock)->rlock 3 [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b &(&wb->list_lock)->rlock 6 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86 &(&wb->list_lock)->rlock 2061 [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf &(&wb->list_lock)->rlock 2629 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: refill b_io iff emptyWu Fengguang
There is no point to carry different refill policies between for_kupdate and other type of works. Use a consistent "refill b_io iff empty" policy which can guarantee fairness in an easy to understand way. A b_io refill will setup a _fixed_ work set with all currently eligible inodes and start a new round of walk through b_io. The "fixed" work set means no new inodes will be added to the work set during the walk. Only when a complete walk over b_io is done, new inodes that are eligible at the time will be enqueued and the walk be started over. This procedure provides fairness among the inodes because it guarantees each inode to be synced once and only once at each round. So all inodes will be free from starvations. This change relies on wb_writeback() to keep retrying as long as we made some progress on cleaning some pages and/or inodes. Without that ability, the old logic on background works relies on aggressively queuing all eligible inodes into b_io at every time. But that's not a guarantee. The below test script completes a slightly faster now: 2.6.39-rc3 2.6.39-rc3-dyn-expire+ ------------------------------------------------ all elapsed 256.043 252.367 stddev 24.381 12.530 tar elapsed 30.097 28.808 dd elapsed 13.214 11.782 #!/bin/zsh cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/ umount /dev/sda7 mkfs.xfs -f /dev/sda7 mount /dev/sda7 /fs echo 3 > /proc/sys/vm/drop_caches tic=$(cat /proc/uptime|cut -d' ' -f2) cd /fs time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 & time dd if=/dev/zero of=/fs/zero bs=1M count=1000 & wait sync tac=$(cat /proc/uptime|cut -d' ' -f2) echo elapsed: $((tac - tic)) It maintains roughly the same small vs. large file writeout shares, and offers large files better chances to be written in nice 4M chunks. Analyzes from Dave Chinner in great details: Let's say we have lots of inodes with 100 dirty pages being created, and one large writeback going on. We expire 8 new inodes for every 1024 pages we write back. With the old code, we do: b_more_io (large inode) -> b_io (1l) 8 newly expired inodes -> b_io (1l, 8s) writeback large inode 1024 pages -> b_more_io b_more_io (large inode) -> b_io (8s, 1l) 8 newly expired inodes -> b_io (8s, 1l, 8s) writeback 8 small inodes 800 pages 1 large inode 224 pages -> b_more_io b_more_io (large inode) -> b_io (8s, 1l) 8 newly expired inodes -> b_io (8s, 1l, 8s) ..... Your new code: b_more_io (large inode) -> b_io (1l) 8 newly expired inodes -> b_io (1l, 8s) writeback large inode 1024 pages -> b_more_io (b_io == 8s) writeback 8 small inodes 800 pages b_io empty: (1800 pages written) b_more_io (large inode) -> b_io (1l) 14 newly expired inodes -> b_io (1l, 14s) writeback large inode 1024 pages -> b_more_io (b_io == 14s) writeback 10 small inodes 1000 pages 1 small inode 24 pages -> b_more_io (1l, 1s(24)) writeback 5 small inodes 500 pages b_io empty: (2548 pages written) b_more_io (large inode) -> b_io (1l, 1s(24)) 20 newly expired inodes -> b_io (1l, 1s(24), 20s) ...... Rough progression of pages written at b_io refill: Old code: total large file % of writeback 1024 224 21.9% (fixed) New code: total large file % of writeback 1800 1024 ~55% 2550 1024 ~40% 3050 1024 ~33% 3500 1024 ~29% 3950 1024 ~26% 4250 1024 ~24% 4500 1024 ~22.7% 4700 1024 ~21.7% 4800 1024 ~21.3% 4800 1024 ~21.3% (pretty much steady state from here) Ok, so the steady state is reached with a similar percentage of writeback to the large file as the existing code. Ok, that's good, but providing some evidence that is doesn't change the shared of writeback to the large should be in the commit message ;) The other advantage to this is that we always write 1024 page chunks to the large file, rather than smaller "whatever remains" chunks. CC: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: the kupdate expire timestamp should be a moving targetWu Fengguang
Dynamically compute the dirty expire timestamp at queue_io() time. writeback_control.older_than_this used to be determined at entrance to the kupdate writeback work. This _static_ timestamp may go stale if the kupdate work runs on and on. The flusher may then stuck with some old busy inodes, never considering newly expired inodes thereafter. This has two possible problems: - It is unfair for a large dirty inode to delay (for a long time) the writeback of small dirty inodes. - As time goes by, the large and busy dirty inode may contain only _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks delaying the expired dirty pages to the end of LRU lists, triggering the evil pageout(). Nevertheless this patch merely addresses part of the problem. v2: keep policy changes inside wb_writeback() and keep the wbc.older_than_this visibility as suggested by Dave. CC: Dave Chinner <david@fromorbit.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: try more writeback as long as something was writtenWu Fengguang
writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that they only populate possibly a subset of eligible inodes into b_io at entrance time. When the queued set of inodes are all synced, they just return, possibly with all queued inode pages written but still wbc.nr_to_write > 0. For kupdate and background writeback, there may be more eligible inodes sitting in b_dirty when the current set of b_io inodes are completed. So it is necessary to try another round of writeback as long as we made some progress in this round. When there are no more eligible inodes, no more inodes will be enqueued in queue_io(), hence nothing could/will be synced and we may safely bail. For example, imagine 100 inodes i0, i1, i2, ..., i90, i91, i99 At queue_io() time, i90-i99 happen to be expired and moved to s_io for IO. When finished successfully, if their total size is less than MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will quit the background work (w/o this patch) while it's still over background threshold. This will be a fairly normal/frequent case I guess. Now that we do tagged sync and update inode->dirtied_when after the sync, this change won't livelock sync(1). I actually tried to write 1 page per 1ms with this command write-and-fsync -n10000 -S 1000 -c 4096 /fs/test and do sync(1) at the same time. The sync completes quickly on ext4, xfs, btrfs. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: introduce writeback_control.inodes_writtenWu Fengguang
The flusher works on dirty inodes in batches, and may quit prematurely if the batch of inodes happen to be metadata-only dirtied: in this case wbc->nr_to_write won't be decreased at all, which stands for "no pages written" but also mis-interpreted as "no progress". So introduce writeback_control.inodes_written to count the inodes get cleaned from VFS POV. A non-zero value means there are some progress on writeback, in which case more writeback can be tried. Acked-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: update dirtied_when for synced inode to prevent livelockWu Fengguang
Explicitly update .dirtied_when on synced inodes, so that they are no longer considered for writeback in the next round. It can prevent both of the following livelock schemes: - while true; do echo data >> f; done - while true; do touch f; done (in theory) The exact livelock condition is, during sync(1): (1) no new inodes are dirtied (2) an inode being actively dirtied On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX. When finished, it will be redirty_tail()ed because it's still dirty and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when on condition (1). The sync work will then revisit it on the next queue_io() and find it eligible again because its old ->dirtied_when predates the sync work start time. We'll do more aggressive "keep writeback as long as we wrote something" logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") will no longer be enough to stop sync livelock. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stageWu Fengguang
sync(2) is performed in two stages: the WB_SYNC_NONE sync and the WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and do livelock prevention for it, too. Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance using page tagging") is a partial fix in that it only fixed the WB_SYNC_ALL phase livelock. Although ext4 is tested to no longer livelock with commit f446daaea9, it may due to some "redirty_tail() after pages_skipped" effect which is by no means a guarantee for _all_ the file systems. Note that writeback_inodes_sb() is called by not only sync(), they are treated the same because the other callers also need livelock prevention. Impact: It changes the order in which pages/inodes are synced to disk. Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode until finished with the current inode. Acked-by: Jan Kara <jack@suse.cz> CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-05-27fs: pass exact type of data dirties to ->dirty_inodeChristoph Hellwig
Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or anything else, so that the filesystem can track internally if it needs to push out a transaction for fdatasync or not. This is just the prototype change with no user for it yet. I plan to push large XFS changes for the next merge window, and getting this trivial infrastructure in this window would help a lot to avoid tree interdependencies. Also remove incorrect comments that ->dirty_inode can't block. That has been changed a long time ago, and many implementations rely on it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-31Fix common misspellingsLucas De Marchi
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-24fs: pull inode->i_lock up out of writeback_single_inodeDave Chinner
First thing we do in writeback_single_inode() is take the i_lock and the last thing we do is drop it. A caller already holds the i_lock, so pull the i_lock out of writeback_single_inode() to reduce the round trips on this lock during inode writeback. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-24fs: move i_wb_list out from under inode_lockDave Chinner
Protect the inode writeback list with a new global lock inode_wb_list_lock and use it to protect the list manipulations and traversals. This lock replaces the inode_lock as the inodes on the list can be validity checked while holding the inode->i_lock and hence the inode_lock is no longer needed to protect the list. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-24fs: move i_sb_list out from under inode_lockDave Chinner
Protect the per-sb inode list with a new global lock inode_sb_list_lock and use it to protect the list manipulations and traversals. This lock replaces the inode_lock as the inodes on the list can be validity checked while holding the inode->i_lock and hence the inode_lock is no longer needed to protect the list. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-24fs: protect inode->i_state with inode->i_lockDave Chinner
Protect inode state transitions and validity checks with the inode->i_lock. This enables us to make inode state transitions independently of the inode_lock and is the first step to peeling away the inode_lock from the code. This requires that __iget() is done atomically with i_state checks during list traversals so that we don't race with another thread marking the inode I_FREEING between the state check and grabbing the reference. Also remove the unlock_new_inode() memory barrier optimisation required to avoid taking the inode_lock when clearing I_NEW. Simplify the code by simply taking the inode->i_lock around the state change and wakeup. Because the wakeup is no longer tricky, remove the wake_up_inode() function and open code the wakeup where necessary. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-13fs/fs-writeback.c: fix sync_inodes_sb() return value kernel-docStefan Hajnoczi
The sync_inodes_sb() function does not have a return value. Remove the outdated documentation comment. Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13sync_inode_metadata: fix commentAndrew Morton
Use correct function name, remove incorrect apostrophe Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>