summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2011-04-28next_pidmap: fix overflow conditionLinus Torvalds
commit c78193e9c7bcbf25b8237ad0dec82f805c4ea69b upstream. next_pidmap() just quietly accepted whatever 'last' pid that was passed in, which is not all that safe when one of the users is /proc. Admittedly the proc code should do some sanity checking on the range (and that will be the next commit), but that doesn't mean that the helper functions should just do that pidmap pointer arithmetic without checking the range of its arguments. So clamp 'last' to PID_MAX_LIMIT. The fact that we then do "last+1" doesn't really matter, the for-loop does check against the end of the pidmap array properly (it's only the actual pointer arithmetic overflow case we need to worry about, and going one bit beyond isn't going to overflow). [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ] Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com> Analyzed-by: Robert Święcki <robert@swiecki.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-03-31Relax si_code check in rt_sigqueueinfo and rt_tgsigqueueinfoRoland Dreier
[ upstream commit 243b422 ] Commit da48524eb206 ("Prevent rt_sigqueueinfo and rt_tgsigqueueinfo from spoofing the signal code") made the check on si_code too strict. There are several legitimate places where glibc wants to queue a negative si_code different from SI_QUEUE: - This was first noticed with glibc's aio implementation, which wants to queue a signal with si_code SI_ASYNCIO; the current kernel causes glibc's tst-aio4 test to fail because rt_sigqueueinfo() fails with EPERM. - Further examination of the glibc source shows that getaddrinfo_a() wants to use SI_ASYNCNL (which the kernel does not even define). The timer_create() fallback code wants to queue signals with SI_TIMER. As suggested by Oleg Nesterov <oleg@redhat.com>, loosen the check to forbid only the problematic SI_TKILL case. Reported-by: Klaus Dittrich <kladit@arcor.de> Acked-by: Julien Tinnes <jln@google.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: <stable@kernel.org> Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-31perf: Fix tear-down of inherited group eventsPeter Zijlstra
[ upstream commit 38b435b16c36b0d863efcf3f07b34a6fac9873fd ] When destroying inherited events, we need to destroy groups too, otherwise the event iteration in perf_event_exit_task_context() will miss group siblings and we leak events with all the consequences. Reported-and-tested-by: Vince Weaver <vweaver1@eecs.utk.edu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: <stable@kernel.org> # .35+ LKML-Reference: <1300196470.2203.61.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-31PM / Hibernate: Make default image size depend on total RAM sizeRafael J. Wysocki
[ upstream commit ac5c24ec1e983313ef0015258fba6f630e54e7cf ] The default hibernation image size is currently hard coded and euqal to 500 MB, which is not a reasonable default on many contemporary systems. Make it equal 2/5 of the total RAM size (this is slightly below the maximum, i.e. 1/2 of the total RAM size, and seems to be generally suitable). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Tested-by: M. Vefa Bicakci <bicave@superonline.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-03-31PM / Hibernate: Improve comments in hibernate_preallocate_memory()Rafael J. Wysocki
[ upstream commit 266f1a25eff5ff98c498d7754a419aacfd88f71c ] One comment in hibernate_preallocate_memory() is wrong, so fix it and add one more comment to clarify the meaning of the fixed one. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-03-31perf: Better fit max unprivileged mlock pages for tools needsFrederic Weisbecker
commit 880f57318450dbead6a03f9e31a1468924d6dd88 upstream. The maximum kilobytes of locked memory that an unprivileged user can reserve is of 512 kB = 128 pages by default, scaled to the number of onlined CPUs, which fits well with the tools that use 128 data pages by default. However tools actually use 129 pages, because they need one more for the user control page. Thus the default mlock threshold is not sufficient for the default tools needs and we always end up to evaluate the constant mlock rlimit policy, which doesn't have this scaling with the number of online CPUs. Hence, on systems that have more than 16 CPUs, we overlap the rlimit threshold and fail to mmap: $ perf record ls Error: failed to mmap with 1 (Operation not permitted) Just increase the max unprivileged mlock threshold by one page so that it supports well perf tools even after 16 CPUs. Reported-by: Han Pingtian <phan@redhat.com> Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Stephane Eranian <eranian@google.com> LKML-Reference: <1300904979-5508-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31Prevent rt_sigqueueinfo and rt_tgsigqueueinfo from spoofing the signal codeJulien Tinnes
commit da48524eb20662618854bb3df2db01fc65f3070c upstream. Userland should be able to trust the pid and uid of the sender of a signal if the si_code is SI_TKILL. Unfortunately, the kernel has historically allowed sigqueueinfo() to send any si_code at all (as long as it was negative - to distinguish it from kernel-generated signals like SIGILL etc), so it could spoof a SI_TKILL with incorrect siginfo values. Happily, it looks like glibc has always set si_code to the appropriate SI_QUEUE, so there are probably no actual user code that ever uses anything but the appropriate SI_QUEUE flag. So just tighten the check for si_code (we used to allow any negative value), and add a (one-time) warning in case there are binaries out there that might depend on using other si_code values. Signed-off-by: Julien Tinnes <jln@google.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-03-31smp_call_function_many: handle concurrent clearing of maskMilton Miller
commit 723aae25d5cdb09962901d36d526b44d4be1051c upstream. Mike Galbraith reported finding a lockup ("perma-spin bug") where the cpumask passed to smp_call_function_many was cleared by other cpu(s) while a cpu was preparing its call_data block, resulting in no cpu to clear the last ref and unlock the block. Having cpus clear their bit asynchronously could be useful on a mask of cpus that might have a translation context, or cpus that need a push to complete an rcu window. Instead of adding a BUG_ON and requiring yet another cpumask copy, just detect the race and handle it. Note: arch_send_call_function_ipi_mask must still handle an empty cpumask because the data block is globally visible before the that arch callback is made. And (obviously) there are no guarantees to which cpus are notified if the mask is changed during the call; only cpus that were online and had their mask bit set during the whole call are guaranteed to be called. Reported-by: Mike Galbraith <efault@gmx.de> Reported-by: Jan Beulich <JBeulich@novell.com> Acked-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-03-31call_function_many: add missing orderingMilton Miller
commit 45a5791920ae643eafc02e2eedef1a58e341b736 upstream. Paul McKenney's review pointed out two problems with the barriers in the 2.6.38 update to the smp call function many code. First, a barrier that would force the func and info members of data to be visible before their consumption in the interrupt handler was missing. This can be solved by adding a smp_wmb between setting the func and info members and setting setting the cpumask; this will pair with the existing and required smp_rmb ordering the cpumask read before the read of refs. This placement avoids the need a second smp_rmb in the interrupt handler which would be executed on each of the N cpus executing the call request. (I was thinking this barrier was present but was not). Second, the previous write to refs (establishing the zero that we the interrupt handler was testing from all cpus) was performed by a third party cpu. This would invoke transitivity which, as a recient or concurrent addition to memory-barriers.txt now explicitly states, would require a full smp_mb(). However, we know the cpumask will only be set by one cpu (the data owner) and any preivous iteration of the mask would have cleared by the reading cpu. By redundantly writing refs to 0 on the owning cpu before the smp_wmb, the write to refs will follow the same path as the writes that set the cpumask, which in turn allows us to keep the barrier in the interrupt handler a smp_rmb instead of promoting it to a smp_mb (which will be be executed by N cpus for each of the possible M elements on the list). I moved and expanded the comment about our (ab)use of the rcu list primitives for the concurrent walk earlier into this function. I considered moving the first two paragraphs to the queue list head and lock, but felt it would have been too disconected from the code. Cc: Paul McKinney <paulmck@linux.vnet.ibm.com> Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-03-31call_function_many: fix list delete vs add raceMilton Miller
commit e6cd1e07a185d5f9b0aa75e020df02d3c1c44940 upstream. Peter pointed out there was nothing preventing the list_del_rcu in smp_call_function_interrupt from running before the list_add_rcu in smp_call_function_many. Fix this by not setting refs until we have gotten the lock for the list. Take advantage of the wmb in list_add_rcu to save an explicit additional one. I tried to force this race with a udelay before the lock & list_add and by mixing all 64 online cpus with just 3 random cpus in the mask, but was unsuccessful. Still, inspection shows a valid race, and the fix is a extension of the existing protection window in the current code. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-03-31ftrace: Fix memory leak with function graph and cpu hotplugSteven Rostedt
commit 868baf07b1a259f5f3803c1dc2777b6c358f83cf upstream. When the fuction graph tracer starts, it needs to make a special stack for each task to save the real return values of the tasks. All running tasks have this stack created, as well as any new tasks. On CPU hot plug, the new idle task will allocate a stack as well when init_idle() is called. The problem is that cpu hotplug does not create a new idle_task. Instead it uses the idle task that existed when the cpu went down. ftrace_graph_init_task() will add a new ret_stack to the task that is given to it. Because a clone will make the task have a stack of its parent it does not check if the task's ret_stack is already NULL or not. When the CPU hotplug code starts a CPU up again, it will allocate a new stack even though one already existed for it. The solution is to treat the idle_task specially. In fact, the function_graph code already does, just not at init_idle(). Instead of using the ftrace_graph_init_task() for the idle task, which that function expects the task to be a clone, have a separate ftrace_graph_init_idle_task(). Also, we will create a per_cpu ret_stack that is used by the idle task. When we call ftrace_graph_init_idle_task() it will check if the idle task's ret_stack is NULL, if it is, then it will assign it the per_cpu ret_stack. Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-03-31cpuset: add a missing unlock in cpuset_write_resmask()Li Zefan
commit b75f38d659e6fc747eda64cb72f3920e29dd44a4 upstream. Don't forget to release cgroup_mutex if alloc_trial_cpuset() fails. [akpm@linux-foundation.org: avoid multiple return points] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Paul Menage <menage@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31clockevents: Prevent oneshot mode when broadcast device is periodicThomas Gleixner
commit 3a142a0672b48a853f00af61f184c7341ac9c99d upstream. When the per cpu timer is marked CLOCK_EVT_FEAT_C3STOP, then we only can switch into oneshot mode, when the backup broadcast device supports oneshot mode as well. Otherwise we would try to switch the broadcast device into an unsupported mode unconditionally. This went unnoticed so far as the current available broadcast devices support oneshot mode. Seth unearthed this problem while debugging and working around an hpet related BIOS wreckage. Add the necessary check to tick_is_oneshot_available(). Reported-and-tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <alpine.LFD.2.00.1102252231200.2701@localhost6.localdomain6> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31genirq: Disable the SHIRQ_DEBUG call in request_threaded_irq for nowThomas Gleixner
commit 6d83f94db95cfe65d2a6359cccdf61cf087c2598 upstream. With CONFIG_SHIRQ_DEBUG=y we call a newly installed interrupt handler in request_threaded_irq(). The original implementation (commit a304e1b8) called the handler _BEFORE_ it was installed, but that caused problems with handlers calling disable_irq_nosync(). See commit 377bf1e4. It's braindead in the first place to call disable_irq_nosync in shared handlers, but .... Moving this call after we installed the handler looks innocent, but it is very subtle broken on SMP. Interrupt handlers rely on the fact, that the irq core prevents reentrancy. Now this debug call violates that promise because we run the handler w/o the IRQ_INPROGRESS protection - which we cannot apply here because that would result in a possibly forever masked interrupt line. A concurrent real hardware interrupt on a different CPU results in handler reentrancy and can lead to complete wreckage, which was unfortunately observed in reality and took a fricking long time to debug. Leave the code here for now. We want this debug feature, but that's not easy to fix. We really should get rid of those disable_irq_nosync() abusers and remove that function completely. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Anton Vorontsov <avorontsov@ru.mvista.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31PM / Hibernate: Return error code when alloc_image_page() failsStanislaw Gruszka
commit 2e725a065b0153f0c449318da1923a120477633d upstream. Currently we return 0 in swsusp_alloc() when alloc_image_page() fails. Fix that. Also remove unneeded "error" variable since the only useful value of error is -ENOMEM. [rjw: Fixed up the changelog and changed subject.] Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-03-31CRED: Fix memory and refcount leaks upon security_prepare_creds() failureTetsuo Handa
commit fb2b2a1d37f80cc818fd4487b510f4e11816e5e1 upstream. In prepare_kernel_cred() since 2.6.29, put_cred(new) is called without assigning new->usage when security_prepare_creds() returned an error. As a result, memory for new and refcount for new->{user,group_info,tgcred} are leaked because put_cred(new) won't call __put_cred() unless old->usage == 1. Fix these leaks by assigning new->usage (and new->subscribers which was added in 2.6.32) before calling security_prepare_creds(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-03-31CRED: Fix BUG() upon security_cred_alloc_blank() failureTetsuo Handa
commit 2edeaa34a6e3f2c43b667f6c4f7b27944b811695 upstream. In cred_alloc_blank() since 2.6.32, abort_creds(new) is called with new->security == NULL and new->magic == 0 when security_cred_alloc_blank() returns an error. As a result, BUG() will be triggered if SELinux is enabled or CONFIG_DEBUG_CREDENTIALS=y. If CONFIG_DEBUG_CREDENTIALS=y, BUG() is called from __invalid_creds() because cred->magic == 0. Failing that, BUG() is called from selinux_cred_free() because selinux_cred_free() is not expecting cred->security == NULL. This does not affect smack_cred_free(), tomoyo_cred_free() or apparmor_cred_free(). Fix these bugs by (1) Set new->magic before calling security_cred_alloc_blank(). (2) Handle null cred->security in creds_are_invalid() and selinux_cred_free(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-03-31kernel/user.c: add lock release annotation on free_user()Namhyung Kim
commit 571428be550fbe37160596995e96ad398873fcbd upstream. free_user() releases uidhash_lock but was missing annotation. Add it. This removes following sparse warnings: include/linux/spinlock.h:339:9: warning: context imbalance in 'free_user' - unexpected unlock kernel/user.c:120:6: warning: context imbalance in 'free_uid' - wrong count at exit Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Dhaval Giani <dhaval.giani@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Use group weight, idle cpu metrics to fix imbalances during idleSuresh Siddha
Commit: aae6d3ddd8b90f5b2c8d79a2b914d1706d124193 upstream Currently we consider a sched domain to be well balanced when the imbalance is less than the domain's imablance_pct. As the number of cores and threads are increasing, current values of imbalance_pct (for example 25% for a NUMA domain) are not enough to detect imbalances like: a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads), 24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another socket. Leading to an idle HT cpu. b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and 16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one socket and 7 on another socket. Leaving one core in a socket idle whereas in another socket we have a core having both its HT siblings busy. While this issue can be fixed by decreasing the domain's imbalance_pct (by making it a function of number of logical cpus in the domain), it can potentially cause more task migrations across sched groups in an overloaded case. Fix this by using imbalance_pct only during newly_idle and busy load balancing. And during idle load balancing, check if there is an imbalance in number of idle cpu's across the busiest and this sched_group or if the busiest group has more tasks than its weight that the idle cpu in this_group can pull. Reported-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched, cgroup: Fixup broken cgroup movementPeter Zijlstra
Commit: b2b5ce022acf5e9f52f7b78c5579994fdde191d4 upstream Dima noticed that we fail to correct the ->vruntime of sleeping tasks when we move them between cgroups. Reported-by: Dima Zavin <dima@android.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> Tested-by: Mike Galbraith <efault@gmx.de> LKML-Reference: <1287150604.29097.1513.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Export account_system_vtime()Ingo Molnar
Commit: b7dadc38797584f6203386da1947ed5edf516646 upstream KVM uses it for example: ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined! Cc: Venkatesh Pallipadi <venki@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-03-31sched: Call tick_check_idle before __irq_enterVenkatesh Pallipadi
Commit: d267f87fb8179c6dba03d08b91952e81bc3723c7 upstream When CPU is idle and on first interrupt, irq_enter calls tick_check_idle() to notify interruption from idle. But, there is a problem if this call is done after __irq_enter, as all routines in __irq_enter may find stale time due to yet to be done tick_check_idle. Specifically, trace calls in __irq_enter when they use global clock and also account_system_vtime change in this patch as it wants to use sched_clock_cpu() to do proper irq timing. But, tick_check_idle was moved after __irq_enter intentionally to prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a: irq: call __irq_enter() before calling the tick_idle_check Impact: avoid spurious ksoftirqd wakeups Moving tick_check_idle() before __irq_enter and wrapping it with local_bh_enable/disable would solve both the problems. Fixed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Remove irq time from available CPU powerVenkatesh Pallipadi
Commit: aa483808516ca5cacfa0e5849691f64fec25828e upstream The idea was suggested by Peter Zijlstra here: http://marc.info/?l=linux-kernel&m=127476934517534&w=2 irq time is technically not available to the tasks running on the CPU. This patch removes irq time from CPU power piggybacking on sched_rt_avg_update(). Tested this by keeping CPU X busy with a network intensive task having 75% oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven cycle soakers on the system. Without this change, there will be two tasks on each CPU. With this change, there is a single task on irq busy CPU X and remaining 7 tasks are spread around among other 3 CPUs. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Do not account irq time to current taskVenkatesh Pallipadi
Commit: 305e6835e05513406fa12820e40e4a8ecb63743c upstream Scheduler accounts both softirq and interrupt processing times to the currently running task. This means, if the interrupt processing was for some other task in the system, then the current task ends up being penalized as it gets shorter runtime than otherwise. Change sched task accounting to acoount only actual task time from currently running task. Now update_curr(), modifies the delta_exec to depend on rq->clock_task. Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats for later. This change will impact scheduling behavior in interrupt heavy conditions. Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2 spending 75%+ of its time in irq processing. CPU 3 spending around 35% time running nc task. Now, if I run another CPU intensive task on CPU 2, without this change /proc/<pid>/schedstat shows 100% of time accounted to this task. With this change, it rightly shows less than 25% accounted to this task as remaining time is actually spent on irq processing. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq timeVenkatesh Pallipadi
Commit: b52bfee445d315549d41eacf2fa7c156e7d153d5 upstream s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does the fine granularity accounting of user, system, hardirq, softirq times. Adding that option on archs like x86 will be challenging however, given the state of TSC reliability on various platforms and also the overhead it will add in syscall entry exit. Instead, add a lighter variant that only does finer accounting of hardirq and softirq times, providing precise irq times (instead of timer tick based samples). This accounting is added with a new config option CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not interested in paying the perf penalty. This accounting is based on sched_clock, with the code being generic. So, other archs may find it useful as well. This patch just adds the core logic and does not enable this logic yet. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Add a PF flag for ksoftirqd identificationVenkatesh Pallipadi
Commit: 6cdd5199daf0cb7b0fcc8dca941af08492612887 upstream To account softirq time cleanly in scheduler, we need to identify whether softirq is invoked in ksoftirqd context or softirq at hardirq tail context. Add PF_KSOFTIRQD for that purpose. As all PF flag bits are currently taken, create space by moving one of the infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along with some other state fields. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Fix softirq time accountingVenkatesh Pallipadi
Commit: 75e1056f5c57050415b64cb761a3acc35d91f013 upstream Peter Zijlstra found a bug in the way softirq time is accounted in VIRT_CPU_ACCOUNTING on this thread: http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html The problem is, softirq processing uses local_bh_disable internally. There is no way, later in the flow, to differentiate between whether softirq is being processed or is it just that bh has been disabled. So, a hardirq when bh is disabled results in time being wrongly accounted as softirq. Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING as well. As account_system_time() in normal tick based accouting also uses softirq_count, which will be set even when not in softirq with bh disabled. Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq processing. The patch below does that and adds API in_serving_softirq() which returns whether we are currently processing softirq or not. Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c to in_serving_softirq. Looks like many usages of in_softirq really want in_serving_softirq. Those changes can be made individually on a case by case basis. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Drop group_capacity to 1 only if local group has extra capacityNikhil Rao
Commit: 75dd321d79d495a0ee579e6249ebc38ddbb2667f upstream When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1 only if the local group has extra capacity. The extra check prevents the case where you always pull from the heaviest group when it is already under-utilized (possible with a large weight task outweighs the tasks on the system). For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task, and each task is running on one core. In this case, we observe the following events when balancing at the NUMA domain: - find_busiest_group() will always pick the sched group containing the niced task to be the busiest group. - find_busiest_queue() will then always pick one of the cpus running the nice0 task (never picks the cpu with the nice -15 task since weighted_cpuload > imbalance). - The load balancer fails to migrate the task since it is the running task and increments sd->nr_balance_failed. - It repeats the above steps a few more times until sd->nr_balance_failed > 5, at which point it kicks off the active load balancer, wakes up the migration thread and kicks the nice 0 task off the cpu. The load balancer doesn't stop until we kick out all nice 0 tasks from the sched group, leaving you with 3 idle cpus and one cpu running the nice -15 task. When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are not relevant because the niced task has a very large weight. In this patch, we add an extra condition to the "if(prefer_sibling)" check in update_sd_lb_stats(). We drop the capacity of a group only if the local group has extra capacity, ie. nr_running < group_capacity. This patch preserves the original intent of the prefer_siblings check (to spread tasks across the system in low utilization scenarios) and fixes the case above. It helps in the following ways: - In low utilization cases (where nr_tasks << nr_cpus), we still drop group_capacity down to 1 if we prefer siblings. - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most likely be > sgs.group_capacity. - When balancing large weight tasks, if the local group does not have extra capacity, we do not pick the group with the niced task as the busiest group. This prevents failed balances, active migration and the under-utilization described above. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Force balancing on newidle balance if local group has capacityNikhil Rao
Commit: fab476228ba37907ad75216d0fd9732ada9c119e upstream This patch forces a load balance on a newly idle cpu when the local group has extra capacity and the busiest group does not have any. It improves system utilization when balancing tasks with a large weight differential. Under certain situations, such as a niced down task (i.e. nice = -15) in the presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and kicks away other tasks because of its large weight. This leads to sub-optimal utilization of the machine. Even though the sched group has capacity, it does not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL. With this patch, if the local group has extra capacity, we shortcut the checks in f_b_g() and try to pull a task over. A sched group has extra capacity if the group capacity is greater than the number of running tasks in that group. Thanks to Mike Galbraith for discussions leading to this patch and for the insight to reuse SD_NEWIDLE_BALANCE. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Set group_imb only a task can be pulled from the busiest cpuNikhil Rao
Commit: 2582f0eba54066b5e98ff2b27ef0cfa833b59f54 upstream When cycling through sched groups to determine the busiest group, set group_imb only if the busiest cpu has more than 1 runnable task. This patch fixes the case where two cpus in a group have one runnable task each, but there is a large weight differential between these two tasks. The load balancer is unable to migrate any task from this group, and hence do not consider this group to be imbalanced. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com> [ small code readability edits ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Do not consider SCHED_IDLE tasks to be cache hotNikhil Rao
Commit: ef8002f6848236de5adc613063ebeabddea8a6fb upstream This patch adds a check in task_hot to return if the task has SCHED_IDLE policy. SCHED_IDLE tasks have very low weight, and when run with regular workloads, are typically scheduled many milliseconds apart. There is no need to consider these tasks hot for load balancing. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: suppress RCU lockdep splat in task_fork_fairPaul E. McKenney
Commit: b0a0f667a349247bd7f05f806b662a25653822bc upstream > =================================================== > [ INFO: suspicious rcu_dereference_check() usage. ] > --------------------------------------------------- > /home/greearb/git/linux.wireless-testing/kernel/sched.c:618 invoked rcu_dereference_check() without protection! > > other info that might help us debug this: > > rcu_scheduler_active = 1, debug_locks = 1 > 1 lock held by ifup/23517: > #0: (&rq->lock){-.-.-.}, at: [<c042f782>] task_fork_fair+0x3b/0x108 > > stack backtrace: > Pid: 23517, comm: ifup Not tainted 2.6.36-rc6-wl+ #5 > Call Trace: > [<c075e219>] ? printk+0xf/0x16 > [<c0455842>] lockdep_rcu_dereference+0x74/0x7d > [<c0426854>] task_group+0x6d/0x79 > [<c042686e>] set_task_rq+0xe/0x57 > [<c042f79e>] task_fork_fair+0x57/0x108 > [<c042e965>] sched_fork+0x82/0xf9 > [<c04334b3>] copy_process+0x569/0xe8e > [<c0433ef0>] do_fork+0x118/0x262 > [<c076302f>] ? do_page_fault+0x16a/0x2cf > [<c044b80c>] ? up_read+0x16/0x2a > [<c04085ae>] sys_clone+0x1b/0x20 > [<c04030a5>] ptregs_clone+0x15/0x30 > [<c0402f1c>] ? sysenter_do_call+0x12/0x38 Here a newly created task is having its runqueue assigned. The new task is not yet on the tasklist, so cannot go away. This is therefore a false positive, suppress with an RCU read-side critical section. Reported-by: Ben Greear <greearb@candelatech.com Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Tested-by: Ben Greear <greearb@candelatech.com Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Give CPU bound RT tasks preferencestable-bot for Steven Rostedt
From:: Steven Rostedt <srostedt@redhat.com> Commit: b3bc211cfe7d5fe94b310480d78e00bea96fbf2a upstream If a high priority task is waking up on a CPU that is running a lower priority task that is bound to a CPU, see if we can move the high RT task to another CPU first. Note, if all other CPUs are running higher priority tasks than the CPU bounded current task, then it will be preempted regardless. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Gregory Haskins <ghaskins@novell.com> LKML-Reference: <20100921024138.888922071@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Try not to migrate higher priority RT tasksSteven Rostedt
Commit: 43fa5460fe60dea5c610490a1d263415419c60f6 upstream When first working on the RT scheduler design, we concentrated on keeping all CPUs running RT tasks instead of having multiple RT tasks on a single CPU waiting for the migration thread to move them. Instead we take a more proactive stance and push or pull RT tasks from one CPU to another on wakeup or scheduling. When an RT task wakes up on a CPU that is running another RT task, instead of preempting it and killing the cache of the running RT task, we look to see if we can migrate the RT task that is waking up, even if the RT task waking up is of higher priority. This may sound a bit odd, but RT tasks should be limited in migration by the user anyway. But in practice, people do not do this, which causes high prio RT tasks to bounce around the CPUs. This becomes even worse when we have priority inheritance, because a high prio task can block on a lower prio task and boost its priority. When the lower prio task wakes up the high prio task, if it happens to be on the same CPU it will migrate off of it. But in reality, the above does not happen much either, because the wake up of the lower prio task, which has already been boosted, if it was on the same CPU as the higher prio task, it would then migrate off of it. But anyway, we do not want to migrate them either. To examine the scheduling, I created a test program and examined it under kernelshark. The test program created CPU * 2 threads, where each thread had a different priority. The program takes different options. The options used in this change log was to have priority inheritance mutexes or not. All threads did the following loop: static void grab_lock(long id, int iter, int l) { ftrace_write("thread %ld iter %d, taking lock %d\n", id, iter, l); pthread_mutex_lock(&locks[l]); ftrace_write("thread %ld iter %d, took lock %d\n", id, iter, l); busy_loop(nr_tasks - id); ftrace_write("thread %ld iter %d, unlock lock %d\n", id, iter, l); pthread_mutex_unlock(&locks[l]); } void *start_task(void *id) { [...] while (!done) { for (l = 0; l < nr_locks; l++) { grab_lock(id, i, l); ftrace_write("thread %ld iter %d sleeping\n", id, i); ms_sleep(id); } i++; } [...] } The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes to the ftrace buffer to help analyze via ftrace. The higher the id, the higher the prio, the shorter it does the busy loop, but the longer it spins. This is usually the case with RT tasks, the lower priority tasks usually run longer than higher priority tasks. At the end of the test, it records the number of loops each thread took, as well as the number of voluntary preemptions, non-voluntary preemptions, and number of migrations each thread took, taking the information from /proc/$$/sched and /proc/$$/status. Running this on a 4 CPU processor, the results without changes to the kernel looked like this: Task vol nonvol migrated iterations Signed-off-by: Andi Kleen <ak@linux.intel.com> ---- --- ------ -------- ---------- 0: 53 3220 1470 98 1: 562 773 724 98 2: 752 933 1375 98 3: 749 39 697 98 4: 758 5 515 98 5: 764 2 679 99 6: 761 2 535 99 7: 757 3 346 99 total: 5156 4977 6341 787 Each thread regardless of priority migrated a few hundred times. The higher priority tasks, were a little better but still took quite an impact. By letting higher priority tasks bump the lower prio task from the CPU, things changed a bit: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 37 2835 1937 98 1: 666 1821 1865 98 2: 654 1003 1385 98 3: 664 635 973 99 4: 698 197 352 99 5: 703 101 159 99 6: 708 1 75 99 7: 713 1 2 99 total: 4843 6594 6748 789 The total # of migrations did not change (several runs showed the difference all within the noise). But we now see a dramatic improvement to the higher priority tasks. (kernelshark showed that the watchdog timer bumped the highest priority task to give it the 2 count. This was actually consistent with every run). Notice that the # of iterations did not change either. The above was with priority inheritance mutexes. That is, when the higher prority task blocked on a lower priority task, the lower priority task would inherit the higher priority task (which shows why task 6 was bumped so many times). When not using priority inheritance mutexes, the current kernel shows this: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 56 3101 1892 95 1: 594 713 937 95 2: 625 188 618 95 3: 628 4 491 96 4: 640 7 468 96 5: 631 2 501 96 6: 641 1 466 96 7: 643 2 497 96 total: 4458 4018 5870 765 Not much changed with or without priority inheritance mutexes. But if we let the high priority task bump lower priority tasks on wakeup we see: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 115 3439 2782 98 1: 633 1354 1583 99 2: 652 919 1218 99 3: 645 713 934 99 4: 690 3 3 99 5: 694 1 4 99 6: 720 3 4 99 7: 747 0 1 100 Which shows a even bigger change. The big difference between task 3 and task 4 is because we have only 4 CPUs on the machine, causing the 4 highest prio tasks to always have preference. Although I did not measure cache misses, and I'm sure there would be little to measure since the test was not data intensive, I could imagine large improvements for higher priority tasks when dealing with lower priority tasks. Thus, I'm satisfied with making the change and agreeing with what Gregory Haskins argued a few years ago when we first had this discussion. One final note. All tasks in the above tests were RT tasks. Any RT task will always preempt a non RT task that is running on the CPU the RT task wants to run on. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> LKML-Reference: <20100921024138.605460343@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Increment cache_nice_tries only on periodic lbVenkatesh Pallipadi
Commit: 58b26c4c025778c09c7a1438ff185080e11b7d0a upstream scheduler uses cache_nice_tries as an indicator to do cache_hot and active load balance, when normal load balance fails. Currently, this value is changed on any failed load balance attempt. That ends up being not so nice to workloads that enter/exit idle often, as they do more frequent new_idle balance and that pretty soon results in cache hot tasks being pulled in. Making the cache_nice_tries ignore failed new_idle balance seems to make better sense. With that only the failed load balance in periodic load balance gets accounted and the rate of accumulation of cache_nice_tries will not depend on idle entry/exit (short running sleep-wakeup kind of tasks). This reduces movement of cache_hot tasks. schedstat diff (after-before) excerpt from a workload that has frequent and short wakeup-idle pattern (:2 in cpu col below refers to NEWIDLE idx) This snapshot was across ~400 seconds. Without this change: domainstats: domain0 cpu cnt bln fld imb gain hgain nobusyq nobusyg 0:2 306487 219575 73167 110069413 44583 19070 1172 218403 1:2 292139 194853 81421 120893383 50745 21902 1259 193594 2:2 283166 174607 91359 129699642 54931 23688 1287 173320 3:2 273998 161788 93991 132757146 57122 24351 1366 160422 4:2 289851 215692 62190 83398383 36377 13680 851 214841 5:2 316312 222146 77605 117582154 49948 20281 988 221158 6:2 297172 195596 83623 122133390 52801 21301 929 194667 7:2 283391 178078 86378 126622761 55122 22239 928 177150 8:2 297655 210359 72995 110246694 45798 19777 1125 209234 9:2 297357 202011 79363 119753474 50953 22088 1089 200922 10:2 278797 178703 83180 122514385 52969 22726 1128 177575 11:2 272661 167669 86978 127342327 55857 24342 1195 166474 12:2 293039 204031 73211 110282059 47285 19651 948 203083 13:2 289502 196762 76803 114712942 49339 20547 1016 195746 14:2 264446 169609 78292 115715605 50459 21017 982 168627 15:2 260968 163660 80142 116811793 51483 21281 1064 162596 With this change: domainstats: domain0 cpu cnt bln fld imb gain hgain nobusyq nobusyg 0:2 272347 187380 77455 105420270 24975 1 953 186427 1:2 267276 172360 86234 116242264 28087 6 1028 171332 2:2 259769 156777 93281 123243134 30555 1 1043 155734 3:2 250870 143129 97627 127370868 32026 6 1188 141941 4:2 248422 177116 64096 78261112 22202 2 757 176359 5:2 275595 180683 84950 116075022 29400 6 778 179905 6:2 262418 162609 88944 119256898 31056 4 817 161792 7:2 252204 147946 92646 122388300 32879 4 824 147122 8:2 262335 172239 81631 110477214 26599 4 864 171375 9:2 261563 164775 88016 117203621 28331 3 849 163926 10:2 243389 140949 93379 121353071 29585 2 909 140040 11:2 242795 134651 98310 124768957 30895 2 1016 133635 12:2 255234 166622 79843 104696912 26483 4 746 165876 13:2 244944 151595 83855 109808099 27787 3 801 150794 14:2 241301 140982 89935 116954383 30403 6 845 140137 15:2 232271 128564 92821 119185207 31207 4 1416 127148 Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <1284167957-3675-1-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31sched: Move sched_avg_update() to update_cpu_load()Suresh Siddha
Commit: da2b71edd8a7db44fe1746261410a981f3e03632 upstream Currently sched_avg_update() (which updates rt_avg stats in the rq) is getting called from scale_rt_power() (in the load balance context) which doesn't take rq->lock. Fix it by moving the sched_avg_update() to more appropriate update_cpu_load() where the CFS load gets updated as well. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <1282596171.2694.3.camel@sbsiddha-MOBL3> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31kernel/smp.c: fix smp_call_function_many() SMP raceAnton Blanchard
commit 6dc19899958e420a931274b94019e267e2396d3e upstream. I noticed a failure where we hit the following WARN_ON in generic_smp_call_function_interrupt: if (!cpumask_test_and_clear_cpu(cpu, data->cpumask)) continue; data->csd.func(data->csd.info); refs = atomic_dec_return(&data->refs); WARN_ON(refs < 0); <------------------------- We atomically tested and cleared our bit in the cpumask, and yet the number of cpus left (ie refs) was 0. How can this be? It turns out commit 54fdade1c3332391948ec43530c02c4794a38172 ("generic-ipi: make struct call_function_data lockless") is at fault. It removes locking from smp_call_function_many and in doing so creates a rather complicated race. The problem comes about because: - The smp_call_function_many interrupt handler walks call_function.queue without any locking. - We reuse a percpu data structure in smp_call_function_many. - We do not wait for any RCU grace period before starting the next smp_call_function_many. Imagine a scenario where CPU A does two smp_call_functions back to back, and CPU B does an smp_call_function in between. We concentrate on how CPU C handles the calls: CPU A CPU B CPU C CPU D smp_call_function smp_call_function_interrupt walks call_function.queue sees data from CPU A on list smp_call_function smp_call_function_interrupt walks call_function.queue sees (stale) CPU A on list smp_call_function int clears last ref on A list_del_rcu, unlock smp_call_function reuses percpu *data A data->cpumask sees and clears bit in cpumask might be using old or new fn! decrements refs below 0 set data->refs (too late!) The important thing to note is since the interrupt handler walks a potentially stale call_function.queue without any locking, then another cpu can view the percpu *data structure at any time, even when the owner is in the process of initialising it. The following test case hits the WARN_ON 100% of the time on my PowerPC box (having 128 threads does help :) #include <linux/module.h> #include <linux/init.h> #define ITERATIONS 100 static void do_nothing_ipi(void *dummy) { } static void do_ipis(struct work_struct *dummy) { int i; for (i = 0; i < ITERATIONS; i++) smp_call_function(do_nothing_ipi, NULL, 1); printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id()); } static struct work_struct work[NR_CPUS]; static int __init testcase_init(void) { int cpu; for_each_online_cpu(cpu) { INIT_WORK(&work[cpu], do_ipis); schedule_work_on(cpu, &work[cpu]); } return 0; } static void __exit testcase_exit(void) { } module_init(testcase_init) module_exit(testcase_exit) MODULE_LICENSE("GPL"); MODULE_AUTHOR("Anton Blanchard"); I tried to fix it by ordering the read and the write of ->cpumask and ->refs. In doing so I missed a critical case but Paul McKenney was able to spot my bug thankfully :) To ensure we arent viewing previous iterations the interrupt handler needs to read ->refs then ->cpumask then ->refs _again_. Thanks to Milton Miller and Paul McKenney for helping to debug this issue. [miltonm@bga.com: add WARN_ON and BUG_ON, remove extra read of refs before initial read of mask that doesn't help (also noted by Peter Zijlstra), adjust comments, hopefully clarify scenario ] [miltonm@bga.com: remove excess tests] Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-31ptrace: use safer wake up on ptrace_detach()Tejun Heo
commit 01e05e9a90b8f4c3997ae0537e87720eb475e532 upstream. The wake_up_process() call in ptrace_detach() is spurious and not interlocked with the tracee state. IOW, the tracee could be running or sleeping in any place in the kernel by the time wake_up_process() is called. This can lead to the tracee waking up unexpectedly which can be dangerous. The wake_up is spurious and should be removed but for now reduce its toxicity by only waking up if the tracee is in TRACED or STOPPED state. This bug can possibly be used as an attack vector. I don't think it will take too much effort to come up with an attack which triggers oops somewhere. Most sleeps are wrapped in condition test loops and should be safe but we have quite a number of places where sleep and wakeup conditions are expected to be interlocked. Although the window of opportunity is tiny, ptrace can be used by non-privileged users and with some loading the window can definitely be extended and exploited. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-02-06posix-cpu-timers: workaround to suppress the problems with mt execOleg Nesterov
commit e0a70217107e6f9844628120412cb27bb4cea194 upstream. posix-cpu-timers.c correctly assumes that the dying process does posix_cpu_timers_exit_group() and removes all !CPUCLOCK_PERTHREAD timers from signal->cpu_timers list. But, it also assumes that timer->it.cpu.task is always the group leader, and thus the dead ->task means the dead thread group. This is obviously not true after de_thread() changes the leader. After that almost every posix_cpu_timer_ method has problems. It is not simple to fix this bug correctly. First of all, I think that timer->it.cpu should use struct pid instead of task_struct. Also, the locking should be reworked completely. In particular, tasklist_lock should not be used at all. This all needs a lot of nontrivial and hard-to-test changes. Change __exit_signal() to do posix_cpu_timers_exit_group() when the old leader dies during exec. This is not the fix, just the temporary hack to hide the problem for 2.6.37 and stable. IOW, this is obviously wrong but this is what we currently have anyway: cpu timers do not work after mt exec. In theory this change adds another race. The exiting leader can detach the timers which were attached to the new leader. However, the window between de_thread() and release_task() is small, we can pretend that sys_timer_create() was called before de_thread(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-02-06Sched: fix skip_clock_update optimizationMike Galbraith
commit f26f9aff6aaf67e9a430d16c266f91b13a5bff64 upstream. idle_balance() drops/retakes rq->lock, leaving the previous task vulnerable to set_tsk_need_resched(). Clear it after we return from balancing instead, and in setup_thread_stack() as well, so no successfully descheduled or never scheduled task has it set. Need resched confused the skip_clock_update logic, which assumes that the next call to update_rq_clock() will come nearly immediately after being set. Make the optimization robust against the waking a sleeper before it sucessfully deschedules case by checking that the current task has not been dequeued before setting the flag, since it is that useless clock update we're trying to save, and clear unconditionally in schedule() proper instead of conditionally in put_prev_task(). Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Andi Kleen <ak@linux.intel.com> Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com> Tested-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1291802742.1417.9.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-06fix freeing user_struct in user cacheHillf Danton
commit 4ef9e11d6867f88951e30db910fa015300e31871 upstream. When racing on adding into user cache, the new allocated from mm slab is freed without putting user namespace. Since the user namespace is already operated by getting, putting has to be issued. Signed-off-by: Hillf Danton <dhillf@gmail.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-02-06tracing: Fix panic when lseek() called on "trace" opened for writingSlava Pestov
commit 364829b1263b44aa60383824e4c1289d83d78ca7 upstream. The file_ops struct for the "trace" special file defined llseek as seq_lseek(). However, if the file was opened for writing only, seq_open() was not called, and the seek would dereference a null pointer, file->private_data. This patch introduces a new wrapper for seq_lseek() which checks if the file descriptor is opened for reading first. If not, it does nothing. Signed-off-by: Slava Pestov <slavapestov@google.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <1290640396-24179-1-git-send-email-slavapestov@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-06sched: Cure more NO_HZ load average woesPeter Zijlstra
commit 0f004f5a696a9434b7214d0d3cbd0525ee77d428 upstream. There's a long-running regression that proved difficult to fix and which is hitting certain people and is rather annoying in its effects. Damien reported that after 74f5187ac8 (sched: Cure load average vs NO_HZ woes) his load average is unnaturally high, he also noted that even with that patch reverted the load avgerage numbers are not correct. The problem is that the previous patch only solved half the NO_HZ problem, it addressed the part of going into NO_HZ mode, not of comming out of NO_HZ mode. This patch implements that missing half. When comming out of NO_HZ mode there are two important things to take care of: - Folding the pending idle delta into the global active count. - Correctly aging the averages for the idle-duration. So with this patch the NO_HZ interaction should be complete and behaviour between CONFIG_NO_HZ=[yn] should be equivalent. Furthermore, this patch slightly changes the load average computation by adding a rounding term to the fixed point multiplication. Reported-by: Damien Wyart <damien.wyart@free.fr> Reported-by: Tim McGrath <tmhikaru@gmail.com> Tested-by: Damien Wyart <damien.wyart@free.fr> Tested-by: Orion Poplawski <orion@cora.nwra.com> Tested-by: Kyle McMartin <kyle@mcmartin.ca> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Chase Douglas <chase.douglas@canonical.com> LKML-Reference: <1291129145.32004.874.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-06printk: Fix wake_up_klogd() vs cpu hotplugHeiko Carstens
commit 49f4138346b3cec2706adff02658fe27ceb1e46f upstream. wake_up_klogd() may get called from preemptible context but uses __raw_get_cpu_var() to write to a per cpu variable. If it gets preempted between getting the address and writing to it, the cpu in question could be offline if the process gets scheduled back and hence writes to the per cpu data of an offline cpu. This buggy behaviour was introduced with fa33507a "printk: robustify printk, fix #2" which was supposed to fix a "using smp_processor_id() in preemptible" warning. Let's use this_cpu_write() instead which disables preemption and makes sure that the outlined scenario cannot happen. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <20101126124247.GC7023@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-06user_ns: Introduce user_nsmap_uid and user_ns_map_gid.Eric W. Biederman
Upstream commit 5c1469de7545a35a16ff2b902e217044a7d2f8a5 Define what happens when a we view a uid from one user_namespace in another user_namepece. - If the user namespaces are the same no mapping is necessary. - For most cases of difference use overflowuid and overflowgid, the uid and gid currently used for 16bit apis when we have a 32bit uid that does fit in 16bits. Effectively the situation is the same, we want to return a uid or gid that is not assigned to any user. - For the case when we happen to be mapping the uid or gid of the creator of the target user namespace use uid 0 and gid as confusing that user with root is not a problem. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-02-06PM / Hibernate: Fix PM_POST_* notification with user-space suspendTakashi Iwai
commit 1497dd1d29c6a53fcd3c80f7ac8d0e0239e7389e upstream. The user-space hibernation sends a wrong notification after the image restoration because of thinko for the file flag check. RDONLY corresponds to hibernation and WRONLY to restoration, confusingly. Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2011-02-06nohz: Fix get_next_timer_interrupt() vs cpu hotplugHeiko Carstens
commit dbd87b5af055a0cc9bba17795c9a2b0d17795389 upstream. This fixes a bug as seen on 2.6.32 based kernels where timers got enqueued on offline cpus. If a cpu goes offline it might still have pending timers. These will be migrated during CPU_DEAD handling after the cpu is offline. However while the cpu is going offline it will schedule the idle task which will then call tick_nohz_stop_sched_tick(). That function in turn will call get_next_timer_intterupt() to figure out if the tick of the cpu can be stopped or not. If it turns out that the next tick is just one jiffy off (delta_jiffies == 1) tick_nohz_stop_sched_tick() incorrectly assumes that the tick should not stop and takes an early exit and thus it won't update the load balancer cpu. Just afterwards the cpu will be killed and the load balancer cpu could be the offline cpu. On 2.6.32 based kernel get_nohz_load_balancer() gets called to decide on which cpu a timer should be enqueued (see __mod_timer()). Which leads to the possibility that timers get enqueued on an offline cpu. These will never expire and can cause a system hang. This has been observed 2.6.32 kernels. On current kernels __mod_timer() uses get_nohz_timer_target() which doesn't have that problem. However there might be other problems because of the too early exit tick_nohz_stop_sched_tick() in case a cpu goes offline. The easiest and probably safest fix seems to be to let get_next_timer_interrupt() just lie and let it say there isn't any pending timer if the current cpu is offline. I also thought of moving migrate_[hr]timers() from CPU_DEAD to CPU_DYING, but seeing that there already have been fixes at least in the hrtimer code in this area I'm afraid that this could add new subtle bugs. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <20101201091109.GA8984@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-06nohz: Fix printk_needs_cpu() return value on offline cpusHeiko Carstens
commit 61ab25447ad6334a74e32f60efb135a3467223f8 upstream. This patch fixes a hang observed with 2.6.32 kernels where timers got enqueued on offline cpus. printk_needs_cpu() may return 1 if called on offline cpus. When a cpu gets offlined it schedules the idle process which, before killing its own cpu, will call tick_nohz_stop_sched_tick(). That function in turn will call printk_needs_cpu() in order to check if the local tick can be disabled. On offline cpus this function should naturally return 0 since regardless if the tick gets disabled or not the cpu will be dead short after. That is besides the fact that __cpu_disable() should already have made sure that no interrupts on the offlined cpu will be delivered anyway. In this case it prevents tick_nohz_stop_sched_tick() to call select_nohz_load_balancer(). No idea if that really is a problem. However what made me debug this is that on 2.6.32 the function get_nohz_load_balancer() is used within __mod_timer() to select a cpu on which a timer gets enqueued. If printk_needs_cpu() returns 1 then the nohz_load_balancer cpu doesn't get updated when a cpu gets offlined. It may contain the cpu number of an offline cpu. In turn timers get enqueued on an offline cpu and not very surprisingly they never expire and cause system hangs. This has been observed 2.6.32 kernels. On current kernels __mod_timer() uses get_nohz_timer_target() which doesn't have that problem. However there might be other problems because of the too early exit tick_nohz_stop_sched_tick() in case a cpu goes offline. Easiest way to fix this is just to test if the current cpu is offline and call printk_tick() directly which clears the condition. Alternatively I tried a cpu hotplug notifier which would clear the condition, however between calling the notifier function and printk_needs_cpu() something could have called printk() again and the problem is back again. This seems to be the safest fix. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <20101126120235.406766476@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-12-14genirq: Fix incorrect proc spurious outputKenji Kaneshige
commit 25c9170ed64a6551beefe9315882f754e14486f4 upstream. Since commit a1afb637(switch /proc/irq/*/spurious to seq_file) all /proc/irq/XX/spurious files show the information of irq 0. Current irq_spurious_proc_open() passes on NULL as the 3rd argument, which is used as an IRQ number in irq_spurious_proc_show(), to the single_open(). Because of this, all the /proc/irq/XX/spurious file shows IRQ 0 information regardless of the IRQ number. To fix the problem, irq_spurious_proc_open() must pass on the appropreate data (IRQ number) to single_open(). Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> LKML-Reference: <4CF4B778.90604@jp.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-12-14PM / Hibernate: Fix memory corruption related to swapRafael J. Wysocki
commit c9e664f1fdf34aa8cede047b206deaa8f1945af0 upstream. There is a problem that swap pages allocated before the creation of a hibernation image can be released and used for storing the contents of different memory pages while the image is being saved. Since the kernel stored in the image doesn't know of that, it causes memory corruption to occur after resume from hibernation, especially on systems with relatively small RAM that need to swap often. This issue can be addressed by keeping the GFP_IOFS bits clear in gfp_allowed_mask during the entire hibernation, including the saving of the image, until the system is finally turned off or the hibernation is aborted. Unfortunately, for this purpose it's necessary to rework the way in which the hibernate and suspend code manipulates gfp_allowed_mask. This change is based on an earlier patch from Hugh Dickins. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reported-by: Ondrej Zary <linux@rainbow-software.org> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>