small patch for pageout. Comments?

Discussion:

small patch for pageout. Comments?

Larry McVoy

2017-11-30 17:34:24 UTC

In a recent numa meeting that Scott called, Jeff suggested a small
patch to the pageout daemon (included below).

It's rather dramatic the difference it makes for me. If I arrange to
thrash the crap out of memory, without this patch the kernel is so
borked with all the processes in disk wait that I can't kill them,
I can't reboot, my only option is to power off.

With the patch there is still some borkage, the kernel is randomly
killing processes because of out of mem, it should kill one of my
processes that is causing the problem but it doesn't, it killed
random stuff like dhclient, getty (logged me out), etc.

But the system is responsive.

What the patch does is say "if we have more than one core, don't sleep
in pageout, just keep running until we freed enough mem".

Comments?

--lm

diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c
index 4ecae5ad5fd..f59a09e96e2 100644
--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -1815,10 +1815,18 @@ vm_pageout_worker(void *arg)
* (page reclamation) scan, then increase the level
* and scan again now. Otherwise, sleep a bit and
* try again later.
+ * LM: per discussions with the numa team, don't
+ * sleep if we have at least 2 cpus, just keep
+ * scanning. This makes a HUGE difference when
+ * the system is thrashing on memory, it's the
+ * difference between usable and borked.
*/
mtx_unlock(&vm_page_queue_free_mtx);
- if (pass >= 1)
- pause("psleep", hz / VM_INACT_SCAN_RATE);
+ if (pass >= 1) {
+ if (mp_ncpus < 2) {
+ pause("psleep", hz /VM_INACT_SCAN_RATE);
+ }
+ }
pass++;
} else {
/*

Larry McVoy

2017-11-30 17:42:40 UTC

https://reviews.freebsd.org/D13308

Post by Larry McVoy
In a recent numa meeting that Scott called, Jeff suggested a small
patch to the pageout daemon (included below).
It's rather dramatic the difference it makes for me. If I arrange to
thrash the crap out of memory, without this patch the kernel is so
borked with all the processes in disk wait that I can't kill them,
I can't reboot, my only option is to power off.
With the patch there is still some borkage, the kernel is randomly
killing processes because of out of mem, it should kill one of my
processes that is causing the problem but it doesn't, it killed
random stuff like dhclient, getty (logged me out), etc.
But the system is responsive.
What the patch does is say "if we have more than one core, don't sleep
in pageout, just keep running until we freed enough mem".
Comments?
--lm
diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c
index 4ecae5ad5fd..f59a09e96e2 100644
--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -1815,10 +1815,18 @@ vm_pageout_worker(void *arg)
* (page reclamation) scan, then increase the level
* and scan again now. Otherwise, sleep a bit and
* try again later.
+ * LM: per discussions with the numa team, don't
+ * sleep if we have at least 2 cpus, just keep
+ * scanning. This makes a HUGE difference when
+ * the system is thrashing on memory, it's the
+ * difference between usable and borked.
*/
mtx_unlock(&vm_page_queue_free_mtx);
- if (pass >= 1)
- pause("psleep", hz / VM_INACT_SCAN_RATE);
+ if (pass >= 1) {
+ if (mp_ncpus < 2) {
+ pause("psleep", hz /VM_INACT_SCAN_RATE);
+ }
+ }
pass++;
} else {
/*

--
---
Larry McVoy lm at mcvoy.com http://www.mcvoy.com/lm

Warner Losh

2017-11-30 18:37:35 UTC

Post by Larry McVoy
In a recent numa meeting that Scott called, Jeff suggested a small
patch to the pageout daemon (included below).
It's rather dramatic the difference it makes for me. If I arrange to
thrash the crap out of memory, without this patch the kernel is so
borked with all the processes in disk wait that I can't kill them,
I can't reboot, my only option is to power off.
With the patch there is still some borkage, the kernel is randomly
killing processes because of out of mem, it should kill one of my
processes that is causing the problem but it doesn't, it killed
random stuff like dhclient, getty (logged me out), etc.
But the system is responsive.
What the patch does is say "if we have more than one core, don't sleep
in pageout, just keep running until we freed enough mem".
Comments?

Just to confirm why this patch works.

For UP systems, we have to pause here to allow work to complete, otherwise
we can't switch to their threads to complete the I/Os. For MP, however, we
can continue to schedule more work because that work can be completed on
other CPUs. This parallelism greatly increases the pageout rate, allowing
the system to keep up better when some ass-hat process (or processes) is
thrashing memory.

I'm pretty sure the UP case was also designed to not flood the lower layers
with work, starving other consumers. Does this result in undo flooding, and
would we get better results if we could schedule up to the right amount of
work rather flooding in the MP case?

Warner

--lm

Post by Larry McVoy
diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c
index 4ecae5ad5fd..f59a09e96e2 100644
--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -1815,10 +1815,18 @@ vm_pageout_worker(void *arg)
* (page reclamation) scan, then increase the level
* and scan again now. Otherwise, sleep a bit and
* try again later.
+ * LM: per discussions with the numa team, don't
+ * sleep if we have at least 2 cpus, just keep
+ * scanning. This makes a HUGE difference when
+ * the system is thrashing on memory, it's the
+ * difference between usable and borked.
*/
mtx_unlock(&vm_page_queue_free_mtx);
- if (pass >= 1)
- pause("psleep", hz / VM_INACT_SCAN_RATE);
+ if (pass >= 1) {
+ if (mp_ncpus < 2) {
+ pause("psleep", hz
/VM_INACT_SCAN_RATE);
+ }
+ }
pass++;
} else {
/*
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-arch

Larry McVoy

2017-11-30 18:49:23 UTC

Post by Warner Losh

Post by Larry McVoy
In a recent numa meeting that Scott called, Jeff suggested a small
patch to the pageout daemon (included below).
It's rather dramatic the difference it makes for me. If I arrange to
thrash the crap out of memory, without this patch the kernel is so
borked with all the processes in disk wait that I can't kill them,
I can't reboot, my only option is to power off.
With the patch there is still some borkage, the kernel is randomly
killing processes because of out of mem, it should kill one of my
processes that is causing the problem but it doesn't, it killed
random stuff like dhclient, getty (logged me out), etc.
But the system is responsive.
What the patch does is say "if we have more than one core, don't sleep
in pageout, just keep running until we freed enough mem".
Comments?

Just to confirm why this patch works.
For UP systems, we have to pause here to allow work to complete, otherwise
we can't switch to their threads to complete the I/Os. For MP, however, we
can continue to schedule more work because that work can be completed on
other CPUs. This parallelism greatly increases the pageout rate, allowing
the system to keep up better when some ass-hat process (or processes) is
thrashing memory.

Yep.

Post by Warner Losh
I'm pretty sure the UP case was also designed to not flood the lower layers
with work, starving other consumers. Does this result in undo flooding, and
would we get better results if we could schedule up to the right amount of
work rather flooding in the MP case?

I dunno if there is a "right amount". I could make it a little smarter by
keeping track of how many pages we freed and sleep if we freed none in a
scan (which seems really unlikely).

All I know for sure is that without this you can lock up the system to
the point it takes a power cycle to unwedge it. With this the system
is responsive.

Rather than worrying about the smartness, I'd argue this is an improvement,
ship it, and then I can go look at how the system decides to kill processes
(because that's currently busted).

Konstantin Belousov

2017-11-30 19:32:13 UTC

Post by Larry McVoy

Post by Warner Losh

Post by Larry McVoy
In a recent numa meeting that Scott called, Jeff suggested a small
patch to the pageout daemon (included below).
It's rather dramatic the difference it makes for me. If I arrange to
thrash the crap out of memory, without this patch the kernel is so
borked with all the processes in disk wait that I can't kill them,
I can't reboot, my only option is to power off.
With the patch there is still some borkage, the kernel is randomly
killing processes because of out of mem, it should kill one of my
processes that is causing the problem but it doesn't, it killed
random stuff like dhclient, getty (logged me out), etc.
But the system is responsive.
What the patch does is say "if we have more than one core, don't sleep
in pageout, just keep running until we freed enough mem".
Comments?

Just to confirm why this patch works.
For UP systems, we have to pause here to allow work to complete, otherwise
we can't switch to their threads to complete the I/Os. For MP, however, we
can continue to schedule more work because that work can be completed on
other CPUs. This parallelism greatly increases the pageout rate, allowing
the system to keep up better when some ass-hat process (or processes) is
thrashing memory.

Yep.

Post by Warner Losh
I'm pretty sure the UP case was also designed to not flood the lower layers
with work, starving other consumers. Does this result in undo flooding, and
would we get better results if we could schedule up to the right amount of
work rather flooding in the MP case?

I dunno if there is a "right amount". I could make it a little smarter by
keeping track of how many pages we freed and sleep if we freed none in a
scan (which seems really unlikely).
All I know for sure is that without this you can lock up the system to
the point it takes a power cycle to unwedge it. With this the system
is responsive.
Rather than worrying about the smartness, I'd argue this is an improvement,
ship it, and then I can go look at how the system decides to kill processes
(because that's currently busted).

The issue with the current OOM is that it is too conservative and hard
to tune automatically. I tries to measure the pagedaemon progress and
cautiously steps back if some forward progress is made. It is apparently
hard to make it more aggressive but not to reintroduce too eager behaviour
of the previous OOM, where small machines without or with small swap get
OOM triggered without much justification.

I have a patch that complements the pagedaemon progress detection
by the simple but apparently effective detection of the page fault'
time allocation stalls. It worked for me on some relatively large
machines (~64-128G) with too many multi-gig processes and no swap. Such
load made the machine catatonic, OOM triggered but it required long
reshuffling of the pages from active to inactive queues or hand-tuning
vm.pageout_oom_seq (lowering it with the risk of allowing false
positives). With the patch, I got them recovered in 30 seconds (this is
configurable).

diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
index ece496407c2..560983c632e 100644
--- a/sys/vm/vm_fault.c
+++ b/sys/vm/vm_fault.c
@@ -134,6 +134,16 @@ static void vm_fault_dontneed(const struct faultstate *fs, vm_offset_t vaddr,
static void vm_fault_prefault(const struct faultstate *fs, vm_offset_t addra,
int backward, int forward);

+static int vm_pfault_oom_attempts = 3;
+SYSCTL_INT(_vm, OID_AUTO, pfault_oom_attempts, CTLFLAG_RWTUN,
+ &vm_pfault_oom_attempts, 0,
+ "");
+
+static int vm_pfault_oom_wait = 10;
+SYSCTL_INT(_vm, OID_AUTO, pfault_oom_wait, CTLFLAG_RWTUN,
+ &vm_pfault_oom_wait, 0,
+ "");
+
static inline void
release_page(struct faultstate *fs)
{
@@ -531,7 +541,7 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
vm_pindex_t retry_pindex;
vm_prot_t prot, retry_prot;
int ahead, alloc_req, behind, cluster_offset, error, era, faultcount;
- int locked, nera, result, rv;
+ int locked, nera, oom, result, rv;
u_char behavior;
boolean_t wired; /* Passed by reference. */
bool dead, hardfault, is_first_object_locked;
@@ -543,6 +553,8 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
hardfault = false;

RetryFault:;
+ oom = 0;
+RetryFault_oom:;

/*
* Find the backing store object and offset into it to begin the
@@ -787,7 +799,17 @@ RetryFault:;
}
if (fs.m == NULL) {
unlock_and_deallocate(&fs);
- VM_WAITPFAULT;
+ vm_waitpfault(vm_pfault_oom_wait);
+ if (vm_pfault_oom_wait < 0 ||
+ oom < vm_pfault_oom_wait) {
+ oom++;
+ goto RetryFault_oom;
+ }
+ if (bootverbose)
+ printf(
+ "proc %d (%s) failed to alloc page on fault, starting OOM\n",
+ curproc->p_pid, curproc->p_comm);
+ vm_pageout_oom(VM_OOM_MEM);
goto RetryFault;
}
}
diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
index 4711af9d16d..67c8ddd4b92 100644
--- a/sys/vm/vm_page.c
+++ b/sys/vm/vm_page.c
@@ -2724,7 +2724,7 @@ vm_page_alloc_fail(vm_object_t object, int req)
* this balance without careful testing first.
*/
void
-vm_waitpfault(void)
+vm_waitpfault(int wp)
{

mtx_lock(&vm_page_queue_free_mtx);
@@ -2734,7 +2734,7 @@ vm_waitpfault(void)
}
vm_pages_needed = true;
msleep(&vm_cnt.v_free_count, &vm_page_queue_free_mtx, PDROP | PUSER,
- "pfault", 0);
+ "pfault", wp * hz);
}

struct vm_pagequeue *
diff --git a/sys/vm/vm_pageout.h b/sys/vm/vm_pageout.h
index 2cdb1492fab..bf09d7142d0 100644
--- a/sys/vm/vm_pageout.h
+++ b/sys/vm/vm_pageout.h
@@ -96,11 +96,10 @@ extern bool vm_pages_needed;
* Signal pageout-daemon and wait for it.
*/

-extern void pagedaemon_wakeup(void);
+void pagedaemon_wakeup(void);
#define VM_WAIT vm_wait()
-#define VM_WAITPFAULT vm_waitpfault()
-extern void vm_wait(void);
-extern void vm_waitpfault(void);
+void vm_wait(void);
+void vm_waitpfault(int wp);

#ifdef _KERNEL
int vm_pageout_flush(vm_page_t *, int, int, int, int *, boolean_t *);

Larry McVoy

2017-11-30 20:30:47 UTC

I'll give this patch a try and compare to the pageout patch.

--
---
Larry McVoy lm at mcvoy.com http://www.mcvoy.com/lm

Konstantin Belousov

2017-11-30 20:37:25 UTC

Post by Larry McVoy
I'll give this patch a try and compare to the pageout patch.

What do you mean by 'compare to ...' ?

BTW, below are some fixes which I found when re-read the changes.
It messed up counters/timeouts. Better use this.

diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
index ece496407c2..48b3934ddd5 100644
--- a/sys/vm/vm_fault.c
+++ b/sys/vm/vm_fault.c
@@ -134,6 +134,16 @@ static void vm_fault_dontneed(const struct faultstate *fs, vm_offset_t vaddr,
static void vm_fault_prefault(const struct faultstate *fs, vm_offset_t addra,
int backward, int forward);

+static int vm_pfault_oom_attempts = 3;
+SYSCTL_INT(_vm, OID_AUTO, pfault_oom_attempts, CTLFLAG_RWTUN,
+ &vm_pfault_oom_attempts, 0,
+ "");
+
+static int vm_pfault_oom_wait = 10;
+SYSCTL_INT(_vm, OID_AUTO, pfault_oom_wait, CTLFLAG_RWTUN,
+ &vm_pfault_oom_wait, 0,
+ "");
+
static inline void
release_page(struct faultstate *fs)
{
@@ -531,7 +541,7 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
vm_pindex_t retry_pindex;
vm_prot_t prot, retry_prot;
int ahead, alloc_req, behind, cluster_offset, error, era, faultcount;
- int locked, nera, result, rv;
+ int locked, nera, oom, result, rv;
u_char behavior;
boolean_t wired; /* Passed by reference. */
bool dead, hardfault, is_first_object_locked;
@@ -543,6 +553,8 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
hardfault = false;

RetryFault:;
+ oom = 0;
+RetryFault_oom:;

/*
* Find the backing store object and offset into it to begin the
@@ -787,7 +799,17 @@ RetryFault:;
}
if (fs.m == NULL) {
unlock_and_deallocate(&fs);
- VM_WAITPFAULT;
+ if (vm_pfault_oom_attempts < 0 ||
+ oom < vm_pfault_oom_attempts) {
+ oom++;
+ vm_waitpfault(vm_pfault_oom_wait);
+ goto RetryFault_oom;
+ }
+ if (bootverbose)
+ printf(
+ "proc %d (%s) failed to alloc page on fault, starting OOM\n",
+ curproc->p_pid, curproc->p_comm);
+ vm_pageout_oom(VM_OOM_MEM);
goto RetryFault;
}
}
diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
index 4711af9d16d..67c8ddd4b92 100644
--- a/sys/vm/vm_page.c
+++ b/sys/vm/vm_page.c
@@ -2724,7 +2724,7 @@ vm_page_alloc_fail(vm_object_t object, int req)
* this balance without careful testing first.
*/
void
-vm_waitpfault(void)
+vm_waitpfault(int wp)
{

mtx_lock(&vm_page_queue_free_mtx);
@@ -2734,7 +2734,7 @@ vm_waitpfault(void)
}
vm_pages_needed = true;
msleep(&vm_cnt.v_free_count, &vm_page_queue_free_mtx, PDROP | PUSER,
- "pfault", 0);
+ "pfault", wp * hz);
}

struct vm_pagequeue *
diff --git a/sys/vm/vm_pageout.h b/sys/vm/vm_pageout.h
index 2cdb1492fab..bf09d7142d0 100644
--- a/sys/vm/vm_pageout.h
+++ b/sys/vm/vm_pageout.h
@@ -96,11 +96,10 @@ extern bool vm_pages_needed;
* Signal pageout-daemon and wait for it.
*/

-extern void pagedaemon_wakeup(void);
+void pagedaemon_wakeup(void);
#define VM_WAIT vm_wait()
-#define VM_WAITPFAULT vm_waitpfault()
-extern void vm_wait(void);
-extern void vm_waitpfault(void);
+void vm_wait(void);
+void vm_waitpfault(int wp);

#ifdef _KERNEL
int vm_pageout_flush(vm_page_t *, int, int, int, int *, boolean_t *);

Larry McVoy

2017-11-30 20:39:11 UTC

Perhaps I misunderstood, I thought your patch was instead of the pageout
patch. Is that not the case?

Post by Konstantin Belousov

Post by Larry McVoy
I'll give this patch a try and compare to the pageout patch.

What do you mean by 'compare to ...' ?
BTW, below are some fixes which I found when re-read the changes.
It messed up counters/timeouts. Better use this.
diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
index ece496407c2..48b3934ddd5 100644
--- a/sys/vm/vm_fault.c
+++ b/sys/vm/vm_fault.c
@@ -134,6 +134,16 @@ static void vm_fault_dontneed(const struct faultstate *fs, vm_offset_t vaddr,
static void vm_fault_prefault(const struct faultstate *fs, vm_offset_t addra,
int backward, int forward);
+static int vm_pfault_oom_attempts = 3;
+SYSCTL_INT(_vm, OID_AUTO, pfault_oom_attempts, CTLFLAG_RWTUN,
+ &vm_pfault_oom_attempts, 0,
+ "");
+
+static int vm_pfault_oom_wait = 10;
+SYSCTL_INT(_vm, OID_AUTO, pfault_oom_wait, CTLFLAG_RWTUN,
+ &vm_pfault_oom_wait, 0,
+ "");
+
static inline void
release_page(struct faultstate *fs)
{
@@ -531,7 +541,7 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
vm_pindex_t retry_pindex;
vm_prot_t prot, retry_prot;
int ahead, alloc_req, behind, cluster_offset, error, era, faultcount;
- int locked, nera, result, rv;
+ int locked, nera, oom, result, rv;
u_char behavior;
boolean_t wired; /* Passed by reference. */
bool dead, hardfault, is_first_object_locked;
@@ -543,6 +553,8 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
hardfault = false;
RetryFault:;
+ oom = 0;
+RetryFault_oom:;
/*
* Find the backing store object and offset into it to begin the
@@ -787,7 +799,17 @@ RetryFault:;
}
if (fs.m == NULL) {
unlock_and_deallocate(&fs);
- VM_WAITPFAULT;
+ if (vm_pfault_oom_attempts < 0 ||
+ oom < vm_pfault_oom_attempts) {
+ oom++;
+ vm_waitpfault(vm_pfault_oom_wait);
+ goto RetryFault_oom;
+ }
+ if (bootverbose)
+ printf(
+ "proc %d (%s) failed to alloc page on fault, starting OOM\n",
+ curproc->p_pid, curproc->p_comm);
+ vm_pageout_oom(VM_OOM_MEM);
goto RetryFault;
}
}
diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
index 4711af9d16d..67c8ddd4b92 100644
--- a/sys/vm/vm_page.c
+++ b/sys/vm/vm_page.c
@@ -2724,7 +2724,7 @@ vm_page_alloc_fail(vm_object_t object, int req)
* this balance without careful testing first.
*/
void
-vm_waitpfault(void)
+vm_waitpfault(int wp)
{
mtx_lock(&vm_page_queue_free_mtx);
@@ -2734,7 +2734,7 @@ vm_waitpfault(void)
}
vm_pages_needed = true;
msleep(&vm_cnt.v_free_count, &vm_page_queue_free_mtx, PDROP | PUSER,
- "pfault", 0);
+ "pfault", wp * hz);
}
struct vm_pagequeue *
diff --git a/sys/vm/vm_pageout.h b/sys/vm/vm_pageout.h
index 2cdb1492fab..bf09d7142d0 100644
--- a/sys/vm/vm_pageout.h
+++ b/sys/vm/vm_pageout.h
@@ -96,11 +96,10 @@ extern bool vm_pages_needed;
* Signal pageout-daemon and wait for it.
*/
-extern void pagedaemon_wakeup(void);
+void pagedaemon_wakeup(void);
#define VM_WAIT vm_wait()
-#define VM_WAITPFAULT vm_waitpfault()
-extern void vm_wait(void);
-extern void vm_waitpfault(void);
+void vm_wait(void);
+void vm_waitpfault(int wp);
#ifdef _KERNEL
int vm_pageout_flush(vm_page_t *, int, int, int, int *, boolean_t *);

--
---
Larry McVoy lm at mcvoy.com http://www.mcvoy.com/lm

Konstantin Belousov

2017-11-30 21:04:19 UTC

Post by Larry McVoy
Perhaps I misunderstood, I thought your patch was instead of the pageout
patch. Is that not the case?

The patch aims to fix the situation where OOM triggers too late.

It might have some effect when pagedaemon cannot cope with the speed of
page allocations, but it would do it in rather rude manner of causing
OOM if some process cannot get a free page for long time because page
daemon does not provide enough free pages and other allocators steal
the limited supply.

Mark Johnston

2017-11-30 20:47:50 UTC

Post by Larry McVoy

Post by Warner Losh

Post by Larry McVoy
In a recent numa meeting that Scott called, Jeff suggested a small
patch to the pageout daemon (included below).
It's rather dramatic the difference it makes for me. If I arrange to
thrash the crap out of memory, without this patch the kernel is so
borked with all the processes in disk wait that I can't kill them,
I can't reboot, my only option is to power off.
With the patch there is still some borkage, the kernel is randomly
killing processes because of out of mem, it should kill one of my
processes that is causing the problem but it doesn't, it killed
random stuff like dhclient, getty (logged me out), etc.
But the system is responsive.
What the patch does is say "if we have more than one core, don't sleep
in pageout, just keep running until we freed enough mem".
Comments?

Just to confirm why this patch works.
For UP systems, we have to pause here to allow work to complete, otherwise
we can't switch to their threads to complete the I/Os. For MP, however, we
can continue to schedule more work because that work can be completed on
other CPUs. This parallelism greatly increases the pageout rate, allowing
the system to keep up better when some ass-hat process (or processes) is
thrashing memory.

Yep.

Post by Warner Losh
I'm pretty sure the UP case was also designed to not flood the lower layers
with work, starving other consumers. Does this result in undo flooding, and
would we get better results if we could schedule up to the right amount of
work rather flooding in the MP case?

I dunno if there is a "right amount". I could make it a little smarter by
keeping track of how many pages we freed and sleep if we freed none in a
scan (which seems really unlikely).

This situation can happen if the inactive queue is full of dirty pages.
A problem with your patch is that we might not give enough time to the
laundry thread (the thread responsible for writing the contents of dirty
pages to disk and returning them to inactive queue for the page daemon
to free) to write out dirty pages. In this case we might trigger the OOM
killer prematurely, and in fact this scenario is what motivated r300865.
So I would argue that we do in fact need to sleep if the page daemon is
failing to make progress, in order to give time for I/O to complete.

Post by Larry McVoy
All I know for sure is that without this you can lock up the system to
the point it takes a power cycle to unwedge it. With this the system
is responsive.
Rather than worrying about the smartness, I'd argue this is an improvement,
ship it, and then I can go look at how the system decides to kill processes
(because that's currently busted).
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-arch

Larry McVoy

2017-11-30 20:50:41 UTC

Post by Mark Johnston

Post by Larry McVoy
I dunno if there is a "right amount". I could make it a little smarter by
keeping track of how many pages we freed and sleep if we freed none in a
scan (which seems really unlikely).

This situation can happen if the inactive queue is full of dirty pages.
A problem with your patch is that we might not give enough time to the
laundry thread (the thread responsible for writing the contents of dirty
pages to disk and returning them to inactive queue for the page daemon
to free) to write out dirty pages. In this case we might trigger the OOM
killer prematurely, and in fact this scenario is what motivated r300865.
So I would argue that we do in fact need to sleep if the page daemon is
failing to make progress, in order to give time for I/O to complete.

OK, that sounds reasonable. So what defines progress? v_dfree not
increasing? Is one page freed progress?

Mark Johnston

2017-11-30 21:01:31 UTC

Post by Larry McVoy

Post by Mark Johnston

Post by Larry McVoy
I dunno if there is a "right amount". I could make it a little smarter by
keeping track of how many pages we freed and sleep if we freed none in a
scan (which seems really unlikely).

This situation can happen if the inactive queue is full of dirty pages.
A problem with your patch is that we might not give enough time to the
laundry thread (the thread responsible for writing the contents of dirty
pages to disk and returning them to inactive queue for the page daemon
to free) to write out dirty pages. In this case we might trigger the OOM
killer prematurely, and in fact this scenario is what motivated r300865.
So I would argue that we do in fact need to sleep if the page daemon is
failing to make progress, in order to give time for I/O to complete.

OK, that sounds reasonable. So what defines progress? v_dfree not
increasing? Is one page freed progress?

One page freed is progress. We currently invoke the OOM killer only when
the page daemon is making no progress. This turns out to be too
conservative, which is what kib's patch attempts to address. wrt
your patch, I'm saying that I think we should still sleep after a scan
that failed to free any pages.

Larry McVoy

2017-11-30 21:09:19 UTC

Post by Mark Johnston

Post by Larry McVoy

Post by Mark Johnston

Post by Larry McVoy
I dunno if there is a "right amount". I could make it a little smarter by
keeping track of how many pages we freed and sleep if we freed none in a
scan (which seems really unlikely).

This situation can happen if the inactive queue is full of dirty pages.
A problem with your patch is that we might not give enough time to the
laundry thread (the thread responsible for writing the contents of dirty
pages to disk and returning them to inactive queue for the page daemon
to free) to write out dirty pages. In this case we might trigger the OOM
killer prematurely, and in fact this scenario is what motivated r300865.
So I would argue that we do in fact need to sleep if the page daemon is
failing to make progress, in order to give time for I/O to complete.

OK, that sounds reasonable. So what defines progress? v_dfree not
increasing? Is one page freed progress?

One page freed is progress. We currently invoke the OOM killer only when
the page daemon is making no progress. This turns out to be too
conservative, which is what kib's patch attempts to address. wrt
your patch, I'm saying that I think we should still sleep after a scan
that failed to free any pages.

Something like this?

--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -1752,6 +1752,7 @@ vm_pageout_worker(void *arg)
struct vm_domain *domain;
int domidx, pass;
bool target_met;
+ u_int dfree;

domidx = (uintptr_t)arg;
domain = &vm_dom[domidx];
@@ -1776,6 +1777,7 @@ vm_pageout_worker(void *arg)
*/
while (TRUE) {
mtx_lock(&vm_page_queue_free_mtx);
+ dfree = VM_CNT_FETCH(v_dfree);

/*
* Generally, after a level >= 1 scan, if there are enough
@@ -1815,10 +1817,22 @@ vm_pageout_worker(void *arg)
* (page reclamation) scan, then increase the level
* and scan again now. Otherwise, sleep a bit and
* try again later.
+ *
+ * If we have more than one CPU this pause is not
+ * helpful, it just decreases the rate at which we
+ * clean pages. On a uniprocessor we want to pause
+ * to let the user level processes get some time to
+ * run. We also don't keep banging on the page tables
+ * if we didn't manage to free any in the last pass.
*/
mtx_unlock(&vm_page_queue_free_mtx);
- if (pass >= 1)
- pause("psleep", hz / VM_INACT_SCAN_RATE);
+ if (pass >= 1) {
+ dfree = VM_CNT_FETCH(v_dfree) - dfree;
+ if ((dfree == 0) || (mp_ncpus < 2)) {
+if (!dfree) printf("Sleeping because pass %d didn't find anything\n", pass);
+ pause("psleep", hz / VM_INACT_SCAN_RATE);
+ }
+ }
pass++;
} else {
/*

Larry McVoy

2017-11-30 21:15:24 UTC

Look at that, Mark was right:

Sleeping because pass 1 didn't find anything
Sleeping because pass 2 didn't find anything
Sleeping because pass 1 didn't find anything
Sleeping because pass 3 didn't find anything

--
---
Larry McVoy lm at mcvoy.com http://www.mcvoy.com/lm

Larry McVoy

2017-11-30 21:19:02 UTC

But we're borked again, this patch is not enough. Gonna try adding the
OOM stuff.

Post by Larry McVoy
Sleeping because pass 1 didn't find anything
Sleeping because pass 2 didn't find anything
Sleeping because pass 1 didn't find anything
Sleeping because pass 3 didn't find anything
--
---
Larry McVoy lm at mcvoy.com http://www.mcvoy.com/lm

--
---
Larry McVoy lm at mcvoy.com http://www.mcvoy.com/lm

Larry McVoy

2017-11-30 21:40:05 UTC

Welp, that doesn't help either. So with my patch and KIB's patch, system
is wedged, can't kill the processes, can't do anything other than reboot.

I guess that's motivation to learn the remote console stuff.

Post by Larry McVoy
But we're borked again, this patch is not enough. Gonna try adding the
OOM stuff.

Post by Larry McVoy
Sleeping because pass 1 didn't find anything
Sleeping because pass 2 didn't find anything
Sleeping because pass 1 didn't find anything
Sleeping because pass 3 didn't find anything
--
---
Larry McVoy lm at mcvoy.com http://www.mcvoy.com/lm

--
---
Larry McVoy lm at mcvoy.com http://www.mcvoy.com/lm

--
---
Larry McVoy lm at mcvoy.com http://www.mcvoy.com/lm

15 Replies
28 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Larry McVoy 2017-11-30 17:34:24 UTC

Larry McVoy 2017-11-30 17:42:40 UTC

Warner Losh 2017-11-30 18:37:35 UTC

Larry McVoy 2017-11-30 18:49:23 UTC

Konstantin Belousov 2017-11-30 19:32:13 UTC

Larry McVoy 2017-11-30 20:30:47 UTC

Konstantin Belousov 2017-11-30 20:37:25 UTC

Larry McVoy 2017-11-30 20:39:11 UTC

Konstantin Belousov 2017-11-30 21:04:19 UTC

Mark Johnston 2017-11-30 20:47:50 UTC

Larry McVoy 2017-11-30 20:50:41 UTC

Mark Johnston 2017-11-30 21:01:31 UTC

Larry McVoy 2017-11-30 21:09:19 UTC

Larry McVoy 2017-11-30 21:15:24 UTC

Larry McVoy 2017-11-30 21:19:02 UTC

Larry McVoy 2017-11-30 21:40:05 UTC

about - legalese

Loading...