Discussion:
bus_dmamap_sync() for bounced client buffers from user address space
Svatopluk Kraus
2015-04-24 13:13:05 UTC
Permalink
DMA can be done on client buffer from user address space. For example,
thru bus_dmamap_load_uio() when uio->uio_segflg is UIO_USERSPACE. Such
client buffer can bounce and then, it must be copied to and from
bounce buffer in bus_dmamap_sync().

Current implementations (in all archs) do not take into account that
bus_dmamap_sync() is asynchronous for POSTWRITE and POSTREAD in
general. It can be asynchronous for PREWRITE and PREREAD too. For
example, in some driver implementations where DMA client buffers
operations are buffered. In those cases, simple bcopy() is not
correct.

Demonstration of current implementation (x86) is the following:

-----------------------------
struct bounce_page {
vm_offset_t vaddr; /* kva of bounce buffer */
bus_addr_t busaddr; /* Physical address */
vm_offset_t datavaddr; /* kva of client data */
bus_addr_t dataaddr; /* client physical address */
bus_size_t datacount; /* client data count */
STAILQ_ENTRY(bounce_page) links;
};


if ((op & BUS_DMASYNC_PREWRITE) != 0) {
while (bpage != NULL) {
if (bpage->datavaddr != 0) {
bcopy((void *)bpage->datavaddr,
(void *)bpage->vaddr,
bpage->datacount);
} else {
physcopyout(bpage->dataaddr,
(void *)bpage->vaddr,
bpage->datacount);
}
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
}
-----------------------------

There are two things:

(1) datavaddr is not always kva of client data, but sometimes it can
be uva of client data.
(2) bcopy() can be used only if datavaddr is kva or when map->pmap is
current pmap.

Note that there is an implication for bus_dmamap_load_uio() with
uio->uio_segflg set to UIO_USERSPACE that used physical pages are
in-core and wired. See "man bus_dma".

There is not public interface to check that map->pmap is current pmap.
So one solution is the following:

if (bpage->datavaddr >= VM_MIN_KERNEL_ADDRESS) {
bcopy();
} else {
physcopy();
}

If there will be public pmap_is_current() then another solution is the
following:

if (bpage->datavaddr != 0) && pmap_is_current(map->pmap)) {
bcopy();
} else {
physcopy();
}

The second solution implies that context switch must not happen during
bus_dmamap_sync() called from an interrupt routine. However, IMO, it's
granted.

Note that map->pmap should be always kernel_pmap for datavaddr >=
VM_MIN_KERNEL_ADDRESS.

Comments, different solutions, or objections?

Svatopluk Kraus
Jason Harmening
2015-04-24 22:50:15 UTC
Permalink
A couple of comments:

--POSTWRITE and POSTREAD are only asynchronous if you call them from an
asynchronous context.
For a driver that's already performing DMA operations on usermode memory,
it seems likely that there's going to be *some* place where you can call
bus_dmamap_sync() and be guaranteed to be executing in the context of the
process that owns the memory. Then a regular bcopy will be safe and
inexpensive, assuming the pages have been properly vslock-ed/vm_map_wire-d.
That's usually whatever read/write/ioctl operation spawned the DMA transfer
in the first place. So, in those cases can you not just move the
POSTREAD/POSTWRITE sync from the "DMA-finished" interrupt to the
d_read/d_write/d_ioctl that waits on the "DMA-finished" interrupt?

--physcopyin/physcopyout aren't trivial. They go through uiomove_fromphys,
which often uses sfbufs to create temporary KVA mappings for the physical
pages. sf_buf_alloc() can sleep (unless SFB_NOWAIT is specified, which
means it can fail and which uiomove_fromphys does not specify for good
reason); that makes it unsafe for use in either a threaded interrupt or a
filter. Perhaps the physcopyout path could be changed to use pmap_qenter
directly in this case, but that can still be expensive in terms of TLB
shootdowns.

Checking against VM_MIN_KERNEL_ADDRESS seems sketchy; it eliminates the
chance to use a much-less-expensive bcopy in cases where the sync is
happening in correct process context.

Context-switching during bus_dmamap_sync() shouldn't be an issue. In a
filter interrupt, curproc will be completely arbitrary but none of this
stuff should be called in a filter anyway. Otherwise, if you're in a
kernel thread (including an ithread), curproc should be whatever proc was
supplied when the thread was created. That's usually proc0, which only has
kernel address space. IOW, even if a context-switch happens sometime
during bus_dmamap_sync, any pmap check or copy should have a consistent and
non-arbitrary process context.

I think something like your second solution would be workable to make
UIO_USERSPACE syncs work in non-interrupt kernel threads, but given all the
restrictions and extra cost of physcopy, I'm not sure how useful that would
be.

I do think busdma.9 could at least use a note that bus_dmamap_sync() is
only safe to call in the context of the owning process for user buffers.
Post by Svatopluk Kraus
DMA can be done on client buffer from user address space. For example,
thru bus_dmamap_load_uio() when uio->uio_segflg is UIO_USERSPACE. Such
client buffer can bounce and then, it must be copied to and from
bounce buffer in bus_dmamap_sync().
Current implementations (in all archs) do not take into account that
bus_dmamap_sync() is asynchronous for POSTWRITE and POSTREAD in
general. It can be asynchronous for PREWRITE and PREREAD too. For
example, in some driver implementations where DMA client buffers
operations are buffered. In those cases, simple bcopy() is not
correct.
-----------------------------
struct bounce_page {
vm_offset_t vaddr; /* kva of bounce buffer */
bus_addr_t busaddr; /* Physical address */
vm_offset_t datavaddr; /* kva of client data */
bus_addr_t dataaddr; /* client physical address */
bus_size_t datacount; /* client data count */
STAILQ_ENTRY(bounce_page) links;
};
if ((op & BUS_DMASYNC_PREWRITE) != 0) {
while (bpage != NULL) {
if (bpage->datavaddr != 0) {
bcopy((void *)bpage->datavaddr,
(void *)bpage->vaddr,
bpage->datacount);
} else {
physcopyout(bpage->dataaddr,
(void *)bpage->vaddr,
bpage->datacount);
}
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
}
-----------------------------
(1) datavaddr is not always kva of client data, but sometimes it can
be uva of client data.
(2) bcopy() can be used only if datavaddr is kva or when map->pmap is
current pmap.
Note that there is an implication for bus_dmamap_load_uio() with
uio->uio_segflg set to UIO_USERSPACE that used physical pages are
in-core and wired. See "man bus_dma".
There is not public interface to check that map->pmap is current pmap.
if (bpage->datavaddr >= VM_MIN_KERNEL_ADDRESS) {
bcopy();
} else {
physcopy();
}
If there will be public pmap_is_current() then another solution is the
if (bpage->datavaddr != 0) && pmap_is_current(map->pmap)) {
bcopy();
} else {
physcopy();
}
The second solution implies that context switch must not happen during
bus_dmamap_sync() called from an interrupt routine. However, IMO, it's
granted.
Note that map->pmap should be always kernel_pmap for datavaddr >=
VM_MIN_KERNEL_ADDRESS.
Comments, different solutions, or objections?
Svatopluk Kraus
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
Konstantin Belousov
2015-04-25 09:41:52 UTC
Permalink
Post by Jason Harmening
--POSTWRITE and POSTREAD are only asynchronous if you call them from an
asynchronous context.
For a driver that's already performing DMA operations on usermode memory,
it seems likely that there's going to be *some* place where you can call
bus_dmamap_sync() and be guaranteed to be executing in the context of the
process that owns the memory. Then a regular bcopy will be safe and
inexpensive, assuming the pages have been properly vslock-ed/vm_map_wire-d.
That's usually whatever read/write/ioctl operation spawned the DMA transfer
in the first place. So, in those cases can you not just move the
POSTREAD/POSTWRITE sync from the "DMA-finished" interrupt to the
d_read/d_write/d_ioctl that waits on the "DMA-finished" interrupt?
--physcopyin/physcopyout aren't trivial. They go through uiomove_fromphys,
which often uses sfbufs to create temporary KVA mappings for the physical
pages. sf_buf_alloc() can sleep (unless SFB_NOWAIT is specified, which
means it can fail and which uiomove_fromphys does not specify for good
reason); that makes it unsafe for use in either a threaded interrupt or a
filter. Perhaps the physcopyout path could be changed to use pmap_qenter
directly in this case, but that can still be expensive in terms of TLB
shootdowns.
Checking against VM_MIN_KERNEL_ADDRESS seems sketchy; it eliminates the
chance to use a much-less-expensive bcopy in cases where the sync is
happening in correct process context.
Context-switching during bus_dmamap_sync() shouldn't be an issue. In a
filter interrupt, curproc will be completely arbitrary but none of this
stuff should be called in a filter anyway. Otherwise, if you're in a
kernel thread (including an ithread), curproc should be whatever proc was
supplied when the thread was created. That's usually proc0, which only has
kernel address space. IOW, even if a context-switch happens sometime
during bus_dmamap_sync, any pmap check or copy should have a consistent and
non-arbitrary process context.
I think something like your second solution would be workable to make
UIO_USERSPACE syncs work in non-interrupt kernel threads, but given all the
restrictions and extra cost of physcopy, I'm not sure how useful that would
be.
I do think busdma.9 could at least use a note that bus_dmamap_sync() is
only safe to call in the context of the owning process for user buffers.
UIO_USERSPACE for busdma is absolutely unsafe and cannot be used without
making kernel panicing. Even if you wire the userspace bufer, there is
nothing which could prevent other thread in the user process, or other
process sharing the same address space, to call munmap(2) on the range.

The only safe method to work with the userspace regions is to
vm_fault_quick_hold() them to get hold on the pages, and then either
pass pages array down, or remap them in the KVA with pmap_qenter().
Post by Jason Harmening
Post by Svatopluk Kraus
DMA can be done on client buffer from user address space. For example,
thru bus_dmamap_load_uio() when uio->uio_segflg is UIO_USERSPACE. Such
client buffer can bounce and then, it must be copied to and from
bounce buffer in bus_dmamap_sync().
Current implementations (in all archs) do not take into account that
bus_dmamap_sync() is asynchronous for POSTWRITE and POSTREAD in
general. It can be asynchronous for PREWRITE and PREREAD too. For
example, in some driver implementations where DMA client buffers
operations are buffered. In those cases, simple bcopy() is not
correct.
-----------------------------
struct bounce_page {
vm_offset_t vaddr; /* kva of bounce buffer */
bus_addr_t busaddr; /* Physical address */
vm_offset_t datavaddr; /* kva of client data */
bus_addr_t dataaddr; /* client physical address */
bus_size_t datacount; /* client data count */
STAILQ_ENTRY(bounce_page) links;
};
if ((op & BUS_DMASYNC_PREWRITE) != 0) {
while (bpage != NULL) {
if (bpage->datavaddr != 0) {
bcopy((void *)bpage->datavaddr,
(void *)bpage->vaddr,
bpage->datacount);
} else {
physcopyout(bpage->dataaddr,
(void *)bpage->vaddr,
bpage->datacount);
}
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
}
-----------------------------
(1) datavaddr is not always kva of client data, but sometimes it can
be uva of client data.
(2) bcopy() can be used only if datavaddr is kva or when map->pmap is
current pmap.
Note that there is an implication for bus_dmamap_load_uio() with
uio->uio_segflg set to UIO_USERSPACE that used physical pages are
in-core and wired. See "man bus_dma".
There is not public interface to check that map->pmap is current pmap.
if (bpage->datavaddr >= VM_MIN_KERNEL_ADDRESS) {
bcopy();
} else {
physcopy();
}
If there will be public pmap_is_current() then another solution is the
if (bpage->datavaddr != 0) && pmap_is_current(map->pmap)) {
bcopy();
} else {
physcopy();
}
The second solution implies that context switch must not happen during
bus_dmamap_sync() called from an interrupt routine. However, IMO, it's
granted.
Note that map->pmap should be always kernel_pmap for datavaddr >=
VM_MIN_KERNEL_ADDRESS.
Comments, different solutions, or objections?
Svatopluk Kraus
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
Jason Harmening
2015-04-25 14:02:12 UTC
Permalink
Post by Konstantin Belousov
Post by Jason Harmening
--POSTWRITE and POSTREAD are only asynchronous if you call them from an
asynchronous context.
For a driver that's already performing DMA operations on usermode memory,
it seems likely that there's going to be *some* place where you can call
bus_dmamap_sync() and be guaranteed to be executing in the context of the
process that owns the memory. Then a regular bcopy will be safe and
inexpensive, assuming the pages have been properly vslock-ed/vm_map_wire-d.
That's usually whatever read/write/ioctl operation spawned the DMA transfer
in the first place. So, in those cases can you not just move the
POSTREAD/POSTWRITE sync from the "DMA-finished" interrupt to the
d_read/d_write/d_ioctl that waits on the "DMA-finished" interrupt?
--physcopyin/physcopyout aren't trivial. They go through uiomove_fromphys,
which often uses sfbufs to create temporary KVA mappings for the physical
pages. sf_buf_alloc() can sleep (unless SFB_NOWAIT is specified, which
means it can fail and which uiomove_fromphys does not specify for good
reason); that makes it unsafe for use in either a threaded interrupt or a
filter. Perhaps the physcopyout path could be changed to use pmap_qenter
directly in this case, but that can still be expensive in terms of TLB
shootdowns.
Checking against VM_MIN_KERNEL_ADDRESS seems sketchy; it eliminates the
chance to use a much-less-expensive bcopy in cases where the sync is
happening in correct process context.
Context-switching during bus_dmamap_sync() shouldn't be an issue. In a
filter interrupt, curproc will be completely arbitrary but none of this
stuff should be called in a filter anyway. Otherwise, if you're in a
kernel thread (including an ithread), curproc should be whatever proc was
supplied when the thread was created. That's usually proc0, which only has
kernel address space. IOW, even if a context-switch happens sometime
during bus_dmamap_sync, any pmap check or copy should have a consistent and
non-arbitrary process context.
I think something like your second solution would be workable to make
UIO_USERSPACE syncs work in non-interrupt kernel threads, but given all the
restrictions and extra cost of physcopy, I'm not sure how useful that would
be.
I do think busdma.9 could at least use a note that bus_dmamap_sync() is
only safe to call in the context of the owning process for user buffers.
UIO_USERSPACE for busdma is absolutely unsafe and cannot be used without
making kernel panicing. Even if you wire the userspace bufer, there is
nothing which could prevent other thread in the user process, or other
process sharing the same address space, to call munmap(2) on the range.
The only safe method to work with the userspace regions is to
vm_fault_quick_hold() them to get hold on the pages, and then either
pass pages array down, or remap them in the KVA with pmap_qenter().
I was under the impression that any attempt to free or munmap the
virtual range would block waiting for the underlying pages to be unwired.
That's certainly the behavior I've seen with free; is it not the case
with munmap? Either way, I haven't walked the code to see where the
blocking is implemented.

If UIO_USERSPACE truly is unsafe to use with busdma, then we need to
either make it safe or stop documenting it in the manpage.
Perhaps the bounce buffer logic could use copyin/copyout for userspace,
if in the right process? Or just always use physcopy for non-KVA as in
the first suggestion?

It seems like in general it is too hard for drivers using busdma to deal
with usermode memory in a way that's both safe and efficient:
--bus_dmamap_load_uio + UIO_USERSPACE is apparently really unsafe
--if they do things the other way and allocate in the kernel, then then
they better either be willing to do extra copying, or create and
refcount their own vm_objects and use d_mmap_single (I still haven't
seen a good example of that), or leak a bunch of memory (if they use
d_mmap), because the old device pager is also really unsafe.
Post by Konstantin Belousov
Post by Jason Harmening
Post by Svatopluk Kraus
DMA can be done on client buffer from user address space. For example,
thru bus_dmamap_load_uio() when uio->uio_segflg is UIO_USERSPACE. Such
client buffer can bounce and then, it must be copied to and from
bounce buffer in bus_dmamap_sync().
Current implementations (in all archs) do not take into account that
bus_dmamap_sync() is asynchronous for POSTWRITE and POSTREAD in
general. It can be asynchronous for PREWRITE and PREREAD too. For
example, in some driver implementations where DMA client buffers
operations are buffered. In those cases, simple bcopy() is not
correct.
-----------------------------
struct bounce_page {
vm_offset_t vaddr; /* kva of bounce buffer */
bus_addr_t busaddr; /* Physical address */
vm_offset_t datavaddr; /* kva of client data */
bus_addr_t dataaddr; /* client physical address */
bus_size_t datacount; /* client data count */
STAILQ_ENTRY(bounce_page) links;
};
if ((op & BUS_DMASYNC_PREWRITE) != 0) {
while (bpage != NULL) {
if (bpage->datavaddr != 0) {
bcopy((void *)bpage->datavaddr,
(void *)bpage->vaddr,
bpage->datacount);
} else {
physcopyout(bpage->dataaddr,
(void *)bpage->vaddr,
bpage->datacount);
}
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
}
-----------------------------
(1) datavaddr is not always kva of client data, but sometimes it can
be uva of client data.
(2) bcopy() can be used only if datavaddr is kva or when map->pmap is
current pmap.
Note that there is an implication for bus_dmamap_load_uio() with
uio->uio_segflg set to UIO_USERSPACE that used physical pages are
in-core and wired. See "man bus_dma".
There is not public interface to check that map->pmap is current pmap.
if (bpage->datavaddr >= VM_MIN_KERNEL_ADDRESS) {
bcopy();
} else {
physcopy();
}
If there will be public pmap_is_current() then another solution is the
if (bpage->datavaddr != 0) && pmap_is_current(map->pmap)) {
bcopy();
} else {
physcopy();
}
The second solution implies that context switch must not happen during
bus_dmamap_sync() called from an interrupt routine. However, IMO, it's
granted.
Note that map->pmap should be always kernel_pmap for datavaddr >=
VM_MIN_KERNEL_ADDRESS.
Comments, different solutions, or objections?
Svatopluk Kraus
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
Konstantin Belousov
2015-04-25 16:34:44 UTC
Permalink
Post by Jason Harmening
It seems like in general it is too hard for drivers using busdma to deal
--bus_dmamap_load_uio + UIO_USERSPACE is apparently really unsafe
--if they do things the other way and allocate in the kernel, then then
they better either be willing to do extra copying, or create and
refcount their own vm_objects and use d_mmap_single (I still haven't
seen a good example of that), or leak a bunch of memory (if they use
d_mmap), because the old device pager is also really unsafe.
munmap(2) does not free the pages, it removes the mapping and dereferences
the backing vm object. If the region was wired, munmap would decrement
the wiring count for the pages. So if a kernel code wired the regions
pages, they are kept wired, but no longer mapped into the userspace.
So bcopy() still does not work.

d_mmap_single() is used by GPU, definitely by GEM and TTM code, and possibly
by the proprietary nvidia driver.

I believe UIO_USERSPACE is almost unused, it might be there for some
obscure (and buggy) driver.
Jason Harmening
2015-04-25 17:07:29 UTC
Permalink
Post by Konstantin Belousov
Post by Jason Harmening
It seems like in general it is too hard for drivers using busdma to deal
--bus_dmamap_load_uio + UIO_USERSPACE is apparently really unsafe
--if they do things the other way and allocate in the kernel, then then
they better either be willing to do extra copying, or create and
refcount their own vm_objects and use d_mmap_single (I still haven't
seen a good example of that), or leak a bunch of memory (if they use
d_mmap), because the old device pager is also really unsafe.
munmap(2) does not free the pages, it removes the mapping and dereferences
the backing vm object. If the region was wired, munmap would decrement
the wiring count for the pages. So if a kernel code wired the regions
pages, they are kept wired, but no longer mapped into the userspace.
So bcopy() still does not work.
Ok, my question wasn't whether munmap frees the pages, but whether it
accounts for wire count when attempting to remove the mapping. It
doesn't, so yes bcopy will be unsafe.
Post by Konstantin Belousov
d_mmap_single() is used by GPU, definitely by GEM and TTM code, and possibly
by the proprietary nvidia driver.
I believe UIO_USERSPACE is almost unused, it might be there for some
obscure (and buggy) driver.
It may be nearly unused, but we still document it in busdma.9, and we
still explicitly check for it when setting the pmap in
_bus_dmamap_load_uio. If it's not safe to use, then it's not OK for us
to do that.
We need to either a) remove support for it by adding a failure/KASSERT
on UIO_USERSPACE in _busdmamap_load_uio() and remove the paragraph on it
from busdma.9, or b) make it safe.

I'd be in favor of b), because I think it is still valid to support some
non-painful way of using DMA with userspace buffers. Right now, the
only safe way to do that seems to be:
1) vm_fault_quick_hold_pages
2) kva_alloc
3) pmap_qenter
4) bus_dmamap_load

That seems both unnecessarily complex and wasteful of KVA space.
Konstantin Belousov
2015-04-25 17:28:33 UTC
Permalink
Post by Jason Harmening
Post by Konstantin Belousov
I believe UIO_USERSPACE is almost unused, it might be there for some
obscure (and buggy) driver.
It may be nearly unused, but we still document it in busdma.9, and we
still explicitly check for it when setting the pmap in
_bus_dmamap_load_uio. If it's not safe to use, then it's not OK for us
to do that.
We need to either a) remove support for it by adding a failure/KASSERT
on UIO_USERSPACE in _busdmamap_load_uio() and remove the paragraph on it
from busdma.9, or b) make it safe.
I'd be in favor of b), because I think it is still valid to support some
non-painful way of using DMA with userspace buffers. Right now, the
1) vm_fault_quick_hold_pages
2) kva_alloc
3) pmap_qenter
4) bus_dmamap_load
1. vm_fault_quick_hold
2. bus_dmamap_load_ma
Post by Jason Harmening
That seems both unnecessarily complex and wasteful of KVA space.
The above sequence does not need a KVA allocation.
Jason Harmening
2015-04-25 17:55:13 UTC
Permalink
Post by Konstantin Belousov
Post by Jason Harmening
Post by Konstantin Belousov
I believe UIO_USERSPACE is almost unused, it might be there for some
obscure (and buggy) driver.
It may be nearly unused, but we still document it in busdma.9, and we
still explicitly check for it when setting the pmap in
_bus_dmamap_load_uio. If it's not safe to use, then it's not OK for us
to do that.
We need to either a) remove support for it by adding a failure/KASSERT
on UIO_USERSPACE in _busdmamap_load_uio() and remove the paragraph on it
from busdma.9, or b) make it safe.
I'd be in favor of b), because I think it is still valid to support some
non-painful way of using DMA with userspace buffers. Right now, the
1) vm_fault_quick_hold_pages
2) kva_alloc
3) pmap_qenter
4) bus_dmamap_load
1. vm_fault_quick_hold
2. bus_dmamap_load_ma
Post by Jason Harmening
That seems both unnecessarily complex and wasteful of KVA space.
The above sequence does not need a KVA allocation.
Ah, that looks much better. A few things though:
1) _bus_dmamap_load_ma (note the underscore) is still part of the MI/MD
interface, which we tell drivers not to use. It looks like it's
implemented for every arch though. Should there be a public and
documented bus_dmamap_load_ma ?
2) There is a bus_dmamap_load_ma_triv that's part of the MI interface,
but it's not documented, and it seems like it would be suboptimal in
certain cases, such as when dmar is enabled.
3) Using bus_dmamap_load_ma would mean always using physcopy for bounce
buffers...seems like the sfbufs would slow things down ?
Konstantin Belousov
2015-04-25 18:18:46 UTC
Permalink
Post by Jason Harmening
1) _bus_dmamap_load_ma (note the underscore) is still part of the MI/MD
interface, which we tell drivers not to use. It looks like it's
implemented for every arch though. Should there be a public and
documented bus_dmamap_load_ma ?
Might be yes. But at least one consumer of the KPI must appear before
the facility is introduced.
Post by Jason Harmening
2) There is a bus_dmamap_load_ma_triv that's part of the MI interface,
but it's not documented, and it seems like it would be suboptimal in
certain cases, such as when dmar is enabled.
When DMAR is enabled, bus_dmamap_load_triv() should not be used.
It should not be used directly even when not. Drivers should use
bus_dmamap_load_ma(), and implementation redirects to _triv() if
needed.

The _triv() is the helper to allow bus_dmamap_load_ma() to exists
on architectures which cannot implement, on not yet implemented,
proper page array load op.
Post by Jason Harmening
3) Using bus_dmamap_load_ma would mean always using physcopy for bounce
buffers...seems like the sfbufs would slow things down ?
For amd64, sfbufs are nop, due to the direct map. But, I doubt that
we can combine bounce buffers and performance in the single sentence.
Jason Harmening
2015-04-25 18:47:07 UTC
Permalink
Post by Konstantin Belousov
Post by Jason Harmening
1) _bus_dmamap_load_ma (note the underscore) is still part of the MI/MD
interface, which we tell drivers not to use. It looks like it's
implemented for every arch though. Should there be a public and
documented bus_dmamap_load_ma ?
Might be yes. But at least one consumer of the KPI must appear before
the facility is introduced.
Could some of the GART/GTT code consume that?
Post by Konstantin Belousov
Post by Jason Harmening
2) There is a bus_dmamap_load_ma_triv that's part of the MI interface,
but it's not documented, and it seems like it would be suboptimal in
certain cases, such as when dmar is enabled.
When DMAR is enabled, bus_dmamap_load_triv() should not be used.
It should not be used directly even when not. Drivers should use
bus_dmamap_load_ma(), and implementation redirects to _triv() if
needed.
The _triv() is the helper to allow bus_dmamap_load_ma() to exists
on architectures which cannot implement, on not yet implemented,
proper page array load op.
Yes, I noticed the same thing. I'm not sure why _triv() is treated as
part of the public API and not prefixed with an underscore and a comment
not to use it in drivers.
Post by Konstantin Belousov
Post by Jason Harmening
3) Using bus_dmamap_load_ma would mean always using physcopy for bounce
buffers...seems like the sfbufs would slow things down ?
For amd64, sfbufs are nop, due to the direct map. But, I doubt that
we can combine bounce buffers and performance in the single sentence.
In fact the amd64 implementation of uiomove_fromphys doesn't use sfbufs
at all thanks to the direct map. sparc64 seems to avoid sfbufs as much
as possible too. I don't know what arm64/aarch64 will be able to use.
Those seem like the platforms where bounce buffering would be the most
likely, along with i386 + PAE. They might still be used on 32-bit
platforms for alignment or devices with < 32-bit address width, but then
those are likely to be old and slow anyway.

I'm still a bit worried about the slowness of waiting for an sfbuf if
one is needed, but in practice that might not be a big issue.
Konstantin Belousov
2015-04-25 20:14:10 UTC
Permalink
Post by Jason Harmening
Post by Konstantin Belousov
Post by Jason Harmening
1) _bus_dmamap_load_ma (note the underscore) is still part of the MI/MD
interface, which we tell drivers not to use. It looks like it's
implemented for every arch though. Should there be a public and
documented bus_dmamap_load_ma ?
Might be yes. But at least one consumer of the KPI must appear before
the facility is introduced.
Could some of the GART/GTT code consume that?
Do you mean, by GEM/GTT code ? Indeed, this is interesting and probably
workable suggestion. I thought that I would need to provide a special
interface from DMAR for the GEM, but your proposal seems to fit. Still,
an issue is that the Linux code is structured significantly different,
and this code, although isolated, is significant divergent from the
upstream.

The special DMAR interface is still needed for bhyve, I am slowly working
on it.
Post by Jason Harmening
Post by Konstantin Belousov
Post by Jason Harmening
2) There is a bus_dmamap_load_ma_triv that's part of the MI interface,
but it's not documented, and it seems like it would be suboptimal in
certain cases, such as when dmar is enabled.
When DMAR is enabled, bus_dmamap_load_triv() should not be used.
It should not be used directly even when not. Drivers should use
bus_dmamap_load_ma(), and implementation redirects to _triv() if
needed.
The _triv() is the helper to allow bus_dmamap_load_ma() to exists
on architectures which cannot implement, on not yet implemented,
proper page array load op.
Yes, I noticed the same thing. I'm not sure why _triv() is treated as
part of the public API and not prefixed with an underscore and a comment
not to use it in drivers.
It is not. We do not claim that a function not starting with '_' is part
of the driver KPI. Comment would be nice, indeed.
Post by Jason Harmening
Post by Konstantin Belousov
Post by Jason Harmening
3) Using bus_dmamap_load_ma would mean always using physcopy for bounce
buffers...seems like the sfbufs would slow things down ?
For amd64, sfbufs are nop, due to the direct map. But, I doubt that
we can combine bounce buffers and performance in the single sentence.
In fact the amd64 implementation of uiomove_fromphys doesn't use sfbufs
at all thanks to the direct map. sparc64 seems to avoid sfbufs as much
as possible too. I don't know what arm64/aarch64 will be able to use.
Those seem like the platforms where bounce buffering would be the most
likely, along with i386 + PAE. They might still be used on 32-bit
platforms for alignment or devices with < 32-bit address width, but then
those are likely to be old and slow anyway.
I'm still a bit worried about the slowness of waiting for an sfbuf if
one is needed, but in practice that might not be a big issue.
Jason Harmening
2015-04-26 18:04:00 UTC
Permalink
Post by Konstantin Belousov
Post by Jason Harmening
Post by Konstantin Belousov
Post by Jason Harmening
1) _bus_dmamap_load_ma (note the underscore) is still part of the MI/MD
interface, which we tell drivers not to use. It looks like it's
implemented for every arch though. Should there be a public and
documented bus_dmamap_load_ma ?
Might be yes. But at least one consumer of the KPI must appear before
the facility is introduced.
Could some of the GART/GTT code consume that?
Do you mean, by GEM/GTT code ? Indeed, this is interesting and probably
workable suggestion. I thought that I would need to provide a special
interface from DMAR for the GEM, but your proposal seems to fit. Still,
an issue is that the Linux code is structured significantly different,
and this code, although isolated, is significant divergent from the
upstream.
Yes, GEM/GTT. I know it would be useful for i915, maybe other drm2
drivers too.
Post by Konstantin Belousov
Post by Jason Harmening
Post by Konstantin Belousov
Post by Jason Harmening
3) Using bus_dmamap_load_ma would mean always using physcopy for bounce
buffers...seems like the sfbufs would slow things down ?
For amd64, sfbufs are nop, due to the direct map. But, I doubt that
we can combine bounce buffers and performance in the single sentence.
In fact the amd64 implementation of uiomove_fromphys doesn't use sfbufs
at all thanks to the direct map. sparc64 seems to avoid sfbufs as much
as possible too. I don't know what arm64/aarch64 will be able to use.
Those seem like the platforms where bounce buffering would be the most
likely, along with i386 + PAE. They might still be used on 32-bit
platforms for alignment or devices with < 32-bit address width, but then
those are likely to be old and slow anyway.
I'm still a bit worried about the slowness of waiting for an sfbuf if
one is needed, but in practice that might not be a big issue.
I noticed the following in vm_map_delete, which is called by sys_munmap:


2956 * Wait for wiring or unwiring of an entry to complete.
2957 * Also wait for any system wirings to disappear on
2958 * user maps.
2959 */
2960 if ((entry->eflags & MAP_ENTRY_IN_TRANSITION) != 0 ||
2961 (vm_map_pmap(map) != kernel_pmap &&
2962 vm_map_entry_system_wired_count(entry) != 0)) {
...
2970 (void) vm_map_unlock_and_wait(map, 0);

It looks like munmap does wait on wired pages (well, system-wired pages, not mlock'ed pages).
The system-wire count on the PTE will be non-zero if vslock/vm_map_wire(...VM_MAP_WIRE_SYSTEM...) was called on it.
Does that mean UIO_USERSPACE dmamaps are actually safe from getting the UVA taken out from under them?
Obviously it doesn't make bcopy safe to do in the wrong process context, but that seems easily fixable.
Konstantin Belousov
2015-04-27 08:14:53 UTC
Permalink
Post by Jason Harmening
Post by Konstantin Belousov
Post by Jason Harmening
Post by Konstantin Belousov
Post by Jason Harmening
1) _bus_dmamap_load_ma (note the underscore) is still part of the MI/MD
interface, which we tell drivers not to use. It looks like it's
implemented for every arch though. Should there be a public and
documented bus_dmamap_load_ma ?
Might be yes. But at least one consumer of the KPI must appear before
the facility is introduced.
Could some of the GART/GTT code consume that?
Do you mean, by GEM/GTT code ? Indeed, this is interesting and probably
workable suggestion. I thought that I would need to provide a special
interface from DMAR for the GEM, but your proposal seems to fit. Still,
an issue is that the Linux code is structured significantly different,
and this code, although isolated, is significant divergent from the
upstream.
Yes, GEM/GTT. I know it would be useful for i915, maybe other drm2
drivers too.
Post by Konstantin Belousov
Post by Jason Harmening
Post by Konstantin Belousov
Post by Jason Harmening
3) Using bus_dmamap_load_ma would mean always using physcopy for bounce
buffers...seems like the sfbufs would slow things down ?
For amd64, sfbufs are nop, due to the direct map. But, I doubt that
we can combine bounce buffers and performance in the single sentence.
In fact the amd64 implementation of uiomove_fromphys doesn't use sfbufs
at all thanks to the direct map. sparc64 seems to avoid sfbufs as much
as possible too. I don't know what arm64/aarch64 will be able to use.
Those seem like the platforms where bounce buffering would be the most
likely, along with i386 + PAE. They might still be used on 32-bit
platforms for alignment or devices with < 32-bit address width, but then
those are likely to be old and slow anyway.
I'm still a bit worried about the slowness of waiting for an sfbuf if
one is needed, but in practice that might not be a big issue.
2956 * Wait for wiring or unwiring of an entry to complete.
2957 * Also wait for any system wirings to disappear on
2958 * user maps.
2959 */
2960 if ((entry->eflags & MAP_ENTRY_IN_TRANSITION) != 0 ||
2961 (vm_map_pmap(map) != kernel_pmap &&
2962 vm_map_entry_system_wired_count(entry) != 0)) {
...
2970 (void) vm_map_unlock_and_wait(map, 0);
It looks like munmap does wait on wired pages (well, system-wired pages, not mlock'ed pages).
The system-wire count on the PTE will be non-zero if vslock/vm_map_wire(...VM_MAP_WIRE_SYSTEM...) was called on it.
Does that mean UIO_USERSPACE dmamaps are actually safe from getting the UVA taken out from under them?
Obviously it doesn't make bcopy safe to do in the wrong process context, but that seems easily fixable.
vslock() indeed would prevent the unmap, but it also causes very serious
user address space fragmentation. vslock() carves map entry covering the
specified region, which, for the typical application use of malloced
memory for buffers, could easily fragment the bss into per-page map
entries. It is not very important for the current vslock() use by sysctl
code, since apps usually do bounded number of sysctls at the startup,
but definitely it would be an issue if vslock() appears on the i/o path.
Svatopluk Kraus
2015-04-27 14:21:35 UTC
Permalink
On Mon, Apr 27, 2015 at 10:14 AM, Konstantin Belousov
Post by Konstantin Belousov
Post by Jason Harmening
Post by Konstantin Belousov
Post by Jason Harmening
Post by Konstantin Belousov
Post by Jason Harmening
1) _bus_dmamap_load_ma (note the underscore) is still part of the MI/MD
interface, which we tell drivers not to use. It looks like it's
implemented for every arch though. Should there be a public and
documented bus_dmamap_load_ma ?
Might be yes. But at least one consumer of the KPI must appear before
the facility is introduced.
Could some of the GART/GTT code consume that?
Do you mean, by GEM/GTT code ? Indeed, this is interesting and probably
workable suggestion. I thought that I would need to provide a special
interface from DMAR for the GEM, but your proposal seems to fit. Still,
an issue is that the Linux code is structured significantly different,
and this code, although isolated, is significant divergent from the
upstream.
Yes, GEM/GTT. I know it would be useful for i915, maybe other drm2
drivers too.
Post by Konstantin Belousov
Post by Jason Harmening
Post by Konstantin Belousov
Post by Jason Harmening
3) Using bus_dmamap_load_ma would mean always using physcopy for bounce
buffers...seems like the sfbufs would slow things down ?
For amd64, sfbufs are nop, due to the direct map. But, I doubt that
we can combine bounce buffers and performance in the single sentence.
In fact the amd64 implementation of uiomove_fromphys doesn't use sfbufs
at all thanks to the direct map. sparc64 seems to avoid sfbufs as much
as possible too. I don't know what arm64/aarch64 will be able to use.
Those seem like the platforms where bounce buffering would be the most
likely, along with i386 + PAE. They might still be used on 32-bit
platforms for alignment or devices with < 32-bit address width, but then
those are likely to be old and slow anyway.
I'm still a bit worried about the slowness of waiting for an sfbuf if
one is needed, but in practice that might not be a big issue.
2956 * Wait for wiring or unwiring of an entry to complete.
2957 * Also wait for any system wirings to disappear on
2958 * user maps.
2959 */
2960 if ((entry->eflags & MAP_ENTRY_IN_TRANSITION) != 0 ||
2961 (vm_map_pmap(map) != kernel_pmap &&
2962 vm_map_entry_system_wired_count(entry) != 0)) {
...
2970 (void) vm_map_unlock_and_wait(map, 0);
It looks like munmap does wait on wired pages (well, system-wired pages, not mlock'ed pages).
The system-wire count on the PTE will be non-zero if vslock/vm_map_wire(...VM_MAP_WIRE_SYSTEM...) was called on it.
Does that mean UIO_USERSPACE dmamaps are actually safe from getting the UVA taken out from under them?
Obviously it doesn't make bcopy safe to do in the wrong process context, but that seems easily fixable.
vslock() indeed would prevent the unmap, but it also causes very serious
user address space fragmentation. vslock() carves map entry covering the
specified region, which, for the typical application use of malloced
memory for buffers, could easily fragment the bss into per-page map
entries. It is not very important for the current vslock() use by sysctl
code, since apps usually do bounded number of sysctls at the startup,
but definitely it would be an issue if vslock() appears on the i/o path.
In the scope of this thread, there are two things which must be
fulfilled during DMA operations:

(1) Affected physical pages must be kept in system at any cost. It
means no swapping and no freeing.

(2) DMA sync must be doable. It means that physical pages must be
mapped somewhere if needed, even temporarily.

The point (1) must be fulfilled by DMA client by the way which is
suitable for it. It should not be a part of any DMA load or unload
method.

The subject of this thread was meant to be about point (2). I have no
problem that it was extented to point (1) too. In fact, I welcome
that. But there are still two proposed solutions how to fix bouncing
for user space buffers here.

The first solution is very simple and user space buffers could be
looked at like unbuffered ones. If a mapping is needed, some temporaty
KVA is used.

The second solution is simple too. If a mapping is needed and context
is correct, UVA is used. Otherwise, some temporaty KVA is used. I
prefer this solution as on cache not coherent DMA case, cache
maintainace operations must be taken and buffer must always have valid
mapping for them in DMA sync.

I think that support for DMA from/to user space buffers is important
for graphic adapters, fast data grabbers, whatever what needs fast
user process interaction with a device. IMHO, There is no way to
cancel support for it.

Thus some fix for bouncing must be done in all archs.
Svatopluk Kraus
2015-04-26 20:08:31 UTC
Permalink
On Sat, Apr 25, 2015 at 7:28 PM, Konstantin Belousov
Post by Konstantin Belousov
Post by Jason Harmening
Post by Konstantin Belousov
I believe UIO_USERSPACE is almost unused, it might be there for some
obscure (and buggy) driver.
It may be nearly unused, but we still document it in busdma.9, and we
still explicitly check for it when setting the pmap in
_bus_dmamap_load_uio. If it's not safe to use, then it's not OK for us
to do that.
We need to either a) remove support for it by adding a failure/KASSERT
on UIO_USERSPACE in _busdmamap_load_uio() and remove the paragraph on it
from busdma.9, or b) make it safe.
I'd be in favor of b), because I think it is still valid to support some
non-painful way of using DMA with userspace buffers. Right now, the
1) vm_fault_quick_hold_pages
2) kva_alloc
3) pmap_qenter
4) bus_dmamap_load
1. vm_fault_quick_hold
2. bus_dmamap_load_ma
Post by Jason Harmening
That seems both unnecessarily complex and wasteful of KVA space.
The above sequence does not need a KVA allocation.
But if the buffer bounces, then some KVA must be allocated temporarily
for physcopyin/physcopyout.

FYI, we are in the following situation in ARM arch. (1) The DMA in not
cache coherent, and (2) cache maintainance operations are done on
virtual addresses. It means that cache maintainance must be done for
cached memory. Moreover, it must be done even for unmapped buffers and
they must be mapped for that.

Thus it could be of much help if we can used UVA for that if context is correct.
John Baldwin
2015-04-28 13:40:33 UTC
Permalink
Post by Konstantin Belousov
Post by Jason Harmening
It seems like in general it is too hard for drivers using busdma to deal
--bus_dmamap_load_uio + UIO_USERSPACE is apparently really unsafe
--if they do things the other way and allocate in the kernel, then then
they better either be willing to do extra copying, or create and
refcount their own vm_objects and use d_mmap_single (I still haven't
seen a good example of that), or leak a bunch of memory (if they use
d_mmap), because the old device pager is also really unsafe.
munmap(2) does not free the pages, it removes the mapping and dereferences
the backing vm object. If the region was wired, munmap would decrement
the wiring count for the pages. So if a kernel code wired the regions
pages, they are kept wired, but no longer mapped into the userspace.
So bcopy() still does not work.
d_mmap_single() is used by GPU, definitely by GEM and TTM code, and possibly
by the proprietary nvidia driver.
Yes, the nvidia driver uses it. I've also used it for some proprietary
driver extensions.
Post by Konstantin Belousov
I believe UIO_USERSPACE is almost unused, it might be there for some
obscure (and buggy) driver.
I believe it was added (and only ever used) in crypto drivers, and that they
all did bus_dma operations in the context of the thread that passed in the
uio. I definitely think it is fragile and should be replaced with something
more reliable.
--
John Baldwin
Jason Harmening
2015-04-28 14:49:06 UTC
Permalink
Post by John Baldwin
Post by Konstantin Belousov
Post by Jason Harmening
It seems like in general it is too hard for drivers using busdma to deal
--bus_dmamap_load_uio + UIO_USERSPACE is apparently really unsafe
--if they do things the other way and allocate in the kernel, then then
they better either be willing to do extra copying, or create and
refcount their own vm_objects and use d_mmap_single (I still haven't
seen a good example of that), or leak a bunch of memory (if they use
d_mmap), because the old device pager is also really unsafe.
munmap(2) does not free the pages, it removes the mapping and dereferences
the backing vm object. If the region was wired, munmap would decrement
the wiring count for the pages. So if a kernel code wired the regions
pages, they are kept wired, but no longer mapped into the userspace.
So bcopy() still does not work.
d_mmap_single() is used by GPU, definitely by GEM and TTM code, and possibly
by the proprietary nvidia driver.
Yes, the nvidia driver uses it. I've also used it for some proprietary
driver extensions.
I've seen d_mmap_single() used in the GPU code, but I haven't seen it
used in conjunction with busdma (but maybe not looking in the right place).
Post by John Baldwin
Post by Konstantin Belousov
I believe UIO_USERSPACE is almost unused, it might be there for some
obscure (and buggy) driver.
I believe it was added (and only ever used) in crypto drivers, and that they
all did bus_dma operations in the context of the thread that passed in the
uio. I definitely think it is fragile and should be replaced with something
more reliable.
I think it's useful to make the bounce-buffering logic more robust in
cases where it's not executed in the owning process; it's also a really
simple set of changes. Of course doing vslock beforehand is still going
to be the only safe way to use that API, but that seems reasonable if
it's documented and done sparingly (which it is).
In the longer term, vm_fault_quick_hold_pages + _bus_dmamap_load_ma is
probably better for user buffers, at least for short transfers (which I
think is most of them). load_ma needs to at least be made a public and
documented KPI though. I'd like to try moving some of the drm2 code to
use it once I finally have a reasonably modern machine for testing -current.

Either _bus_dmamap_load_ma or out-of-context UIO_USERSPACE bounce
buffering could have issues with waiting on sfbufs on some arches,
including arm. That could be fixed by making each unmapped bounce
buffer set up a kva mapping for the data addr when it's created, but
that fix might be worse than the problem it's trying to solve.
Konstantin Belousov
2015-04-28 15:42:45 UTC
Permalink
Post by Jason Harmening
Either _bus_dmamap_load_ma or out-of-context UIO_USERSPACE bounce
buffering could have issues with waiting on sfbufs on some arches,
including arm. That could be fixed by making each unmapped bounce
buffer set up a kva mapping for the data addr when it's created, but
that fix might be worse than the problem it's trying to solve.
I had an implementation of the sfbuf allocator which never sleeps. If
sfbuf was not available without sleep, a callback is called later, when
a reusable sf buf is freed. It was written to allow drivers like PIO
ATA to take unmapped bios, but I never finished it, at least did not
converted a single driver.

I am not sure if I can find the branch or is it reasonable to try to
rebase it, but the base idea may be useful for the UIO_USERSPACE case
as well.
Warner Losh
2015-04-28 16:19:20 UTC
Permalink
Post by John Baldwin
Post by Konstantin Belousov
I believe UIO_USERSPACE is almost unused, it might be there for some
obscure (and buggy) driver.
I believe it was added (and only ever used) in crypto drivers, and that they
all did bus_dma operations in the context of the thread that passed in the
uio. I definitely think it is fragile and should be replaced with something
more reliable.
Fusion I/O’s SDK used this trick to allow mapping of userspace buffers down
into the block layer after doing the requisite locking / pinning / etc of the buffers
into memory. That’s if memory serves correctly (the SDK did these things, I can’t
easily check on that detail since I’m no longer at FIO).

Warner
Adrian Chadd
2015-04-28 19:10:43 UTC
Permalink
Post by John Baldwin
Post by Konstantin Belousov
I believe UIO_USERSPACE is almost unused, it might be there for some
obscure (and buggy) driver.
I believe it was added (and only ever used) in crypto drivers, and that they
all did bus_dma operations in the context of the thread that passed in the
uio. I definitely think it is fragile and should be replaced with something
more reliable.
Fusion I/O’s SDK used this trick to allow mapping of userspace buffers down
into the block layer after doing the requisite locking / pinning / etc of the buffers
into memory. That’s if memory serves correctly (the SDK did these things, I can’t
easily check on that detail since I’m no longer at FIO).
This is a long-standing trick. physio() does it too,
aio_read/aio_write does it for direct block accesses. Now that pbufs
aren't involved anymore, it should scale rather well.

So I'd like to see more of it in the kernel and disk/net APIs and drivers.


-adrian
John Baldwin
2015-04-28 22:27:34 UTC
Permalink
Post by Adrian Chadd
Post by John Baldwin
Post by Konstantin Belousov
I believe UIO_USERSPACE is almost unused, it might be there for some
obscure (and buggy) driver.
I believe it was added (and only ever used) in crypto drivers, and that they
all did bus_dma operations in the context of the thread that passed in the
uio. I definitely think it is fragile and should be replaced with something
more reliable.
Fusion I/O’s SDK used this trick to allow mapping of userspace buffers down
into the block layer after doing the requisite locking / pinning / etc of the buffers
into memory. That’s if memory serves correctly (the SDK did these things, I can’t
easily check on that detail since I’m no longer at FIO).
This is a long-standing trick. physio() does it too,
aio_read/aio_write does it for direct block accesses. Now that pbufs
aren't involved anymore, it should scale rather well.
So I'd like to see more of it in the kernel and disk/net APIs and drivers.
aio_read/write jump through gross hacks to create dedicated kthreads that
"borrow" the address space of the requester. The fact is that we want to
make unmapped I/O work in the general case and the same solutions for
temporary mappings for that can be reused to temporarily map the wired
pages backing a user request when needed. Reusing user mappings directly
in the kernel isn't really the way forward.
--
John Baldwin
Svatopluk Kraus
2015-04-29 10:22:19 UTC
Permalink
Post by John Baldwin
Post by Adrian Chadd
Post by John Baldwin
Post by Konstantin Belousov
I believe UIO_USERSPACE is almost unused, it might be there for some
obscure (and buggy) driver.
I believe it was added (and only ever used) in crypto drivers, and that they
all did bus_dma operations in the context of the thread that passed in the
uio. I definitely think it is fragile and should be replaced with something
more reliable.
Fusion I/O’s SDK used this trick to allow mapping of userspace buffers down
into the block layer after doing the requisite locking / pinning / etc of the buffers
into memory. That’s if memory serves correctly (the SDK did these things, I can’t
easily check on that detail since I’m no longer at FIO).
This is a long-standing trick. physio() does it too,
aio_read/aio_write does it for direct block accesses. Now that pbufs
aren't involved anymore, it should scale rather well.
So I'd like to see more of it in the kernel and disk/net APIs and drivers.
aio_read/write jump through gross hacks to create dedicated kthreads that
"borrow" the address space of the requester. The fact is that we want to
make unmapped I/O work in the general case and the same solutions for
temporary mappings for that can be reused to temporarily map the wired
pages backing a user request when needed. Reusing user mappings directly
in the kernel isn't really the way forward.
If using unmapped buffers is the way we will take to play with user
space buffers, then:

(1) DMA clients, which support DMA for user space buffers, must use
some variant of _bus_dmamap_load_phys(). They must wire physical pages
in system anyway.
(2) Maybe some better way how to temporarily allocate KVA for unmapped
buffers should be implemented.
(3) DMA clients which already use _bus_dmamap_load_uio() with
UIO_USERSPACE must be reimplemented or made obsolete.
(4) UIO_USERSPACE must be off limit in _bus_dmamap_load_uio() and man
page should be changed according to it.
(5) And pmap can be deleted from struct bus_dmamap and all functions
which use it as argument. Only kernel pmap will be used in DMA
framework.

Did I miss out something?
Post by John Baldwin
--
John Baldwin
Konstantin Belousov
2015-04-29 13:20:17 UTC
Permalink
Post by Svatopluk Kraus
If using unmapped buffers is the way we will take to play with user
(1) DMA clients, which support DMA for user space buffers, must use
some variant of _bus_dmamap_load_phys(). They must wire physical pages
in system anyway.
No, vm_fault_quick_hold_pages() + bus_dmamap_load_ma().
Or yes, if you count bus_dmamap_load_ma() as a variant of _load_phys().
I do not.
Post by Svatopluk Kraus
(2) Maybe some better way how to temporarily allocate KVA for unmapped
buffers should be implemented.
See some other mail from me about non-blocking sfbuf allocator with
callback.
Post by Svatopluk Kraus
(3) DMA clients which already use _bus_dmamap_load_uio() with
UIO_USERSPACE must be reimplemented or made obsolete.
Yes.
Post by Svatopluk Kraus
(4) UIO_USERSPACE must be off limit in _bus_dmamap_load_uio() and man
page should be changed according to it.
Yes.
Post by Svatopluk Kraus
(5) And pmap can be deleted from struct bus_dmamap and all functions
which use it as argument. Only kernel pmap will be used in DMA
framework.
Probably yes.
Post by Svatopluk Kraus
Did I miss out something?
Post by John Baldwin
--
John Baldwin
Svatopluk Kraus
2015-04-29 15:09:18 UTC
Permalink
On Wed, Apr 29, 2015 at 3:20 PM, Konstantin Belousov
Post by Konstantin Belousov
Post by Svatopluk Kraus
If using unmapped buffers is the way we will take to play with user
(1) DMA clients, which support DMA for user space buffers, must use
some variant of _bus_dmamap_load_phys(). They must wire physical pages
in system anyway.
No, vm_fault_quick_hold_pages() + bus_dmamap_load_ma().
Or yes, if you count bus_dmamap_load_ma() as a variant of _load_phys().
I do not.
There are only two basic functions in MD implementations which all
other functions call: _bus_dmamap_load_phys() and
_bus_dmamap_load_buffer() as a synonym for unmapped buffers and mapped
ones. Are you saying that bus_dmamap_load_ma() should be some third
kind?
Post by Konstantin Belousov
Post by Svatopluk Kraus
(2) Maybe some better way how to temporarily allocate KVA for unmapped
buffers should be implemented.
See some other mail from me about non-blocking sfbuf allocator with
callback.
This small list was meant as summary. As I saw your emails in this
thread, I added this point . I did not get that it's already in source
tree.
Post by Konstantin Belousov
Post by Svatopluk Kraus
(3) DMA clients which already use _bus_dmamap_load_uio() with
UIO_USERSPACE must be reimplemented or made obsolete.
Yes.
Post by Svatopluk Kraus
(4) UIO_USERSPACE must be off limit in _bus_dmamap_load_uio() and man
page should be changed according to it.
Yes.
Hmm, I think that for the beginning, _bus_dmamap_load_uio() for
UIO_USERSPACE can be hacked to use bus_dmamap_load_ma(). Maybe with
some warning to force users of old clients to reimplement them.
Post by Konstantin Belousov
Post by Svatopluk Kraus
(5) And pmap can be deleted from struct bus_dmamap and all functions
which use it as argument. Only kernel pmap will be used in DMA
framework.
Probably yes.
Konstantin Belousov
2015-04-29 16:54:32 UTC
Permalink
Post by Svatopluk Kraus
On Wed, Apr 29, 2015 at 3:20 PM, Konstantin Belousov
Post by Konstantin Belousov
Post by Svatopluk Kraus
If using unmapped buffers is the way we will take to play with user
(1) DMA clients, which support DMA for user space buffers, must use
some variant of _bus_dmamap_load_phys(). They must wire physical pages
in system anyway.
No, vm_fault_quick_hold_pages() + bus_dmamap_load_ma().
Or yes, if you count bus_dmamap_load_ma() as a variant of _load_phys().
I do not.
There are only two basic functions in MD implementations which all
other functions call: _bus_dmamap_load_phys() and
_bus_dmamap_load_buffer() as a synonym for unmapped buffers and mapped
ones. Are you saying that bus_dmamap_load_ma() should be some third
kind?
It is.

On the VT-d backed x86 busdma, load_ma() is the fundamental function,
which is called both by _load_buffer() and _load_phys(). This is not
completely true, the real backstage worker is called _load_something(),
but it differs from _load_ma() only by taking casted tag and map.

On the other hand, the load_ma_triv() wrapper implements _load_ma()
using load_phys() on architectures which do not yet provide native
_load_ma(), or where native _load_ma() does not make sense.
Post by Svatopluk Kraus
Post by Konstantin Belousov
Post by Svatopluk Kraus
(2) Maybe some better way how to temporarily allocate KVA for unmapped
buffers should be implemented.
See some other mail from me about non-blocking sfbuf allocator with
callback.
This small list was meant as summary. As I saw your emails in this
thread, I added this point . I did not get that it's already in source
tree.
No, it is not. I stopped working on it during the unmapped i/o work,
after I realized that there is no much interest from the device drivers
authors. Nobody cared about drivers like ATA PIO.

Now, with the new possible use for the non-blocking sfbuf allocator,
it can be revived.
Post by Svatopluk Kraus
Post by Konstantin Belousov
Post by Svatopluk Kraus
(3) DMA clients which already use _bus_dmamap_load_uio() with
UIO_USERSPACE must be reimplemented or made obsolete.
Yes.
Post by Svatopluk Kraus
(4) UIO_USERSPACE must be off limit in _bus_dmamap_load_uio() and man
page should be changed according to it.
Yes.
Hmm, I think that for the beginning, _bus_dmamap_load_uio() for
UIO_USERSPACE can be hacked to use bus_dmamap_load_ma(). Maybe with
some warning to force users of old clients to reimplement them.
Also it would be a good test for my claim that
vm_fault_quick_hold_pages() + bus_dmamap_load_ma() is all what is
needed.
Post by Svatopluk Kraus
Post by Konstantin Belousov
Post by Svatopluk Kraus
(5) And pmap can be deleted from struct bus_dmamap and all functions
which use it as argument. Only kernel pmap will be used in DMA
framework.
Probably yes.
Jason Harmening
2015-04-29 18:04:46 UTC
Permalink
So, here's a patch that would add unmapped user bounce-buffer support for
existing UIO_USERSPACE cases. I've only made sure it builds (everywhere)
and given it a quick check on amd64.
Things to note:
--no changes to sparc64 and intel dmar, because they don't use bounce
buffers
--effectively adds UIO_USERSPACE support for mips, which was a KASSERT
before
--I am worried about the cache maintenance operations for arm and mips.
I'm not an expert in non-coherent architectures. In particular, I'm not
sure what (if any) allowances need to be made for user VAs that may be
present in VIPT caches on other cores of SMP systems.
--the above point about cache maintenance also makes me wonder how that
should be handled for drivers that would use vm_fault_quick_hold_pages() +
bus_dmamap_load_ma(). Presumably, some UVAs for the buffer could be
present in caches for the same or other core.


Index: sys/arm/arm/busdma_machdep-v6.c
===================================================================
--- sys/arm/arm/busdma_machdep-v6.c (revision 282208)
+++ sys/arm/arm/busdma_machdep-v6.c (working copy)
@@ -1309,15 +1309,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
{
struct bounce_page *bpage;
struct sync_list *sl, *end;
- /*
- * If the buffer was from user space, it is possible that this is not
- * the same vm map, especially on a POST operation. It's not clear that
- * dma on userland buffers can work at all right now. To be safe, until
- * we're able to test direct userland dma, panic on a map mismatch.
- */
+
if ((bpage = STAILQ_FIRST(&map->bpages)) != NULL) {
- if (!pmap_dmap_iscurrent(map->pmap))
- panic("_bus_dmamap_sync: wrong user map for bounce sync.");

CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x op 0x%x "
"performing bounce", __func__, dmat, dmat->flags, op);
@@ -1328,14 +1321,10 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
*/
if (op & BUS_DMASYNC_PREWRITE) {
while (bpage != NULL) {
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->datavaddr,
- (void *)bpage->vaddr,
- bpage->datacount);
+ if (bpage->datavaddr != 0 && pmap_dmap_iscurrent(map->pmap))
+ bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
else
- physcopyout(bpage->dataaddr,
- (void *)bpage->vaddr,
- bpage->datacount);
+ physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
cpu_dcache_wb_range((vm_offset_t)bpage->vaddr,
bpage->datacount);
l2cache_wb_range((vm_offset_t)bpage->vaddr,
@@ -1396,14 +1385,10 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
arm_dcache_align;
l2cache_inv_range(startv, startp, len);
cpu_dcache_inv_range(startv, len);
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->vaddr,
- (void *)bpage->datavaddr,
- bpage->datacount);
+ if (bpage->datavaddr != 0 && pmap_dmap_iscurrent(map->pmap))
+ bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
else
- physcopyin((void *)bpage->vaddr,
- bpage->dataaddr,
- bpage->datacount);
+ physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
@@ -1433,10 +1418,15 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
* that the sequence is inner-to-outer for PREREAD invalidation and
* outer-to-inner for POSTREAD invalidation is not a mistake.
*/
+#ifndef ARM_L2_PIPT
+ /*
+ * If we don't have any physically-indexed caches, we don't need to do
+ * cache maintenance if we're not in the context that owns the VA.
+ */
+ if (!pmap_dmap_iscurrent(map->pmap))
+ return;
+#endif
if (map->sync_count != 0) {
- if (!pmap_dmap_iscurrent(map->pmap))
- panic("_bus_dmamap_sync: wrong user map for sync.");
-
sl = &map->slist[0];
end = &map->slist[map->sync_count];
CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x op 0x%x "
@@ -1446,7 +1436,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
case BUS_DMASYNC_PREWRITE:
case BUS_DMASYNC_PREWRITE | BUS_DMASYNC_PREREAD:
while (sl != end) {
- cpu_dcache_wb_range(sl->vaddr, sl->datacount);
+ if (pmap_dmap_iscurrent(map->pmap))
+ cpu_dcache_wb_range(sl->vaddr, sl->datacount);
l2cache_wb_range(sl->vaddr, sl->busaddr,
sl->datacount);
sl++;
@@ -1472,7 +1463,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
l2cache_wb_range(sl->vaddr,
sl->busaddr, 1);
}
- cpu_dcache_inv_range(sl->vaddr, sl->datacount);
+ if (pmap_dmap_iscurrent(map->pmap))
+ cpu_dcache_inv_range(sl->vaddr, sl->datacount);
l2cache_inv_range(sl->vaddr, sl->busaddr,
sl->datacount);
sl++;
@@ -1487,7 +1479,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
while (sl != end) {
l2cache_inv_range(sl->vaddr, sl->busaddr,
sl->datacount);
- cpu_dcache_inv_range(sl->vaddr, sl->datacount);
+ if (pmap_dmap_iscurrent(map->pmap))
+ cpu_dcache_inv_range(sl->vaddr, sl->datacount);
sl++;
}
break;
Index: sys/arm/arm/busdma_machdep.c
===================================================================
--- sys/arm/arm/busdma_machdep.c (revision 282208)
+++ sys/arm/arm/busdma_machdep.c (working copy)
@@ -131,7 +131,6 @@ struct bounce_page {

struct sync_list {
vm_offset_t vaddr; /* kva of bounce buffer */
- bus_addr_t busaddr; /* Physical address */
bus_size_t datacount; /* client data count */
};

@@ -177,6 +176,7 @@ struct bus_dmamap {
STAILQ_ENTRY(bus_dmamap) links;
bus_dmamap_callback_t *callback;
void *callback_arg;
+ pmap_t pmap;
int sync_count;
struct sync_list *slist;
};
@@ -831,7 +831,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
}

static void
-_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
+_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
void *buf, bus_size_t buflen, int flags)
{
vm_offset_t vaddr;
@@ -851,10 +851,10 @@ static void
vendaddr = (vm_offset_t)buf + buflen;

while (vaddr < vendaddr) {
- if (__predict_true(pmap == kernel_pmap))
+ if (__predict_true(map->pmap == kernel_pmap))
paddr = pmap_kextract(vaddr);
else
- paddr = pmap_extract(pmap, vaddr);
+ paddr = pmap_extract(map->pmap, vaddr);
if (run_filter(dmat, paddr) != 0)
map->pagesneeded++;
vaddr += PAGE_SIZE;
@@ -1009,7 +1009,7 @@ _bus_dmamap_load_ma(bus_dma_tag_t dmat, bus_dmamap
*/
int
_bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf,
- bus_size_t buflen, struct pmap *pmap, int flags, bus_dma_segment_t
*segs,
+ bus_size_t buflen, pmap_t pmap, int flags, bus_dma_segment_t *segs,
int *segp)
{
bus_size_t sgsize;
@@ -1023,8 +1023,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
if ((flags & BUS_DMA_LOAD_MBUF) != 0)
map->flags |= DMAMAP_CACHE_ALIGNED;

+ map->pmap = pmap;
+
if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) {
- _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
+ _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
if (map->pagesneeded != 0) {
error = _bus_dmamap_reserve_pages(dmat, map, flags);
if (error)
@@ -1042,6 +1044,8 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
curaddr = pmap_kextract(vaddr);
} else {
curaddr = pmap_extract(pmap, vaddr);
+ if (curaddr == 0)
+ goto cleanup;
map->flags &= ~DMAMAP_COHERENT;
}

@@ -1067,7 +1071,6 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
sl++;
sl->vaddr = vaddr;
sl->datacount = sgsize;
- sl->busaddr = curaddr;
} else
sl->datacount += sgsize;
}
@@ -1205,12 +1208,11 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap

STAILQ_FOREACH(bpage, &map->bpages, links) {
if (op & BUS_DMASYNC_PREWRITE) {
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->datavaddr,
- (void *)bpage->vaddr, bpage->datacount);
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
else
- physcopyout(bpage->dataaddr,
- (void *)bpage->vaddr,bpage->datacount);
+ physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
cpu_dcache_wb_range(bpage->vaddr, bpage->datacount);
cpu_l2cache_wb_range(bpage->vaddr, bpage->datacount);
dmat->bounce_zone->total_bounced++;
@@ -1218,12 +1220,11 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap
if (op & BUS_DMASYNC_POSTREAD) {
cpu_dcache_inv_range(bpage->vaddr, bpage->datacount);
cpu_l2cache_inv_range(bpage->vaddr, bpage->datacount);
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->vaddr,
- (void *)bpage->datavaddr, bpage->datacount);
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
else
- physcopyin((void *)bpage->vaddr,
- bpage->dataaddr, bpage->datacount);
+ physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
dmat->bounce_zone->total_bounced++;
}
}
@@ -1243,7 +1244,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
_bus_dmamap_sync_bp(dmat, map, op);
CTR3(KTR_BUSDMA, "%s: op %x flags %x", __func__, op, map->flags);
bufaligned = (map->flags & DMAMAP_CACHE_ALIGNED);
- if (map->sync_count) {
+ if (map->sync_count != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace))) {
end = &map->slist[map->sync_count];
for (sl = &map->slist[0]; sl != end; sl++)
bus_dmamap_sync_buf(sl->vaddr, sl->datacount, op,
Index: sys/mips/mips/busdma_machdep.c
===================================================================
--- sys/mips/mips/busdma_machdep.c (revision 282208)
+++ sys/mips/mips/busdma_machdep.c (working copy)
@@ -96,7 +96,6 @@ struct bounce_page {

struct sync_list {
vm_offset_t vaddr; /* kva of bounce buffer */
- bus_addr_t busaddr; /* Physical address */
bus_size_t datacount; /* client data count */
};

@@ -144,6 +143,7 @@ struct bus_dmamap {
void *allocbuffer;
TAILQ_ENTRY(bus_dmamap) freelist;
STAILQ_ENTRY(bus_dmamap) links;
+ pmap_t pmap;
bus_dmamap_callback_t *callback;
void *callback_arg;
int sync_count;
@@ -725,7 +725,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
}

static void
-_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
+_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
void *buf, bus_size_t buflen, int flags)
{
vm_offset_t vaddr;
@@ -747,9 +747,11 @@ static void
while (vaddr < vendaddr) {
bus_size_t sg_len;

- KASSERT(kernel_pmap == pmap, ("pmap is not kernel pmap"));
sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK);
- paddr = pmap_kextract(vaddr);
+ if (map->pmap == kernel_pmap)
+ paddr = pmap_kextract(vaddr);
+ else
+ paddr = pmap_extract(map->pmap, vaddr);
if (((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) &&
run_filter(dmat, paddr) != 0) {
sg_len = roundup2(sg_len, dmat->alignment);
@@ -895,7 +897,7 @@ _bus_dmamap_load_ma(bus_dma_tag_t dmat, bus_dmamap
*/
int
_bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf,
- bus_size_t buflen, struct pmap *pmap, int flags, bus_dma_segment_t
*segs,
+ bus_size_t buflen, pmap_t pmap, int flags, bus_dma_segment_t *segs,
int *segp)
{
bus_size_t sgsize;
@@ -908,8 +910,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
if (segs == NULL)
segs = dmat->segments;

+ map->pmap = pmap;
+
if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) {
- _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
+ _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
if (map->pagesneeded != 0) {
error = _bus_dmamap_reserve_pages(dmat, map, flags);
if (error)
@@ -922,12 +926,11 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
while (buflen > 0) {
/*
* Get the physical address for this segment.
- *
- * XXX Don't support checking for coherent mappings
- * XXX in user address space.
*/
- KASSERT(kernel_pmap == pmap, ("pmap is not kernel pmap"));
- curaddr = pmap_kextract(vaddr);
+ if (pmap == kernel_pmap)
+ curaddr = pmap_kextract(vaddr);
+ else
+ curaddr = pmap_extract(pmap, vaddr);

/*
* Compute the segment size, and adjust counts.
@@ -951,7 +954,6 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
sl++;
sl->vaddr = vaddr;
sl->datacount = sgsize;
- sl->busaddr = curaddr;
} else
sl->datacount += sgsize;
}
@@ -1111,17 +1113,14 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap

STAILQ_FOREACH(bpage, &map->bpages, links) {
if (op & BUS_DMASYNC_PREWRITE) {
- if (bpage->datavaddr != 0)
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
bcopy((void *)bpage->datavaddr,
- (void *)(bpage->vaddr_nocache != 0 ?
- bpage->vaddr_nocache :
- bpage->vaddr),
+ (void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache :
bpage->vaddr),
bpage->datacount);
else
physcopyout(bpage->dataaddr,
- (void *)(bpage->vaddr_nocache != 0 ?
- bpage->vaddr_nocache :
- bpage->vaddr),
+ (void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache :
bpage->vaddr),
bpage->datacount);
if (bpage->vaddr_nocache == 0) {
mips_dcache_wb_range(bpage->vaddr,
@@ -1134,13 +1133,12 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap
mips_dcache_inv_range(bpage->vaddr,
bpage->datacount);
}
- if (bpage->datavaddr != 0)
- bcopy((void *)(bpage->vaddr_nocache != 0 ?
- bpage->vaddr_nocache : bpage->vaddr),
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache :
bpage->vaddr),
(void *)bpage->datavaddr, bpage->datacount);
else
- physcopyin((void *)(bpage->vaddr_nocache != 0 ?
- bpage->vaddr_nocache : bpage->vaddr),
+ physcopyin((void *)(bpage->vaddr_nocache != 0 ? bpage->vaddr_nocache :
bpage->vaddr),
bpage->dataaddr, bpage->datacount);
dmat->bounce_zone->total_bounced++;
}
@@ -1164,7 +1162,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
return;

CTR3(KTR_BUSDMA, "%s: op %x flags %x", __func__, op, map->flags);
- if (map->sync_count) {
+ if (map->sync_count != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace))) {
end = &map->slist[map->sync_count];
for (sl = &map->slist[0]; sl != end; sl++)
bus_dmamap_sync_buf(sl->vaddr, sl->datacount, op);
Index: sys/powerpc/powerpc/busdma_machdep.c
===================================================================
--- sys/powerpc/powerpc/busdma_machdep.c (revision 282208)
+++ sys/powerpc/powerpc/busdma_machdep.c (working copy)
@@ -131,6 +131,7 @@ struct bus_dmamap {
int nsegs;
bus_dmamap_callback_t *callback;
void *callback_arg;
+ pmap_t pmap;
STAILQ_ENTRY(bus_dmamap) links;
int contigalloc;
};
@@ -596,7 +597,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
}

static void
-_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
+_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
void *buf, bus_size_t buflen, int flags)
{
vm_offset_t vaddr;
@@ -619,10 +620,10 @@ static void
bus_size_t sg_len;

sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK);
- if (pmap == kernel_pmap)
+ if (map->pmap == kernel_pmap)
paddr = pmap_kextract(vaddr);
else
- paddr = pmap_extract(pmap, vaddr);
+ paddr = pmap_extract(map->pmap, vaddr);
if (run_filter(dmat, paddr) != 0) {
sg_len = roundup2(sg_len, dmat->alignment);
map->pagesneeded++;
@@ -785,8 +786,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat,
if (segs == NULL)
segs = map->segments;

+ map->pmap = pmap;
+
if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) {
- _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
+ _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
if (map->pagesneeded != 0) {
error = _bus_dmamap_reserve_pages(dmat, map, flags);
if (error)
@@ -905,14 +908,11 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t

if (op & BUS_DMASYNC_PREWRITE) {
while (bpage != NULL) {
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->datavaddr,
- (void *)bpage->vaddr,
- bpage->datacount);
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
else
- physcopyout(bpage->dataaddr,
- (void *)bpage->vaddr,
- bpage->datacount);
+ physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
@@ -920,13 +920,11 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t

if (op & BUS_DMASYNC_POSTREAD) {
while (bpage != NULL) {
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->vaddr,
- (void *)bpage->datavaddr,
- bpage->datacount);
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
else
- physcopyin((void *)bpage->vaddr,
- bpage->dataaddr, bpage->datacount);
+ physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
Index: sys/x86/x86/busdma_bounce.c
===================================================================
--- sys/x86/x86/busdma_bounce.c (revision 282208)
+++ sys/x86/x86/busdma_bounce.c (working copy)
@@ -121,6 +121,7 @@ struct bus_dmamap {
struct memdesc mem;
bus_dmamap_callback_t *callback;
void *callback_arg;
+ pmap_t pmap;
STAILQ_ENTRY(bus_dmamap) links;
};

@@ -139,7 +140,7 @@ static bus_addr_t add_bounce_page(bus_dma_tag_t dm
static void free_bounce_page(bus_dma_tag_t dmat, struct bounce_page
*bpage);
int run_filter(bus_dma_tag_t dmat, bus_addr_t paddr);
static void _bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
- pmap_t pmap, void *buf, bus_size_t buflen,
+ void *buf, bus_size_t buflen,
int flags);
static void _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dmamap_t map,
vm_paddr_t buf, bus_size_t buflen,
@@ -491,7 +492,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
}

static void
-_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
+_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
void *buf, bus_size_t buflen, int flags)
{
vm_offset_t vaddr;
@@ -515,10 +516,10 @@ static void

while (vaddr < vendaddr) {
sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK);
- if (pmap == kernel_pmap)
+ if (map->pmap == kernel_pmap)
paddr = pmap_kextract(vaddr);
else
- paddr = pmap_extract(pmap, vaddr);
+ paddr = pmap_extract(map->pmap, vaddr);
if (bus_dma_run_filter(&dmat->common, paddr) != 0) {
sg_len = roundup2(sg_len,
dmat->common.alignment);
@@ -668,12 +669,14 @@ bounce_bus_dmamap_load_buffer(bus_dma_tag_t dmat,

if (map == NULL)
map = &nobounce_dmamap;
+ else
+ map->pmap = pmap;

if (segs == NULL)
segs = dmat->segments;

if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0) {
- _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
+ _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
if (map->pagesneeded != 0) {
error = _bus_dmamap_reserve_pages(dmat, map, flags);
if (error)
@@ -775,15 +778,11 @@ bounce_bus_dmamap_sync(bus_dma_tag_t dmat, bus_dma

if ((op & BUS_DMASYNC_PREWRITE) != 0) {
while (bpage != NULL) {
- if (bpage->datavaddr != 0) {
- bcopy((void *)bpage->datavaddr,
- (void *)bpage->vaddr,
- bpage->datacount);
- } else {
- physcopyout(bpage->dataaddr,
- (void *)bpage->vaddr,
- bpage->datacount);
- }
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
+ else
+ physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
@@ -791,15 +790,11 @@ bounce_bus_dmamap_sync(bus_dma_tag_t dmat, bus_dma

if ((op & BUS_DMASYNC_POSTREAD) != 0) {
while (bpage != NULL) {
- if (bpage->datavaddr != 0) {
- bcopy((void *)bpage->vaddr,
- (void *)bpage->datavaddr,
- bpage->datacount);
- } else {
- physcopyin((void *)bpage->vaddr,
- bpage->dataaddr,
- bpage->datacount);
- }
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
+ else
+ physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
Konstantin Belousov
2015-04-29 18:50:19 UTC
Permalink
Post by Jason Harmening
So, here's a patch that would add unmapped user bounce-buffer support for
existing UIO_USERSPACE cases. I've only made sure it builds (everywhere)
and given it a quick check on amd64.
--no changes to sparc64 and intel dmar, because they don't use bounce
buffers
--effectively adds UIO_USERSPACE support for mips, which was a KASSERT
before
--I am worried about the cache maintenance operations for arm and mips.
I'm not an expert in non-coherent architectures. In particular, I'm not
sure what (if any) allowances need to be made for user VAs that may be
present in VIPT caches on other cores of SMP systems.
--the above point about cache maintenance also makes me wonder how that
should be handled for drivers that would use vm_fault_quick_hold_pages() +
bus_dmamap_load_ma(). Presumably, some UVAs for the buffer could be
present in caches for the same or other core.
The spaces/tabs in your mail are damaged. It does not matter in the
text, but makes the patch unapplicable and hardly readable.

I only read the x86/busdma_bounce.c part. It looks fine in the part
where you add the test for the current pmap being identical to the pmap
owning the user page mapping.

I do not understand the part of the diff for bcopy/physcopyout lines,
I cannot find non-whitespace changes there, and whitespace change would
make too long line. Did I misread the patch ?

BTW, why not use physcopyout() unconditionally on x86 ? To avoid i386 sfbuf
allocation failures ?

For non-coherent arches, isn't the issue of CPUs having filled caches
for the DMA region present regardless of the vm_fault_quick_hold() use ?
DMASYNC_PREREAD/WRITE must ensure that the lines are written back and
invalidated even now, or always fall back to use bounce page.
Post by Jason Harmening
Index: sys/arm/arm/busdma_machdep-v6.c
===================================================================
--- sys/arm/arm/busdma_machdep-v6.c (revision 282208)
+++ sys/arm/arm/busdma_machdep-v6.c (working copy)
@@ -1309,15 +1309,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
{
struct bounce_page *bpage;
struct sync_list *sl, *end;
- /*
- * If the buffer was from user space, it is possible that this is not
- * the same vm map, especially on a POST operation. It's not clear that
- * dma on userland buffers can work at all right now. To be safe, until
- * we're able to test direct userland dma, panic on a map mismatch.
- */
+
if ((bpage = STAILQ_FIRST(&map->bpages)) != NULL) {
- if (!pmap_dmap_iscurrent(map->pmap))
- panic("_bus_dmamap_sync: wrong user map for bounce sync.");
CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x op 0x%x "
"performing bounce", __func__, dmat, dmat->flags, op);
@@ -1328,14 +1321,10 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
*/
if (op & BUS_DMASYNC_PREWRITE) {
while (bpage != NULL) {
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->datavaddr,
- (void *)bpage->vaddr,
- bpage->datacount);
+ if (bpage->datavaddr != 0 && pmap_dmap_iscurrent(map->pmap))
+ bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
else
- physcopyout(bpage->dataaddr,
- (void *)bpage->vaddr,
- bpage->datacount);
+ physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
cpu_dcache_wb_range((vm_offset_t)bpage->vaddr,
bpage->datacount);
l2cache_wb_range((vm_offset_t)bpage->vaddr,
@@ -1396,14 +1385,10 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
arm_dcache_align;
l2cache_inv_range(startv, startp, len);
cpu_dcache_inv_range(startv, len);
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->vaddr,
- (void *)bpage->datavaddr,
- bpage->datacount);
+ if (bpage->datavaddr != 0 && pmap_dmap_iscurrent(map->pmap))
+ bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
else
- physcopyin((void *)bpage->vaddr,
- bpage->dataaddr,
- bpage->datacount);
+ physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
@@ -1433,10 +1418,15 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
* that the sequence is inner-to-outer for PREREAD invalidation and
* outer-to-inner for POSTREAD invalidation is not a mistake.
*/
+#ifndef ARM_L2_PIPT
+ /*
+ * If we don't have any physically-indexed caches, we don't need to do
+ * cache maintenance if we're not in the context that owns the VA.
+ */
+ if (!pmap_dmap_iscurrent(map->pmap))
+ return;
+#endif
if (map->sync_count != 0) {
- if (!pmap_dmap_iscurrent(map->pmap))
- panic("_bus_dmamap_sync: wrong user map for sync.");
-
sl = &map->slist[0];
end = &map->slist[map->sync_count];
CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x op 0x%x "
@@ -1446,7 +1436,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
while (sl != end) {
- cpu_dcache_wb_range(sl->vaddr, sl->datacount);
+ if (pmap_dmap_iscurrent(map->pmap))
+ cpu_dcache_wb_range(sl->vaddr, sl->datacount);
l2cache_wb_range(sl->vaddr, sl->busaddr,
sl->datacount);
sl++;
@@ -1472,7 +1463,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
l2cache_wb_range(sl->vaddr,
sl->busaddr, 1);
}
- cpu_dcache_inv_range(sl->vaddr, sl->datacount);
+ if (pmap_dmap_iscurrent(map->pmap))
+ cpu_dcache_inv_range(sl->vaddr, sl->datacount);
l2cache_inv_range(sl->vaddr, sl->busaddr,
sl->datacount);
sl++;
@@ -1487,7 +1479,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
while (sl != end) {
l2cache_inv_range(sl->vaddr, sl->busaddr,
sl->datacount);
- cpu_dcache_inv_range(sl->vaddr, sl->datacount);
+ if (pmap_dmap_iscurrent(map->pmap))
+ cpu_dcache_inv_range(sl->vaddr, sl->datacount);
sl++;
}
break;
Index: sys/arm/arm/busdma_machdep.c
===================================================================
--- sys/arm/arm/busdma_machdep.c (revision 282208)
+++ sys/arm/arm/busdma_machdep.c (working copy)
@@ -131,7 +131,6 @@ struct bounce_page {
struct sync_list {
vm_offset_t vaddr; /* kva of bounce buffer */
- bus_addr_t busaddr; /* Physical address */
bus_size_t datacount; /* client data count */
};
@@ -177,6 +176,7 @@ struct bus_dmamap {
STAILQ_ENTRY(bus_dmamap) links;
bus_dmamap_callback_t *callback;
void *callback_arg;
+ pmap_t pmap;
int sync_count;
struct sync_list *slist;
};
@@ -831,7 +831,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
}
static void
-_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
+_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
void *buf, bus_size_t buflen, int flags)
{
vm_offset_t vaddr;
@@ -851,10 +851,10 @@ static void
vendaddr = (vm_offset_t)buf + buflen;
while (vaddr < vendaddr) {
- if (__predict_true(pmap == kernel_pmap))
+ if (__predict_true(map->pmap == kernel_pmap))
paddr = pmap_kextract(vaddr);
else
- paddr = pmap_extract(pmap, vaddr);
+ paddr = pmap_extract(map->pmap, vaddr);
if (run_filter(dmat, paddr) != 0)
map->pagesneeded++;
vaddr += PAGE_SIZE;
@@ -1009,7 +1009,7 @@ _bus_dmamap_load_ma(bus_dma_tag_t dmat, bus_dmamap
*/
int
_bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf,
- bus_size_t buflen, struct pmap *pmap, int flags, bus_dma_segment_t
*segs,
+ bus_size_t buflen, pmap_t pmap, int flags, bus_dma_segment_t *segs,
int *segp)
{
bus_size_t sgsize;
@@ -1023,8 +1023,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
if ((flags & BUS_DMA_LOAD_MBUF) != 0)
map->flags |= DMAMAP_CACHE_ALIGNED;
+ map->pmap = pmap;
+
if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) {
- _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
+ _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
if (map->pagesneeded != 0) {
error = _bus_dmamap_reserve_pages(dmat, map, flags);
if (error)
@@ -1042,6 +1044,8 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
curaddr = pmap_kextract(vaddr);
} else {
curaddr = pmap_extract(pmap, vaddr);
+ if (curaddr == 0)
+ goto cleanup;
map->flags &= ~DMAMAP_COHERENT;
}
@@ -1067,7 +1071,6 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
sl++;
sl->vaddr = vaddr;
sl->datacount = sgsize;
- sl->busaddr = curaddr;
} else
sl->datacount += sgsize;
}
@@ -1205,12 +1208,11 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap
STAILQ_FOREACH(bpage, &map->bpages, links) {
if (op & BUS_DMASYNC_PREWRITE) {
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->datavaddr,
- (void *)bpage->vaddr, bpage->datacount);
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
else
- physcopyout(bpage->dataaddr,
- (void *)bpage->vaddr,bpage->datacount);
+ physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
cpu_dcache_wb_range(bpage->vaddr, bpage->datacount);
cpu_l2cache_wb_range(bpage->vaddr, bpage->datacount);
dmat->bounce_zone->total_bounced++;
@@ -1218,12 +1220,11 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap
if (op & BUS_DMASYNC_POSTREAD) {
cpu_dcache_inv_range(bpage->vaddr, bpage->datacount);
cpu_l2cache_inv_range(bpage->vaddr, bpage->datacount);
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->vaddr,
- (void *)bpage->datavaddr, bpage->datacount);
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
else
- physcopyin((void *)bpage->vaddr,
- bpage->dataaddr, bpage->datacount);
+ physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
dmat->bounce_zone->total_bounced++;
}
}
@@ -1243,7 +1244,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
_bus_dmamap_sync_bp(dmat, map, op);
CTR3(KTR_BUSDMA, "%s: op %x flags %x", __func__, op, map->flags);
bufaligned = (map->flags & DMAMAP_CACHE_ALIGNED);
- if (map->sync_count) {
+ if (map->sync_count != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace))) {
end = &map->slist[map->sync_count];
for (sl = &map->slist[0]; sl != end; sl++)
bus_dmamap_sync_buf(sl->vaddr, sl->datacount, op,
Index: sys/mips/mips/busdma_machdep.c
===================================================================
--- sys/mips/mips/busdma_machdep.c (revision 282208)
+++ sys/mips/mips/busdma_machdep.c (working copy)
@@ -96,7 +96,6 @@ struct bounce_page {
struct sync_list {
vm_offset_t vaddr; /* kva of bounce buffer */
- bus_addr_t busaddr; /* Physical address */
bus_size_t datacount; /* client data count */
};
@@ -144,6 +143,7 @@ struct bus_dmamap {
void *allocbuffer;
TAILQ_ENTRY(bus_dmamap) freelist;
STAILQ_ENTRY(bus_dmamap) links;
+ pmap_t pmap;
bus_dmamap_callback_t *callback;
void *callback_arg;
int sync_count;
@@ -725,7 +725,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
}
static void
-_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
+_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
void *buf, bus_size_t buflen, int flags)
{
vm_offset_t vaddr;
@@ -747,9 +747,11 @@ static void
while (vaddr < vendaddr) {
bus_size_t sg_len;
- KASSERT(kernel_pmap == pmap, ("pmap is not kernel pmap"));
sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK);
- paddr = pmap_kextract(vaddr);
+ if (map->pmap == kernel_pmap)
+ paddr = pmap_kextract(vaddr);
+ else
+ paddr = pmap_extract(map->pmap, vaddr);
if (((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) &&
run_filter(dmat, paddr) != 0) {
sg_len = roundup2(sg_len, dmat->alignment);
@@ -895,7 +897,7 @@ _bus_dmamap_load_ma(bus_dma_tag_t dmat, bus_dmamap
*/
int
_bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf,
- bus_size_t buflen, struct pmap *pmap, int flags, bus_dma_segment_t
*segs,
+ bus_size_t buflen, pmap_t pmap, int flags, bus_dma_segment_t *segs,
int *segp)
{
bus_size_t sgsize;
@@ -908,8 +910,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
if (segs == NULL)
segs = dmat->segments;
+ map->pmap = pmap;
+
if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) {
- _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
+ _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
if (map->pagesneeded != 0) {
error = _bus_dmamap_reserve_pages(dmat, map, flags);
if (error)
@@ -922,12 +926,11 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
while (buflen > 0) {
/*
* Get the physical address for this segment.
- *
- * XXX Don't support checking for coherent mappings
- * XXX in user address space.
*/
- KASSERT(kernel_pmap == pmap, ("pmap is not kernel pmap"));
- curaddr = pmap_kextract(vaddr);
+ if (pmap == kernel_pmap)
+ curaddr = pmap_kextract(vaddr);
+ else
+ curaddr = pmap_extract(pmap, vaddr);
/*
* Compute the segment size, and adjust counts.
@@ -951,7 +954,6 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dm
sl++;
sl->vaddr = vaddr;
sl->datacount = sgsize;
- sl->busaddr = curaddr;
} else
sl->datacount += sgsize;
}
@@ -1111,17 +1113,14 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap
STAILQ_FOREACH(bpage, &map->bpages, links) {
if (op & BUS_DMASYNC_PREWRITE) {
- if (bpage->datavaddr != 0)
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
bcopy((void *)bpage->datavaddr,
- (void *)(bpage->vaddr_nocache != 0 ?
- bpage->vaddr),
bpage->vaddr),
bpage->datacount);
else
physcopyout(bpage->dataaddr,
- (void *)(bpage->vaddr_nocache != 0 ?
- bpage->vaddr),
bpage->vaddr),
bpage->datacount);
if (bpage->vaddr_nocache == 0) {
mips_dcache_wb_range(bpage->vaddr,
@@ -1134,13 +1133,12 @@ _bus_dmamap_sync_bp(bus_dma_tag_t dmat, bus_dmamap
mips_dcache_inv_range(bpage->vaddr,
bpage->datacount);
}
- if (bpage->datavaddr != 0)
- bcopy((void *)(bpage->vaddr_nocache != 0 ?
- bpage->vaddr_nocache : bpage->vaddr),
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
bpage->vaddr),
(void *)bpage->datavaddr, bpage->datacount);
else
- physcopyin((void *)(bpage->vaddr_nocache != 0 ?
- bpage->vaddr_nocache : bpage->vaddr),
bpage->vaddr),
bpage->dataaddr, bpage->datacount);
dmat->bounce_zone->total_bounced++;
}
@@ -1164,7 +1162,8 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
return;
CTR3(KTR_BUSDMA, "%s: op %x flags %x", __func__, op, map->flags);
- if (map->sync_count) {
+ if (map->sync_count != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace))) {
end = &map->slist[map->sync_count];
for (sl = &map->slist[0]; sl != end; sl++)
bus_dmamap_sync_buf(sl->vaddr, sl->datacount, op);
Index: sys/powerpc/powerpc/busdma_machdep.c
===================================================================
--- sys/powerpc/powerpc/busdma_machdep.c (revision 282208)
+++ sys/powerpc/powerpc/busdma_machdep.c (working copy)
@@ -131,6 +131,7 @@ struct bus_dmamap {
int nsegs;
bus_dmamap_callback_t *callback;
void *callback_arg;
+ pmap_t pmap;
STAILQ_ENTRY(bus_dmamap) links;
int contigalloc;
};
@@ -596,7 +597,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
}
static void
-_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
+_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
void *buf, bus_size_t buflen, int flags)
{
vm_offset_t vaddr;
@@ -619,10 +620,10 @@ static void
bus_size_t sg_len;
sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK);
- if (pmap == kernel_pmap)
+ if (map->pmap == kernel_pmap)
paddr = pmap_kextract(vaddr);
else
- paddr = pmap_extract(pmap, vaddr);
+ paddr = pmap_extract(map->pmap, vaddr);
if (run_filter(dmat, paddr) != 0) {
sg_len = roundup2(sg_len, dmat->alignment);
map->pagesneeded++;
@@ -785,8 +786,10 @@ _bus_dmamap_load_buffer(bus_dma_tag_t dmat,
if (segs == NULL)
segs = map->segments;
+ map->pmap = pmap;
+
if ((dmat->flags & BUS_DMA_COULD_BOUNCE) != 0) {
- _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
+ _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
if (map->pagesneeded != 0) {
error = _bus_dmamap_reserve_pages(dmat, map, flags);
if (error)
@@ -905,14 +908,11 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
if (op & BUS_DMASYNC_PREWRITE) {
while (bpage != NULL) {
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->datavaddr,
- (void *)bpage->vaddr,
- bpage->datacount);
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
else
- physcopyout(bpage->dataaddr,
- (void *)bpage->vaddr,
- bpage->datacount);
+ physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
@@ -920,13 +920,11 @@ _bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t
if (op & BUS_DMASYNC_POSTREAD) {
while (bpage != NULL) {
- if (bpage->datavaddr != 0)
- bcopy((void *)bpage->vaddr,
- (void *)bpage->datavaddr,
- bpage->datacount);
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
else
- physcopyin((void *)bpage->vaddr,
- bpage->dataaddr, bpage->datacount);
+ physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
Index: sys/x86/x86/busdma_bounce.c
===================================================================
--- sys/x86/x86/busdma_bounce.c (revision 282208)
+++ sys/x86/x86/busdma_bounce.c (working copy)
@@ -121,6 +121,7 @@ struct bus_dmamap {
struct memdesc mem;
bus_dmamap_callback_t *callback;
void *callback_arg;
+ pmap_t pmap;
STAILQ_ENTRY(bus_dmamap) links;
};
@@ -139,7 +140,7 @@ static bus_addr_t add_bounce_page(bus_dma_tag_t dm
static void free_bounce_page(bus_dma_tag_t dmat, struct bounce_page
*bpage);
int run_filter(bus_dma_tag_t dmat, bus_addr_t paddr);
static void _bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
- pmap_t pmap, void *buf, bus_size_t buflen,
+ void *buf, bus_size_t buflen,
int flags);
static void _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dmamap_t map,
vm_paddr_t buf, bus_size_t buflen,
@@ -491,7 +492,7 @@ _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dma
}
static void
-_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap,
+_bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
void *buf, bus_size_t buflen, int flags)
{
vm_offset_t vaddr;
@@ -515,10 +516,10 @@ static void
while (vaddr < vendaddr) {
sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK);
- if (pmap == kernel_pmap)
+ if (map->pmap == kernel_pmap)
paddr = pmap_kextract(vaddr);
else
- paddr = pmap_extract(pmap, vaddr);
+ paddr = pmap_extract(map->pmap, vaddr);
if (bus_dma_run_filter(&dmat->common, paddr) != 0) {
sg_len = roundup2(sg_len,
dmat->common.alignment);
@@ -668,12 +669,14 @@ bounce_bus_dmamap_load_buffer(bus_dma_tag_t dmat,
if (map == NULL)
map = &nobounce_dmamap;
+ else
+ map->pmap = pmap;
if (segs == NULL)
segs = dmat->segments;
if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0) {
- _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags);
+ _bus_dmamap_count_pages(dmat, map, buf, buflen, flags);
if (map->pagesneeded != 0) {
error = _bus_dmamap_reserve_pages(dmat, map, flags);
if (error)
@@ -775,15 +778,11 @@ bounce_bus_dmamap_sync(bus_dma_tag_t dmat, bus_dma
if ((op & BUS_DMASYNC_PREWRITE) != 0) {
while (bpage != NULL) {
- if (bpage->datavaddr != 0) {
- bcopy((void *)bpage->datavaddr,
- (void *)bpage->vaddr,
- bpage->datacount);
- } else {
- physcopyout(bpage->dataaddr,
- (void *)bpage->vaddr,
- bpage->datacount);
- }
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->datavaddr, (void *)bpage->vaddr, bpage->datacount);
+ else
+ physcopyout(bpage->dataaddr, (void *)bpage->vaddr, bpage->datacount);
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
@@ -791,15 +790,11 @@ bounce_bus_dmamap_sync(bus_dma_tag_t dmat, bus_dma
if ((op & BUS_DMASYNC_POSTREAD) != 0) {
while (bpage != NULL) {
- if (bpage->datavaddr != 0) {
- bcopy((void *)bpage->vaddr,
- (void *)bpage->datavaddr,
- bpage->datacount);
- } else {
- physcopyin((void *)bpage->vaddr,
- bpage->dataaddr,
- bpage->datacount);
- }
+ if (bpage->datavaddr != 0 &&
+ (map->pmap == kernel_pmap || map->pmap ==
vmspace_pmap(curproc->p_vmspace)))
+ bcopy((void *)bpage->vaddr, (void *)bpage->datavaddr, bpage->datacount);
+ else
+ physcopyin((void *)bpage->vaddr, bpage->dataaddr, bpage->datacount);
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
Jason Harmening
2015-04-29 19:17:50 UTC
Permalink
Post by Konstantin Belousov
The spaces/tabs in your mail are damaged. It does not matter in the
text, but makes the patch unapplicable and hardly readable.
Ugh. I'm at work right now and using the gmail web client. It seems like
every day I find a new way in which that thing is incredibly unfriendly for
use with mailing lists.
I will re-post the patch from a sane mail client later.
Post by Konstantin Belousov
I only read the x86/busdma_bounce.c part. It looks fine in the part
where you add the test for the current pmap being identical to the pmap
owning the user page mapping.
I do not understand the part of the diff for bcopy/physcopyout lines,
I cannot find non-whitespace changes there, and whitespace change would
make too long line. Did I misread the patch ?\
You probably misread it, since it is unreadable. There is a section in
bounce_bus_dmamap_sync() where I check for map->pmap being kernel_pmap or
curproc's pmap before doing bcopy.
Post by Konstantin Belousov
BTW, why not use physcopyout() unconditionally on x86 ? To avoid i386 sfbuf
allocation failures ?
Yes.
Post by Konstantin Belousov
For non-coherent arches, isn't the issue of CPUs having filled caches
for the DMA region present regardless of the vm_fault_quick_hold() use ?
DMASYNC_PREREAD/WRITE must ensure that the lines are written back and
invalidated even now, or always fall back to use bounce page.
Yes, that needs to be done regardless of how the pages are wired. The
particular problem here is that some caches on arm and mips are
virtually-indexed (usually virtually-indexed, physically-tagged (VIPT)).
That means the flush/invalidate instructions need virtual addresses, so
figuring out the correct UVA to use for those could be a challenge. As I
understand it, VIPT caches usually do have some hardware logic for finding
all the cachelines that correspond to a physical address, so they can
handle multiple VA mappings of the same PA. But it is unclear to me how
cross-processor cache maintenance is supposed to work with VIPT caches on
SMP systems.

If the caches were physically-indexed, then I don't think there would be an
issue. You'd just pass the PA to the flush/invalidate instruction, and
presumably a sane SMP implementation would propagate that to other cores
via IPI.
Konstantin Belousov
2015-04-29 19:33:37 UTC
Permalink
Post by Jason Harmening
Post by Konstantin Belousov
The spaces/tabs in your mail are damaged. It does not matter in the
text, but makes the patch unapplicable and hardly readable.
Ugh. I'm at work right now and using the gmail web client. It seems like
every day I find a new way in which that thing is incredibly unfriendly for
use with mailing lists.
I will re-post the patch from a sane mail client later.
Post by Konstantin Belousov
I only read the x86/busdma_bounce.c part. It looks fine in the part
where you add the test for the current pmap being identical to the pmap
owning the user page mapping.
I do not understand the part of the diff for bcopy/physcopyout lines,
I cannot find non-whitespace changes there, and whitespace change would
make too long line. Did I misread the patch ?\
You probably misread it, since it is unreadable. There is a section in
bounce_bus_dmamap_sync() where I check for map->pmap being kernel_pmap or
curproc's pmap before doing bcopy.
See the paragraph in my mail before the one you answered.
I am asking about the bcopy()/physcopyout() lines in diff, not about
the if () conditions change. The later is definitely fine.
Post by Jason Harmening
Post by Konstantin Belousov
BTW, why not use physcopyout() unconditionally on x86 ? To avoid i386 sfbuf
allocation failures ?
Yes.
Post by Konstantin Belousov
For non-coherent arches, isn't the issue of CPUs having filled caches
for the DMA region present regardless of the vm_fault_quick_hold() use ?
DMASYNC_PREREAD/WRITE must ensure that the lines are written back and
invalidated even now, or always fall back to use bounce page.
Yes, that needs to be done regardless of how the pages are wired. The
particular problem here is that some caches on arm and mips are
virtually-indexed (usually virtually-indexed, physically-tagged (VIPT)).
That means the flush/invalidate instructions need virtual addresses, so
figuring out the correct UVA to use for those could be a challenge. As I
understand it, VIPT caches usually do have some hardware logic for finding
all the cachelines that correspond to a physical address, so they can
handle multiple VA mappings of the same PA. But it is unclear to me how
cross-processor cache maintenance is supposed to work with VIPT caches on
SMP systems.
If the caches were physically-indexed, then I don't think there would be an
issue. You'd just pass the PA to the flush/invalidate instruction, and
presumably a sane SMP implementation would propagate that to other cores
via IPI.
Even without SMP, VIPT cache cannot hold two mappings of the same page.
As I understand, sometimes it is more involved, eg if mappings have
correct color (eg. on ultrasparcs), then cache can deal with aliasing.
Otherwise pmap has to map the page uncached for all mappings.

I do not see what would make this case special for SMP after that.
Cache invalidation would be either not needed, or coherency domain
propagation of the virtual address does the right thing.
Jason Harmening
2015-04-29 19:59:02 UTC
Permalink
Post by Konstantin Belousov
See the paragraph in my mail before the one you answered.
I am asking about the bcopy()/physcopyout() lines in diff, not about
the if () conditions change. The later is definitely fine.
Oh, yes, sorry. There were a couple of whitespace changes there, but
nothing of consequence.
Post by Konstantin Belousov
Even without SMP, VIPT cache cannot hold two mappings of the same page.
As I understand, sometimes it is more involved, eg if mappings have
correct color (eg. on ultrasparcs), then cache can deal with aliasing.
Otherwise pmap has to map the page uncached for all mappings.
Yes, you are right. Regardless of whatever logic the cache uses (or
doesn't use), FreeBSD's page-coloring scheme should prevent that.
Post by Konstantin Belousov
I do not see what would make this case special for SMP after that.
Cache invalidation would be either not needed, or coherency domain
propagation of the virtual address does the right thing.
Since VIPT cache operations require a virtual address, I'm wondering about
the case where different processes are running on different cores, and the
same UVA corresponds to a completely different physical page for each of
those processes. If the d-cache for each core contains that UVA, then what
does it mean when one core issues a flush/invalidate instruction for that
UVA?

Admittedly, there's a lot I don't know about how that's supposed to work in
the arm/mips SMP world. For all I know, the SMP targets could be
fully-snooped and we don't need to worry about cache maintenance at all.
Ian Lepore
2015-04-29 22:23:24 UTC
Permalink
Post by Jason Harmening
Post by Konstantin Belousov
Even without SMP, VIPT cache cannot hold two mappings of the same page.
As I understand, sometimes it is more involved, eg if mappings have
correct color (eg. on ultrasparcs), then cache can deal with aliasing.
Otherwise pmap has to map the page uncached for all mappings.
Yes, you are right. Regardless of whatever logic the cache uses (or
doesn't use), FreeBSD's page-coloring scheme should prevent that.
Post by Konstantin Belousov
I do not see what would make this case special for SMP after that.
Cache invalidation would be either not needed, or coherency domain
propagation of the virtual address does the right thing.
Since VIPT cache operations require a virtual address, I'm wondering about
the case where different processes are running on different cores, and the
same UVA corresponds to a completely different physical page for each of
those processes. If the d-cache for each core contains that UVA, then what
does it mean when one core issues a flush/invalidate instruction for that
UVA?
Admittedly, there's a lot I don't know about how that's supposed to work in
the arm/mips SMP world. For all I know, the SMP targets could be
fully-snooped and we don't need to worry about cache maintenance at all.
For what we call armv6 (which is mostly armv7)...

The cache maintenance operations require virtual addresses, which means
it looks a lot like a VIPT cache. Under the hood the implementation
behaves as if it were a PIPT cache so even in the presence of multiple
mappings of the same physical page into different virtual addresses, the
SMP coherency hardware works correctly.

The ARM ARM says...

[Stuff about ARMv6 and page coloring when a cache way exceeds
4K.]

ARMv7 does not support page coloring, and requires that all data
and unified caches behave as Physically Indexed Physically
Tagged (PIPT) caches.

The only true armv6 chip we support isn't SMP and has a 16K/4-way cache
that neatly sidesteps the aliasing problem that requires page coloring
solutions. So modern arm chips we get to act like we've got PIPT data
caches, but with the quirk that cache ops are initiated by virtual
address.

Basically, when you perform a cache maintainence operation, a
translation table walk is done on the core that issued the cache op,
then from that point on the physical address is used within the cache
hardware and that's what gets broadcast to the other cores by the snoop
control unit or cache coherency fabric (depending on the chip).

Not that it's germane to this discussion, but an ARM instruction cache
can really be VIPT with no "behave as if" restrictions in the spec.
That means when doing i-cache maintenance on a virtual address that
could be multiply-mapped our only option a rather expensive all-cores
"invalidate entire i-cache and branch predictor cache".

For the older armv4/v5 world which is VIVT, we have a restriction that a
page that is multiply-mapped cannot have cache enabled (it's handled in
pmap). That's also probably not very germane to this discussion,
because it doesn't seem likely that anyone is going to try to add
physical IO or userspace DMA support to that old code.

-- Ian
Jason Harmening
2015-04-29 23:10:18 UTC
Permalink
Post by Ian Lepore
For what we call armv6 (which is mostly armv7)...
The cache maintenance operations require virtual addresses, which means
it looks a lot like a VIPT cache. Under the hood the implementation
behaves as if it were a PIPT cache so even in the presence of multiple
mappings of the same physical page into different virtual addresses, the
SMP coherency hardware works correctly.
The ARM ARM says...
[Stuff about ARMv6 and page coloring when a cache way exceeds
4K.]
ARMv7 does not support page coloring, and requires that all data
and unified caches behave as Physically Indexed Physically
Tagged (PIPT) caches.
The only true armv6 chip we support isn't SMP and has a 16K/4-way cache
that neatly sidesteps the aliasing problem that requires page coloring
solutions. So modern arm chips we get to act like we've got PIPT data
caches, but with the quirk that cache ops are initiated by virtual
address.
Cool, thanks for the explanation!
To satisfy my own curiosity, since it "looks like VIPT", does that mean we
still have to flush the cache on context switch?
Post by Ian Lepore
Basically, when you perform a cache maintainence operation, a
translation table walk is done on the core that issued the cache op,
then from that point on the physical address is used within the cache
hardware and that's what gets broadcast to the other cores by the snoop
control unit or cache coherency fabric (depending on the chip).
So, if we go back to the original problem of wanting to do
bus_dmamap_sync() on userspace buffers from some asynchronous context:

Say the process that owns the buffer is running on one core and prefetches
some data into a cacheline for the buffer, and bus_dmamap_sync(POSTREAD) is
done by a kernel thread running on another core. Since the core running
the kernel thread is responsible for the TLB lookup to get the physical
address, then since that core has no UVA the cache ops will be treated as
misses and the cacheline on the core that owns the UVA won't be
invalidated, correct?

That means the panic on !pmap_dmap_iscurrent() in busdma_machdep-v6.c
should stay?

Sort of the same problem would apply to drivers using
vm_fault_quick_hold_pages + bus_dmamap_load_ma...no cache maintenance,
since there are no VAs to operate on. Indeed, both arm and mips
implementation of _bus_dmamap_load_phys don't do anything with the sync
list.
Jason Harmening
2015-04-30 04:13:45 UTC
Permalink
Post by Jason Harmening
Post by Ian Lepore
For what we call armv6 (which is mostly armv7)...
The cache maintenance operations require virtual addresses, which means
it looks a lot like a VIPT cache. Under the hood the implementation
behaves as if it were a PIPT cache so even in the presence of multiple
mappings of the same physical page into different virtual addresses, the
SMP coherency hardware works correctly.
The ARM ARM says...
[Stuff about ARMv6 and page coloring when a cache way exceeds
4K.]
ARMv7 does not support page coloring, and requires that all data
and unified caches behave as Physically Indexed Physically
Tagged (PIPT) caches.
The only true armv6 chip we support isn't SMP and has a 16K/4-way cache
that neatly sidesteps the aliasing problem that requires page coloring
solutions. So modern arm chips we get to act like we've got PIPT data
caches, but with the quirk that cache ops are initiated by virtual
address.
Cool, thanks for the explanation!
To satisfy my own curiosity, since it "looks like VIPT", does that mean we
still have to flush the cache on context switch?
Post by Ian Lepore
Basically, when you perform a cache maintainence operation, a
translation table walk is done on the core that issued the cache op,
then from that point on the physical address is used within the cache
hardware and that's what gets broadcast to the other cores by the snoop
control unit or cache coherency fabric (depending on the chip).
So, if we go back to the original problem of wanting to do
Say the process that owns the buffer is running on one core and prefetches
some data into a cacheline for the buffer, and bus_dmamap_sync(POSTREAD) is
done by a kernel thread running on another core. Since the core running
the kernel thread is responsible for the TLB lookup to get the physical
address, then since that core has no UVA the cache ops will be treated as
misses and the cacheline on the core that owns the UVA won't be
invalidated, correct?
That means the panic on !pmap_dmap_iscurrent() in busdma_machdep-v6.c
should stay?
Sort of the same problem would apply to drivers using
vm_fault_quick_hold_pages + bus_dmamap_load_ma...no cache maintenance,
since there are no VAs to operate on. Indeed, both arm and mips
implementation of _bus_dmamap_load_phys don't do anything with the sync
list.
It occurs to me that one way to deal with both the blocking-sfbuf for
physcopy and VIPT cache maintenance might be to have a reserved per-CPU KVA
page. For arches that don't have a direct map, the idea would be to grab a
critical section, copy the bounce page or do cache maintenance on the
synclist entry, then drop the critical section. That brought up a dim
memory I had of Linux doing something similar, and in fact it seems to use
kmap_atomic for both cache ops and bounce buffers.
Svatopluk Kraus
2015-04-30 09:53:06 UTC
Permalink
On Thu, Apr 30, 2015 at 1:10 AM, Jason Harmening
Post by Jason Harmening
Post by Ian Lepore
For what we call armv6 (which is mostly armv7)...
The cache maintenance operations require virtual addresses, which means
it looks a lot like a VIPT cache. Under the hood the implementation
behaves as if it were a PIPT cache so even in the presence of multiple
mappings of the same physical page into different virtual addresses, the
SMP coherency hardware works correctly.
The ARM ARM says...
[Stuff about ARMv6 and page coloring when a cache way exceeds
4K.]
ARMv7 does not support page coloring, and requires that all data
and unified caches behave as Physically Indexed Physically
Tagged (PIPT) caches.
The only true armv6 chip we support isn't SMP and has a 16K/4-way cache
that neatly sidesteps the aliasing problem that requires page coloring
solutions. So modern arm chips we get to act like we've got PIPT data
caches, but with the quirk that cache ops are initiated by virtual
address.
Cool, thanks for the explanation!
To satisfy my own curiosity, since it "looks like VIPT", does that mean we
still have to flush the cache on context switch?
No, in general, there is no need to flush PIPT caches (even if they
"look like VIPT") on context switch. When it comes (cache
maintainance), physical page is either mapped in correct context or
you have to map it somewhere in current context (KVA is used for
that).
Post by Jason Harmening
Post by Ian Lepore
Basically, when you perform a cache maintainence operation, a
translation table walk is done on the core that issued the cache op,
then from that point on the physical address is used within the cache
hardware and that's what gets broadcast to the other cores by the snoop
control unit or cache coherency fabric (depending on the chip).
So, if we go back to the original problem of wanting to do bus_dmamap_sync()
Say the process that owns the buffer is running on one core and prefetches
some data into a cacheline for the buffer, and bus_dmamap_sync(POSTREAD) is
done by a kernel thread running on another core. Since the core running the
kernel thread is responsible for the TLB lookup to get the physical address,
then since that core has no UVA the cache ops will be treated as misses and
the cacheline on the core that owns the UVA won't be invalidated, correct?
That means the panic on !pmap_dmap_iscurrent() in busdma_machdep-v6.c should
stay?
Not for unmapped buffers. For user space buffers, it's still a
question how this will be resolved. It looks now that it's aiming to
not using UVA for DMA buffers in kernel at all. In any case, even if
UVA will be used, the panic won't be needed if correct implementation
will be done.
Post by Jason Harmening
Sort of the same problem would apply to drivers using
vm_fault_quick_hold_pages + bus_dmamap_load_ma...no cache maintenance, since
there are no VAs to operate on. Indeed, both arm and mips implementation of
_bus_dmamap_load_phys don't do anything with the sync list.
I'm just working on _bus_dmamap_load_phys() implementation for armv6.
It means that I'm adding sync list for unmapped buffers (with virtual
address set to zero) and implement cache maintainance operations with
physical address passed as argument. It means that given range will be
temporarily mapped to kernel (page by page) and cache operation using
virtual address willl be called. It's the same scenarion taken in i386
pmap.

Konstantin Belousov
2015-04-30 08:38:32 UTC
Permalink
Post by Ian Lepore
For what we call armv6 (which is mostly armv7)...
The cache maintenance operations require virtual addresses, which means
it looks a lot like a VIPT cache. Under the hood the implementation
behaves as if it were a PIPT cache so even in the presence of multiple
mappings of the same physical page into different virtual addresses, the
SMP coherency hardware works correctly.
The ARM ARM says...
[Stuff about ARMv6 and page coloring when a cache way exceeds
4K.]
ARMv7 does not support page coloring, and requires that all data
and unified caches behave as Physically Indexed Physically
Tagged (PIPT) caches.
The only true armv6 chip we support isn't SMP and has a 16K/4-way cache
that neatly sidesteps the aliasing problem that requires page coloring
solutions. So modern arm chips we get to act like we've got PIPT data
caches, but with the quirk that cache ops are initiated by virtual
address.
Basically, when you perform a cache maintainence operation, a
translation table walk is done on the core that issued the cache op,
then from that point on the physical address is used within the cache
hardware and that's what gets broadcast to the other cores by the snoop
control unit or cache coherency fabric (depending on the chip).
This is the same as it is done on the x86. There is a CLFLUSH
instruction, which takes virtual address and invalidates the cache line,
maintaining cache coherency in the coherency domain and possibly doing
write-back. It takes a virtual address, and even set the accessed bit in
the page table entry.

My understanding is that such decision to operate on the virtual
addresses for x86 was done to allow the instruction to work from the
user mode. Still, an instruction to flush cache line addressed by the
physical address would be nice. The required circuits are already
there, since CPUs must react on the coherency requests from other CPUs.
On amd64, the pmap_invalidate_cache_pages() uses direct map, but on
i386 kernel has to use specially allocated KVA page frame for temporal
mapping (per-cpu CMAP2), see i386/i386/pmap.c:pmap_flush_page().
Warner Losh
2015-04-29 20:05:42 UTC
Permalink
Yes, that needs to be done regardless of how the pages are wired. The particular problem here is that some caches on arm and mips are virtually-indexed (usually virtually-indexed, physically-tagged (VIPT)). That means the flush/invalidate instructions need virtual addresses, so figuring out the correct UVA to use for those could be a challenge. As I understand it, VIPT caches usually do have some hardware logic for finding all the cachelines that correspond to a physical address, so they can handle multiple VA mappings of the same PA. But it is unclear to me how cross-processor cache maintenance is supposed to work with VIPT caches on SMP systems.
If the caches were physically-indexed, then I don't think there would be an issue. You'd just pass the PA to the flush/invalidate instruction, and presumably a sane SMP implementation would propagate that to other cores via IPI.
I know on MIPS you cannot have more than one mapping to a page you are doing DMA to/from ever.

Warner
Svatopluk Kraus
2015-04-26 20:00:32 UTC
Permalink
On Sat, Apr 25, 2015 at 11:41 AM, Konstantin Belousov
Post by Konstantin Belousov
Post by Jason Harmening
--POSTWRITE and POSTREAD are only asynchronous if you call them from an
asynchronous context.
For a driver that's already performing DMA operations on usermode memory,
it seems likely that there's going to be *some* place where you can call
bus_dmamap_sync() and be guaranteed to be executing in the context of the
process that owns the memory. Then a regular bcopy will be safe and
inexpensive, assuming the pages have been properly vslock-ed/vm_map_wire-d.
That's usually whatever read/write/ioctl operation spawned the DMA transfer
in the first place. So, in those cases can you not just move the
POSTREAD/POSTWRITE sync from the "DMA-finished" interrupt to the
d_read/d_write/d_ioctl that waits on the "DMA-finished" interrupt?
--physcopyin/physcopyout aren't trivial. They go through uiomove_fromphys,
which often uses sfbufs to create temporary KVA mappings for the physical
pages. sf_buf_alloc() can sleep (unless SFB_NOWAIT is specified, which
means it can fail and which uiomove_fromphys does not specify for good
reason); that makes it unsafe for use in either a threaded interrupt or a
filter. Perhaps the physcopyout path could be changed to use pmap_qenter
directly in this case, but that can still be expensive in terms of TLB
shootdowns.
Checking against VM_MIN_KERNEL_ADDRESS seems sketchy; it eliminates the
chance to use a much-less-expensive bcopy in cases where the sync is
happening in correct process context.
Context-switching during bus_dmamap_sync() shouldn't be an issue. In a
filter interrupt, curproc will be completely arbitrary but none of this
stuff should be called in a filter anyway. Otherwise, if you're in a
kernel thread (including an ithread), curproc should be whatever proc was
supplied when the thread was created. That's usually proc0, which only has
kernel address space. IOW, even if a context-switch happens sometime
during bus_dmamap_sync, any pmap check or copy should have a consistent and
non-arbitrary process context.
I think something like your second solution would be workable to make
UIO_USERSPACE syncs work in non-interrupt kernel threads, but given all the
restrictions and extra cost of physcopy, I'm not sure how useful that would
be.
I do think busdma.9 could at least use a note that bus_dmamap_sync() is
only safe to call in the context of the owning process for user buffers.
UIO_USERSPACE for busdma is absolutely unsafe and cannot be used without
making kernel panicing. Even if you wire the userspace bufer, there is
nothing which could prevent other thread in the user process, or other
process sharing the same address space, to call munmap(2) on the range.
Using of vslock() is proposed method in bus_dma man page. IMO, the
function looks complex and can be a big time eater. However, are you
saying that vslock() does not work for that? Then for what reason does
that function exist?
Post by Konstantin Belousov
The only safe method to work with the userspace regions is to
vm_fault_quick_hold() them to get hold on the pages, and then either
pass pages array down, or remap them in the KVA with pmap_qenter().
So, even vm_fault_quick_hold() does not keep valid user mapping?
Post by Konstantin Belousov
Post by Jason Harmening
Post by Svatopluk Kraus
DMA can be done on client buffer from user address space. For example,
thru bus_dmamap_load_uio() when uio->uio_segflg is UIO_USERSPACE. Such
client buffer can bounce and then, it must be copied to and from
bounce buffer in bus_dmamap_sync().
Current implementations (in all archs) do not take into account that
bus_dmamap_sync() is asynchronous for POSTWRITE and POSTREAD in
general. It can be asynchronous for PREWRITE and PREREAD too. For
example, in some driver implementations where DMA client buffers
operations are buffered. In those cases, simple bcopy() is not
correct.
-----------------------------
struct bounce_page {
vm_offset_t vaddr; /* kva of bounce buffer */
bus_addr_t busaddr; /* Physical address */
vm_offset_t datavaddr; /* kva of client data */
bus_addr_t dataaddr; /* client physical address */
bus_size_t datacount; /* client data count */
STAILQ_ENTRY(bounce_page) links;
};
if ((op & BUS_DMASYNC_PREWRITE) != 0) {
while (bpage != NULL) {
if (bpage->datavaddr != 0) {
bcopy((void *)bpage->datavaddr,
(void *)bpage->vaddr,
bpage->datacount);
} else {
physcopyout(bpage->dataaddr,
(void *)bpage->vaddr,
bpage->datacount);
}
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
}
-----------------------------
(1) datavaddr is not always kva of client data, but sometimes it can
be uva of client data.
(2) bcopy() can be used only if datavaddr is kva or when map->pmap is
current pmap.
Note that there is an implication for bus_dmamap_load_uio() with
uio->uio_segflg set to UIO_USERSPACE that used physical pages are
in-core and wired. See "man bus_dma".
There is not public interface to check that map->pmap is current pmap.
if (bpage->datavaddr >= VM_MIN_KERNEL_ADDRESS) {
bcopy();
} else {
physcopy();
}
If there will be public pmap_is_current() then another solution is the
if (bpage->datavaddr != 0) && pmap_is_current(map->pmap)) {
bcopy();
} else {
physcopy();
}
The second solution implies that context switch must not happen during
bus_dmamap_sync() called from an interrupt routine. However, IMO, it's
granted.
Note that map->pmap should be always kernel_pmap for datavaddr >=
VM_MIN_KERNEL_ADDRESS.
Comments, different solutions, or objections?
Svatopluk Kraus
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
Jason Harmening
2015-04-27 14:46:52 UTC
Permalink
Post by Svatopluk Kraus
Using of vslock() is proposed method in bus_dma man page. IMO, the
Post by Svatopluk Kraus
function looks complex and can be a big time eater. However, are you
saying that vslock() does not work for that? Then for what reason
does that function exist?
There's been some misunderstanding here, I think. If you use vslock (or
vm_map_wire, which vslock wraps), then the UVAs should be safe from
teardown and you should be able to use bcopy if you are in the correct
context. See the post elsewhere in this thread where I dig through the
sys_munmap path and find vm_map_delete waiting on system-wired PTEs.
Post by Svatopluk Kraus
Post by Svatopluk Kraus
Post by Konstantin Belousov
The only safe method to work with the userspace regions is to
vm_fault_quick_hold() them to get hold on the pages, and then either
pass pages array down, or remap them in the KVA with pmap_qenter().
So, even vm_fault_quick_hold() does not keep valid user mapping?
vm_fault_quick_hold_pages() doesn't do any bookkeeping on UVAs, only the
underlying physical pages. That means it is possible for the UVA region to
be munmap'ed if vm_fault_quick_hold_pages() has been used. So if you use
vm_fault_quick_hold_pages() instead of vslock(), you can't use
bus_dmamap_load_uio(UIO_USERSPACE) because that assumes valid UVA
mappings. You must instead deal only with the underlying vm_page_t's,
which means using _bus_dmamap_load_ma().

Here's my take on it:

vslock(), as you mention, is very complex. It not only keeps the physical
pages from being swapped out, but it also removes them from page queues (see
https://lists.freebsd.org/pipermail/freebsd-current/2015-March/054890.html)
and does a lot of bookkeeping on the UVA mappings for those pages. Part of
that involves simulating a pagefault, which as kib mentions can lead to a
lot of UVA fragmentation.

vm_fault_quick_hold_pages() is much cheaper and seems mostly intended for
short-term DMA operations.

So, you might use vslock() + bus_dmamap_load_uio() for long-duration DMA
transfers, like continuous streaming to a circular buffer that could last
minutes or longer. Then, the extra cost of the vslock will be amortized
over the long time of the transfer, and UVA fragmentation will be less of a
concern since you presumably will have a limited number of vslock() calls
over the lifetime of the process. Also, you will probably be keeping the
DMA map for a long duration anyway, so it should be OK to wait and call
bus_dmamap_sync() in the process context. Since vslock() removed the pages
from the page queues, there will also be less work for pagedaemon to do
during the long transfer.

OTOH, vm_fault_quick_hold_pages() + _bus_dmamap_load_ma() seems much better
to do for frequent short transfers to widely-varying buffers, such as block
I/O. The extra pagedaemon work is inconsequential here, and since the DMA
operations are frequent and you may have many in-flight at once, the
reduced setup cost and fragmentation are much more important.
Svatopluk Kraus
2015-04-26 19:56:57 UTC
Permalink
On Sat, Apr 25, 2015 at 12:50 AM, Jason Harmening
Post by Jason Harmening
--POSTWRITE and POSTREAD are only asynchronous if you call them from an
asynchronous context.
For a driver that's already performing DMA operations on usermode memory, it
seems likely that there's going to be *some* place where you can call
bus_dmamap_sync() and be guaranteed to be executing in the context of the
process that owns the memory. Then a regular bcopy will be safe and
inexpensive, assuming the pages have been properly vslock-ed/vm_map_wire-d.
That's usually whatever read/write/ioctl operation spawned the DMA transfer
in the first place. So, in those cases can you not just move the
POSTREAD/POSTWRITE sync from the "DMA-finished" interrupt to the
d_read/d_write/d_ioctl that waits on the "DMA-finished" interrupt?
Yes, it could be possible in those cases. However it implies, that dma
unload must be moved as well. And to make it symmetric, dma load too.
Then dma driver just programs hardware and all clients must do dma
load, sync, wait for finish, sync, and unload itself. So (1) almost
same code will be spreaded on many places and (2) all the stuff which
is done for dma load will be pending in system much longer.
Post by Jason Harmening
--physcopyin/physcopyout aren't trivial. They go through uiomove_fromphys,
which often uses sfbufs to create temporary KVA mappings for the physical
pages. sf_buf_alloc() can sleep (unless SFB_NOWAIT is specified, which
means it can fail and which uiomove_fromphys does not specify for good
reason); that makes it unsafe for use in either a threaded interrupt or a
filter. Perhaps the physcopyout path could be changed to use pmap_qenter
directly in this case, but that can still be expensive in terms of TLB
shootdowns.
I thought that unmapped buffers are used to save KVA space. For such
buffers physcopyin/physcopyout must be used already. So if there is
some slowing down, it's taken into acount already. And, if it's good
for unmapped buffers, it should be good for user buffers as well.

I'm not so afraid of TLB shutdowns in ARM arch. On the contrary, the
arch is not DMA cache coherent, so cache maintainance is of much care.
It must always be done for cached memory, bouncing or not.
Post by Jason Harmening
Checking against VM_MIN_KERNEL_ADDRESS seems sketchy; it eliminates the
chance to use a much-less-expensive bcopy in cases where the sync is
happening in correct process context.
Right, but it's the simplest solution.
Post by Jason Harmening
Context-switching during bus_dmamap_sync() shouldn't be an issue. In a
filter interrupt, curproc will be completely arbitrary but none of this
stuff should be called in a filter anyway. Otherwise, if you're in a kernel
thread (including an ithread), curproc should be whatever proc was supplied
when the thread was created. That's usually proc0, which only has kernel
address space. IOW, even if a context-switch happens sometime during
bus_dmamap_sync, any pmap check or copy should have a consistent and
non-arbitrary process context.
It's correct analysis with given presumptions. But why are you so sure
that this stuff should not be done in interrupt filter?
Post by Jason Harmening
I think something like your second solution would be workable to make
UIO_USERSPACE syncs work in non-interrupt kernel threads, but given all the
restrictions and extra cost of physcopy, I'm not sure how useful that would
be.
That or KASSERT to check that context is bad. In fact, the second
solution does not close door. If it's called in correct context, bcopy
is used anyway, and if it's called in bad context, some extra work is
done due to physcopyin/physcopyout.
Post by Jason Harmening
I do think busdma.9 could at least use a note that bus_dmamap_sync() is only
safe to call in the context of the owning process for user buffers.
At least for now. However, I would be unhappy if it remains that for ever.
Post by Jason Harmening
Post by Svatopluk Kraus
DMA can be done on client buffer from user address space. For example,
thru bus_dmamap_load_uio() when uio->uio_segflg is UIO_USERSPACE. Such
client buffer can bounce and then, it must be copied to and from
bounce buffer in bus_dmamap_sync().
Current implementations (in all archs) do not take into account that
bus_dmamap_sync() is asynchronous for POSTWRITE and POSTREAD in
general. It can be asynchronous for PREWRITE and PREREAD too. For
example, in some driver implementations where DMA client buffers
operations are buffered. In those cases, simple bcopy() is not
correct.
-----------------------------
struct bounce_page {
vm_offset_t vaddr; /* kva of bounce buffer */
bus_addr_t busaddr; /* Physical address */
vm_offset_t datavaddr; /* kva of client data */
bus_addr_t dataaddr; /* client physical address */
bus_size_t datacount; /* client data count */
STAILQ_ENTRY(bounce_page) links;
};
if ((op & BUS_DMASYNC_PREWRITE) != 0) {
while (bpage != NULL) {
if (bpage->datavaddr != 0) {
bcopy((void *)bpage->datavaddr,
(void *)bpage->vaddr,
bpage->datacount);
} else {
physcopyout(bpage->dataaddr,
(void *)bpage->vaddr,
bpage->datacount);
}
bpage = STAILQ_NEXT(bpage, links);
}
dmat->bounce_zone->total_bounced++;
}
-----------------------------
(1) datavaddr is not always kva of client data, but sometimes it can
be uva of client data.
(2) bcopy() can be used only if datavaddr is kva or when map->pmap is
current pmap.
Note that there is an implication for bus_dmamap_load_uio() with
uio->uio_segflg set to UIO_USERSPACE that used physical pages are
in-core and wired. See "man bus_dma".
There is not public interface to check that map->pmap is current pmap.
if (bpage->datavaddr >= VM_MIN_KERNEL_ADDRESS) {
bcopy();
} else {
physcopy();
}
If there will be public pmap_is_current() then another solution is the
if (bpage->datavaddr != 0) && pmap_is_current(map->pmap)) {
bcopy();
} else {
physcopy();
}
The second solution implies that context switch must not happen during
bus_dmamap_sync() called from an interrupt routine. However, IMO, it's
granted.
Note that map->pmap should be always kernel_pmap for datavaddr >=
VM_MIN_KERNEL_ADDRESS.
Comments, different solutions, or objections?
Svatopluk Kraus
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
Loading...