Ranting about OCF / crypto(9)

Discussion:

John Baldwin

2018-01-11 00:18:52 UTC

While working on hooking the ccr(4) driver into our in-kernel crypto
framework (along with some out-of-tree patches to extend OpenSSL's
/dev/crypto engine to support AES-CTR/XTS/GCM and some further changes to
do zero-copy), I've run into several bumps / oddities in OCF. I'm probably
going to miss several of them, but here's at least a start of a list of
things. In some cases I have some suggestions on improvements.

I will try to start with more broad / higher-level items first before
diving into minutiae:

- OCF is over flexible and overly broad. Rather than supporting
arbitrary stacking of transforms (and arbitrary depths), I think we
should probably aim to support more specific workloads and support
them well. To my mind the classes of things we should support are
probably:

- Simple block cipher requests.
- Simple "hash a buffer" requests. (Both HMAC and non-HMAC)
- IPSec-style requests (combined auth and encryption using
"encrypt-then-mac" with an optional AAD region before the
ciphertext). Note that geli requests fall into this type.
- TLS-style requests (using TLS's different methods of
combining auth and encryption methods when those are
separate)
- Simple compression / decompression requests. While this isn't
"crypto", per se, I do think it is probably still simpler to
manage this via OCF than a completely separate interface.

In terms of algorithms, I suspect there are some older algorithms
we could drop. Modern hardware doesn't offload DES for example.
Both ccr(4) and aesni(4) only support AES for encryption. We
do need to keep algorithms required for IPSec in the kernel, but
we could probably drop some others?

- To better support OpenSSL's engine, the /dev/crypto hash interface
should not require monotonic buffers, but support requests for
large buffers that span multiple requests (so you can do something
akin to the 'Init' / 'Update' (N times) / 'Final' model existing
software hashing APIs use). In particular, the bigger win for
hashing in hardware is when you can offload the hashing of a large
thing rather than small requests.

- To better support OpenSSL's engine, the /dev/crypto hash interface
should support "plain" hash algorithms such as SHA* without an
HMAC. By default OpenSSL's engine interface does the HMAC-specific
bits (generating pads, etc.) in software and only defers to the
engine for the raw hash (e.g. if you use the HMAC() function from
libcrypto it will only ask the engine interface for a raw hash,
not for an HMAC hash).

- To better support OpenSSL's engine, the /dev/crypto cipher
interface should also support non-monolithic buffers. The existing
engine does this now by copying the last block of the output data
out as a saved IV to use for a subsequent request, but it might be
nicer to be more formal here and return the IV to userland for
non-"final" cipher requests.

- The interface between the crypto layer and backend drivers should
_not_ use integer session IDs. This is rediculously dumb and
inefficient. All the drivers have silly algorithms to try to manage
growable arrays that can be indexed by the returned session ID.
Instead, drivers should be able to return a 'void *' cookie when
creating a session and get that cookie pointer as an argument to
the 'process' and 'freesession' callbacks. Imagine if vnodes used
an i-node number rather than 'v_data' and you'd have the model OCF
uses. I don't mind if we have a kind of generic 'session' structure
that we export to drivers and pass in the callbacks and the drivers
get to use a 'foo_data' member of.

- The interface to describe crypto requests needs to move away from
arbitrary linked lists of descriptors. We should just have a
single "session" structure that assumes you have one cipher and
one auth with a "mode" member to indicate the particular direction
/ combination. Likewise, the description of a request needs to
have a similar assumption. The structures used by the /dev/crypto
ioctl's are a bit closer to what I think we should use compared to
the linked-list thing we have now. Related is that we should be
able to get rid of having the three separate "algorithms" for GCM
hashes. For AES-GCM one would just say they are using AES-GCM
and both the hash/tag and ciphertext would be valid inputs / outputs
with a single key.

- To support non-monolithic buffers from the OpenSSL engine, crypto
requests to drivers also have to support non-monolithic buffers.
This means having a notion of a buffer that may be at the start,
middle, or end of a larger transformation (e.g. for hash only the
start gets the IPAD, only the end gets the OPAD and returns a
valid hash, etc., whereas for ciphers any non-end requests would
return the IV to use for the next request).

For drivers that have buffer size limits, it would be nice to expose
those limits in the driver capabilities and depend on the upper layer
to "split" requests such as happens now for disk drivers.

- For hashing algorithms we should support a "verify" mode in addition
to the current "compute" mode. The verify mode would accept a block
of data to hash along with an expected mac and return a success
/ failure rather than an computed hash value. AES-GCM already works
this way for decryption, but this would extend that mode for other
hash algorithms (e.g. AES-CBC+SHA2-256-HMAC). Existing crypto
co-processors (e.g. ccr(4)) already support these types of requests.

Related is that we need to fix IPSec to treat EBADMSG errors from
descryption as auth failure rather than encryption failure (right
now AES-GCM auth failures are reported incorrectly in netstat -s
due to this).

- Sessions for a combined cipher + hash should also be tied to a
specific way of combining the algorithms. Right now you can
create a session for AES-CBC with a SHA hash and the driver has no
way to know if you are going to do encrypt-then-mac or one of the
other variants. We should include this in the session (so a given
session can only be used for one type which is normally true anyway),
and drivers can then only claim to support combinations they
support.

- The CRD_F_IV_PRESENT flag should be removed and replaced with
a CRD_F_IV_INJECT flag which means "inject the IV". Right now
the _lack_ of CRD_F_IV_PRESENT for encryption (but not decryption!)
requests means "inject the IV". It would be clearer to just have
a flag that is only set when you want the driver to take the
action.

- Speaking of IV handling, drivers have to do some extra handling for
IVs including possibly generating them. I think the idea is that
some co-processors might support generating IVs, but most of the
drivers I've looked at just end up duplicating the same block of
code to call arc4rand() for encryption requests without
CRD_F_IV_EXPLICIT. I don't believe Linux tries to support this and
instead always supplies an IV to the driver. I'd rather we do that
and only depend on a flag to indicate where the IV is (crd_iv vs
in the buffer).

- The API for copying data to/from crypto buffers is a bit obtuse and
limiting. Rather than accepting the crypto operation ('crp') as
a parameter to describe the crypto buffer, the crypto_copyback()
and crypto_copydata() functions accept various members of that
function explicitly (e.g. crp_flags and crp_buf). However, in my
experiments with zero-copy AES-GCM via /dev/crypto and OpenSSL it
was convenient to store the AAD in a KVA buffer in the 'crp' and
the payload to transform in an array of VM pages. However, for
this model 'crp_buf' is useless. I ended up adding a wrapper API
'crypto_copyto' and 'crypto_copyfrom' which accept a 'crp' directly.
Linux's API actually passes something akin to sglist as the
description of the buffers in a crypto request.

- We need to not treat accelerated software (e.g. AES-NI) as a
hardware interface. Right now OCF's model of priorities when
trying to choose a backend driver for a session only has two
"levels" software vs hardware and aesni(4) (and the ARMv8 variant)
are lumped into the hardware bucket so that they have precedence
over the "dumb" software implementation. However, the accelerated
software algorithms do need some of the same support features of
the "dumb" software implementation (such as being scheduled on a
thread pool to use CPU cycles) that are not needed by other "hardware"
engines. OCF needs to understand this distinction.

- Somewhat related, we should try to use accelerated software when
possible (e.g. AES-CBC with SHA) doesn't use AES-NI unless the
CPU supports accelerated SHA. Ideally for this case we'd still
use AES-NI for the AES portion along with the software SHA
implementation (and we'd do it one pass over the data rather than
two when possible).

- Sometimes a crypto driver might need to defer certain requests to
software (e.g. ccr(4) has to do this for some GCM requests). In
addition, there are some other cases when we might want requests
from a single session to be sent to different backends (e.g. you
may want to use accelerated software for requests below a certain
size, and a crypto engine for larger requests. You might also want
to take NUMA into account when choosing which backend crypto engine
to dispatch a request to.) To that end, I think we want to have the
ability for a single OCF session to support multiple backend
sessions.

One use case is that if I as a driver can't handle a request I'd like
to be able to fail it with a special error code and have the crypto
later fall back to software for me (and to use accelerated software if
possible). Right now ccr(4) duplicates the "dumb" software for GCM
requests it can't handle explicitly.

Another use case might be failover if a hardware engine experiences
a hardware failure. In theory it should be possible to fail over
to a different driver at that point including resubmitting pending
requests that weren't completed, and it should be possible (I think)
to manage this in the crypto framework rather than in consumers like
IPSec and GELI.

Load distribution among backends might be another case to consider
(e.g. GELI or ZFS encryption once that lands) if you have long-
running sessions that spawn lots of self-contained requests.

Note that if we want to spawn additional backend sessions on the fly
(e.g. only create a software fallback session on demand if a driver
fails a request with the "use software" magic error code), we will
have to keep per-session state such as keys around. We probably
already do that now, but this would definitely require doing that.

One concern with some of these changes is that there are several drivers
in the tree for older hardware that I'm not sure is really used anymore.
That is an impediment to making changes to the crypto <-> driver interface
if we can't find folks willing to at least test changes to those drivers
if not maintain them.

This is all I could think of today. What do other folks think?

--
John Baldwin

Benjamin Kaduk

2018-01-11 05:56:21 UTC

Permalink

Replying mostly with my upstream OpenSSL hat on, not least to note
my regret that OpenSSL's historic poor API design choices reflect so
heavily herein...

Note that on master/proto-1.1.1, the cryptodev engine got swapped
out for the one from https://github.com/openssl/openssl/pull/3744,
since the OpenBSD folks would not agree to relicense. (Apparently
the move to the Apache license is probably actually going to
happen.) But I'd be happy to help review patches to add additional
functionality.

Post by John Baldwin
do zero-copy), I've run into several bumps / oddities in OCF. I'm probably
going to miss several of them, but here's at least a start of a list of
things. In some cases I have some suggestions on improvements.
I will try to start with more broad / higher-level items first before
- OCF is over flexible and overly broad. Rather than supporting
arbitrary stacking of transforms (and arbitrary depths), I think we
should probably aim to support more specific workloads and support
them well. To my mind the classes of things we should support are
- Simple block cipher requests.
- Simple "hash a buffer" requests. (Both HMAC and non-HMAC)
- IPSec-style requests (combined auth and encryption using
"encrypt-then-mac" with an optional AAD region before the
ciphertext). Note that geli requests fall into this type.
- TLS-style requests (using TLS's different methods of
combining auth and encryption methods when those are
separate)

My brain is trying to ask "is that really a good idea?" about
putting the old (bad/broken) TLS mac+encrypt schemes in the kernel,
from the vantage point of making it easier to do insecure things.

Post by John Baldwin
- Simple compression / decompression requests. While this isn't
"crypto", per se, I do think it is probably still simpler to
manage this via OCF than a completely separate interface.

Probably, though perhaps less so after the removal of arbitrary
stacking depths. And mixing compression with encryption has its own
risks, of course.

Post by John Baldwin
In terms of algorithms, I suspect there are some older algorithms
we could drop. Modern hardware doesn't offload DES for example.
Both ccr(4) and aesni(4) only support AES for encryption. We
do need to keep algorithms required for IPSec in the kernel, but
we could probably drop some others?

Yes, it's probably time for DES to go. Maybe others as well.

Something of an aside, but we are growing some support for
"one-shot" stuff as a result of Ed25519 support landing, but the
Init/Update/Final mindset is still pretty baked in, for now.

Post by John Baldwin
- To better support OpenSSL's engine, the /dev/crypto hash interface
should support "plain" hash algorithms such as SHA* without an
HMAC. By default OpenSSL's engine interface does the HMAC-specific
bits (generating pads, etc.) in software and only defers to the
engine for the raw hash (e.g. if you use the HMAC() function from
libcrypto it will only ask the engine interface for a raw hash,
not for an HMAC hash).

(https://github.com/openssl/openssl/issues/977; patches welcome for
improving the ENGINE interface)

Post by John Baldwin
- To better support OpenSSL's engine, the /dev/crypto cipher
interface should also support non-monolithic buffers. The existing
engine does this now by copying the last block of the output data
out as a saved IV to use for a subsequent request, but it might be
nicer to be more formal here and return the IV to userland for
non-"final" cipher requests.
- The interface between the crypto layer and backend drivers should
_not_ use integer session IDs. This is rediculously dumb and
inefficient. All the drivers have silly algorithms to try to manage
growable arrays that can be indexed by the returned session ID.
Instead, drivers should be able to return a 'void *' cookie when
creating a session and get that cookie pointer as an argument to
the 'process' and 'freesession' callbacks. Imagine if vnodes used
an i-node number rather than 'v_data' and you'd have the model OCF
uses. I don't mind if we have a kind of generic 'session' structure
that we export to drivers and pass in the callbacks and the drivers
get to use a 'foo_data' member of.
- The interface to describe crypto requests needs to move away from
arbitrary linked lists of descriptors. We should just have a
single "session" structure that assumes you have one cipher and
one auth with a "mode" member to indicate the particular direction
/ combination. Likewise, the description of a request needs to
have a similar assumption. The structures used by the /dev/crypto
ioctl's are a bit closer to what I think we should use compared to
the linked-list thing we have now. Related is that we should be
able to get rid of having the three separate "algorithms" for GCM
hashes. For AES-GCM one would just say they are using AES-GCM
and both the hash/tag and ciphertext would be valid inputs / outputs
with a single key.
- To support non-monolithic buffers from the OpenSSL engine, crypto
requests to drivers also have to support non-monolithic buffers.
This means having a notion of a buffer that may be at the start,
middle, or end of a larger transformation (e.g. for hash only the
start gets the IPAD, only the end gets the OPAD and returns a
valid hash, etc., whereas for ciphers any non-end requests would
return the IV to use for the next request).
For drivers that have buffer size limits, it would be nice to expose
those limits in the driver capabilities and depend on the upper layer
to "split" requests such as happens now for disk drivers.
- For hashing algorithms we should support a "verify" mode in addition
to the current "compute" mode. The verify mode would accept a block
of data to hash along with an expected mac and return a success
/ failure rather than an computed hash value. AES-GCM already works
this way for decryption, but this would extend that mode for other
hash algorithms (e.g. AES-CBC+SHA2-256-HMAC). Existing crypto
co-processors (e.g. ccr(4)) already support these types of requests.
Related is that we need to fix IPSec to treat EBADMSG errors from
descryption as auth failure rather than encryption failure (right
now AES-GCM auth failures are reported incorrectly in netstat -s
due to this).
- Sessions for a combined cipher + hash should also be tied to a
specific way of combining the algorithms. Right now you can
create a session for AES-CBC with a SHA hash and the driver has no
way to know if you are going to do encrypt-then-mac or one of the
other variants. We should include this in the session (so a given
session can only be used for one type which is normally true anyway),
and drivers can then only claim to support combinations they
support.
- The CRD_F_IV_PRESENT flag should be removed and replaced with
a CRD_F_IV_INJECT flag which means "inject the IV". Right now
the _lack_ of CRD_F_IV_PRESENT for encryption (but not decryption!)
requests means "inject the IV". It would be clearer to just have
a flag that is only set when you want the driver to take the
action.
- Speaking of IV handling, drivers have to do some extra handling for
IVs including possibly generating them. I think the idea is that
some co-processors might support generating IVs, but most of the
drivers I've looked at just end up duplicating the same block of
code to call arc4rand() for encryption requests without
CRD_F_IV_EXPLICIT. I don't believe Linux tries to support this and
instead always supplies an IV to the driver. I'd rather we do that
and only depend on a flag to indicate where the IV is (crd_iv vs
in the buffer).
- The API for copying data to/from crypto buffers is a bit obtuse and
limiting. Rather than accepting the crypto operation ('crp') as
a parameter to describe the crypto buffer, the crypto_copyback()
and crypto_copydata() functions accept various members of that
function explicitly (e.g. crp_flags and crp_buf). However, in my
experiments with zero-copy AES-GCM via /dev/crypto and OpenSSL it
was convenient to store the AAD in a KVA buffer in the 'crp' and
the payload to transform in an array of VM pages. However, for
this model 'crp_buf' is useless. I ended up adding a wrapper API
'crypto_copyto' and 'crypto_copyfrom' which accept a 'crp' directly.
Linux's API actually passes something akin to sglist as the
description of the buffers in a crypto request.
- We need to not treat accelerated software (e.g. AES-NI) as a
hardware interface. Right now OCF's model of priorities when
trying to choose a backend driver for a session only has two
"levels" software vs hardware and aesni(4) (and the ARMv8 variant)
are lumped into the hardware bucket so that they have precedence
over the "dumb" software implementation. However, the accelerated
software algorithms do need some of the same support features of
the "dumb" software implementation (such as being scheduled on a
thread pool to use CPU cycles) that are not needed by other "hardware"
engines. OCF needs to understand this distinction.
- Somewhat related, we should try to use accelerated software when
possible (e.g. AES-CBC with SHA) doesn't use AES-NI unless the
CPU supports accelerated SHA. Ideally for this case we'd still
use AES-NI for the AES portion along with the software SHA
implementation (and we'd do it one pass over the data rather than
two when possible).
- Sometimes a crypto driver might need to defer certain requests to
software (e.g. ccr(4) has to do this for some GCM requests). In
addition, there are some other cases when we might want requests
from a single session to be sent to different backends (e.g. you
may want to use accelerated software for requests below a certain
size, and a crypto engine for larger requests. You might also want
to take NUMA into account when choosing which backend crypto engine
to dispatch a request to.) To that end, I think we want to have the
ability for a single OCF session to support multiple backend
sessions.
One use case is that if I as a driver can't handle a request I'd like
to be able to fail it with a special error code and have the crypto
later fall back to software for me (and to use accelerated software if
possible). Right now ccr(4) duplicates the "dumb" software for GCM
requests it can't handle explicitly.
Another use case might be failover if a hardware engine experiences
a hardware failure. In theory it should be possible to fail over
to a different driver at that point including resubmitting pending
requests that weren't completed, and it should be possible (I think)
to manage this in the crypto framework rather than in consumers like
IPSec and GELI.
Load distribution among backends might be another case to consider
(e.g. GELI or ZFS encryption once that lands) if you have long-
running sessions that spawn lots of self-contained requests.
Note that if we want to spawn additional backend sessions on the fly
(e.g. only create a software fallback session on demand if a driver
fails a request with the "use software" magic error code), we will
have to keep per-session state such as keys around. We probably
already do that now, but this would definitely require doing that.
One concern with some of these changes is that there are several drivers
in the tree for older hardware that I'm not sure is really used anymore.
That is an impediment to making changes to the crypto <-> driver interface
if we can't find folks willing to at least test changes to those drivers
if not maintain them.

That does seem like a relevant concern, as some of this stuff seems
pretty obscure now. I expect that some of it will have to go since
no one can be found to test it.

Post by John Baldwin
This is all I could think of today. What do other folks think?

This generally seems unobjectionable and nice to have.

-Ben

Bjoern A. Zeeb

2018-01-11 13:07:58 UTC

Permalink

Post by Benjamin Kaduk

Yes, it's probably time for DES to go. Maybe others as well.

There sadly still is a lot of commercial gear out there that still
requires single-DES.

Post by Benjamin Kaduk

Post by John Baldwin
One concern with some of these changes is that there are several drivers
in the tree for older hardware that I'm not sure is really used anymore.
That is an impediment to making changes to the crypto <-> driver interface
if we can't find folks willing to at least test changes to those drivers
if not maintain them.

That does seem like a relevant concern, as some of this stuff seems
pretty obscure now. I expect that some of it will have to go since
no one can be found to test it.

I am sure I have old soekris boxes in use with a hifn(4) in them.

/bz

John-Mark Gurney

2018-01-15 00:06:38 UTC

Permalink

Post by Bjoern A. Zeeb

Post by Benjamin Kaduk

Yes, it's probably time for DES to go. Maybe others as well.

There sadly still is a lot of commercial gear out there that still
requires single-DES.

Even 3DES is effectively broken:
https://sweet32.info/

and does it even make sense to support acceleration of DES? Does
anyone realistically depend upon hardware acceleration of it? I'd be
surprised if anyone does.

--
John-Mark Gurney Voice: +1 415 225 5579

"All that I will do, has been done, All that I have, has not."

John Baldwin

2018-01-11 17:41:15 UTC

Permalink

Post by Benjamin Kaduk
Replying mostly with my upstream OpenSSL hat on, not least to note
my regret that OpenSSL's historic poor API design choices reflect so
heavily herein...

Hmm, I tried mailing the person who worked on that new engine a while ago
to see if they were open to accepting upstream patches and never got a
reply. If you are able to work with upstream then I will talk to you more
offline about CTR/XTS/GCM support for 1.1.1.

Post by Benjamin Kaduk

Post by John Baldwin
- TLS-style requests (using TLS's different methods of
combining auth and encryption methods when those are
separate)

My brain is trying to ask "is that really a good idea?" about
putting the old (bad/broken) TLS mac+encrypt schemes in the kernel,
from the vantage point of making it easier to do insecure things.

It's more about being able to offload TLS encryption. Hardware engines
support combined auth+enc operations, so we need to have requests at
that granularity. When I last looked, OpenSSL 1.0.x at least didn't
really support CBC+SHA as a real NID (there's a special NID only used
by the userland aesni bits, but no engine implements it). Not all of
the TLS world is GCM yet, so offloading non-GCM TLS is still desirable.
My understanding is that TLS now supports IPSec-style EtM via some RFC
I can't recall, but trying to read the code in OpenSSL 1.0.x at least
I couldn't find anything that seemed to support that.

Post by Benjamin Kaduk

Probably, though perhaps less so after the removal of arbitrary
stacking depths. And mixing compression with encryption has its own
risks, of course.

I probably think you wouldn't mix but would either do compression, auth,
hash, or auth+enc. NetBSD's /dev/crypto does support stacking
compression + auth + enc in a single ioctl, but it doesn't provide any
way to control the ordering so in practice I think it was just a way to
permit offloading compression alone.

Post by Benjamin Kaduk

Yes, it's probably time for DES to go. Maybe others as well.

Something of an aside, but we are growing some support for
"one-shot" stuff as a result of Ed25519 support landing, but the
Init/Update/Final mindset is still pretty baked in, for now.

Also, the new /dev/crypto engine for 1.1.x supports a Linux API for hashes
that assumes Init/Update/Final. My thinking is to implement the Linux API
(really just a new flag IIRC) for this so that hopefully the 1.1.x engine
doesn't need any changes to support this. It would be nice if OpenSSL
would use the HMAC nids for HMAC requests instead of the plain hashes
perhaps? (That seems relevant to the 977 issue you quoted and right now
FreeBSD's /dev/crypto is an existing implementation that would benefit
from HMAC using HMAC NIDs instead of digest NIDs.) I don't think it would
require a change in the engine interface, btw. I think it means that
HMAC() needs to first try to find an engine that supports the HMAC NID,
and if that fails it could then use the current code which generates PADs
in software and looks for an engine that implements the digest NID.

--
John Baldwin

John-Mark Gurney

2018-01-15 00:08:36 UTC

Permalink

Post by John Baldwin

Post by Benjamin Kaduk

Probably, though perhaps less so after the removal of arbitrary
stacking depths. And mixing compression with encryption has its own
risks, of course.

Never makes sense to do compression after enc, so it's really what order
auth and enc should happen in..

--
John-Mark Gurney Voice: +1 415 225 5579

"All that I will do, has been done, All that I have, has not."

Poul-Henning Kamp

2018-01-11 07:46:24 UTC

Permalink

--------

Post by John Baldwin
- OCF is over flexible and overly broad.

I would actually argue that it is neithe, quite the contrary.

With the kernel-userland transition becoming more expensive, what
we need is a DSL where you can put entire processing steps into the
kernel, sort of like BPF but more general.

Ideally, you should be able to push something like this into
the kernel and have it executed in a single syscall:

h = hash:sha256()
b = file_buffer()
f = open("/tmp/foo", "r")
while f.read(b):
h.input(b)
return h.hex()

BPF is the existence proof that stuff like this is both
feasible and profitable, now we just need to take it to
the next level.

If we had a language like this, accept-filters whouldn't be
necessary.

--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
***@FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

Conrad Meyer

2018-01-11 07:54:19 UTC

Permalink

Post by Poul-Henning Kamp
--------

Post by John Baldwin
- OCF is over flexible and overly broad.

I would actually argue that it is neithe, quite the contrary.
With the kernel-userland transition becoming more expensive, what
we need is a DSL where you can put entire processing steps into the
kernel, sort of like BPF but more general.
Ideally, you should be able to push something like this into
h = hash:sha256()
b = file_buffer()
f = open("/tmp/foo", "r")
h.input(b)
return h.hex()
BPF is the existence proof that stuff like this is both
feasible and profitable, now we just need to take it to
the next level.
If we had a language like this, accept-filters whouldn't be
necessary.

Sure, that's a great idea (well, aside from introducing a large attack
surface that the Linux folks have repeatedly discovered with eBPF).
But, embedding lua or something like lua in the kernel is completely
tangential to the problem of providing a good generic interface for
crypto hardware. Please don't hijack this thread with that
discussion.

Conrad

Poul-Henning Kamp

2018-01-11 08:24:11 UTC

Permalink

--------

Post by Conrad Meyer
But, embedding lua or something like lua in the kernel is completely
tangential to the problem of providing a good generic interface for
crypto hardware. Please don't hijack this thread with that
discussion.

The problem is not the interface to the crypto hardware, but getting
the data to and from the crypto hardware in the first place.

You can either drown yourself in special cases (IPSEC, HTTPS, ...) with
a "good generic interface for crypto hardware", or you can solve the
actual problem, with a A Good Generic Interface For Data Streams.

But don't let me distract you with my experience here, I only spent
years on it...

John Baldwin

2018-01-11 17:44:47 UTC

Permalink

Post by Poul-Henning Kamp
--------

Post by John Baldwin
- OCF is over flexible and overly broad.

I would actually argue that it is neithe, quite the contrary.

From a device driver's perspective it is overly broad. The linked-list of
descriptors in theory allows arbitrary data arrangments, but all of the
recent crypto engines I'm familiar with basically cater to the layout of
IPSec and TLS. They assume exactly one region of cipher text with an
optional AAD region that is before (not after), etc. They don't support
arbitrary combinations of alorithms, and they make certain assumptions about
how combined auth+enc actually works.

Post by Poul-Henning Kamp
With the kernel-userland transition becoming more expensive, what
we need is a DSL where you can put entire processing steps into the
kernel, sort of like BPF but more general.
Ideally, you should be able to push something like this into
h = hash:sha256()
b = file_buffer()
f = open("/tmp/foo", "r")
h.input(b)
return h.hex()
BPF is the existence proof that stuff like this is both
feasible and profitable, now we just need to take it to
the next level.
If we had a language like this, accept-filters whouldn't be
necessary.

While I think this is not a bad idea, I don't think it has any bearing
on the crypto <-> driver interface which is where most of my beef lies,
but rather a different method to allow construction of in-kernel requests
to the crypto layer.

--
John Baldwin

John-Mark Gurney

2018-01-14 23:59:37 UTC

Permalink

Post by John Baldwin
While working on hooking the ccr(4) driver into our in-kernel crypto
framework (along with some out-of-tree patches to extend OpenSSL's
/dev/crypto engine to support AES-CTR/XTS/GCM and some further changes to
do zero-copy), I've run into several bumps / oddities in OCF. I'm probably
going to miss several of them, but here's at least a start of a list of
things. In some cases I have some suggestions on improvements.
I will try to start with more broad / higher-level items first before
- OCF is over flexible and overly broad. Rather than supporting
arbitrary stacking of transforms (and arbitrary depths), I think we
should probably aim to support more specific workloads and support

Many drivers don't fully support artitrary stacking... In fact, they
will reorder to "make sense"...

Post by John Baldwin
them well. To my mind the classes of things we should support are
- Simple block cipher requests.
- Simple "hash a buffer" requests. (Both HMAC and non-HMAC)
- IPSec-style requests (combined auth and encryption using
"encrypt-then-mac" with an optional AAD region before the
ciphertext). Note that geli requests fall into this type.
- TLS-style requests (using TLS's different methods of
combining auth and encryption methods when those are
separate)
- Simple compression / decompression requests. While this isn't
"crypto", per se, I do think it is probably still simpler to
manage this via OCF than a completely separate interface.

We need to decide what we are using OCF for. Currently, due to how
slow most hardware acceleration is, it's IPsec and GELI in the kernel,
and then for embedded systems, OpenSSL for TLS acceleration...

IMO, making it 100% generic is a terrible idea, and we should only
support the above use cases... W/ the fact most modern processors
are faster than most hardware acceleration, and I don't even know
how many embedded systems are using OCF from userland as you have to
configure the system to use crypto...

I attempted to do this a few years back, and got significant push back...

Please see the archives...

Post by John Baldwin
- To better support OpenSSL's engine, the /dev/crypto hash interface
should not require monotonic buffers, but support requests for
large buffers that span multiple requests (so you can do something
akin to the 'Init' / 'Update' (N times) / 'Final' model existing
software hashing APIs use). In particular, the bigger win for
hashing in hardware is when you can offload the hashing of a large
thing rather than small requests.
- To better support OpenSSL's engine, the /dev/crypto hash interface
should support "plain" hash algorithms such as SHA* without an
HMAC. By default OpenSSL's engine interface does the HMAC-specific
bits (generating pads, etc.) in software and only defers to the
engine for the raw hash (e.g. if you use the HMAC() function from
libcrypto it will only ask the engine interface for a raw hash,
not for an HMAC hash).

Already does for MD5 and SHA1. We have not added support for SHA-2 or
SHA-3...

#define CRYPTO_MD5 13
#define CRYPTO_SHA1 14

Yes, I know crypto(7) is lacking documentation for additional modes,
but we didn't have any before I was working on it, so added what I
could...

Please make sure that the compare is constant time for any verify modes.

Post by John Baldwin
Related is that we need to fix IPSec to treat EBADMSG errors from
descryption as auth failure rather than encryption failure (right
now AES-GCM auth failures are reported incorrectly in netstat -s
due to this).
- Sessions for a combined cipher + hash should also be tied to a
specific way of combining the algorithms. Right now you can
create a session for AES-CBC with a SHA hash and the driver has no
way to know if you are going to do encrypt-then-mac or one of the
other variants. We should include this in the session (so a given
session can only be used for one type which is normally true anyway),
and drivers can then only claim to support combinations they
support.
- The CRD_F_IV_PRESENT flag should be removed and replaced with
a CRD_F_IV_INJECT flag which means "inject the IV". Right now
the _lack_ of CRD_F_IV_PRESENT for encryption (but not decryption!)
requests means "inject the IV". It would be clearer to just have
a flag that is only set when you want the driver to take the
action.
- Speaking of IV handling, drivers have to do some extra handling for
IVs including possibly generating them. I think the idea is that
some co-processors might support generating IVs, but most of the
drivers I've looked at just end up duplicating the same block of
code to call arc4rand() for encryption requests without
CRD_F_IV_EXPLICIT. I don't believe Linux tries to support this and
instead always supplies an IV to the driver. I'd rather we do that
and only depend on a flag to indicate where the IV is (crd_iv vs
in the buffer).
- The API for copying data to/from crypto buffers is a bit obtuse and
limiting. Rather than accepting the crypto operation ('crp') as
a parameter to describe the crypto buffer, the crypto_copyback()
and crypto_copydata() functions accept various members of that
function explicitly (e.g. crp_flags and crp_buf). However, in my
experiments with zero-copy AES-GCM via /dev/crypto and OpenSSL it
was convenient to store the AAD in a KVA buffer in the 'crp' and
the payload to transform in an array of VM pages. However, for
this model 'crp_buf' is useless. I ended up adding a wrapper API
'crypto_copyto' and 'crypto_copyfrom' which accept a 'crp' directly.
Linux's API actually passes something akin to sglist as the
description of the buffers in a crypto request.
- We need to not treat accelerated software (e.g. AES-NI) as a
hardware interface. Right now OCF's model of priorities when
trying to choose a backend driver for a session only has two
"levels" software vs hardware and aesni(4) (and the ARMv8 variant)
are lumped into the hardware bucket so that they have precedence
over the "dumb" software implementation. However, the accelerated
software algorithms do need some of the same support features of
the "dumb" software implementation (such as being scheduled on a
thread pool to use CPU cycles) that are not needed by other "hardware"
engines. OCF needs to understand this distinction.
- Somewhat related, we should try to use accelerated software when
possible (e.g. AES-CBC with SHA) doesn't use AES-NI unless the
CPU supports accelerated SHA. Ideally for this case we'd still
use AES-NI for the AES portion along with the software SHA
implementation (and we'd do it one pass over the data rather than
two when possible).

Intel has lots of assembly for combined modes, including a pipelined
mode for AES-CBC+SHA2 that allows 4 streams to be processed in
effectively the same time as one stream... Being able to make use of
these is cool, but IMO, w/ AES-GCM, or AES-GCM-SIV is a better solution
than trying to shoe horn in old algorithms like this... If someone
really needs it, they can pay for it, but IMO, lets get the most bang
for the buck...

Post by John Baldwin
- Sometimes a crypto driver might need to defer certain requests to
software (e.g. ccr(4) has to do this for some GCM requests). In
addition, there are some other cases when we might want requests
from a single session to be sent to different backends (e.g. you
may want to use accelerated software for requests below a certain
size, and a crypto engine for larger requests. You might also want
to take NUMA into account when choosing which backend crypto engine
to dispatch a request to.) To that end, I think we want to have the
ability for a single OCF session to support multiple backend
sessions.
One use case is that if I as a driver can't handle a request I'd like
to be able to fail it with a special error code and have the crypto
later fall back to software for me (and to use accelerated software if
possible). Right now ccr(4) duplicates the "dumb" software for GCM
requests it can't handle explicitly.
Another use case might be failover if a hardware engine experiences
a hardware failure. In theory it should be possible to fail over
to a different driver at that point including resubmitting pending
requests that weren't completed, and it should be possible (I think)
to manage this in the crypto framework rather than in consumers like
IPSec and GELI.
Load distribution among backends might be another case to consider
(e.g. GELI or ZFS encryption once that lands) if you have long-
running sessions that spawn lots of self-contained requests.
Note that if we want to spawn additional backend sessions on the fly
(e.g. only create a software fallback session on demand if a driver
fails a request with the "use software" magic error code), we will
have to keep per-session state such as keys around. We probably
already do that now, but this would definitely require doing that.
One concern with some of these changes is that there are several drivers
in the tree for older hardware that I'm not sure is really used anymore.
That is an impediment to making changes to the crypto <-> driver interface
if we can't find folks willing to at least test changes to those drivers
if not maintain them.

I have recently obtained a good amount of this hardware from various
donations... hifn, via padlock, and possibly ubsec, I'd have to check..

IMO, I'd like to see us deprecate most of these old drivers as they are
mostly too slow, or even if you can find a system to put them in, it's
overall faster/energy effecient to go w/ newer hardware that is faster and
run pure software... Again, lease read the archives for more of this discussion...

Post by John Baldwin
This is all I could think of today. What do other folks think?

I'd like to see a full redesign of the system, but I also don't know
how many other third party utilities depend upon /dev/crypto that are not
in tree...

I'm willing to meet w/ people to discuss/design this...

--
John-Mark Gurney Voice: +1 415 225 5579

"All that I will do, has been done, All that I have, has not."

Emeric POUPON

2018-01-16 09:50:47 UTC

Permalink

Hello,

Post by John Baldwin
- We need to not treat accelerated software (e.g. AES-NI) as a
hardware interface. Right now OCF's model of priorities when
trying to choose a backend driver for a session only has two
"levels" software vs hardware and aesni(4) (and the ARMv8 variant)
are lumped into the hardware bucket so that they have precedence
over the "dumb" software implementation. However, the accelerated
software algorithms do need some of the same support features of
the "dumb" software implementation (such as being scheduled on a
thread pool to use CPU cycles) that are not needed by other "hardware"
engines. OCF needs to understand this distinction.
- Somewhat related, we should try to use accelerated software when
possible (e.g. AES-CBC with SHA) doesn't use AES-NI unless the
CPU supports accelerated SHA. Ideally for this case we'd still
use AES-NI for the AES portion along with the software SHA
implementation (and we'd do it one pass over the data rather than
two when possible).

Indeed it would make sense to extend the software driver to make use
of available software acceleration. From IPSec, this would allow to
accelerate the encryption part without accelerating the authentication
part, and it is still a very common use case.
Actually, we have some patches to do that, maybe it would make sense
to try to distribute them? This would require quite a significant amount
of work though.

Post by John Baldwin
This is all I could think of today. What do other folks think?

Well, the batch mode and its queue is questionable. Indeed, when using
several hardware drivers, having a single process trying to dispatch the
crypto jobs to the drivers and calculating the CRYPTO_HINT_MORE flag
sounds inefficient. Maybe we would need a dedicated queue/thread per
driver if we really want the batch mode to be effective?
Furthermore, hardware drivers often already manage internal queues for
jobs. I guess the only benefit of the batch mode would be to allow a
lot of crypto requests to be queued in the framework and prevent the
consumers to deal with the crypto requests they don't manage to enqueue?

Emeric POUPON

2018-03-26 15:09:15 UTC

Permalink

Hello,

As you already mentioned before, we cannot benefit from aesni(4) with IPSec on Intel platforms, except when using AEAD algorithms like GCM.

Maybe we should not expose accelerated software (e.g. AESNI) as crypto drivers. Plugging them directly into cryptosoft brings some benefits:
- no duplicate code about crypto session handling,
- partially accelerated crypto (e.g. aesni for AES-CBC, soft for SHA256. Useful for IPSec),
- possible use of the 'async' mode to process crypto jobs using a thread pool.

Actually that's something we already did using straightforward, internal patches.
Now I would like to know what you think about this idea and what you would suggest to achieve this? (using a new framework?)

Regards,

Emeric