rfc2045 — RFC 2045 (MIME) parsing library
#include <rfc822.h> #include <rfc2045.h> g++ ... -lrfc2045 -lrfc822 -lcourier-unicode
The rfc2045 library parses MIME-formatted messages. The rfc2045 library is used to:
1) Parse the structure of a MIME formatted message
2) Examine the contents of each MIME section
3) Optionally rewrite and reformat the message.
#include <rfc2045.h> rfc2045::entity entity; std::istreambuf_iterator<char> b{input_stream}, e; rfc2045::entity::line_iter<false> parser{b, e}; entity.parse(parser);
The rfc2045::entity object represents a MIME
object or entity. It's created from a message that's defined by a
beginning iterator and an ending iterator for the message's contents.
The iterators are used to create a
rfc2045::entity::line_iter template instance.
Its template parameter specifies whether the message uses
a LF (false) or CRLF (true)
line sequence. If these iterators are passed by reference to the
constructor, they must exist until the message is fully parsed.
After parse() returns, a successfully-parsed
message results in the beginning iterator advanced to the ending
iterator's value. If these iterators are passed by value to the
constructor they must be copyable and they are copied into the
parser object/
rfc2045::entity_parser<false> parser; parser.parse(std::istreambuf_iterator<char><{input_stream}, std::istreambuf_iterator<char><{}); rfc2045::entity e=parser.parsed_entity();
rfc2045::entity_parser is an alternative,
push-based approach for creating an entity object. Its
parse() method gets repeatedly invoked with a
pair of beginning and ending iterator values, that incrementally
defines the contents of the MIME message. Afterwards,
parsed_entity() returns the parsed MIME entity
object, at which point the entity parser object is no longer usable
and can only be destroyed.
rfc2045::entity_parser uses a separate
execution thread for creating the new MIME entity object. Each
call to parse() copies the entire sequence of
characters from the beginning/ending iterator pair into an internal
buffer that the background execution thread digests. Use reasonably-
sized sequences, and while the main execution thread assembles the
next chunk, the background execution thread eats the previous one.
Although the iterator pair can be anything that meets the
definition of a beginning and an ending iterator,
several rfc2045::entity methods demand a
std::streambuf from which the MIME entity
was constructed from. In the case of a pair of
std::istreambuf_iterators, obtaining the
original input stream's rdbuf() meets this
criteria.
std::string charset=rfc2045::default_charset;
rfc2045::default_charset gives the default MIME
character set, initially set to “utf-8”.
The library uses the default_charset when it is
not specified in the MIME message. There's rarely a need to change
that.
rfc2045::entity entity; for (auto &subentity:entity.subentities) { // ... rfc2045::entity *ptr=subentity.get_parent_entity(); } rfc2045::entity::errors_t code=entity.errors.code; code=entity.all_errors(); std::vector<std::string> messages=entity.errors.describe();
The rfc2045::entity objects has many members,
only some are publicly documented. An entity may have sub-entities.
get_parent_entity returns a sub-entity's
parent entity (or a nullptr in the case of a top
level MIME entity).
Various errors that occured while parsing the MIME entity are collected
into an error code, which is a bitmask.
all_errors() returns a combined bitmask from
the MIME entity and all of its subentities.
See the rfc2045.h header file for a complete list
of parsing errors.
size_t startpos=entity.start_pos, startbody=entity.startbody, endbody=entity.endbody, nlines=entity.nlines, nbodylines=entity.nbodylines; rfc2231::header content_type=entity.content_type; std::string mime_type=content_type.value; for (auto &[paramname, paramvalue]:content_type.parameters) { size_t index=paramvalue.index; std::string value=paramvalue.value, charset=paramvalue.charset, language=paramvalue.language; value=paramvalue.value_in_charset(); value=paramvalue.value_in_charset("iso-8859-1"); } std::string_view charset=entity.content_type_charset(); rfc2045::cte content_transfer_encoding=entity.content_transfer_encoding;
The following rfc2045::entity class members
define the position of each MIME entity in its character sequence
startposThis is the starting position of the MIME entity's headers. The top level MIME entity's starting position is always 0.
startbodyThis is the starting position of the body portion of the MIME entity.
endbodyThis is one-past-the-end position of the body portion of the MIME entity.
nlines and nbodylines also
have the number of lines in the MIME entity (header+body) and just
the MIME entity's body portion.
content_type is an object that has the parsed
contents of the “Content-Type” header. The
rfc2231::header has two members:
valueThe header's value (the part before the semicolon.
parametersThe header's parameters (if any).
parameters is an associative container. The
key is the parameter name. The container's value has the following
members.
index
This member counts each parameter, in the order of its
appearance in the header. MIME parameters are parsed according
to the rules in RFC 2231, so multiple parameters get combined
into a single parameter and value. It is unspecified which
part's original index represents the
parameter.
valueThis is the value of the parameter. If the parameter's value was split into parts using RFC 2231 this is the reassembled value.
charset
This is the value character set, as specified
in an RFC 2231-encoded parameter. charset
defaults to “utf-8” if unspecified.
language
This is the RFC 2231-encoded value's
language. language is an empty string if
it's unspecified or if the parameter wasn't encoded using
RFC 2231.
value_in_charset()
This method returns the value converted to
rfc2045::default_charset.
value_in_charset(charset)
This method returns the value converted to
the specified character set.
content_type_charset() is a convenient shortcut
for returning the MIME entity's charset content type
parameter.
content_transfer_encoding is one of the following
values that reflects the MIME entity's encoding:
rfc2045::cte::sevenbit
rfc2045::cte::eightbit
rfc2045::cte::qp
rfc2045::cte::base64
rfc2045::cte::eightbit is also encoded for the
rare “binary” encoding. A
rfc2045::cte::error value indicates an invalid
encoding.
std::string content_id=entity.content_id; std::string content_disposition=entity.content_disposition; std::string content_description=entity.content_description; std::string content_base=entity.content_base; std::string content_location=entity.content_location; std::string content_md5=entity.content_md5; std::string content_language=entity.content_language;
These class members provide the contents of the corresponding MIME
headers, if they exist. Notably, content_disposition
can be used to instantiate an rfc2231::header
object in order to parse this header.
rfc2045::mime_decoder decoder{ [] (const char *bytes, size_t bytecnt) { }, *input_stream.rdbuf(), "utf-8" }; rfc2045::mime_unicode_decoder decoder{ [] (const char32_t *bytes, size_t bytecnt) { }, *input_stream.rdbuf() }; decoder.decode_header=true; decoder.decode_body=true; decoder.add_eol=false; decoder.header_name_lc=true; decoder.header_name_suppress=false; decoder.decode_subentities=true; decoder.headerfilter=[](std::string_view name, std::string_view content) -> bool { return true; }; decoder.headerdone=[](std::string_view name) { }; // ... decoder.decode<false>(entity);
rfc2045::mime_decoder and
rfc2045::mime_unicode_decoder extract the
contents of the headers and/or the body portion of a MIME entity.
Extraction involves:
Decoding RFC 2047-encoded headers, and decoding IDN-encoded domain names. Unfolding headers that are folded across multiple lines.
Decoding the MIME entity's body transfer encoding, and optionally
converting it to a specific character set,
if the optional third parameter to
rfc2045::mime_decoder's constructor exists.
If it doesn't exist no character set mapping takes place.
The first constructor parameter is a callable object, the output
sink. The output sink gets
repeatedly invoked from decode() with the contents
of the MIME entity's header and/or body, in the original or
mapped character set (rfc2045::mime_decoder), or
as Unicode characters
(rfc2045::mime_unicode_decoder).
decode's template parameter must match
rfc2045::entity::line_iter template's
parameter that was used to create the MIME entity object.
rfc2045::mime_decoder and
rfc2045::mime_unicode_decoder are actually
templates. Their template parameters are deduced from their
constructors' parameters.
The following class members are available to be set prior to
calling decode():
decode_header
(default: true)Whether the MIME entity's headers should be decoded.
decode_body
(default: true)Whether the MIME entity's body should be decoded.
decode_subentities
(default: true)Whether to decode, recursively, the MIME entity's subentities.
add_eol
(default: false)Whether to include an extra newline after decoding each MIME entity.
header_name_lc
(default: true)Whether to convert the name of each decoded header to lowercase.
header_name_suppress
(default: false)Whether to include only the contents of each header, to not include the header's name (possibly converted to lowercase) and the colon that separates the header's name from its contents.
headerfilter
(default: [](std::string_view, std::string_view){ return true; })
This is a callable object that's called before extracting each
header. Return true includes it in the decoded
content given to the output sink.
Together with header_name_suppress
this provides for targeted means to extract the decoded contents
of specific headers, only.
headerdone
(default: [](std::string_view) {})This is called after extracting each header
After parsing an rfc2045::entity, it can be
rewritten in order to convert 8-bit-encoded data to 7-bit encoding,
or to convert 7-bit encoded data to full 8-bit data, if possible.
if (entity.autoconvert_check(rfc2045::convert::standardize)) { rfc2045::entity::autoconvert_meta metadata; metadata.appid="courier"; rfc2045::entity::line_iter<false>::autoconvert( entity, [] (const char *ptr, size_t l) { // ... }, input_streambuf, metadata); }
autoconvert_check returns
true
if the MIME entity must be rewritten in order to comply with the
requested format. A false return indicates that
the MIME entity already complies.
The subsequent call to autoconvert() effects the
rewrite (its template parameter must match the one that was used to
parse the original MIME entity). autoconvert()'s
parameters are: the MIME entity to rewrite, a callable object that
gets repeatedly invoked with the contents of the rewritten MIME
entity, and a std::streambuf-like object that
corresponds to the current, parsed, MIME entity.
autoconvert() can only be called after a prior
autoconvert_check(), that defines
autoconvert()'s marching orders:
rfc2045::convert::standardizeDo not change the content encoding of the MIME object, only add default values for any missing “Content-Type” and “Content-Transfer-Encoding” headers.
rfc2045::convert::sevenbit
Reencode 8bit with
“quoted-printable”. Also replace any
7bit content with excessively long lines.
rfc2045::convert::8bit
Replace “quoted-printable” with
8bit unless doing so will produce
excessively long lines.
rfc2045::convert::8bit_always
Always replace “quoted-printable” with
8bit evenf if doing so will produce
excessively long lines.