Child pages
  • Project Roadmap
Skip to end of metadata
Go to start of metadata

PLCrashReporter Road Map

We're extremely excited about PLCrashReporter's future as a sustainable open source project. In planning our road map from the coming years, we've focused on a few goals that we believe are imperative to growing PLCrashReporter's value:

  • Increase the scope and depth of useful data gathered in our crash reports, while strictly maintaining user privacy guarantees.

  • Expand the user base of the library to ensure the continued health of the project.
  • Work with implementors of managed runtimes (such as XamarinUnity3dRubyMotion, and RoboVM) to improve compatibility between PLCrashReporter and their managed runtimes (some of which already leverage PLCrashReporter).
  • Maintain our focus on reliability; introducing technical solutions to provide even stronger reliability guarantees as the scope and complexity of the library continues to grow.
  • Improve usage and integration documentation, targeted at both 3rd party integrators, and application developers.

We firmly believe that complex, nuts and bolts development tools – like PLCrashReporter – benefit greatly from the open source, liberally licensed development model, and look forward to continuing to develop PLCrashReporter as an independent, self-sustaining open-source project. In that regard, we send our sincere thanks to the PLCrashReporter Consortium Members and Application Developer Support Members who make this possible.

While this document does provide our official roadmap, plans may change as we discuss priorities and implementation with the community, and the extent of work that may be accomplished depends on available sponsorship. As releases are more firmly scheduled, we'll add links into the bug tracker, where scheduling and project management is handled.

Additionally, we'll be adding more road map items to cover documentation efforts as we have time. If you're interested in contributing to PLCrashReporter's documentation, please say "Hello" on the project mailing list.

Supporting PLCrashReporter

This year's road map is ambitious, and we plan to further refine and extend it given feedback from the community. The scope (and speed) at which we can execute these tasks depends in large part on the contributions of the application developers and platform vendors that make up our community.

If you're an application developer that uses PLCrashReporter — or any service based on PLCrashReporter — we provide hands-on support via the Application Developer Annual Maintenance Subscription (or "ADAMS"). We can provide assistance with interpreting crash reports, including direct, hands-on source and assembly-level inspection and debugging of your applications (and system frameworks!). If you have a mystery bug that you absolutely cannot track down, or even reproduce locally, we can help.

Companies deploying a crash reporting platform or product with PLCrashReporter may wish to consider joining the PLCrashReporter Consortium as inexpensive insurance in the future health of the project. The Consortium membership fee is a fraction of the cost of devoting internal staff to support, maintain, and improve PLCrashReporter, and provides significant enterprise support benefits in exchange for your funding of the project's development.

 

Release Roadmap

PLCrashReporter 1.2.1 (Maintenance Release)

This is scheduled as a minor maintenance release, providing small improvements to reliability, and fixes for any bugs that may be reported in 1.2.

Concurrent Fault Safety

 

Add async-safe locking in the crash reporter path to allow only one crash reporter to run at a time. This will lay the groundwork for full concurrent fault reporting; the eventual aim is to support writing multiple reports should multiple threads fail concurrently. While an unusual occurrence, this is more likely to occur in extremely concurrent applications, and providing all reports can provide additional debugging insight.

PLCR-521 - Concurrent Fault Safety Open

Fix for -Wshorten-64-to-32

There are a number of innocuous -Wshorten-64-to-32 warnings in 1.2 that appeared late in the development cycle due to improvements to clang's warnings; these have been investigated and a patch generated; the patch requires review, and the temporary addition of -Wno-shorting-64-to-32 should be removed.

PLCR-496 - Re-enable -Wno-shorten-64-to-32 for arm64 Open

PLCrashReporter 1.3 (Feature Release)

This release is targeted at reliability and groundwork necessary to achieve our longer-term 2014 goals. Specifically, we're focused finishing outstanding functionality work started during the 1.3 development cycle, rolling out reliability improvements enabled by these new features, and restructuring the project to best support our targeted functionality.

Objective-C Exception Chaining

In PLCrashReporter 1.2, we implemented support for chaining of Mach exception handlers and BSD signal handlers. At crash time, after the report is written, the original signal or exception data is dispatched to any previously registered exception handler(s).

However, support for chaining of Objective-C exceptions was not implemented at this time; this is a small feature, but is necessary to support .

Multiple Crash Reporters

With support for Objective-C Exception Chaining, we can fully support the instantiation of multiple instances of PLCrashReporter within a single process. This is a necessary step  towards supporting interoperability with other instances of PLCrashReporter, other crash reporting libraries, and perhaps more importantly, with managed runtimes such as XamarinUnity3dRubyMotion, and RoboVM.

Double Fault Handling

In PLCrashReporter 1.2, we laid the initial groundwork for chaining of exception handlers. In 1.3, the goal is to leverage this work to implement support for fail-safe double-fault handling. In the event that the PLCrashReporter reporting process itself crashes, we will run a double-fault crash reporter in 'safe mode', providing the absolute minimum of reporting data to assist in diagnosing the failure while maintaining a low risk of re-triggering the failure.

In the case of a triple-fault, we will terminate and hand reporting over to the host system, attempting to maintain the original process failure state to the maximum degree possible.

Async-Safe Locked Allocator

Currently, PLCrashReporter performs all async-safe allocations on a malloc'd heap. As part of the 1.3 release, we plan to finish the implementation of our async-safe allocator, with support for:

  • Guard pages at the top and bottom of the page range.
  • Locking the entirety of the allocated pages during normal process operation.
  • Statistics tracking, to be used to analyze PLCrashReporter's memory requirements, and provide fixed bounds on the amount of heap used.
  • Possible support for growing the heap during async-safe runtime.

This work provides a number of immediate reliability benefits:

  • Eliminate the likelihood of our state being corrupted during host process failure
  • Remove any risk of a stack overflow occurring in the report paths themselves due to heavy stack allocations, and
  • Provide more concrete insight into the reporter's crash time resource requirements.

In addition, this will lay the groundwork for moving to explicit reference counting of PLCrashCore allocations, which will:

  • Simplify APIs that have been modeled as to avoid heap allocations
  • Remove the 'double-pass' report writing implementation, where we currently perform one pass to determine the output buffer size, and then a second pass to perform 
  • Significantly improve performance, as we can safely cache parsed DWARF, Mach-O, Obj-C, and other metadata.
  • Maintain reliability by avoiding complex ownership semantics that would be necessary without reference counting. 

Mac OS X - Out of Process Execution

On iOS, we're currently hindered by our inability to fork a subprocess; this has resulted in an enormous amount of work to maintain async-safety in an increasingly complex system. On Mac OS X, however, we can spawn a separate process for crash handling, which:

  • Increases reliability by running out-of-process
  • Allows us to display a crash reporting dialog immediately instead of on next launch
  • Prepares us for a possible future when fork()/posix_spawn() is permitted on iOS (we can dream).

As part of our work in PLCrashReporter 1.2, we adopted the use of fully process-space agnostic memory-mapped APIs; the only remaining work necessary for out-of-process execution is providing:

  • Support for a task_t reference to an embedded binary. 
  • A framework-embedded binary that will monitor the target processes' mach exceptions, collect a report, and display user-supplied crash UI on demand.
  • A refactor of the PLCrashReporter API into a class cluster comprised of PLCrashReporterLocal and PLCrashReporterRemote concrete implementations.

Finish C++ Migration

The current PLCrashCore APIs are implemented in both C and C++. Our goal here is to unify style, naming, and implementation language. This clean-up is necessary to simplify future implementation work, including the refactors planned for Async-Safe Reference Counting, and more importantly, the enhanced process introspection work scheduled for PLCrashReporter 1.4.

Multiple Libraries

The 1.3 release will be the first to break the existing monolithic library out into smaller, distinct libraries; we will continue to support the existing CrashReporter.framework, vending the currently supported API/ABI indefinitely.

In doing so, we hope to improve the usability of the library for both new and existing use-cases. Specifically, our aims are to:

  • Encourage use in projects outside the realm of application-level crash reporting, where async-safe frame unwinding, exception handling, DWARF support, and other tools are often required. The advantages of extending the utility of our library include:
    • Shaking out any latent bugs by expanding use cases and total installed user base.
    • Building a broader community to keep PLCrashReporter development healthy and encourage additional contributions of both code and ideas.
  • Decrease the total library size for integrators that do not require all features (such as client-side report parsing). This translates to smaller application binaries, which is never a bad thing.
  • Lay the groundwork for platform portability of the core library by separating the highly platform-specific code from the cross-platform core (such as the DWARF APIs). Eventually, we hope to leverage portability to broaden our community base – and bring with it all the advantages of additional users and testers – by supporting additional mainstream platforms.
The current implementation plan calls for the introduction of the following new frameworks:
  • PLCrashCore.framework:
    • This library will contain core functionality of use in crash reporters, language runtimes, and other low-level systems development:
      • Thread state manipulation
      • Stack Unwinder
      • DWARF and Compact Unwind Parsers
      • Tracking of loaded binary images
      • Mach-O and Objective-C Parsers
      • High-level exception APIs
        • Mach Exceptions
        • BSD Signals
        • Objective-C Exceptions
        • C++ Exceptions
      • Async-safe Utility Library
        • Async-safe allocator
        • Async-safe data structures
    • C++, with an eye towards future cross-platform support.
    • The API and ABI of this library would be considered permanently unstable simply due to the level of access it provides to PLCrashReporter's internals; it will be the integrator's responsibility to track updates to this library as necessary.
    • Possibility of an API/ABI stable C library interface, for 1.3 or a future release.
  • PLCrashReport.framework:
    • This library will contain the Objective-C and internal C APIs to support client-side parsing of crash reports, eg:
      • PLCrashReport and the related model objects
    • Objective-C, with an eye towards providing cross-platform parsing API.
    • The API and ABI of this library will continue to be stable/supported (these APIs are already vended publicly)
  • PLCrashReporter.framework:
    • Will serve as a replacement to the legacy CrashReporter.framework
    • This library will contain almost all of the supported high-level Objective-C APIs and types currently shipped in CrashReporter.framework:
      • PLCrashReporter: The high-level per-process crash reporter API
      • PLCrashReporterConfig: Crash reporter configuration API
      • PLCrashReporterCallbacks: Post-crash callback registration.
    • Additional internal APIs will be maintained in this library:
      • Process state introspection (eg, process name, start time, trace flags).
    • The following APIs will not be vended included by this library:
      • PLCrashReport and all report-parsing APIs: Implementors requiring client-side report parsing may either link against PLCrashReport directly, or use the legacy CrashReporter.framework, which will continue to vend these APIs.
      • PLCrashProcessInfo and PLCrashHostInfoVersion: This process introspection support may be vended in PLCrashCore, but these will not be vended from PLCrashReporter.framework

PLCrashReporter 1.4 (Feature Release)

In the 1.4 release, our primary aim is to increase the scope, depth, and accuracy of the data gathered in our crash reports, without compromising our approach to using only async-safe, supported API and ABI.

Crash Report Format v2.0

In the 5 years since we first designed the protobuf-based crash report format, it has held up surprisingly well. We've been able to avoid any breaking changes and maintain backwards compatibility for the lifetime of the project. However, experience has provided a number of insights into ways we can improve the format, and additionally, we have a number of features slated that will require significant changes to the current v1 format.

Rather than incur the complexity (and necessary data duplication) of attempting to shoe-horn these features into the existing v1 format, we're making a clean break, designing a new format that should last for the next 5 years. Our current plan is to continue to use protobuf, however, we welcome community feedback on the serialization used. Bear in mind, however, that the format must be easily generated from within async-safe code paths, binary formats will be much more compact, and the use of an IDL (such as with protobuf) allows us to easily generate parsing code for almost any language or runtime.

The following features are currently scheduled for inclusion in the V2 format:

  • Transactional report format to allow for best-effort collection and writing of crash report data.
    • Reporting tasks will be ordered by the importance of the data they generate.
    • Each task will write a single atomic entry to the report file, containing:
      • An initial length field
      • A checksum for the written data (eg, CRC32?)
      • The complete set of data generated by that task, in a format that can be used in isolation from the data written by any later task.
    • Failure will leave the remainder of the crash report unaffected.
    • Consumers must be able to read and validate any valid blocks, while skipping (and reporting on) invalid data blocks.
  • Support for inclusion of multiple reports. This is necessary for Concurrent Fault Reporting.
  • Multiple crash types and associated metadata; not all crashes provide BSD or Mach exception data:
    • A report generated due to a fatal signal or machine exception.
    • A report generated due to an unhandled language-level exception.
    • A report generated due to a process deadlock and/or watchdog event.
    • A report generated on a running process, where no crash occured, based on user request.
  • Multiple encodings for exception type-specific metadata:
    • Mach exceptions, which do not map directly to POSIX signals
    • POSIX signals
    • Language-level exceptions (ObjC, C++)
    • Extensible to allow defining additional exception types later, without breaking parsers.
  • Improved register encodings
    • Compact and well-defined register type encodings for each supported architecture. We currently encode registers with string names, which increases the size of the resultant file, and requires consumers of the API to be aware of our string mappings. Instead, we should implement architecture-specific message types that encompass their machine state. This decreases the total size of the report, and eliminates a source of ambiguity for server-side processors when interpreting register data.
    • Support for inclusion of additional, interpreted data, such as the referenced Objective-C SEL.
  • Support for attaching arbitrary user data:
    • Opaque data blocks
    • Key/value pairs

This list will almost certainly change as work on this feature commences.

Async-Safe Refcounting

Leveraging the async-safe allocator work done in PLCrashReporter 1.3, we'll migrate the code base to use reference counting for tracking object ownership. This will significantly accelerate the additional refactors to core components that are listed below.

Crash Report V2 Writer

The migration to the V2 report format will require:

  • Refactor of our PLCrashLogWriter API
  • A new PLCrashReport API (implemented along-side the previous API).

Specifically, in PLCrashLogWriter, we'll need to:

  • Leverage the async-safe allocator work done in PLCrashReporter 1.3, as well as the refcounting work scheduled for this release, to implement single-pass atomic writing of the crash report file.
  • Investigate replacing protobuf-c with a simplified async-safe runtime, to allow us to automatically generate serialization code (which will reduce the human-written LoC, and simplify later additions to the format).
  • Update the crash data models to represent the newly defined meta-data; exception types, registers, arbitrary user data, etc.
  • Break the process into multiple prioritized tasks, each atomically committing an update to the transactional report format.

The new PLCrashReport parser API will leverage the library split implemented in the PLCrashReporter 1.3 release, with the PLCrashReport additions API provided in the new PLCrashReport.framework. Depending on user demand, we may add support for converting a V2 crash report to a V1-compatible format. This conversion will necessarily be lossy, but it will provide integrators with more time to adopt the new format.

Concurrent Fault Reporting

PLCR-522 - Concurrent Fault Reporting Open

Leveraging the Concurrent Fault Safety and the new Crash Report V2 Writer, this feature will add support for preserving all crash reports that are generated concurrently at crash time.

PLCrashAsyncLog.framework

Implemented as an optional, standalone framework, this API will provide high-performance fully async-safe logging, with integration into the crash reporter itself via the support for inclusion of arbitrary user data. This solves two problems, one for users, and one for us:

  • It is too difficult for users to safely include logging data in crash reports. None of the general logging APIs that I am aware of are async-safe, which is an absolute requirement for safe inclusion in the report.
  • This will allow us to log PLCF_DEBUG() log messages in crash reports, which may provide additional insight into in-the-field integration issues that have hitherto been opaque to us.

Register State: All Threads, All Frames

With the introduction of DWARF and compact unwinding support, we have the ability to fetch and preserve not just the current register state for all threads, but also the non-volatile register state for all frames, providing unparalleled introspection into process state on platforms that make significant use of non-volatile registers.

When coupled with the V2 report format's efficient register encoding, it becomes viable to record and preserve this data; the support already exists in the stack frame unwinder, and we'll merely need to wire this into the new PLCrashLogWriter API.

Register State Introspection

Implementation of (optional) register state introspection, subject to the user privacy requirements. Leveraging our pre-existing async-safe Mach-O and Objective-C metadata parsers, as well as the new V2 report format's support for including additional register data, this can provide:

  • Visibility into Mach-O/Objective-C data, including:
    • SEL values
    • Symbol names
    • ???

The full range of what we can safely gather will require further analysis.

PLCrashReporter 2.0 (Feature Release - Tentative)

The feature plan for PLCrashReport 2.0 depends a great deal on the accuracy of the proposed schedule, what we learn during the design and deployment of earlier releases, as well as ongoing feedback from the community.

A tentative feature-list includes:

  • First-class integration/compatibility with major managed runtimes.
    • This will require coordination with the projects in question; the crash reporter must know whether an exception can (or was) handled by the managed runtime, as writing a report may not be necessary (eg, managed runtimes leverage signals or mach exceptions to handle non-fatal errors, such as a safe NULL dereference which is then translated to a NullPointerException).
    • We may need to drive towards some agreed-upon standards on the interactions between managed runtimes and crash reporting.
  • Async-safe disassemblers and heuristics
    • Architecture-specific unwinding via instruction heuristics for cases where neither a frame pointer, DWARF, or compact unwind data is available.
    • Provide better crash data by performing basic analysis of the faulting instruction. For example, we could provide the actual faulting address for a GP fault on x86-64/i386.
  • Plugin support to allow for safe handling of more complex/custom client-side data introspection and state modification. This would require:
    • Exposing the PLCrashLogWriter APIs for registering logging tasks.
    • Prioritizing the tasks appropriately (eg, after the known-good internal tasks).
    • Clients would need access to the non-API-stable PLCrashCore APIs.
    • Well-defined callback hooks for vending crashed process state to the plugins.
  • Initial work on portability, with the aim of expanding our target audience to additional mobile, desktop, and server platforms:
    • A portable PLCrashCore (POSIX-ish systems (Android, Linux) Mac OS X, etc)
      • Replace direct task_t and thread_t types with a generic thread/task API.
      • Implement build system support for the target systems.
    • Experimental first-class Android NDK Support

 

  • No labels