Experimental Scancode Integration

Starting from version 1.4.0 Solicitor can be integrated with the tool ScanCode to include detailed information gathered from the "deep license scan" performed by ScanCode. This includes detected Licenses, Copyrights and Notice-Files.

The current integration with ScanCode is experimental: The used ScanCode parameters, interfacing and curations logic and all parts of the data persistence are experimental and thus might result in insufficient quality of results. The current workflow and implementation is subject to change in future versions without further notice.

General workflow

The general workflow when integrating with ScanCode consists of the following 3 steps:

  1. Execute Solicitor in a "classic" way i.e. just based on the data provided via the Readers as described in Reading License Information with Readers. Besides the normal reports/documents generated this will also create scripts for downloading the needed OSS source codes and run Scancode.

  2. Download source codes and run ScanCode by executing the generated scripts. The downloaded sources and ScanCode results will be saved to a directory tree in the local filesystem.

  3. Execute Solicitor a second time. For all ApplicationComponents where ScanCode information is available (stored in the local directory tree) the license data as obtained from the Readers is replaced by this information. The data model is enriched with the found copyright and notice file information. Reports (see Reporting and Creating output documents) are now based on the ScanCode data (where available).

Prerequisites

Bash

The scripts generated by Solicitor to download sources and run ScanCode are in Bash syntax. So either run it on a system using natively Bash (linux) or install an appropriate environment (e.g. Git Bash) if you are using a windows environment.

ScanCode

Download and install ScanCode from https://github.com/nexB/scancode-toolkit/releases. Make sure that the executable is included in the search PATH for executables.

Activate feature

As the ScanCode integration is still experimental it is currently deactivated by default. To enable it set system property solicitor.feature-flag.scancode=true. (See Built in Default Properties for information how to do so.) If this feature flag is not activated then Solicitor will not try to attempt to read ScanCode information from the local file system.

Detailed workflow

Solicitor 1st run

Execute Solicitor in a classic way. As part of the report creation step this will generate two scripts:

  • output/scancode_PROJECTNAME.sh (for downloading the sources, also calls scancodeScan.sh)

  • output/scancodeScan.sh (for running ScanCode on the downloaded sources)

Scripts will include all ApplicationComponents with exception of those where normalizedLicenseType was set to COMMERCIAL.

Download Sources and run Scancode

Change to directory output and execute sh scancode_PROJECTNAME.sh. This will download all sources and process them via ScanCode. This might take several hours to complete. Results are stored in subdirectory Source of the directory output and is organized in a tree structure given by the PackageURL of the ApplicationComponents.

Solicitor 2nd run

Execute Solicitor a second time. After reading the component/license information from the Readers (but before starting the rule engine) Solicitor will try to look up ScanCode information from the directory tree in output/Sources for all processed ApplicationComponents. If information is found for an ApplicationsComponent the following is done:

  • License information (including URL of license text) as obtained from the Readers is replaced by the license info found by ScanCode

  • Copyrights are taken from ScanCode results

  • Info on NOTICE file is taken from the ScanCode results

  • If the ScanCode results contain information about a project URL then this is stored as ossHomepage

Output

Main target of the additional information obtained from ScanCode is currently the new report Attributions_PROJECTNAME.html which lists

  • all ApplicationComponents (excluding those which are not OSS licensed)

  • with all found copyrights

  • and all licenses

  • including all different license texts

  • and contents of all found NOTICE files

Correcting data

The data obtained from ScanCode might be affected by false positives (wrongly detected a license or copyright) or false negatives (missed to detect a license or copyright). To compensate such defects there are two mechanisms: Applying Curation information from a "curations" file or changing the License information via the decision table rules.

Curations file

To define curations you might create a file output/curations.yaml containing the following structure:

artifacts:
  - name: pkg/npm/@somescope/somepackage/1.2.3                  (1)
    url: https://github.com/foo/bar                             (2)
    licenses:                                                   (3)
      - license: MIT                                            (4)
        url: https://raw.githubusercontent.com/foo/bar/LICENSE  (5)
    copyrights:                                                 (6)
      - (c) 2021 Donald Duck                                    (7)
      - "(c) 2019 Mickey Mouse <http://mickey.mouse>"           (8)
  - name: pkg/npm/@anotherscope/anotherpackage/4.5.6            (9)
.
.
.
1 Path of the package information as used in the file tree. Derived from the PackageURL.
2 URL of the project, will be stored as ossHomepage. (Optional: no change if not existing.)
3 Licenses to set. Optional. If defined then all found licenses will be replaced by the list of licenses given here.
4 SPDX identifier of license.
5 URL pointing to license text.
6 Copyrights to set. Optional. If defined then all found copyrights will be replaced by the list of copyrights given here.
7 A single copyright.
8 Another copyright. Note that due to YAML syntax any string containing : needs to be enclosed with parentheses
9 Further packages to follow.
Decision table rules

As for license information obtained from the Readers the license information from ScanCode can also be altered using decision table rules. A new attribute origin was introduced in the RawLicense entity as well as condition field in decision table LicenseAssignmentV2*.xls/csv. The origin attribute in Rawlicense either contains the string scancode if the license information came from ScanCode or it contains the (lowercase) class name of the used Reader.

Using the Extended comparison syntax it is possible to qualify whether a rule should apply for licenses found by ScanCode or not:

Value of condition Origin rule applies for …​

scancode

…​ licenses obtained from ScanCode information

NOT:scancode

…​ licenses obtained from normal Readers

(empty)

…​ in both cases

Last updated 2022-11-30 15:22:16 UTC