Experimental Scancode Integration

Starting from version 1.4.0 Solicitor can be integrated with the tool ScanCode to include detailed information gathered from the "deep license scan" performed by ScanCode. This includes detected Licenses, Copyrights and Notice-Files.

The current integration with ScanCode is experimental: The used ScanCode parameters, interfacing and curations logic and all parts of the data persistence are experimental and thus might result in insufficient quality of results. The current workflow and implementation is subject to change in future versions without further notice.

General workflow

The general workflow when integrating with ScanCode consists of the following 3 steps:

  1. Execute Solicitor in a "classic" way i.e. just based on the data provided via the Readers as described in Reading License Information with Readers. Besides the normal reports/documents generated this will also create scripts for downloading the needed OSS source codes and run Scancode.

  2. Download source codes and run ScanCode by executing the generated scripts. The downloaded sources and ScanCode results will be saved to a directory tree in the local filesystem.

  3. Execute Solicitor a second time. For all ApplicationComponents where ScanCode information is available (stored in the local directory tree) the license data as obtained from the Readers is replaced by this information. The data model is enriched with the found copyright and notice file information. Reports (see Reporting and Creating output documents) are now based on the ScanCode data (where available).

Prerequisites

Bash

The scripts generated by Solicitor to download sources and run ScanCode are in Bash syntax. So either run it on a system using natively Bash (linux) or install an appropriate environment (e.g. Git Bash) if you are using a windows environment.

ScanCode

Download and install ScanCode from https://github.com/nexB/scancode-toolkit/releases. Make sure that the executable is included in the search PATH for executables.

Activate feature

As the ScanCode integration is still experimental it is currently deactivated by default. To enable it set system property solicitor.feature-flag.scancode=true. (See Built in Default Properties for information how to do so.) If this feature flag is not activated then Solicitor will not try to attempt to read ScanCode information from the local file system.

Detailed workflow

Solicitor 1st run

Execute Solicitor in a classic way. As part of the report creation step this will generate two scripts:

  • output/scancode_PROJECTNAME.sh (for downloading the sources, also calls scancodeScan.sh)

  • output/scancodeScan.sh (for running ScanCode on the downloaded sources)

Scripts will include all ApplicationComponents with exception of those where normalizedLicenseType was set to COMMERCIAL.

Download Sources and run Scancode

Change to directory output and execute sh scancode_PROJECTNAME.sh. This will download all sources and process them via ScanCode. This might take several hours to complete. Results are stored in subdirectory Source of the directory output and is organized in a tree structure given by the PackageURL of the ApplicationComponents.

Origin file

The Scancode integration scripts try to download ApplicationComponent sources from default URLs derived from the PackageUrl (e.g. Maven Central). In cases where the sources are not available at these locations, the download will fail (and the subsequent source scan will be skipped). In this case it is possible to manually download the sources from some other location and store it in the directory structure. Restarting the Scancode integration script might then perform the source scan.

To be able to document the (non default) origin of the ApplicationComponent sources a file origin.yaml is created in the components directory in the file system. If the failed source download has been performed manually it is possible to edit this file and correct the data given in this file.

# This file contains metadata about the orgin of the package and the sources.
# This file was automatically created but might manually be edited if the contained data is not correct
sourceDownloadUrl: https://url/pointing/to/the/source/archive.jar  (1)
packageDownloadUrl: https://url/pointing/to/the/binary/archive.jar (2)
# note: to add comments: write them here and remove the hash at the beginning of the line (not yet processed by Solicitor)
1 URL for downloading the sources - will be available as property ApplicationComponent.sourceDownloadUrl in the Solicitor data model.
2 URL for downloading the binaries - will be available as property ApplicationComponent.packageDownloadUrl in the Solicitor data model.

The content of the file origin.yaml currently just affects the above given two properties, it does not affect the downloading of sources by the scripts.

Solicitor 2nd run

Execute Solicitor a second time. After reading the component/license information from the Readers (but before starting the rule engine) Solicitor will try to look up ScanCode information from the directory tree in output/Sources for all processed ApplicationComponents. If information is found for an ApplicationComponent the following is done:

  • License information (including URL of license text) as obtained from the Readers is replaced by the license info found by ScanCode

  • Copyrights are taken from ScanCode results

  • Info on NOTICE file is taken from the ScanCode results

  • If the ScanCode results contain information about project URLs this is stored as sourceRepoUrl and/or ossHomepage

  • sourceDownloadUrl and packageDownloadUrl are set to the values given in file origin.yaml

Output

Main target of the additional information obtained from ScanCode is currently the new report Attributions_PROJECTNAME.html which lists

  • all ApplicationComponents (excluding those which are not OSS licensed)

  • with all found copyrights

  • and all licenses

  • including all different license texts

  • and contents of all found NOTICE files

Correcting data

The data obtained from ScanCode might be affected by false positives (wrongly detected a license or copyright) or false negatives (missed to detect a license or copyright). To compensate such defects there are two mechanisms: Applying Curation information from a "curations" file or changing the License information via the decision table rules.

Curations file

To define curations you might create a file output/curations.yaml containing the following structure:

artifacts:
  - name: pkg/npm/@somescope/somepackage/1.2.3                  (1)
    url: https://github.com/foo/bar                             (2)
    licenses:                                                   (3)
      - license: MIT                                            (4)
        url: https://raw.githubusercontent.com/foo/bar/LICENSE  (5)
    copyrights:                                                 (6)
      - (c) 2021 Donald Duck                                    (7)
      - "(c) 2019 Mickey Mouse <http://mickey.mouse>"           (8)
    excludedPaths:                                              (9)
    - "sources/src"                                             (10)
  - name: pkg/npm/@anotherscope/anotherpackage/4.5.6            (11)
.
.
.
1 Path of the package information as used in the file tree. Derived from the PackageURL.
2 URL of the project, will be stored as sourceRepoUrl. (Optional: no change if not existing.)
3 Licenses to set. Optional. If defined then all found licenses will be replaced by the list of licenses given here.
4 SPDX identifier of license.
5 URL pointing to license text.
6 Copyrights to set. Optional. If defined then all found copyrights will be replaced by the list of copyrights given here.
7 A single copyright.
8 Another copyright. Note that due to YAML syntax any string containing : needs to be enclosed with parentheses
9 Excluded paths to be set. Optional. If defined then all scanned files, whose path prefix contain any given string here, are excluded from the ScanCode information.
10 A single path prefix. All scanned files starting with this path prefix are excluded from the Scancode information.
11 Further packages to follow.
Decision table rules

As for license information obtained from the Readers the license information from ScanCode can also be altered using decision table rules. A new attribute origin was introduced in the RawLicense entity as well as condition field in decision table LicenseAssignmentV2*.xls/csv. The origin attribute in Rawlicense either contains the string scancode if the license information came from ScanCode or it contains the (lowercase) class name of the used Reader.

Using the Extended comparison syntax it is possible to qualify whether a rule should apply for licenses found by ScanCode or not:

Value of condition Origin rule applies for …​

scancode

…​ licenses obtained from ScanCode information

NOT:scancode

…​ licenses obtained from normal Readers

(empty)

…​ in both cases

Last updated 2023-11-20 10:37:01 UTC