Experimental Scancode Integration
Starting from version 1.4.0 Solicitor can be integrated with the tool ScanCode to include detailed information gathered from the "deep license scan" performed by ScanCode. This includes detected Licenses, Copyrights and Notice-Files.
|The current integration with ScanCode is experimental: The used ScanCode parameters, interfacing and curations logic and all parts of the data persistence are experimental and thus might result in insufficient quality of results. The current workflow and implementation is subject to change in future versions without further notice.
The general workflow when integrating with ScanCode consists of the following 3 steps:
Execute Solicitor in a "classic" way i.e. just based on the data provided via the Readers as described in Reading License Information with Readers. Besides the normal reports/documents generated this will also create scripts for downloading the needed OSS source codes and run Scancode.
Download source codes and run ScanCode by executing the generated scripts. The downloaded sources and ScanCode results will be saved to a directory tree in the local filesystem.
Execute Solicitor a second time. For all ApplicationComponents where ScanCode information is available (stored in the local directory tree) the license data as obtained from the Readers is replaced by this information. The data model is enriched with the found copyright and notice file information. Reports (see Reporting and Creating output documents) are now based on the ScanCode data (where available).
The scripts generated by Solicitor to download sources and run ScanCode are in Bash syntax. So either run it on a system using natively Bash (linux) or install an appropriate environment (e.g. Git Bash) if you are using a windows environment.
Download and install ScanCode from https://github.com/nexB/scancode-toolkit/releases. Make sure that the executable is included in the search PATH for executables.
As the ScanCode integration is still experimental it is currently deactivated by default.
To enable it set system property
(See Built in Default Properties for information how to do so.)
If this feature flag is not activated then Solicitor will not try to attempt to read ScanCode information from the local file system.
Solicitor 1st run
Execute Solicitor in a classic way. As part of the report creation step this will generate two scripts:
output/scancode_PROJECTNAME.sh(for downloading the sources, also calls
output/scancodeScan.sh(for running ScanCode on the downloaded sources)
Scripts will include all ApplicationComponents with exception of those where
normalizedLicenseType was set to
Download Sources and run Scancode
Change to directory
output and execute
This will download all sources and process them via ScanCode.
This might take several hours to complete.
Results are stored in subdirectory
Source of the directory
output and is organized in a tree structure given by the PackageURL of the ApplicationComponents.
The Scancode integration scripts try to download ApplicationComponent sources from default URLs derived from the PackageUrl (e.g. Maven Central). In cases where the sources are not available at these locations, the download will fail (and the subsequent source scan will be skipped). In this case it is possible to manually download the sources from some other location and store it in the directory structure. Restarting the Scancode integration script might then perform the source scan.
To be able to document the (non default) origin of the ApplicationComponent sources a file
origin.yaml is created in the components directory in the file system. If the failed source download has been performed manually it is possible to edit this file and correct the data given in this file.
# This file contains metadata about the orgin of the package and the sources.
# This file was automatically created but might manually be edited if the contained data is not correct
sourceDownloadUrl: https://url/pointing/to/the/source/archive.jar (1)
packageDownloadUrl: https://url/pointing/to/the/binary/archive.jar (2)
# note: to add comments: write them here and remove the hash at the beginning of the line (not yet processed by Solicitor)
|URL for downloading the sources - will be available as property
ApplicationComponent.sourceDownloadUrl in the Solicitor data model.
|URL for downloading the binaries - will be available as property
ApplicationComponent.packageDownloadUrl in the Solicitor data model.
The content of the file
origin.yaml currently just affects the above given two properties, it does not affect the downloading of sources by the scripts.
Solicitor 2nd run
Execute Solicitor a second time.
After reading the component/license information from the Readers (but before starting the rule engine)
Solicitor will try to look up ScanCode information from the directory tree in
output/Sources for all processed ApplicationComponents. If information is found for an ApplicationComponent the following is done:
License information (including URL of license text) as obtained from the Readers is replaced by the license info found by ScanCode
Copyrights are taken from ScanCode results
Info on NOTICE file is taken from the ScanCode results
If the ScanCode results contain information about project URLs this is stored as
packageDownloadUrlare set to the values given in file
Main target of the additional information obtained from ScanCode is currently the new report
Attributions_PROJECTNAME.html which lists
all ApplicationComponents (excluding those which are not OSS licensed)
with all found copyrights
and all licenses
including all different license texts
and contents of all found NOTICE files
The data obtained from ScanCode might be affected by false positives (wrongly detected a license or copyright) or false negatives (missed to detect a license or copyright). To compensate such defects there are two mechanisms: Applying Curation information from a "curations" file or changing the License information via the decision table rules.
To define curations you might create a file
output/curations.yaml containing the following structure:
- name: pkg/npm/@somescope/somepackage/1.2.3 (1)
url: https://github.com/foo/bar (2)
- license: MIT (4)
url: https://raw.githubusercontent.com/foo/bar/LICENSE (5)
- (c) 2021 Donald Duck (7)
- "(c) 2019 Mickey Mouse <http://mickey.mouse>" (8)
- "sources/src" (10)
- name: pkg/npm/@anotherscope/anotherpackage/4.5.6 (11)
|Path of the package information as used in the file tree. Derived from the PackageURL.
|URL of the project, will be stored as
sourceRepoUrl. (Optional: no change if not existing.)
|Licenses to set. Optional. If defined then all found licenses will be replaced by the list of licenses given here.
|SPDX identifier of license.
|URL pointing to license text.
|Copyrights to set. Optional. If defined then all found copyrights will be replaced by the list of copyrights given here.
|A single copyright.
|Another copyright. Note that due to YAML syntax any string containing
: needs to be enclosed with parentheses
|Excluded paths to be set. Optional. If defined then all scanned files, whose path prefix contain any given string here, are excluded from the ScanCode information.
|A single path prefix. All scanned files starting with this path prefix are excluded from the Scancode information.
|Further packages to follow.
Decision table rules
As for license information obtained from the Readers the license information from ScanCode can also be altered using decision table rules. A new attribute
origin was introduced in the
RawLicense entity as well as condition field in decision table
origin attribute in
Rawlicense either contains the string
scancode if the license information came from ScanCode or it contains the (lowercase) class name of the used Reader.
Using the Extended comparison syntax it is possible to qualify whether a rule should apply for licenses found by ScanCode or not:
|Value of condition Origin
|rule applies for …
… licenses obtained from ScanCode information
… licenses obtained from normal Readers
… in both cases