Guessing of license URLs

Fetching the license content NormalizedLicense.effectiveNormalizedLicenseContent based on the URL in NormalizedLicense.effectiveNormalizedLicenseUrl will often result in content which is in HTML format instead of plain text and is not properly rendered when included in reports. Sometimes the URL even does not point to the license text itself but just the homepage of the project. In general it is possible to manually correct this by editing the downloaded and cached content as described in the previous section. This approach might require a lot of manual work. Solicitor therefore includes a mechanism named license url guessing which tries to guess an alternative license URL which should point to a representation of the content better suited for rendering.

Currently license URL guessing is based solely on the URL given in NormalizedLicense.effectiveNormalizedLicenseUrl. It will try the following approaches:

  • If the original URL is a Github-URL and matches patterns which are known to return HTML-formatted content then the URL is rewritten to point to a raw version of the content.

  • If the original URL points to a Github project page (not to a file), then the algorithm will try different typical locations (like e.g. looking for file LICENSE). If found it will return this URL as result.

  • If no "better" URL could be guessed it will return the original URL.

The result of the license URL guessing is available via three attributes:

  • NormalizedLicense.guessedLicenseUrl: The (possibly) improved URL pointing to the license text.

  • NormalizedLicense.guessedLicenseUrlAuditInfo: A text which gives info how the guessed url was determined (available for auditing purposes).

  • NormalizedLicense.guessedLicenseContent: The content downloaded from the guessed URL

Downloading the license content (also including the checking if a certain resource is available when trying different possible filenames) is done using the same (caching) mechanisms as downloading the content for other URLs, see the previous section.

Caching of guessed URLs

The information about guessed URLs for given original URLs (also including the audit info on the guessing process) uses a caching mechanism which is mainly identical to the caching of downloaded content. The files containing the cached data are stored in directory licenseurls (instead of licenses for the content itself).

The file content looks as follows:

https://raw.githubusercontent.com/some/project/master/LICENSE (1)
-------------------------                                     (2)
URL changed from https://github.com/some/project/blob/master/LICENSE to https://raw.githubusercontent.com/some/project/master/LICENSE (3)
1 the guessed URL
2 a line of dashes as separator
3 the audit info (might be multiple lines)

It is possible to manually change this cached information and thus correct it - similar to manually correcting the license text as described above.

License guessing is a new feature as of Solicitor 1.3.0. The guessing algorithm might be modified in future versions without further notice which might result in different outcomes for the guessed URLs.
Last updated 2023-11-20 10:37:01 UTC