Utils Docs

Public api for methods and functions to handle/verify the jsonschemas.

class inspire_schemas.utils.LocalRefResolver(base_uri, referrer, store=(), cache_remote=True, handlers=(), urljoin_cache=None, remote_cache=None)[source]

Bases: jsonschema.validators.RefResolver

Simple resolver to handle non-uri relative paths.


Resolve a uri or relative path to a schema.

inspire_schemas.utils.author_id_normalize_and_schema(uid, schema=None)[source]

Detect and normalize an author UID schema.

  • uid (string) – a UID string
  • schema (string) – try to resolve to schema

a tuple (uid, schema) where: - uid: the UID normalized to comply with the id.json schema - schema: a schema of the UID or None if not recognised

Return type:

Tuple[string, string]

UnknownUIDSchema: if UID is too little to definitively guess the schema SchemaUIDConflict: if specified schema is not matching the given UID
inspire_schemas.utils.build_pubnote(title, volume, page_start=None, page_end=None, artid=None)[source]

Build pubnote string from parts (reverse of split_pubnote).


Normalize value to an Inspire category.

Parameters:value (str) – an Inspire category to properly case, or an arXiv category to translate to the corresponding Inspire category.
None if value is not a non-empty string,
otherwise the corresponding Inspire category.
Return type:str

Convert back a publication_info value from the new format to the old.

Does the inverse transformation of convert_old_publication_info_to_new(), to be used whenever we are sending back records from Labs to Legacy.

Parameters:publication_infos – a publication_info in the new format.
Returns:a publication_info in the old format.
Return type:list(dict)

Convert a publication_info value from the old format to the new.

On Legacy different series of the same journal were modeled by adding the letter part of the name to the journal volume. For example, a paper published in Physical Review D contained:

    'publication_info': [
            'journal_title': 'Phys.Rev.',
            'journal_volume': 'D43',

On Labs we instead represent each series with a different journal record. As a consequence, the above example becomes:

    'publication_info': [
            'journal_title': 'Phys.Rev.D',
            'journal_volume': '43',

This function handles this translation from the old format to the new. Please also see the tests for various edge cases that this function also handles.

Parameters:publication_infos – a publication_info in the old format.
Returns:a publication_info in the new format.
Return type:list(dict)

The country’s name for the given code.

Parameters:code – needs to be alpha_2 country code.

The country’s code for the given name.

Parameters:name – needs to be an ISO 3166-1 or ISO 3166-3 country name.

Decorator that is filtering empty parameters.

Parameters:func (function) – function that you want wrapping

Used to parse an incorect url to try to fix it with the most common ocurrences for errors. If the fixed url is still incorrect, it returns None.

Returns:String containing the fixed url or the original one if it could not be fixed.

Add the starting http to a url that is missing it


A common error in urls is that all / have been changed for |, we fix that in this function


Replace unicode characters by their working equivalent


Get the license abbreviation from an URL.

Parameters:url (str) – canonical url of the license.
Returns:the corresponding license abbreviation.
Return type:str
Raises:ValueError – when the url is not recognized
inspire_schemas.utils.get_paths(schema, previous_node=None)[source]

For every schema return path and index name for every referenced record :returns: index and path to the referenced record :rtype: dict(list(tuple))

inspire_schemas.utils.get_schema_path(schema, resolved=False)[source]

Retrieve the installed path for the given schema.

  • schema (str) – relative or absolute url of the schema to validate, for example, ‘records/authors.json’ or ‘jobs.json’, or just the name of the schema, like ‘jobs’.
  • resolved (bool) – if True, the returned path points to a fully resolved schema, that is to the schema with all $ref replaced by their targets.

path to the given schema name.

Return type:



SchemaNotFound – if no schema could be found.

inspire_schemas.utils.get_validation_errors(data, schema=None)[source]

Validation errors for a given record.

  • data (dict) – record to validate.
  • schema (Union[dict, str]) – schema to validate against. If it is a string, it is intepreted as the name of the schema to load (e.g. authors or jobs). If it is None, the schema is taken from data['$schema']. If it is a dictionary, it is used directly.

jsonschema.exceptions.ValidationError – validation errors.

  • SchemaNotFound – if the given schema was not found.
  • SchemaKeyNotFound – if schema is None and no $schema key was found in data.
  • jsonschema.SchemaError – if the schema is invalid.

Return True if obj contains an arXiv identifier.

The idutils library’s is_arxiv function has been modified here to work with two regular expressions instead of three and adding a check for valid arxiv categories only

inspire_schemas.utils.load_schema(schema_name, resolved=False, _cache={'/home/docs/checkouts/readthedocs.org/user_builds/inspire-schemas/envs/latest/lib/python3.7/site-packages/inspire_schemas/records/elements/rank.json': {'enum': ['STAFF', 'SENIOR', 'JUNIOR', 'VISITOR', 'POSTDOC', 'PHD', 'MASTER', 'UNDERGRADUATE', 'OTHER', None], 'minLength': 1, 'title': 'Rank of academic position', 'type': 'string'}, 'elements/rank': {'enum': ['STAFF', 'SENIOR', 'JUNIOR', 'VISITOR', 'POSTDOC', 'PHD', 'MASTER', 'UNDERGRADUATE', 'OTHER', None], 'minLength': 1, 'title': 'Rank of academic position', 'type': 'string'}})[source]

Load the given schema from wherever it’s installed.

  • schema_name (str) – Name of the schema to load, for example ‘authors’.
  • resolved (bool) – If True will return the resolved schema, that is with all the $refs replaced by their targets.
  • _cache (dict) – Private argument used for memoization.

the schema with the given name.

Return type:



Return a normalized arXiv identifier from obj.


Normalize arXiv category to be schema compliant.

This properly capitalizes the category and replaces the dash by a dot if needed. If the category is obsolete, it also gets converted it to its current equivalent.


>>> from inspire_schemas.utils import normalize_arxiv_category
>>> normalize_arxiv_category('funct-an')  # doctest: +SKIP

Normalize collaboration string.

Parameters:collaboration – a string containing collaboration(s) or None
Returns:List of extracted and normalized collaborations
Return type:list


>>> from inspire_schemas.utils import normalize_collaboration
>>> normalize_collaboration('for the CMS and ATLAS Collaborations')
['CMS', 'ATLAS']

Normalize an ISBN in order to be schema-compliant.


Sanitize HTML for use inside records fields.

This strips most of the tags and attributes, only allowing a safe whitelisted subset.


Split page_artid into page_start/end and artid.


Split pubnote into journal information.


List of all arXiv categories that ever existed.


>>> from inspire_schemas.utils import valid_arxiv_categories
>>> 'funct-an' in valid_arxiv_categories()
inspire_schemas.utils.validate(data, schema=None)[source]

Validate the given dictionary against the given schema.

  • data (dict) – record to validate.
  • schema (Union[dict, str]) – schema to validate against. If it is a string, it is intepreted as the name of the schema to load (e.g. authors or jobs). If it is None, the schema is taken from data['$schema']. If it is a dictionary, it is used directly.
  • SchemaNotFound – if the given schema was not found.
  • SchemaKeyNotFound – if schema is None and no $schema key was found in data.
  • jsonschema.SchemaError – if the schema is invalid.
  • jsonschema.ValidationError – if the data is invalid.