Usage

The library aims to offer tools for two main operations:

Reading data from the usual formats (ROOT, text files, etc.) into python lists.
Writing data from python lists to the HEPData YAML-based format

All of this happens in a user-friendly python interface. Reading data is helpful if you need help getting your data as a python list. If you already have your data accessible in python, great! Skip right ahead to Writing data.

In the following sections, there are

HEPData and its data format

The HEPData data model revolves around Tables and Variables. At its core, a Variable is a one-dimensional array of numbers with some additional (meta-)data, such as uncertainties, units, etc. assigned to it. A Table is simply a set of multiple Variables. This definition will immediately make sense to you when you think of a general table, which has multiple columns representing different variables.

Reading data

Reading from plain text

If you save your data in a text file, a simple-to-use tool is the numpy.loadtxt function, which loads column-wise data from plain-text files and returns it as a numpy.array.

import numpy as np
my_array = np.loadtxt("some_file.txt")

A detailed example is available here. For documentation on the loadtxt function, please refer the numpy documentation.

Reading from ROOT files

In many cases, data in the experiments is available as one of various ROOT data types, such as TGraphs, TH1, TH2, TEfficiency, etc, which are saved in *.root files.

To facilitate reading these objects, the RootFileReader class is provided. The reader is instantiated by passing a path to the ROOT file to read from:

from hepdata_lib import RootFileReader
reader = RootFileReader("/path/to/myfile.root")

After initialization, individual methods are provided for access to different types of objects stored in the file.

Reading TGraph, TGraphErrors, TGraphAsymmErrors: RootFileReader.read_graph
Reading TEfficiency: RootFileReader.read_teff
Reading TH1: RootFileReader.read_hist_1d
Reading TH2: RootFileReader.read_hist_2d

While the details of each function are adapted to their respective use cases, they follow a common input/output logic. The methods are called by providing the path to the object inside the ROOT file. They return a dictionary containing lists of all relevant numbers that can be extracted from the object, such as x values, y values, uncertainties, etc.

As an example, if a TGraph is saved as with name mygraph in the directory topdir/subdir inside the ROOT file, it can be retrieved as:

data = reader.read_graph("topdir/subdir/mygraph")

Since a graph is simply a set of (x,y) pairs for each point, the data dictionary will have two key/value pairs:

key “x” -> list of x values.
key “y” -> list of y values.

More complex information will be returned for TGraphErrors, etc, which can also be read in this manner. For detailed descriptions of the extraction logic and returned data, please refer to the documentation of the individual methods.

An example notebook shows how to read histograms from a ROOT file.

Writing data

Following the HEPData data model, the hepdata_lib implements four main classes for writing data:

Submission
Table
Variable
Uncertainty

The Submission object

The Submission object is the central object where all threads come together. It represents the whole HEPData entry and thus carries the top-level meta data that is equally valid for all the tables and variables you may want to enter. The object is also used to create the physical submission files you will upload to the HEPData web interface.

When using hepdata_lib to make an entry, you always need to create a Submission object. The most bare-bone submission consists of only a Submission object with no data in it:

from hepdata_lib import Submission
sub = Submission()
outdir="./output"
sub.create_files(outdir)

The create_files function writes all the YAML output files you need and packs them up in a tar.gz file ready to be uploaded.

Please note: creating the output files also creates a submission folder containing the individual files going into the tarball. This folder exists merely for convenience, in order to make it easy to inspect each individual file. It is not recommended to attempt to manually manage or edit the files in the folder, and there is no guarantee that hepdata_lib will handle any of the changes you make in a graceful manner. As far as we are aware, there is no use case where manual editing of the files is necessary. If you have such a use case, please report it in a Github issue.

Adding resource links or files

Additional resources, hosted either externally or locally, can be linked with the add_additional_resource function of the Submission object.

sub.add_additional_resource("Web page with auxiliary material", "https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/STDM-2012-02/")
sub.add_additional_resource("Some file", "root_file.root", copy_file=True)
sub.add_additional_resource("Some file", "root_file.root", copy_file=True, resource_license={"name": "CC BY 4.0", "url": "https://creativecommons.org/licenses/by/4.0/", "description": "This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator."})
sub.add_additional_resource("Archive of full likelihoods in the HistFactory JSON format", "Likelihoods.tar.gz", copy_file=True, file_type="HistFactory")
sub.add_additional_resource("Likelihood in HS3 format", "likelihood.json", copy_file=True, file_type="HS3")
sub.add_additional_resource("SimpleAnalysis code snippet", "analysis.cxx", copy_file=True, file_type="SimpleAnalysis")
sub.add_additional_resource("Selection and projection function examples", "analysis.cxx", copy_file=True, file_type="ProSelecta")

The first argument is a description and the second is the location of the external link or local resource file. The optional argument copy_file=True (default value of False) will copy a local file into the output directory. The optional argument resource_license can be used to define a data license for an additional resource. The resource_license is in the form of a dictionary with mandatory string name and url values, and an optional description. The optional argument file_type="HistFactory" (default value of None) can be used to identify statistical models provided in the HistFactory JSON format (see pyhf section of submission documentation). The optional argument file_type="HS3" can be used to identify statistical models provided in the HEP Statistics Serialization Standard (HS3) format (see HS3 section of submission documentation). The optional argument file_type="SimpleAnalysis" can be used to identify C++ code snippets in the Simplified ATLAS SUSY analysis (SimpleAnalysis) framework (see SimpleAnalysis section of submission documentation). The optional argument file_type="ProSelecta" can be used to identify C++ snippets in the ProSelecta format for use with the NUISANCE framework for event generators in neutrino physics (see NUISANCE section of submission documentation).

Please note: The default license applied to all data uploaded to HEPData is CC0. You do not need to specify a license for a resource file unless it differs from CC0.

The add_link function can alternatively be used to add a link to an external resource:

sub.add_link("Web page with auxiliary material", "https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/STDM-2012-02/")

Again, the first argument is a description and the second is the location of the external link.

Tables and Variables

The real data is stored in Variables and Tables. Variables come in two flavors: independent and dependent. Whether a variable is independent or dependent may change with context, but the general idea is that the independent variable is what you put in, the dependent variable is what comes out. Example: if you calculate a cross-section limit as a function of the mass of a hypothetical new particles, the mass would be independent, the limit dependent. The number of either type of variables is not limited, so if you have a scenario where you give N results as a function of M model parameters, you can have N dependent and M independent variables. All the variables are then bundled up and added into a Table object.

Let’s see what this looks like in code:

from hepdata_lib import Variable

mass = Variable("Graviton mass",
                is_independent=True,
                is_binned=False,
                units="GeV")
mass.values = [ 1, 2, 3 ]

limit = Variable("Cross-section limit",
                is_independent=False,
                is_binned=False,
                units="fb")
limit.values = [ 10, 5, 2 ]

table = Table("Graviton limits")
table.add_variable(mass)
table.add_variable(limit)

That’s it! We have successfully created the Table and Variables and stored our results in them. The only task left is to tell the Submission object about our new Table:

sub.add_table(table)

After we have done this, the table will be included in the output files the Submission.create_files function writes (see The Submission object).

Binned Variables

The above example uses unbinned Variables, which means that every point is simply a single number reflecting a localized value. In many cases, it is useful to use binned Variables, e.g. to represent the x axis of a histogram. In this case, everything works the same way as in the unbinned case, except that we have to specify is_binned=True in the Variable constructor, and change how we format the list of values:

mass_binned = Variable("Same mass as before, but this time it's binned",
                       is_binned=True,
                       is_independent=True)
mass_binned.values = [ (0.5, 1.5), (1.5, 2.5), (2.5, 3.5) ]

The list of values has an entry for each bin of the Variable. The entry is a tuple, where the first entry represents the lower edge of the bin, while the second entry represents the upper edge of the bin. You can simply plug this definition into the code snippet of the unbinned case above to go from an unbinned mass to a binned value. Note that binning a Variable only really makes sense for independent variables.

Two-dimensional plots

In some cases, you may want to define information based on multiple parameters, e.g. in the case of a two-dimensional histogram (TH2 in ROOT). This can be easily accomplished by defining two independent Variables in the same Table:

table = Table()

x = Variable("Variable on the x axis",
             is_independent=True,
             is_binned=True)
# x.values = [ ... ]

y = Variable("Variable on the y axis",
             is_independent=True,
             is_binned=True)
# y.values = [ ... ]

v1 = Variable("A variable depending on x and y",
              is_independent=False,
              is_binned=False)
# v1.values = [ ... ]

v2 = Variable("Another variable depending on x and y",
              is_independent=False,
              is_binned=False)
# v2.values = [ ... ]

table.add_variable(x)
table.add_variable(y)
table.add_variable(v1)
table.add_variable(v2)

Note that you can add as many dependent Variables as you would like, and that you can also make the independent variables unbinned.

One common use case with more than one independent Variable is that of correlation matrices. A detailed example implementation of this case is available here.

Adding a plot thumb nail to a table

HEPData supports the addition of thumbnail images to each table. This makes it easier for the consumer of your entry to find what they are looking for, since they can simply look for the table that has the thumb nail of the plot they are interested in. If you have the full-size plot available on your drive, you can add it to your entry very easily:

table.add_image("path/to/image.pdf")

The library code then takes care of all the necessary steps, like converting the image to the right format and size, and copying it into your submission folder. The conversion relies on the ImageMagick library, and will only work if the convert command is available on your machine.

Adding resource links or files

In the same way as for the Submission object, additional resources, hosted either externally or locally, can be linked with the add_additional_resource function of the Table object.

table.add_additional_resource("Web page with auxiliary material", "https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/STDM-2012-02/")
table.add_additional_resource("Some file", "root_file.root", copy_file=True)

For a description of the arguments, see Adding resource links or files for the Submission object. A possible use case is to attach the data for the table in its original format before it was transformed into the HEPData YAML format. Note that additional resources intended to be highlighted as Analyses (HistFactory, HS3, SimpleAnalysis, ProSelecta) should be attached to a Submission object and not to a Table object.

Adding keywords to a table

To make HEPData entries more searchable, keywords should be used to define what information is shown in a table. HEPData keeps track of keywords separately from the rest of the information in an entry, and provides dedicated functionalities to search for and filter by a given set of keywords. If a user is e.g. interested in finding all tables relevant to graviton production, they can do so quite easily if the tables are labelled properly. This procedure becomes much harder, or even impossible, if no keywords are used. It is therefore considered good practice to add a number of sensible keywords to your tables.

The keywords are stored as a simple dictionary for each table:

table.keywords["observables"] = ["ACC", "EFF"]
table.keywords["reactions"] = ["P P --> GRAVITON --> W+ W-", "P P --> WPRIME --> W+/W- Z0"]

In this example, we specify that the observables shown in a table are acceptance (“ACC”) and efficiency (“EFF”). We also specify the reaction we are talking about, in this case graviton or W’ production with decays to SM gauge bosons. This code snippet is taken from one of our examples.

Lists of recognized keywords are available from the hepdata documentation for Observables, Phrases, and Particles.

Adding a data license

You can add data license information to a table using the add_data_license function of the Table class. This function takes mandatory name and url string arguments, as well as an optional description.

Please note: The default license applied to all data uploaded to HEPData is CC0. You do not need to specify a license for a data table unless it differs from CC0.

table.add_data_license("CC BY 4.0", "https://creativecommons.org/licenses/by/4.0/")
table.add_data_license("CC BY 4.0", "https://creativecommons.org/licenses/by/4.0/", "This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator.")

Uncertainties

In many cases, you will want to give uncertainties on the central values provided in the Variable objects. Uncertainties can be symmetric or asymmetric. For symmetric uncertainties, the values of the uncertainties are stored as a one-dimensional list of positive values, which are applied as equal-magnitude positive and negative changes to the value.

For asymmetric uncertainties, the uncertainties are expressed as a signed two-component iterable (e.g. tuple or list): in general, this pair represents the value changes in response to downward and upward moves of a nuisance parameter, and so it is possible for both the “up” and “down” variations to have the same sign (if the effect of the nuisance is one-sided). Therefore both components should be computed as variation_value - nominal_value such that negative variations correctly acquire a minus sign; asymmetric statistical errors are represented using the same scheme and should also ensure that the “down” uncertainty has a negative sign.

from hepdata_lib import Uncertainty
unc1 = Uncertainty("A symmetric uncertainty", is_symmetric=True)
unc1.values = [ 0.1, 0.3, 0.5]

unc2 = Uncertainty("An asymmetric uncertainty", is_symmetric=False)
unc2.values = [ (-0.08, +0.15), (-0.13, +0.20), (-0.18,+0.27) ]

Note that the sizes of the uncertainties define a natural scale for the precision to which the central value should be represented (and in an asymmetric pair, the larger component may naturally set the precision of reporting for the smaller). In HEPData, any numerical values will be displayed at full floating-point precision, so it is often desirable to manually round the values and uncertainties in the submission, to achieve a more readable final display. The hepdata_lib.helpers functions relative_round, round_multiple, round_value_and_uncertainty_arrs, round_value_and_multiple_uncertainties_arrs round_value_and_uncertainty, round_value_to_decimals and round_value_and_uncertainty_to_decimals can be used to manipulate arrays and dicts of numerical data before attachment to the Variable and Uncertainty objects.

After creating the Uncertainty objects, the only additional step is to attach them to the Variable:

variable.add_uncertainty(unc1)
variable.add_uncertainty(unc2)

See Uncertainties for more guidance. In particular, note that hepdata_lib will omit the errors key from the YAML output if all uncertainties are zero for a particular bin, printing a warning message “Note that bins with zero content should preferably be omitted completely from the HEPData table”. A legitimate use case is where there are multiple dependent variables and a (different) subset of the bins has missing content for some dependent variables. In this case the uncertainties should be set to zero for the missing bins with a non-numeric central value like '-'. The warning message can be suppressed by passing an optional argument zero_uncertainties_warning=False when defining an instance of the Variable class. Furthermore, note that None can be used to suppress the uncertainty for individual bins in cases where the uncertainty components may only apply to a subset of the values.

Usage

HEPData and its data format

Reading data

Reading from plain text

Reading from ROOT files

Writing data

The Submission object

Adding resource links or files

Adding links to related records

Tables and Variables

Binned Variables

Two-dimensional plots

Adding a plot thumb nail to a table

Adding resource links or files

Adding keywords to a table

Adding links to related tables

Adding a data license

Uncertainties