|
Northern Lighthouse |
© Northern Lighthouse Ltd - Last updated 4 Apr 2008
| Introduction to BUFR |
The reader is referred to WMO official documentation as the source of definitions of terms related to BUFR and CREX. The information about BUFR and CREX in this file is based on our interpretation of WMO documents, and in case of discrepancies, WMO documentation should be considered the valid one.
The aim of this introduction is to give a short description, using simple words, for those of you who are not familiar with BUFR. We would like to be able to show that, whatever you have been told, BUFR is understandable. If you like our introduction, or if you do not like it, or if you have suggestions about how it could be improved, please send us your comments.
Use our Glossary if you hit an unfamiliar term
| 1. BUFR background |
BUFR stands for Binary Universal Form for the Representation of meteorological data.
The meteorological world is full of information exchange. All kinds of weather messages are continuously created, transmitted and stored all over the globe. These messages have very different contents: manual observations on the land and on the oceans, upper air soundings, satellite and radar observations, observations from aeroplanes, drifting buoys, constant level balloons, automatic weather stations etc.
All these different observation reports contain different kind of data and traditionally there has been a huge variety of different ways to code them. Too many different ways!
BUFR was developed to unify the way of coding weather information for transmission and/or storing. That is what the middle letters stand for: Universal Form! A single form to handle all kinds of meteorological data! What a blessing for programmers working in modern computerised meteorological centres: no need to code and maintain dozens of different coding packages, a single software package can handle all kinds of meteorological messages.
Although BUFR was initially defined for use in meteorology, it can also be used to code e.g. oceanographic or hydrological data. There is nothing in BUFR that limits its use to meteorological data.
The first letter B stands for Binary. It means that BUFR messages cannot be read or written by humans, and we need a computer program for encoding and decoding our data. Today this is not a major problem as most communication between meteorological centres goes through communication computers. But in some parts of the WWW (World Weather Watch) network it is still a problem and a new coding method, called CREX, has been defined for that reason.
On the other hand, most BUFR-coding software packages provide tools for turning any BUFR message into human readable listings, which are far more readable than any of the messages coded according to old character-based codes. One had to be an expert on meteorological codes to know what all the old character coded messages meant.
BUFR code is table driven, i.e. a lot of BUFR coding logic is stored in external tables and only a part of the required logic is hardcoded into computer programs.
There is no need for a general BUFR-coding library to know about the data contents of messages, whether it is temperatures or winds or something else. BUFR messages contain metadata that tell what the actual data are and how they should be interpreted.
For the kernel coding library, a BUFR message is just a long stream of bits with a certain predefined, but not constant, structure. Using the metadata and external tables the coding library is able to find out what parameters are found in the message and how to decode their values.
Of course, application programs that use or create BUFR messages need to know about the meteorological contents. But an application program does not need to know about the stream of bits. It just uses the API (Application Program Interface) of a BUFR library, and lets the library do all the difficult work.
The mechanism of using external tables makes BUFR very versatile. Although BUFR was designed for meteorological data, a universal BUFR coding library can be used for other sciences, or disciplines, as they are called in BUFR terminology, without any changes in the library itself. Of course, in order to use BUFR in other discipline one would need tables appropriate for that discipline, and application programs would be different.
The main BUFR table, called Table B, contains descriptors for all parameters used within a field of science, originally within meteorology. A descriptor describes the name and the unit of a parameter and also how its value is to be packed into a bit representation (see examples in 2.4).
These descriptors are keys to the parameter values within a BUFR message. As a user you do not need to know anything about bits nor bytes, but you must know the descriptor numbers for the parameters you are interested in. Those numbers can be found in Table B.
| 2. BUFR coding principles |
In this chapter we discuss the basics of the BUFR coding mechanism.
In a BUFR message data values are stored as a continuous bit stream of coded positive integer values. Each data value occupies an optimum number of bits, depending on the precision and range of the data parameter.
The number of bits each data value occupies, i.e. its width, and the way to interpret those bits is defined in coding rules stored in Table B.
The location of the bits occupied by an individual data value in this stream depends on the widths of all preceding data values.
Most coding rules are stored in external BUFR tables (files). To encode or decode BUFR messages the coding software needs to have access to the table files.
- Table A is used to classify message types.
- Table B contains information on how different meteorological parameters are encoded into positive integers in a message (vice versa for decoding) plus their names and units.
- Table C contains special coding rules which need to be hardcoded into BUFR software.
- Table D contains shorthand notations to shorten the description of long message structures.
- Flag & Code Table describes translations of coded numeric values into original non-numeric values.
A key to all information stored in BUFR tables B, C and D are the descriptors.
In addition to the bitstream of data values, a BUFR message contains also a sequence of descriptors needed to interpret the contents of the data bitstream.
Descriptor values (or "names") are composed of 3 numbers F, XX and YYY. Within Cipher documentation, descriptors are written either as a 6-digit number FXXYYY or, for readability, as a 8-character string F'XX'YYY.
As an example, element descriptor 0'12'004 describes 2 meter temperature and sequence descriptor 3'01'011 is a shorthand notation for date (year + month + day).
Descriptors are explained more in depth in the next chapter.
BUFR Table B (in the format used in Cipher documentation) contains the following entry for descriptor 0'12'004 :
012004 Dry-bulb temperature at 2m K 1 0 12The first number is the descriptor as an integer value, followed by the element name. The second line contains the unit of the element (K for Kelvin).
The third line contains the coding rules. The first value in this line is the scale (1 in this example), the second one is the reference value (=0) and the third one is the data width (=12 bits). The relation between the coded value and the actual value is given by the formula
coded_value = original_value * 10^scale - reference_valueScale is used to multiply decimal numbers into integer values (scale > 0) or to reduce precision of large values (scale < 0). Thus, in a way, scale tells the length of the decimal fraction used.Reference value is used to ensure that the encoded value is always positive.
Reference value (together with scale) defines the smallest possible value for the parameter, while data width (together with scale and reference value) defines the largest possible value.
Thus 2m temperature is stored with one decimal fraction, the minimum value is 0.0K and the maximum value is 409.4K. Note that the highest value 409.5K is not available because a field with all bits as 1's is reserved to code a missing value.
BUFR Table B contains the following entry for descriptor 0'10'051:
010051 Pressure reduced to mean sea level Pa -1 0 14i.e. unit is Pascal (Pa), stored value discards the last digit (-1), and the smallest possible value is 0. Because 14 bits are used in the data bit stream, the maximum value is 163820 Pa (i.e. 1638.2 hPa).
Descriptor 0'06'002 describes longitude:
006002 Longitude (coarse accuracy) Degree 2 -18000 16i.e. longitude is given in degrees, stored value retains 2 decimals (2), the smallest possible value is -180.00, and 16 bits are used in the data bit stream setting the maximum value to 475.34. This allows us to use the range -180.00 to +180.00, set by WMO rules.
This is the description part of of a simplified message which contains only temperature pressure, location and time:
005002 Latitude (coarse accuracy) 006002 Longitude (coarse accuracy) 004003 Day 004004 Hour 004005 Minute 012004 2 meter temperature 010051 Pressure reduced to mslNote that in real messages we would do better by using sequence descriptors for location and time (see 3.3).
| 3. BUFR descriptors |
In this chapter we discuss BUFR descriptors more thoroughly.
Within Cipher documentation descriptors are written as 8 character strings F'XX'YYY (containing 1+2+3=6 digits), where:
- the digit called F indicates the type of the descriptor (it occupies 2 bits in a real message, so it can take the values 0 to 3);
- the digits called XX indicate a class within a type (it occupies 6 bits in a real message, so it can take the values 0 to 63);
- the digits called YYY indicate an entry within a class (it occupies 8 bits in a real message, so it can take the values 0 to 255).
The digit F divides descriptors into 4 types:
- element descriptors (F=0)
- replication descriptors (F=1)
- operator descriptors (F=2)
- sequence descriptors (F=3)
Classes are used for grouping similar parameters together. For instance, XX=12 is reserved for different temperature parameters.
These are the basic key descriptors between data values and coded values. Element descriptor decoding rules are stored in BUFR Table B.
These are shorthand notations for a sequence of descriptors. BUFR Table D contains rules that describe how a single sequence descriptor is expanded into a sequence of descriptors, which again may contain other sequence descriptors which must be expanded into a sequence of descriptors... This continues until the final sequence contains only element and operator descriptors.
As an example, we could describe the same structure of the example in 2.5 by using sequence descriptor 3'01'025. According to table D, sequence descriptor 3'01'025 finally expands into the following element descriptors:
005002 Latitude (coarse accuracy) 006002 Longitude (coarse accuracy) 004003 Day 004004 Hour 004005 MinuteSo we could describe the structure of the example in 2.5 by using:
301025 (lat + lon + day + hour + minute) 012004 2 meter temperature 010051 pressure reduced to msl
These are shorthand notations for replicating a sequence of descriptors. Descriptor 1'XX'YYY indicates that the next XX descriptors will be repeated YYY times in the expanded descriptor sequence. YYY is called replication count.
If YYY=0 then the replication count is coded into the data section of the message. This is called delayed replication.
These are special descriptors used to alter normal coding rules or to add some extra information e.g. quality control information.
| 4. BUFR message structure |
In this chapter we discuss the physical structure of a BUFR message.
You can use Cipher BUFRtool utility program (downloadable from Northern Lighthouse website) to study structures of BUFR messages.
A BUFR message starts with characters "BUFR", ends with characters "7777" and is binary, i.e. non-human-readable, in between.
For a computer program this binary part is most readable: it has a predefined structure of separate sections. Each section starts with length information, making it possible to find the start of the next section which starts with length information, making it possible to find... I think you got the idea ;-)
- Section 0, the indicator section, is always 8 octets long, and consists of:
(Note: in editions 0 and 1 the section 0 was only 4 octets long: characters "BUFR".)
- characters "BUFR",
- length of the whole message in octets, and
- BUFR edition number.
- Section 1 is the identification section.
- Section 2 is an optional local section. It can be missing (more about local extensions in chapter 5).
- Section 3, the data description section, contains the descriptors, i.e. it gives the structure of the observation report.
- Section 4, the data section, contains coded data values, to be interpreted according to the structure defined in section 3.
- Section 5, the end section, is always 7777. It provides a checkpoint for the decoding program.
You can use the utility program Cipher BUFRtool (task msgexam) to print out the different sections of a BUFR message.
If several observation reports share the same structure, i.e. they have the same descriptors in section 3, then it is possible to build a message that has the descriptors given only once but with a data section containing several observation reports. In this case each report is said to be a subset of the whole message.
If all the observation reports share exactly the same message structure (i.e. no delayed replication) then it is possible to compress them, i.e. use another type of coding rules. Compression saves even more space and makes the decoding more efficient.
| 5. Local BUFR extensions |
Data processing centres can expand BUFR capabilities by defining their own local extensions. In this chapter we discuss these extensions.
Section 2 is optional, and if it exists, it is always locally defined.
Data processing centres can define their own section of metadata as section 2, for local use. For instance, a centre may want to include metadata relevant for its archive system, e.g. database keys that help to make the data retrieval more efficient.
Section 2 is meant for local use; if messages are intended for external distribution, they should either not have a section 2, or if they do, the section 2 should contain only information for local use at the emissor centre.
Section 1 can be extended for local use. Up to BUFR edition 3, the first 17 octets are defined by WMO, but octets 18 onwards can be used to store local extra metadata. In BUFR edition 4, the first 22 octets are defined by WMO, and octets from 23 onwards can be used for local metadata if needed.
The descriptors defined by WMO cater for a large range of parameters. However, it may happen that a center needs to encode data for which no suitable descriptors have been defined by WMO yet. Centers can define their own descriptors for their own needs. Local element descriptors are stored in a local table B and local sequence descriptors in a local table D.
Element and sequence descriptor classes XX=0,..,.47 and entries YYY=0,...,191 in those classes are reserved for WMO definition. The rest can be freely defined and used locally.
If messages containing local descriptors are distributed externally, recipients need to have access to the local tables used to create those messages, in order to be able to decode them. For this reason, when messages are intended for external distribution, the use of local element descriptors is not recommended, unless there is no standard alternative. The section below describes a method that allows the decoding of the standard part of a message containing non-standard element descriptors. However, the use of local sequence descriptors prevents the use of that method, and therefore the use of local sequence descriptors in general is not recommended.
As noted in 5.3, messages intended for external distribution should not contain local descriptors, if there is a standard alternative. However, some times there is no standard alternative. In these cases, if recipient centers do not have access to the local tables used to produce the messages, they cannot successfully decode the whole message. But there is a way, described below, to allow the recipients to decode at least the standard part of the message, even if they do not have the local tables used to produce the message.
It is possible to safeguard the non-local part of the message by preceding each local descriptor with a special operator descriptor 2'06'YYY. This operator indicates that the data value corresponding to the following local descriptor occupies YYY bits in the data section. This information allows the decoding program to skip the correct number of bits, and then continue with the rest of the message.
Here is a simplified and unusual ;-) example taken from file observer.bfr in Cipher BUFRtool. The last parameter 0'12'004 can always be decoded even if a site does not have the required local Table B to decode shoe size, mood and body temperature:
3'01'011 expands into date 2'06'004 local field of 4 bits follows 0'02'202 Shoe size 2'06'003 local field of 3 bits follows 0'20'202 Mood 2'06'006 local field of 6 bits follows 0'12'199 Body temperature 0'12'004 Dry-bulb temperature at 2 m
BUFR files observer.bfr and sauna.bfr contain examples of local descriptor usage. Use Cipher BUFRtool utility program to examine these demo files. Local Table B used in these examples is called B_v1d0s0.255 and can be found in Cipher BUFRtool utility program subdirectory tables.
| ||