A survey of biodiversity informatics: Concepts, practices, and challenges

The unprecedented size of the human population, along with its associated economic activities, has an ever‐increasing impact on global environments. Across the world, countries are concerned about the growing resource consumption and the capacity of ecosystems to provide resources. To effectively conserve biodiversity, it is essential to make indicators and knowledge openly available to decision‐makers in ways that they can effectively use them. The development and deployment of tools and techniques to generate these indicators require having access to trustworthy data from biological collections, field surveys and automated sensors, molecular data, and historic academic literature. The transformation of these raw data into synthesized information that is fit for use requires going through many refinement steps. The methodologies and techniques applied to manage and analyze these data constitute an area usually called biodiversity informatics. Biodiversity data follow a life cycle consisting of planning, collection, certification, description, preservation, discovery, integration, and analysis. Researchers, whether producers or consumers of biodiversity data, will likely perform activities related to at least one of these steps. This article explores each stage of the life cycle of biodiversity data, discussing its methodologies, tools, and challenges.


| INTRODUCTION
services Chapin et al., 2000;Hooper et al., 2012). Chapin et al. (2000) observe that biodiversity variables, such as the number of species present, the number of individuals of each species, and which species are present, along with the interactions (e.g., trophic, competitive, and symbiotic) that are taking place between these species, determine the species traits that affect ecosystem processes. These traits can be defined as characteristics or attributes of species that are expressed by genes or affected by the environment. Chapin et al. (2000) also observe that global changes, often triggered by humans, such as invasive species, increased atmospheric carbon dioxide, and land-use change can significantly alter these biodiversity variables and, consequently, the expression of species traits. This, in turn, affects ecosystem processes and their resulting services, which can have negative impacts on human development. This relationship between global changes and biodiversity is illustrated in Figure 2 (adapted from Chapin et al., 2000). Changes in these ecosystem services due to changes in biodiversity can sometimes be nonlinear and stochastic, which can pose a significant risk to humans. Similar conclusions have been reached in other studies on the relationship between biodiversity, ecosystem functioning, and ecosystem services Hooper et al., 2012). Cardinale et al. (2012) observe that after a species becomes extinct, the resulting changes to ecological processes strongly depend on which traits were eliminated. Hooper et al. (2012) observe that biodiversity loss is as significant to ecosystem change as the direct effects of global changes, such as elevated carbon dioxide in the atmosphere and ozone depletion. This, in turn, affects critical ecosystem services for the local population, such as food production, air quality, and freshwater.
A major effort to address the problem was started in 1992 during the Earth Summit in Rio de Janeiro with the signature of the Convention on Biological Diversity (CBD, 1992), a legally binding international treaty. Its main objectives are the conservation of biodiversity, including ecosystems, species, and genetic resources, and their sustainable and fair use. Countries are required to elaborate and execute a strategy for biodiversity conservation, known as a National Biodiversity Strategy and Action Plan (NBSAP), and to put in place mechanisms to monitor and assess the implementation of this strategy. They should periodically report their progress on implementing their NBSAPs. The Strategic Plan for F I G U R E 1 Biodiversity informatics life cycle Biodiversity 2011-2020 defines actions to be taken by countries to achieve a set of 20 targets by 2020, known as the Aichi Biodiversity Targets. It should be observed that the United Nations General Assembly declared 2011-2020 the United Nations Decade on Biodiversity. In 2012, the Intergovernmental Platform on Biodiversity and Ecosystem Services (IPBES) was created to allow for closer cooperation between scientists and policymakers on assessing the status of biodiversity and ecosystem services and their relationship.
Throughout this work, we use definitions from the International Code of Nomenclature for algae, fungi, and plants ( ICN;McNeill, 2012). This document outlines a set of rules and guidelines for scientifically naming and grouping plants, fungi, and algae, consisting of a universally adopted reference by the botanical scientific community. Nomenclature bestpractices for other groups of organisms are governed by other (though similar) documents. Organism samples collected by biologists are an evidence of the existence of a particular organism at some place and time and should be properly deposited in a biological collection for being preserved as a reference. A specimen is defined as one such evidence and refers to a particular observation of a single kind of organism. Organisms are classified according to their shared characteristics and grouped at distinct levels of specificity (or taxonomic ranks) using a hierarchical system, in which groups that are more specific are nested within broader ones. The taxonomic resolution of a biological sample is the rank of the most specific taxonomic determination that has been assigned to it. A taxon is a taxonomic group of organisms at the level of any rank. Species is one of the taxonomic ranks in which organisms can be classified, being regarded as a basic unit of taxonomic classification. The name of a species is composed using a binomial nomenclature system, composed of the name of the genus followed by a specific epithet, for example, Caryocar brasiliense. After properly deposited in a biological collection, each record receives a taxonomic identification that assigns the individual to a taxon. Physical specimens stored in biological collections (also referred to as vouchers) are often associated with complementary information, including the date, time, and the geographic location where the specimen was collected. The taxonomic identity of a specimen includes not only the taxon name assigned to the sample, but also its nomenclatural status and authorship, the name of the person who has provided the identification. Vouchered specimens, together with their associated data, is what scientifically testifies a particular observation of a species by a collector, at some location and time, and is thus referred to as a species occurrence, setting up a pillar of the information used in biodiversity analysis and synthesis.

| DATA MANAGEMENT
From planning and collecting biodiversity data to making it fit-for-use many steps need to be followed, including data planning and collection, data quality and fitness-for-use, data description, data preservation and publication, and data F I G U R E 2 Relationship between global changes and biodiversity discovery and integration. These steps comprise a biodiversity data management life-cycle, Figure 1 (top-right), that we present in this section.

| Data planning and collection
The various steps of the biodiversity life cycle usually comprise a Data Management Plan (DMP; Michener & Jones, 2012) of biodiversity research activities. Some research funding agencies in countries like the United States require the submission of a DMP in submissions to calls for research proposals. Many funding agencies require that proposals should include a DMP. A DMP is usually composed of: which data will be collected; which formats or standards will be used for these data; which metadata will be provided and in which standard or format; what are the policies for data usage and sharing; how data will be stored and how it will be preserved in the long-term; and how data management will be funded. The DMP Tool (Strasser, Abrams, & Cruse, 2014), for instance, is an online tool that supports designing and implementing a DMP. The Data Stewardship Wizard (Pergl, Hooft, Such´anek, Knaisl, & Slifka, 2019) is a web application for supporting the creation of DMPs by presenting hierarchical questionnaires to data stewards, researchers, and data scientists. These questionnaires leverage knowledge collected from the research data management community on best practices for elaborating DMPs.
Biodiversity is concerned with the variety of living organisms, which can be measured in many ways and scales, from a record of an organism observed in a geographical location at a particular date (a species occurrence; Yesson et al., 2007) to the relative abundance of species in a water sample collected at a long-term ecological research site (Michener, Porter, Servilla, & Vanderbilt, 2011). Omics also present many opportunities for exploring biodiversity as, for instance, molecular data from environmental samples (Robbins et al., 2012;Wooley, Godzik, & Friedberg, 2010) can be analyzed in metagenomics studies to identify functional traits and the taxonomic classification of organisms present in them. Biodiversity data can be collected in various ways: biosensor networks, field expeditions, observations made by citizen scientists, among others. In the collection process, it is important to use unique identifiers for project, sampling event, sampling area, and protocol used (Stocks, Stout, & Shank, 2016). These identifiers will later allow the data collected to be stored in biodiversity databases consistently. Whenever possible, the terms should follow a controlled vocabulary or ontology, such as the Biodiversity Collections Ontology (BCO) (Walls et al., 2014). Next, we list common sources of data that are used to describe and analyze biodiversity: • Species occurrences. Species occurrences are one of the most frequently available types of data concerning biodiversity. The main attributes of a species occurrence are given by a taxon; a location; and a date of occurrence. Species occurrence records originate from different sources. To facilitate the management and improve the accessibility of such information, most institutions currently maintain it organized in digital spreadsheets or in relational database systems, while also keeping references to the physical specimens they refer to. Some institutions are even deploying efforts toward digitizing the physical specimens. Hardisty et al. (2013) observe that, at the time, only about 10% of natural history collections are digitized and that tools are required to accelerate the process. Besides specimens from biological collections, human observations are another source of species occurrences records. These observations take place, for instance, during field expeditions or even through citizen science initiatives (eBird (Sullivan et al., 2014), iNaturalist (Heberling & Isaac, 2018)). In some cases, species are maintained in the culture of living organisms, as of various collections of fungi and other organisms. • Species checklists. Surveys are often performed within a geographic region, such as a continent (Ulloa et al., 2017), a country, or a national park, to determine which species are present in it. These surveys usually result in a list of taxon names called a species checklist. They might also be restricted to a particular kingdom or biome. Forzza et al. (2012), for instance, describe how the Brazilian Flora List, published in 2010, was assembled, which involved aggregating information about species vouchers from herbarium information systems and having taxonomists to review it. The Catalog of Life 1 aggregates over 100 species checklists and contained information of about 1.8 million species in 2020. • Sample-based and observational data. Sample-based data are collected during events, which may be one-time or periodical, typically involve environmental data, and have a wide range and diversity of measurements. They may involve, for instance, abiotic measurements and population surveys in different temporal and spatial scales in transects, grids, and plots (Magnusson et al., 2013). They are typically collected by Long-Term Ecological Research (LTER) projects (Michener et al., 2011). Because of the heterogeneity of ecological data (Reichman, Jones, & Schildhauer, 2011), there is not a controlled vocabulary that is widely used. Some initiatives in this direction include ontologies such as ENVO, OBOE, and BCO (Walls et al., 2014). The most common tools for publishing ecological data rely on metadata to describe tabular datasets that comprise them. Such metadata allows general information, such as dataset owner identification, geographic, temporal, and taxonomic coverages, to be recorded, facilitating their interpretation by users. Metadata also allows textually describing the meaning of each column of a tabular dataset. Later in this article, the Ecological Metadata Language (Fegraus, Andelman, Jones, & Schildhauer, 2005), a metadata standard for ecological datasets, will be described. • Molecular data. The analysis of DNA, RNA, and proteins have various applications to the study of biodiversity. The genomic sequences obtained directly from environmental samples containing communities of microorganisms, that is, metagenomes (Robbins et al., 2012), for instance, provide important information to analyze their taxonomic and functional characteristics. Biological sequences can support taxonomists as well (Tautz, Arctander, Minelli, Thomas, & Vogler, 2003) in identifying species. Taxonomists can also use small genomic or gene regions to assess biological diversity across all domains of life. The Barcode of Life (Ratnasingham & Hebert, 2007;Stockle & Hebert, 2008) project, for example, analyzes and standardizes small regions of genes to help in identifying species. Some systems, such as VoSeq (Peña & Malm, 2012), allow for connecting vouchers present in biological collections to DNA sequences present in genomic databases. R. Guralnick and Hill (2009) observe that diversity can be more precisely measured, when compared to simply counting the number of species, by how species are phylogenetically related. As examples, they assess the conservation priority of North American birds using their phylogenetic distinctness and extinction risk and analyze the dispersal of the influenza A virus also using phylogenetic analysis. • Academic literature. A vast amount of information about the biodiversity is available in the academic literature. Field expeditions syntheses are often available only in scientific papers. Data related to sampling, collection, and their analysis have often not been propagated to biodiversity databases. Some initiatives, such as the Biodiversity Heritage Library (BHL; Gwinn & Rinaldo, 2009), are dedicated to the digitization of historic biodiversity literature. If coupled with text extraction techniques, such as optical character recognition (OCR), one could potentially extract information from scientific articles, such as taxonomic names (Koning, Sarkar, & Moritz, 2005), and make them available in public databases. • Images and videos. Field expeditions to conduct sampling often involve the production of images and videos that support the analysis of the studied sites. In the following sections, we describe Audubon Core (R. A. Morris et al., 2013), a controlled vocabulary for describing multimedia resources associated with sampling and species occurrence data. • Remote sensing. According to Turner et al. (2003), most remote-sensing instruments do not have enough resolution to gather information about organisms but there were advances that enabled some aspects of biodiversity to be observed, such as differentiating species assemblages and tree species (Clark, Roberts, & Clark, 2005). They also argue that, when instrument resolution is insufficient for direct observation, indirect methods can be applied to estimate species distributions and richness. Pettorelli et al. (2016) observe that many EBVs could be derived from satellite remote sensing, which can provide global-scale regular monitoring. Some of these potential EBVs include, for instance, vegetation height and leaf area index. It is also observed that raw satellite data could be processed by scientific workflows, including tasks such as statistical analysis and classification algorithms, to generate EBVs. More recently, (Fernández, Ferrier, Navarro, & Pereira, 2020) describe the integration of on-site observations and remote sensing through biodiversity modeling for EBV estimation.
Herbarium specimens, for instance, are an important resource for documenting and analyzing biodiversity, especially its spatial and temporal patterns. However, such biological collections need to undergo a process called digitization, in which the information provided by them is converted to electronic format. In this process, a specimen can be digitally photographed and information from its labels extracted and exported to a database. This is a challenging task since there are hundreds of millions (Soltis, 2017) of specimens deposited in herbaria worldwide and processing each of them requires considerable effort. Haston, Cubey, Pullan, Atkins, and Harris (2012) describe a digitization workflow applied at the Royal Botanic Garden Edinburgh divided into three phases. The first phase involves specimen preparation, which is given, for instance, by specimen selection, movement, and taxonomic verification. In the second phase, essential data about the specimen contained in labels, curatorial records, and supplementary sources are extracted and digitized. Some of this information might also be extracted with OCR applied to label images. The third and final phase consists of capturing high-resolution images of the specimens which can be used for further information extraction with OCR, quality assessment and control, and online publication for interactive exploration by users. High throughput digitization is possible through the use of technologies such as conveyor belts and digitization stations (Borsch et al., 2020). It is also critical to keep globally unique identifiers for each digitized specimen (Güntsch et al., 2017), facilitating consistent access to information and data gathering by biodiversity portals and aggregators (Berendsohn et al., 2011). Some examples of digitization efforts include iDigBio (Paul, Mast, Riccardi, & Nelson, 2013), the Botanic Garden of Rio de Janeiro (Lanna et al., 2018), and German herbaria (Borsch et al., 2020). Some trends and future directions involve the use of artificial intelligence to automate parts of the digitization workflow, such as the automated identification of herbarium specimens using deep learning methods (Carranza-Rojas, Goeau, Bonnet, Mata-Montero, & Joly, 2017).

| Data quality and fitness for use
Although biodiversity scientists have undoubtedly benefited from open access to massive volumes of species occurrence data from many biological collections, there are some caveats that must be accounted for before using data for modeling. Data are not always adequate for investigating every aspect of natural systems, using inadequate data for studying specific aspects of biological diversity can lead to erroneous or misleading results (Chapman, 2005), and investigators must be aware of the inherent limitations of their data before formulating their questions. The availability of detailed information is still very scarce for most known organisms. This scenario, referred to as the Wallacean Shortfall (Lomolino, 2004), is even more critical in megadiverse countries, which still remain largely unexplored for many regions and taxonomic groups (J. Soberón & Peterson, 2004). The lack of sufficient data for threatened species is even more concerning, as designing effective programs for their conservation require knowledge on their geographic distribution and ecological requirements. This shortage of data, combined with the nonsystematic sampling and insufficient quality, limits the use of data from biological collections for many intended applications, many of which require an intensive amount of data to be available (Guisan et al., 2007). Failing to account for the inherent limitations of such data while posing and investigating their hypotheses, researchers may obtain erroneous or misleading results, eventually impacting the success of management policies that rely on such information (Chapman, 2005).
A definition for data quality based on its fitness for the intended use was first proposed in the context of geographical information systems (Chrisman, 1984), and became widely adopted by the biodiversity informatics community. According to this definition, quality is not an absolute attribute of a dataset but is rather given by its potential to provide users with valuable information, in specific contexts. Assessing quality attributes of data is a fundamental step for any applications that might use it, and requires that users previously delimit the purpose, scope, and requirements of their investigation. Data are considered being of high quality if it is suitable for supporting a given investigation. Depending on the application, users might need to improve the fitness of the data they have in hand, which is part of the data quality management process. Loss of quality in biodiversity data can occur during multiple stages of its life cycle (Chapman, 2005), including the moment of the recording event, its preparation before it is incorporated in the collection, its documentation, digitalization, and storage. J. Soberón and Peterson (2004) list common issues regarding biodiversity data quality. Specimens of biological collections, from which a considerable amount of species occurrence data is extracted, may have incorrect or outdated taxonomic identifications. Biological taxonomy is constantly changing to accommodate new knowledge about species. Georeferencing errors are also possible due to annotation error or instrument inaccuracy. In old records, due to the unavailability of mechanisms for accurate assessment of location, it is common to find only textual descriptions about where a specimen was collected.
It is recommended that biodiversity databases should follow as much as possible controlled vocabularies and standards for naming in order to maintain internal consistency (Chapman, 2005). Geographical coordinates, for instance, may not match the textual location description (e.g., county, state, or country name), leading to inconsistent records. When integrating data from different biodiversity databases, external inconsistencies are also a potential problem (Chapman, 2005). These happen, for instance, when names in the different databases come from lists maintained by different authorities. These may lead to missing existing links between data or even linking the data incorrectly.
One of the objectives of CBD is to establish a global knowledge network on taxonomy (A. Hardisty et al., 2013). Taxonomic concepts (Berendsohn, 1995) are often incorrectly modeled in biodiversity databases. Berendsohn (1997) developed a conceptual database model for the International Organization for Plant Information covering the different aspects and concepts that are present in taxonomy. Several tools can be used to reduce or eliminate species misidentification. For instance, official species catalogs are available online for taxon querying, such as the Catalog of Life, the World Register of Marine Species, 2 and the Brazilian Flora Species List (Forzza et al., 2012). These can be used to support taxonomic data quality assessment of occurrence records. Most of these catalogs are also accessible via application programming interfaces (APIs) available via the web, allowing the automation of this type of assessment with scripts or applications. It is important to observe that matching a name present in a biodiversity database to names in taxonomic lists does not guarantee correctness. The names may be correctly spelled according to a taxonomic list but the identification of the specimen can be erroneous. In these cases, one still needs taxonomists to check the identifications or tools that support automated identification (Carranza-Rojas et al., 2017). Dalcin (2005) investigated data quality in taxonomic databases, proposing quality metrics and techniques for error prevention, detection, and correction using phonetic algorithms, such as Soundex (D. Holmes & McCabe, 2002), and string similarity algorithms, such as Levenshtein distance (Levenshtein, 1966). More recently, Rees (2014) observes that taxonomic names can contain errors due to misspelling which can lead to failure in retrieving data. He proposes Taxamatch, a method for approximate matching of taxonomic names. It uses a modified version of the Damerau-Levenshtein Distance (Wagner & Lowrance, 1975) algorithm for genus and species name matching and a phonetic algorithm for authority matching. Experiments showed that the method is able to identify close to 100% of errors in taxon scientific names. A. Hardisty et al., 2013 observe that there are studies about biodiversity that do not require naming organisms. For instance, metagenomic studies concentrate on analyzing samples to classify them according to functional traits identified through sequence alignment with genomic databases. For collections that have digital images of their specimens available, a promising approach is to use deep learning techniques (Schmidhuber, 2015) to automate species identification .
Regarding georeferencing problems, R. Guralnick and Hill (2009) mentions the importance of determining the georeferencing uncertainty of occurrence records and its impact on the scale at which studies can be performed. Tools like BioGeomancer R. P. Guralnick, Wieczorek, Beaman, and Hijmans (2006) and Geolocate 3 try to infer what the geographic coordinates of an occurrence of species from a textual location description. Otegui and Guralnick (2016) propose a web API that performs simple consistency checks in occurrence records, such as coordinates with zero value, disagreeing coordinates and country identification, and inverted coordinates. Veiga et al. (2017) propose a framework for biodiversity data quality assessment and management that allows users to define their data quality requirements and when a particular dataset is fit-for-use in a standardized manner. Data quality assessment is given by the evaluation of fitness for use of a dataset for some application. Data quality management is defined as the process of improving the fitness-for-use of a dataset. The framework is given by three main components: DQ Needs, DQ Solutions, DQ Report. DQ Needs supports the definition of the intended use for a dataset, the respective data quality dimensions, acceptable criteria for data quality measurements in these dimensions; and activities to improve data quality. DQ Solutions describe mechanisms that support meeting the requirements defined in the DQ Needs component, such as tools that implement techniques to improve data quality measurements in some dimension. The DQ Report component describes the dataset that is being assessed and managed by the framework and assertions on this dataset describing measurements or amendments applied to it as specified in the other components. The authors envision a Fitness for Use Backbone that would implement these components and where participants could share their data quality requirements and tools. More recently, P. J. Morris et al. (2018) have extended Kurator (Dou et al., 2012), a library of data curation scientific workflows, to report data quality in terms of the data quality framework proposed by Veiga et al. (2017).

| Data description
In the description step, metadata is produced to describe biodiversity data. This metadata is essential for users to interpret datasets they download. In this section, we describe the standards, practices, and recommendations for documenting and describing biodiversity data.

| Ecological metadata language
The Ecological Metadata Language (EML; Fegraus et al., 2005) is a metadata standard originally developed for the description of ecological data. It is also used currently to describe datasets about species observations. The standard has several profiles with their respective fields that can be used to define the attributes of a dataset. A scientific description profile contains fields such as the creator, geographic coverage (geographicCoverage), temporal coverage (tempo-ralCoverage), taxonomic coverage (taxonomicCoverage), and sampling protocol (sampling) used. This profile is used to define attributes of the dataset as a whole.
The data representation profile, through the dataTable entity, allows for describing the attributes of a tabular dataset. One can define the data types of such attributes, such as dates and numerical values, as well as their constraints, such as minimum and maximum values. Used together, the scientific description and the data representation profiles can provide good quality documentation for a dataset, supporting their meaningful interpretation. EML metadata is expressed with the XML language, illustrated in Figure 3.
Normally, biodiversity databases provide tools for editing and producing metadata in the EML standard in a more user-friendly way through a graphical interface. The DataONE  repository, for example, allows users to provide metadata through a graphical tool called Morpho (Higgins, Berkley, & Jones, 2002). The same repository has also a web interface called Metacat (Berkley, Jones, Bojilova, & Higgins, 2001), which allows for loading tabular ecological data in free format documented with the EML standard. The EML standard is also used to describe datasets on species occurrences and sampling events, as will be described in the following section.

| Data preservation and publication
In the preservation stage, biodiversity datasets are published in some database, such as DataONE and GBIF, where they will be available to the scientific community. These databases adopt practices of curation and management of the data aiming its preservation and availability in the long term. There are several possible procedures for publication, in this section standards and procedures for loading a dataset to a biodiversity database will be described. The publication workflow of the main current repositories will also be described.
For better management of biological collections (Schindel & Cook, 2018), several systems for this purpose have been developed in the last decades. Among the common features in this category of software are the management of specimens, control of determination history, taxonomy, images associated with specimens, bibliographic references, curatorial management activities, user management, reports tracking the evolution of collections, printing labels in varied sizes, and data quality. Among the main implementations are BRAHMS (Filer, 2013), used in more than 80 countries, allowing for working with botanical collections, Specify, 4 which is used in more than 500 institutions worldwide for more than 30 years, managing collections of flora and fauna; the Emu 5 proprietary software for managing collections, including botanical ones; and BG-Base, 6 also with more than three decades of use, widely used in botanical gardens F I G U R E 3 Part of EML metadata in XML and arboretums. Jabot (da Silva et al., 2017) is used since 2005 in the Rio de Janeiro Botanical Garden and started to be shared in the model of cloud computing with 50 herbaria in Brazil.

| Darwin core
Darwin Core (DwC; Wieczorek et al., 2012) is a standard for representing and sharing biodiversity data. The standard consists of a list of terms related to biodiversity and their definitions. DwC discussions, evolution, and maintenance are conducted by TDWG (Biodiversity Information Standards), an association for the development and promotion of standards for recording and exchanging biodiversity data. DwC emerged as a term profile in the 1998 Species Analyst system developed by the University of Kansas for the management of biological collections. In 2002, it was adopted for the exchange of information in Mammal Networked Information System (MaNIS), a distributed system composed of several institutions that maintain biological collections of mammals. In 2009, the DwC standardization process was started, which was ratified in October of the same year at the TDWG annual meeting. DwC is based on the Dublin Core 7 standard, taking advantage of its terms for resource description such as type, modified, and license, and complementing them with specific biodiversity terms, such as catalogNumber and scientificName.
The DwC vocabulary terms are organized as follows. The classes indicate the categories or entities defined in the standard. Examples of classes are: Event, Location, and Taxon. Each class has a set of properties, which are its attributes. For example, the Location class has attributes such as country and decimalLatitude. Finally, values can be assigned to properties, such as "Chile," −33.61 for the country and decimalLatitude properties, respectively. It is worth noting that it is recommended that, whenever possible, the values come from some controlled vocabulary, in the case of textual values, or some formatting standard, in case of numerical or temporal values. For example, species names from some recognized list of species, such as the Catalog of Life. Table 1 illustrates the representation of species occurrence data with DwC. These records come from a dataset published through the Brazilian Marine Biodiversity Database (BaMBa; Meirelles et al., 2015) in GBIF. 8 Normally, a dataset in the DwC format is accompanied by metadata, which is defined in the EML standard (Fegraus et al., 2005). In EML, fields such as the title, authors, geographic, and temporal coverage of the dataset are found, which help users interpret datasets formatted in the DwC standard.
Like relational databases, datasets that follow the DwC format can contain multiple tables that are related through attributes that are common among them. Such an organization allows, for example, sample data to be expressed also in this standard. Tables 2 and 3 illustrate this type of data organization to represent species sampling. Table 2 contains the sampling events, four in total. The eventId column contains an identifier for each event. The other columns describe the event date, latitude, and longitude, respectively. Table 3 contains counts of organisms for each event. The eventId column, describes which event in Table 2 the counts refer to. For example, the first two rows in the table refer to the event that has identifier 1, which is associated with a sampling performed on March 18, 2009. Biodiversity datasets formatted with DwC can be published in global-scale biodiversity databases such as GBIF (Edwards, 2000). GBIF acts as a central registry and aggregator for datasets published by its national and organizational nodes using the Integrated Publishing Toolkit (IPT; Robertson et al., 2014). The publishing workflow includes the following steps: (i) mapping the internal representation of biodiversity data to DwC and extracting them; (ii) adding EML metadata describing biodiversity data; (iii) packaging both EML metadata and DwC-formatted data to a Darwin Core Archive (DwC-A); (iv) GBIF and national biodiversity aggregators, such as the Brazilian Biodiversity Information System (SiBBr) (Gadelha et al., 2014), harvest DwC-A and ingest them into their databases. This process is illustrated in Figure 4.

| Other data publication workflows
Many research groups have tabular biodiversity data stored in various formats and do not have the resources to format them according to DwC. Different approaches in ecology, coupled with distinct research traditions, both in their subdisciplines and in related fields, lead to the production of highly heterogeneous data. Such data can be, among others, counts of individuals, measures of environmental variables or representations of ecological processes. The terminologies used also vary according to the research line, as well as how to structure the data digitally (Jones, Schildhauer, Reichman, & Bowers, 2006). The EML metadata standard was also adopted for describing the ecological datasets. The datasets themselves, due to the heterogeneity, are published in the original format, through spreadsheets or textual files with values separated by commas. In these cases, Metacat (Berkley et al., 2001) can support data publication and preservation. It is responsible for receiving, storing, and disseminating datasets of, for example, Long-Term Ecological Research (LTER; Michener et al., 2011). The Brazilian Marine Biodiversity Database (BaMBa; Meirelles et al., 2015), for instance, was developed to store large datasets from integrated holistic studies, including physicochemical, microbiological, benthic, and fish parameters. BaMBa is linked to SiBBr and has instances of both IPT and Metacat, making it possible to publish data using the workflows previously described.  The publication of data, and its consequent preservation, is a contribution to the scientific community as a whole. The data can be reused by other scientists who can explore them from other points of view. For the publisher of the data, benefits can also be observed. A recent study (Piwowar & Vision, 2013) shows that articles that provide the data used in their analyses in public repositories tend to have a larger number of citations. Data publication and preservation can also be helpful when biological collections are lost due to disasters. On September 2, 2018, there was a catastrophic fire in the Brazilian National Museum, destroying the vast majority of approximately 20 million items in its collections comprising areas such as archeology, anthropology, zoology, and botany. Many destroyed items belonged to biological collections, including one on invertebrates. Through data publication on GBIF, 9 the museum was able to preserve information about many specimens, 269,660 records were available on September 20, 2018, many of these containing images.

| Data discovery and integration
The search for data to perform biodiversity analysis and synthesis research is still a challenging task. The most recent developments have occurred with the emergence of databases that aggregate datasets at global and national scales such as GBIF (Edwards, 2000), DataONE , SiBBr (Gadelha et al., 2014), and speciesLink network (Canhos et al., 2015). The use of metadata and data publishing standards allows institutions to map the internal representations of this information to a format that is clearly specified and can be consumed and processed automatically by machines. Biodiversity information aggregation databases allow datasets to be geographically, taxonomically, and temporally searched. Languages and data analysis environments, such as R and Python, already have packages and libraries that are integrated with the repositories and aggregators of biodiversity data. Rgbif, 10 for example, is a package for R that allows searching and retrieving records directly from GBIF, with pygbif 11 being its analog for Python.
Often scientists need to combine data from different sources into integrative research. For example, physicochemical data can be combined with metagenomic data to try to establish correlations that explain some ecosystem processes and their implications to the effectiveness of marine protected areas (Bruce et al., 2012;Meirelles et al., 2015). The activity of combining data from different sources is called data integration and is one of the most active areas of research on scientific data management (Ailamaki, Kantere, & Dash, 2010;König et al., 2019;Miller, 2018). Existing biodiversity databases have advanced by establishing standards for metadata, such as EML (Fegraus et al., 2005), and for data such as DwC. However, these are limited to defining controlled vocabularies, consisting of standardized terms in each of the themes. A more sophisticated approach, involving not only the definition of terms, but also the relationships between them and rules of inference, which are called ontologies, is the subject of the Semantic Web research area. Some initiatives in this direction in the area of biodiversity and ecology include ontologies such as the Environment Ontology (ENVO) and the BCO (Walls et al., 2014). Ontologies allow cross-referencing of different domains (Linked Data) and semantic queries, providing a data integration tool considerably more powerful than the current ones.

| DATA ANALYSIS AND SYNTHESIS
In this section, we present some examples where biodiversity data is analyzed along with the computational methods used. These main biological examples are related to ecological niche modeling (ENM), network science, biodiversity genomics, wildlife health monitoring, and biodiversity data mining. We also explore methods for interconnecting various computational tasks, that is, biodiversity workflow management, and for keeping track of data derivation in these workflows with the purpose of enabling analysis and synthesis reproducibility, which are essential in managing biodiversity analysis and synthesis activities.

| Ecological niche modeling
ENM is used to predict the potential geographic distribution of a given species based on environmental factors . A niche-based model represents an approximation of the fundamental ecological niche of a species in the environmental dimensions analyzed Phillips, Anderson, & Schapire, 2006). This model is made using a family of statistical tools to analyze the environmental information associated with the occurrence points (geographic coordinates), generating maps with an indication of geographic areas with the environmental suitability of the modeled species (Elith & Leathwick, 2009;Gomes et al., 2018). Different studies include ENM with ecological and evolutionary objectives of analyzing and is increasingly incorporated in decision-making, for example, the potential distribution of invasive species (e.g., [Peterson & Robins, 2003]), with an indication of vulnerable areas, the distribution of species in scenarios of climate change (e.g., M. Araújo, Nogués-Bravo, Reginster, Rounsevell, & Whittaker, 2008;M. B. Araújo & Peterson, 2012;Pearson, Thuiller, et al., 2006;Thomas et al., 2004;Wiens, Stralberg, Jongsomjit, Howell, & Snyder, 2009), dissemination of infectious diseases (e.g., [Costa, Peterson, & Beard, 2002]), and selecting conservation areas (e.g., [M. B. Araújo & Williams, 2000;Y. Chen, 2009;Engler, Guisan, & Rechsteiner, 2004;Pearson, 2010]). The concept of niche is defined, according to Chase and Leibold (Chase & Leibold, 2003), as environmental conditions that meet the minimum requirements of a species so that its birth rate is higher than its mortality rate. There are three main factors that determine the niche of a species: abiotic (environmental) conditions, biotic conditions, such as species interactions, and dispersal capacity (J. M. Soberón, 2010). These are illustrated by the BAM diagram which depicts the biotic factors (B), the abiotic factors (A), and the mobility (M) .
ENM involves many steps. A very good, detailed, and recent checklist of these steps was proposed by Feng et al. (2019). For summarizing we can group them in three general steps: (1) Preprocessing, (2) Modeling, and (3) Postprocessing.
In the Preprocessing stage, acquisition and pretreatment of the species occurrence records and selecting predictor variables takes place. Databases, such as speciesLink (Canhos et al., 2015) and GBIF (Edwards, 2004), provide these records. Pretreatment is a very important and decisive step to get a good result of the proposed model. Despite the large amount of data available, it is extremely important to proceed with the application of techniques to clean and check the quality of the data, applying geographic and taxonomic filters, as described above. Abiotic variables can be downloaded as well, for example, in climatology databases such as Worldclim (Hijmans, Cameron, Parra, Jones, & Jarvis, 2005) and Bio-ORACLE (terrestrial and marine respectively) (Tyberghein et al., 2012). The environmental layers are downloaded and converted so that they can be used as input to modeling algorithms along with occurrence points. These datasets are usually in raster format, that is, a grid of two-dimensional cells, where each cell has a value. The resolution of the dataset is given by the size of a cell and the smaller the cell, the higher the resolution. In this step, we also verify the sample bias, using spatial filters-to remove points very close geographically, in order to select points with a minimum geographical distance between them (Boria, Olson, Goodman, & Anderson, 2014;Naimi, Skidmore, Groen, & Hamm, 2011;Varela, Anderson, García-Valdés, & Fernández-González, 2014) aiming to minimize the effects of the spatial autocorrelation (Dormann et al., 2007)-and applying techniques to remove cross-correlation between the predictive environmental variables, known as multicollinearity (Dormann et al., 2013). These procedures aim to reduce the spatial bias effects of the data, this spatial bias can increase the uncertainty of the models generated. Moreover, the use of clean data generates models with greater predictive power (Aiello-Lammens, Boria, Radosavljevic, Vilela, & Anderson, 2015;Calabrese, Certain, Kraan, & Dormann, 2014;Lahoz-Monfort, Guillera-Arroita, & Wintle, 2014).
The modeling step consists of the application of algorithms to the data obtained in the preprocessing phase, for the creation of the models. Various algorithms are used in ENM, some based on machine learning, statistical inference, distance, or environmental envelopes. Machine learning algorithms include, for example, Maxent (Phillips et al., 2006) and Boosted Regression Trees (BRTs; Elith, Leathwick, & Hastie, 2008). This modeling step comprises many aspects related to the parameterization of the algorithms used, as features regularizations and learning rates. Retaining and informing the algorithm parameterization set used is very important for fine-fitting the model and for reproducibility (Qiao, Soberón, & Peterson, 2015).
Postprocessing consists of evaluating the performance of the generated model, to increase reliability, or to reduce the uncertainty of the models generated by different algorithms. The evaluation is often made from the comparison of the generated results against distribution data of the species not used in the modeling process. The indices used to measure model performance can be threshold-independent (not based on a specific threshold only), like the area under the receiver operating characteristic curve (ROC-AUC) or threshold dependent, applied in order to scale, optimize, balance, or equalize one or another type of error in the evaluation process, which is the omission or commission associated with the data set, such as Kappa (Cohen's Kappa Statistic), True Skill Statistics (TSS), all of them from the rates calculated by a confusion matrix (Allouche, Tsoar, & Kadmon, 2006;Elith et al., 2006;Pearson, Raxworthy, et al., 2006;C. Liu, White, & Newell, 2011) based on presence and absence dataset. However, the use of absence-based assessments is highly criticized, given that this is not a data usually collected and available, therefore, different methods are used to generate them, which can cause a lot of noise in the assessments performed (Barve et al., 2011;Lobo, Jiménez-Valverde, & Real, 2008;Peterson, Papeş, & Soberón, 2008). Furthermore, complex strategies with a combination of maps, considering, for example, geographic barriers, deforested areas (Anderson, Lew, & Peterson, 2003 (Diniz-Filho et al., 2009) can be applied. A consensus model can be generated from means of combined projection techniques, where high suitability areas coincide in most of the models generated for a given species (Araujo & New, 2007). Projection techniques in different space and time need to adopt measures that guarantee the transferability of the generated model, especially when extrapolations are expected -environmental values outside the domain of values set to fit the model (Feng et al., 2019;Owens et al., 2013).
All these steps above, illustrated in Figure 5, are fundamental for the reproducibility of the models and processes generated. So, to ensure that these processes can be reproduced, it is necessary to ensure that all this information is maintained and made available in an appropriate format.
Currently, there are several packages available in R that can help in producing useful data in the reproducibility processes of the ENM experiments (Cobos, Peterson, Barve, & Osorio-Olvera, 2019;de Andrade, Velazco, & De Marco Júnior, 2020;Golding et al., 2018;Kass et al., 2018;Qiao et al., 2016;Sánchez-Tapia et al., 2018). A framework for scalable and reproducible ENM Model-R (Sánchez-Tapia et al., 2018) was developed with the objective of unifying and automating preprocessing, processing, and postprocessing steps, as well to maintain all this information for reproducibility uses. This tool includes packages related to retrieving and cleaning data, multi-projection tools that can be applied to different temporal and spatial datasets, and postprocessing tools linked to the generated models. The entire modeling process can be parameterized using command-line tools, a local graphical user interface, or through the web.
So far, ENM has relied mostly on abiotic variables. A challenge is to incorporate biotic information as well, such as species interactions. According to , these are hard to incorporate, because they are dynamic, such F I G U R E 5 Typical ENM steps comprising (1) preprocessing (occurrence data selection and retrieval, abiotic data selection and retrieval, and abiotic data correlation analysis and filtering), (2) modeling (algorithm configuration and execution), and (3) postprocessing (model projection and evaluation) that one would need moment specific and context-specific summaries of the biotic, or interactive, variables. One experiment using biotic information was performed by (Heikkinen, Luoto, Virkkala, Pearson, & Körber, 2007) incorporating mutualism information to model four bird species, which improves the accuracy of prediction considerably.

| Biodiversity data mining
The exponential growth of data in recent years has led to an increasing number of discussions around the need for research into new methods of accessing, analyzing, and managing biological data (Howe et al., 2008). Interest in database knowledge discovery began, historically recorded, in 1989 with the Workshop on Knowledge Discovery in Databases (Piatetsky-Shapiro & Frawley, 1989) and has evolved greatly in recent decades. Thus, as a branch of artificial intelligence, data extraction, and knowledge discovery aims to automatically discover statistical rules and models from data. The difficulty of discovering patterns in large databases (Han, Kamber, & Pei, 2011), such as GBIF, demands other methods to access and manage biological data (Howe et al., 2008). As biodiversity databases have observed a substantial increase in data, knowledge extraction from them has become a challenge (Drew, 2011). In this sense, Hochachka et al. (2007) argues that for the development of ecological analyses where there is little prior knowledge and hypotheses are not clearly developed, exploratory analyses with data mining techniques (Liao, Chu, & Hsiao, 2012) are more appropriate than the confirmatory analyses, that is designed to test hypotheses or estimate model parameters. In ecology, some research using data mining was conducted by Spehn and Korner (2009). Pino-Mejías et al. (2010) used classification algorithms for predicting the potential habitat of species; a decision tree algorithm was also used for forest growing stock modeling (Debeljak, Poljanec, & Ženko, 2014); Kumar, Mills, Hoffman, and Hargrove (2011) used cluster analysis to identify regions with similar ecological conditions, Flügge, Olhede, and Murrell (2014) used multivariate spatial associations for grouping species into disjunct sets with similar co-association values. One of the many possibilities of using data mining was investigated by Silva (L. A. E. Silva, Siqueira, et al., 2016), which developed a methodology to allow for the application of association analysis for extracting patterns of co-occurrence from a dataset from the 50 ha Forest Dynamics Project on Barro Colorado Island, finding patterns of positive and negative correlation. To do this, association analysis was applied with the Apriori algorithm (Agrawal & Srikant, 1994). Ciarleglio, Wesley Barnes, and Sarkar (2009) proposed ConsNet, a software for designing conservation area networks using tabu search (Glover, 1986) with multi-criteria objectives. Knowledge discovery from data has been successfully used in several traditional areas such as marketing, medicine, economics, engineering, business administration; and geography . In Ecology, some research was also observed using data mining techniques, as in works using classification algorithms (Cutler et al., 2007;Dlamini, 2011;Hochachka et al., 2007;Lorena et al., 2011;Pino-Mejías et al., 2010); cluster analysis (Brandao et al., 2009;Kumar et al., 2011); but very little compared to other areas (Inman-Narahari, Giardina, Ostertag, Cordell, & Sack, 2010). Next, two categories of data mining algorithms that are frequently applied to Biodiversity analysis are described.
Association analysis is one of the most popular unsupervised methods of data mining for finding frequent item sets from database-logged transactions, by extracting association rules between items present in transactions, without regard to the implications of causality (Agrawal & Srikant, 1994;Han et al., 2011;Tan, Kumar, & Srivastava, 2002;Wu et al., 2008). Association analysis aims to present rules that are often unclear. One example of the application of association rules applied to the identification of species co-occurrence patterns (G. G. Z. Silva, Green, et al., 2016) The algorithms used in cluster analysis, or simply clustering, are intended to partition a set of records into groups such that records within a group are similar to each other, and records belonging to two different groups have different characteristics. The Expectation Maximization algorithm (EM), for example, uses the probability distribution to represent each cluster. It uses Gaussian probability based on density estimation theory. The algorithm makes an initial prediction for the parameters and then improves them iteratively. It generates clusters of similar size, and spherical shape, easily perceived by human eyes. Brandao et al. (2009) used EM to analyze a database of bromeliads, identifying altitudinal patterns at different spatial scales. From the results found, the use of the algorithm was recommended for the conservation of threatened species.

| Wildlife health monitoring
A comprehensive approach for wildlife health monitoring involves many challenges that need to be addressed in order to result in a globally effective mechanism for diseases prevention, such as the difficulty and limited access to wild, mostly uninhabited areas; how to overcome the high diversity and complexity of parasites, vectors, hosts and disease ecology; the methodology and infrastructure for properly collecting, storing and managing georeferenced high-quality data; how to integrate specialists from different areas to handle data, species and distinct socioenvironmental contexts; the research on knowledge extraction from data-driven models to understand, identify and predict risks to ultimately convey relevant information to society; and finally, how to sensitize decision-makers about the importance of monitoring as well as to engage the population as committed citizen scientists. The challenges are yet more acute in megadiverse countries which, in addition to biodiversity richness, usually also have to cope with vast territorial distances and sociocultural diversity. He et al. (2016) present the eMammal framework for wildlife monitoring supported by citizen scientists. Animal images collected with camera traps are sent to its database where visual animal recognition techniques are applied. The species identification recommendations generated are reviewed by citizen scientists and, subsequently, by experts. The resulting validated records are made available to wildlife and ecological researchers. eBird (Sullivan et al., 2014) also leverages the capability of citizen scientists to gather bird observation records. Automated data quality filters are used to support species identifications performed by citizen scientists. The Brazilian Wildlife Health Information System (SISS-Geo; Chame et al., 2019), for example, is a platform for collaborative monitoring that intends to overcome the challenges in wildlife health. It aims integration and participation of various segments of society, encompassing: the registration of occurrences by citizen scientists; the reliable diagnosis of pathogens from the laboratory and expert networks; and computational and mathematical challenges in analytical and predictive systems, knowledge extraction, data integration, and visualization, and geographic information systems. It has been successfully applied to support decision-making on recent wildlife health events, such as a recent Yellow Fever epizooty (Couto-Lima et al., 2017;Moreira-Soto et al., 2018).
By automating the search for occurrence patterns, the information reaches more efficiently citizens nationwide, from the general population through experts, as well as provides the opportunity for the acquisition of knowledge about the possible patterns and parameters that contribute to the occurrence of diseases. In the medium-and long-term it also builds the capacity of researchers to develop complex modeling in the ecology of diseases that can possibly exploit geographic information in order to improve accuracy. Moreover, occurrence patterns yield data that can assist national policy on health and on biodiversity conservation.
Machine learning has been used for image analysis in wildlife monitoring, such as for automated species classification. As mentioned earlier, in (Z. He et al. 2016) the authors describe how species recognition is tackled within the eMammal cyber-infrastructure from camera-trap digital images. Once an animal (or group of) crosses the motion sensor and triggers the camera, the resulting sequence of captured images is processed in order to detect and segment the animal from the natural scene. Detection is the task of identifying the bounding box within the animals lie on the image, whereas segmentation is separating the animals (foreground) from the scene (background). Both tasks are challenging in this context-sometimes even for humans-as the animals are quite camouflaged by the heavy amount of natural elements in the wild. The approach developed to tackle this object-cutting problem (Ren, Han, & He, 2013) takes into consideration multiple image frames from the captured sequence: a standard background-foreground classifier is applied to each frame, however, the locally obtained information is fused across all frames collaboratively; this process is repeated iteratively until the refinement converges. According to the authors, this novel technique led to an improvement in the average segmentation precision of near 15% over the state-of-the-art algorithm. After the backgroundforeground image segmentation, an even more challenging task takes place: the animal species recognition. Here, the segmented image patches of animals are fed into previously trained machine-learning models-built based on existing labeled images-in order to classify them by species. Since these patches usually contain some elements from the background scene, that is, the segmentation is not perfect, the recognition model has to be able to cope with a good deal of noise; to make things harder, the model also needs to recognize animals at different poses. Which machine-learning algorithm is the most adequate to train such recognition models depends mainly on the amount of available training data (G. Chen, Han, He, Kays, & Forrester, 2014). In summary, conventional supervised classification algorithms are recommended for small training datasets whereas deep-learning algorithms best fit the case of an abundance of labeled data. In (G. Chen et al., 2014), using a training dataset of 14,346 images of 20 animal species, a deep convolutional neural network (DCNN) achieved an accuracy of 38%, while a Bag-of-Words model (BoW) achieved 33%. Regardless of the learning algorithm, these species recognition results are still disappointing and of limited use. From an optimistic point of view, though, it is expected that DCNN will perform better with larger training datasets as this architecture is known for its high learning capacity (G. Chen et al., 2014). For instance, in (Gomez Villa, Salazar, & Vargas, 2017) the authors report an accuracy of 88.9% on a large dataset of near 1 million images containing 26 wild animal species. Moreover, there is evidence that increasing the deepness of the neural network architecture leads to higher performance even with no additional training data (Gomez Villa et al., 2017).
Automatically identifying animal species from images seems to be the current trend in wildlife monitoring. Another great work in this vein was recently presented in (Norouzzadeh et al., 2018), where the authors, motivated by the need to eliminate the burden of manual labeling by specialists and volunteers, proposed a deep neural network (DNN) to not only identify animals, but also to count them and describe their behavior. As discussed earlier, large datasets are required in order to harness the full potential of DNNs. By using the Snapshot Serengeti (SS) dataset, which contains a total of 10.8 million classified images of 48 species to train the deep-learning model, an impressive accuracy of 93.8% was obtained. The task of recognition was divided into two stages: in the first stage, a model was trained exclusively to separate images containing at least one animal from images without animals. Then, the second DNN model was trained to take the resulting images with animals (only a quarter of the total) and perform the extraction of information, that is, identification of species, number of animals, and their characteristics. Instead of resorting to different models for each task of the second stage, the authors opted for training a single model. The reasoning behind this choice is that (i) learning related tasks simultaneously is more efficient and (ii) a single model the advantage of a reduced amount of parameters and therefore total complexity. In the article, nine DNN architectures were tested, as well as the ensemble model consisting of all of them. For the task of detecting images that contain animals, the best accuracy of 96.8% was achieved by the Very Deep Convolutional Network architecture (VGG; Simonyan & Zisserman, 2014). Regarding the species identification, the ensemble of DNNs obtained the highest score, 94.9%, followed by the Deep Residual Learning architecture (ResNet) (Z. He et al., 2016), with 93.8%. Interestingly, when considered the relaxed definition of accuracy as being the correct answer in the top-5 guesses by the DNN model, which would also greatly save human labor in manual labeling, these scores further improve to 99.1% and 98.8%, respectively. Counting the number of animals in the image showed to be a hard task for DNNs. Again, the ensemble of DNNs model got the best score, correctly predicting the number of animals 63.1% of the time. If allowed to approximately count the number of animals up to an error of ±1, the accuracy climbs to 84.7%. In this task, the best individual architecture was again ResNet, achieving a score of 62.8% and 83.6% for the exact and approximate accuracies, respectively. Finally, the task of describing the characteristics of the animals aimed to detect the following attributes: standing, resting, moving, eating, and whether young animals are present. Note that this is a nonexclusive multilabel classification task, since one or more animals may exhibit multiple attributes. In this task, the ensemble of DNNs model obtained 76.2% accuracy, and the second-best, ResNet, got 75.6%.
All these enthusiastic results put Deep Learning Networks as the current state-of-the-art when it comes to pattern recognition in wildlife monitoring. This translates to a significant reduction of human labor and also opens a lot of new applications in wildlife monitoring, especially for those small projects that cannot recourse to volunteers and experts for labeling. We are approaching the stage where machine-learning models are as good as humans in species identification, and they can even excel humans in that task, eventually. Figure 6 summarizes the wildlife health monitoring workflow, from data acquisition through computational models and knowledge extraction. An observation of a wildlife animal is registered into the process either by passive or active monitoring. Camera-trap and the like are forms of passive monitoring, whereas fieldwork observation by citizenscientists or experts is called active monitoring. Once the observation is registered, a preprocessing step may take place, such as an image detection/segmentation algorithm or a species identification model, and then the data are stored in a database. In order to ensure data quality and include missing information, the registers may be peer-reviewed and F I G U R E 6 General workflow of wildlife health monitoring edited by collaborators, where the data are expanded if necessary and verified in terms of consistency and correctness. Depending on the purpose of the monitoring, additional information may also be aggregated; in particular, it could be included diagnostic tests from collected animal samples, with possible indication of an ongoing epizootics. Finally, the set of acquired data, optionally coupled with extra variables (e.g., socio-environmental layers), could be used to train computational models such as alert, predictive and forecast models, and then be potentially used to extract knowledge about the subject being investigated (Chame et al., 2019).

| Network science applied to biodiversity
Network Science refers to a relatively new domain of scientific investigation, aiming to describe emergent properties and patterns from complex systems of interacting entities. Such relational systems are naturally represented as networks, in which interactions are represented as pairwise connections (links) between entities (nodes) and assume particular semantics depending on the nature of the modeled phenomenon. The rise of this field is strongly associated with recent advances in information technology, which provided scientists with novel tools for collecting, storing, and processing data from many knowledge domains more efficiently and in larger scales. Although a variety of networked systems in many disciplines had been studied long before that, technological advances allowed us to model real-world systems in much more detail, from large volumes data that are often public or easily accessible to the investigator. Network science (Barabási, 2016;Newman, 2010) has been applied to model networked systems in a variety of knowledge domains, including the Internet, scientific collaboration networks, and ecological networks just to cite a few (Albert & Barabási, 2002). Given their relational structure, network models are formally represented as graphs. Network modeling has been widely adopted in the context of biodiversity research, especially for investigating ecological and evolutionary aspects of ecosystems and natural communities. Efforts toward this goal have led to the creation of the field of network ecology, which has undergone a noticeable growth over the last few years (Bascompte, 2007;Borrett, Moody, & Edelmann, 2014).
Network ecology has traditionally focused on describing general aspects of the entangled networks by which organisms interact. As ecological interactions are regarded as key processes modeling ecosystems functioning and structure, unraveling their architecture and dynamics is essential for understanding a variety of ecosystem features, such as stability and energy flow. Interaction networks can be broadly classified as food webs, host-parasitoid webs, or mutualistic webs (Ings et al., 2009), being food webs the first ones described in literature since two classical papers by Lindeman (1942) and Odum (1956). Besides ecological interactions, network thinking has also been applied for modeling other aspects of natural systems. Patterns of animal movement can be investigated in a structured way, for instance, by means of movement networks (Jacoby & Freeman, 2016). These networks represent geographical space as a set of discrete and interconnected locations, forming a mesh of possible routes through which animals (or groups of animals) travel. Links between each pair of locations are weighted according to their geographical connectivity. Animal movement is thus regarded as dynamic processes composed of sequences of discrete movement steps running through the network structure. As the spatial feature is key in this type of network, they are also referred to as spatial networks (Bascompte, 2007).
Others have applied network science to investigate biogeographical patterns, such as species co-occurrence. C. R. Stephens et al. (2009), for instance, have used biotic interaction networks for analyzing biodiversity and predicting emerging diseases. The so-called co-occurrence networks model species associations in terms of their geographical distributions, such that species which are often observed occurring together in the same set of localities are considered to be strongly connected to each other. Similarly to other networked systems, co-occurrence networks are composed of a majority of the species holding co-occurrence links to very few others, while only a few species are connected to many others (M. B. Araújo, Rozenfeld, Rahbek, & Marquet, 2011). Co-occurrence network analysis has been used for many applications in biodiversity studies, such as for selecting subsets of species to be used as surrogates for the characterization of biological communities (Tulloch et al., 2016); for assessing the resilience of biotic communities toward climate change (M. B. Araújo et al., 2011); and for identifying modularity (clusters of overlapping species ranges) in biological communities from animal-location bipartite networks (Thébault, 2013).
The social network analytics framework, which is a particular application of network theory to represent and analyze social interactions in many distinct knowledge domains, has also been applied in some biodiversity studies, though in most cases for modeling animal social behavior (Faust, 2011). An alternative perspective is to look at communities of biodiversity data producers and consumers, in order to better understand the myriad of contexts in which data are collected, shared, and used. Mapping data flow within the community of biodiversity informatics initiatives, for instance, could help to prioritize and to improve the coordination of collaborative actions, leading to more effective biodiversity data-based policies (Bingham et al., 2017). Furthermore, patterns of scientific community formation can be identified and characterized by exploring collaborative paper authoring networks and scientific topic networks (Borrett et al., 2014). Analogously, a recent work (de Siracusa, Gadelha, & Ziviani, 2020) has shown that the occurrence records of species can be used to identify communities of field collectors, in terms of their mutual collaborations during fieldwork, or even in terms of their taxonomic interests. Figure 7 illustrates such networks, the data can be visualized using three perspectives: (a) Unprojected network, where collectors (green nodes) are linked to the species (red nodes) they have recorded. The total number of records of a given species by some collector is reflected in the strength of their link. (b) SCN projection onto the species set. Species are linked together if they have been collected by common collectors. The strength of links between two species is proportional to the number of collectors they share. (c) SCN projection onto the set of collectors. Collectors are linked together if they have recorded species in common. The strength of links between two collectors is proportional to the number of species they share. Link strength for both projections is graphically displayed as edges thickness. The sizes of collector and species nodes reflect their degrees, in each perspective.
As biological collections result from multiple contributions of individual collectors over time, understanding the evolution of such communities can be an invaluable surrogate for understanding the process of assembling of biological collections themselves. Such an approach opens many new perspectives of use for museum data, including characterizing sampling biases inherent to the collection. A similar example of a collaboration network in biodiversity has been presented by (Groom, O'Reilly, & Humphrey, 2014), where a correspondence network of 19th-20th century botanists was structured from digitized data from the British Herbaria. Botanists composing this network corresponded with each other by exchanging specimens, a practice that has led to the formation of exchange clubs. Many aspects regarding the particular ways botanists used to work as well as the roles they assumed could be investigated with the aid of exchange networks.
Finally, a better understanding of the factors and processes influencing the composition of species occurrence datasets would be invaluable for improving data usability, especially for species distribution modeling (Daru et al., 2017). As biological collections are typically composed of an ensemble of opportunistic species occurrence records, each of which having been gathered in a particular context by a different collection team, their datasets do not necessarily reflect the biological diversity from the areas in which the collections are physically located. Rather, they best reflect the interests of their most active and relevant collectors, that is, those who have contributed to the collection to larger extents.

| Biodiversity genomics
The complete DNA sequence of an organism defines its genome which is present from simple species to more complex organisms such as vertebrates. With the advances in next-generation sequencing (NGS), unknown genomes have been sequenced, assembled, and deposited in public data repositories of molecular data. These data are growing fast because of the decreased cost of NGS and increased capacity of computational infrastructures (Z. D. Stephens et al., 2015;Lee & Amaro, 2018). The advances in genomic information production bring results in many application areas to society, including the production of valuable bioproducts at industrial scale (biofuels, bioenergy, cellulose fibers, gum chemicals, oils, and resins), biomonitoring of species (viruses in epidemiological surveillance [E. C. Holmes, 2008]), new drugs (vaccine design (Y. He, Preece, Hammock, Butler, & Pauw, 2015), and protein therapeutics (Leader, Baca, & Golan, 2008)).
Furthermore, molecular approaches are becoming one of the most relevant tools to support the taxonomist in species identification (Hebert, Cywinska, Ball, & DeWaard, 2003). The community is looking for the genes or regions of the genome as DNA barcode candidates, which includes cytochrome C oxidase I (COI) used to identify animals (mammals, insects, fishes); internal transcribed spacer (ITS) for fungi; ribosomal RNA (rRNA)-subunit 16S for identifying bacteria; maturase K (matK); and ribulose-bisphosphate carboxylase (rbcL) to identify plants. In this direction, Barcode of Life Data System (BOLD; Ratnasingham & Hebert, 2007) provides an integrated bioinformatics platform that assists in the acquisition, storage, analysis, and publication of DNA barcode records. It is developed and hosted by the International Barcode of Life project (iBOL), one of the largest biodiversity genomics initiatives ever executed. Hundreds of biodiversity scientists, bioinformaticians, and technologists from 25 nations are working together to construct a richly parameterized DNA barcode reference library that will be the foundation for a DNA-based identification system for all multicellular life (Page, 2008).
In general, macroscopic species are identified and cataloged by morphological aspects and receive a voucher identifier, which contains information about this species and metadata (geographical location, collector, collection date, etc.). For example, The Global Genome Biodiversity Network (GGBN) data portal (Droege et al., 2014(Droege et al., , 2016, stores information about vouchered collections of DNA or tissue samples. Museums and institutions are joining efforts to link their biological collections with genetic data as nucleotide sequences from particular genes. Genbank (from The National Center for Biotechnology Information-NCBI), DDBJ (DNA Databank of Japan), ENA (European Nucleotide Archive), which are part of the International Nucleotide Sequence Databases Collaboration (INSDC), are the most popular repositories for nucleotide sequences. Although the voucher tag is present in Genbank since 1998, it remained poorly used (Schoch et al., 2014). Recently, the BioCollections database from NCBI connected specimen vouchers to sequence records in GenBank (Sharma et al., 2018).
The recovery of DNA of extinct species, denominated ancient DNA (or aDNA), which provides resources to understand the evolutionary process, is another promising area of biodiversity genomics. The reconstruction of aDNA involves material derived from archeological specimens, mummified tissues, preserved plant remains, and from other environments, such as permafrost and sediments (Burrell, Disotell, & Bergey, 2015). Until now, most of the extinct species sequenced belong to mammalian megafauna and ancient humans (Campbell & Hofreiter, 2012). The experimental difficulty and the challenges in this area are related to the quality of the material which is degraded with time and it is often contaminated.
Considering microscopic organisms, NGS opens a new world of capabilities reducing the need for culture isolation of microscopic species of the domains Bacteria and Archaea. This field is known as metagenomics, which is defined as the analysis of sequences taken from environmental samples, which are called metagenomes (Wooley et al., 2010). Sequencing these metagenomes produces fragments of sequences, that is, sequence reads, of organisms that are present in the environmental samples, which may belong to multiple species, and are considered extremely challenging to analyze from a computational perspective. These sequences are usually filtered to exclude those that belong to taxons that are not of interest. The resulting datasets are described using metadata standards, such as MIxS (Yilmaz et al., 2011), for supporting data discovery and mining. Next, the overlapping sequence reads are used to obtain longer sequences called contigs in a process known as assembly. SPAdes (Bankevich et al., 2012) is one example of a tool used for assembly. There are several important biological discoveries based on complete and near-complete genomes assembled from metagenomes. For instance, the possible ancestor of mitochondria and the possible ancestor of the first eukaryotic cells were proposed from reconstructed phylogenies from those genomes (Eme, Spang, Lombard, Stairs, & Ettema, 2017;Martijn, Vosseberg, Guy, Offre, & Ettema, 2018;Zaremba-Niedzwiedzka et al., 2017). After assembly, the resulting sequences can be analyzed for identifying genes using either sequence alignment, for genes with homologs present in public databases, or through ab initio gene prediction using, for instance, hidden Markov models. Other types of analyses involving metagenomes include the evaluation of species diversity and functional annotation (Wooley et al., 2010). Various tools compose these different analyses into metagenomic workflows. An example is MG-RAST (Meyer et al., 2008;Wilke et al., 2016), a web portal that provides metagenomic dataset analysis workflows containing activities such as quality control, similarity-based annotation, and functional and taxonomic profiling. SUPER-FOCUS (G. G. Z. Silva, Green, et al., 2016) also produces functional and taxonomic profiles from metagenomic datasets. However, its organism identification is based on alignment-free techniques used by the FOCUS (G. G. Z. Silva, Cuevas, Dutilh, & Edwards, 2014) tool. Metagenomics can support various environmental studies such as the analysis of coral diseases. Garcia et al. (2013) identified taxonomic groups that were more abundant in Mussismilia braziliensis corals affected by the white plague disease when compared to healthy corals of the same species. Integrating data from metagenomics with data from other aspects of biodiversity, such as species populations and environmental monitoring, is still a challenging task. More recently, there were efforts to integrate these standards (O'Tuama et al., 2012). Ongoing efforts for improving data integration in bioinformatics and biodiversity are using semantic web techniques (Walls et al., 2014). These efforts are essential in supporting integrative ecosystem studies, such as (Meirelles et al., 2015), where different attributes of the ecosystem found in the mesophotic reefs of the Vitória-Trindade seamount chain were correlated to infer its properties.
As the computational challenges in this area, we can point questions related to storage, recovery, and integration of the information; conceptual modeling, ontology, and semantic representation of the molecular domain. Furthermore, there are usually multiple computational activities in bioinformatics analyses including filtering, normalization, and annotation. Efforts to ensure reproducibility (Cohen-Boulakia et al., 2017) of these analyses involve (but are not limited to) task composition tools (scripts (Babuji et al., 2019), pipelines, scientific workflows (Liew et al., 2016), and software containers (Boettiger, 2015), web-based software platforms, such as Galaxy (Bedoya-Reina et al., 2013), commonly used applications, and source code available in repositories such as Github. We explore these issues in more detail in Section 4.6.

| Biodiversity workflows and reproducibility
Scientific data are being produced at an exponential growth rate by increasingly available scientific sensors. This, coupled with sophisticated computational models that process these data, has demanded new techniques (Hey, Tansley, & Tolle, 2009) for managing computational scientific experiments in a scalable and reproducible way. Wilson et al. (2014) propose best practices for managing scientific computations. These include: recording datasets, programs, libraries, and parameters used, including their respective identifiers or versions, to enable better reproducibility; and using high-level languages for programming and moving to lower-level languages only when performance improvement is necessary. These experiments are often specified as scientific workflows (Deelman, Gannon, Shields, & Taylor, 2009;Liew et al., 2016;Shade & Teal, 2015), which are given by a composition of computational tasks that exchange data through production and consumption relationships. A scientific workflow management system (SWMS) provides features such as fault-tolerance, scalable execution, scalable data management, data dependency tracking, and provenance recording, that greatly reduce the complexity of managing the life-cycle of these experiments (Mattoso et al., 2010). Scientific workflows are often provided through research data portals, Chard et al. (2018) present a design pattern for such portals for data-intensive scientific problems. Reproducibility (Peng, 2011) is an essential property in science. In computational research, it can be a challenging task since one might need vasts amounts of data or supercomputing resources to reproduce a result. However, the reproducibility of experiments allows the verification and validation of results by others and may increase the chances that it can be reused. This is especially relevant given the demand from journals in different domains for submissions of reproducible computational research and also because of recent initiatives that encourage greater accessibility and transparency in scientific research (Stodden, Guo, & Ma, 2013;Vicente-Saez & Martinez-Fuentes, 2018). Sandve, Nekrutenko, Taylor, and Hovig (2013) propose rules that can be followed to better support reproducibility, including recording the steps that were executed to obtain a result, archiving programs that were used in a computational experiment, and versioning the scripts and workflows used. Meng et al. (2015) propose a framework that tackles reproducibility by providing features for sandboxing and preserving computational environments. A combination of containers (Boettiger, 2015;Hale, Li, Richardson, & Wells, 2017) and tools for intercepting system calls is used in order to achieve preservation. Many computational experiments have a detailed record of their execution, such as the datasets used and computational tasks used, and enable easier verification of results. These records describe the provenance (Carata et al., 2014;Freire, Koop, Santos, & Silva, 2008) of the computational experiment. It can support the reproducibility and validation of e-Science experiments. Miles et al. (2007), for instance, propose an architecture for validation of e-Science experiments based on both provenance assertions and ontologies. DataONE, for instance, included support for tracking the provenance of their datasets (Cao et al., 2016). To improve the reuse of research data, Wilkinson et al. (2016) propose a set of guidelines for scientific data or digital assets to be findable, accessible, interoperable and reusable, also known as the FAIR principles. The idea behind these principles is that the process of data recovery may be more automatic, with minimal user intervention, since many of the experiments in different domains rely on computational support for the manipulation and analysis of data. One of the motivations to follow these principles is that good data management improves the quality of the publications. The technologies and tools to achieve each of the principles can vary. In biodiversity, for example, different commonly used tools can be combined to allow experiments to be FAIR (Harjes, Link, Weibulat, Triebel, & Rambold, 2020).
Biodiversity follows the same trend of rapidly increasing production of data found in other areas of science. Currently, biodiversity data are being integrated at a global scale through initiatives such as GBIF (Edwards, 2000). Techniques for analysis and synthesis of biodiversity data, such as ENM (Elith & Leathwick, 2009;Peterson et al., 2011), are widely used. These analyses typically employ several different applications executed in a loosely coupled manner, being a typical use case for scientific workflow management tools (J. Liu, Pacitti, Valduriez, & Mattoso, 2015). Next, we list some works related to scientific workflows and reproducibility in biodiversity. Pennington et al. (2007) describe the implementation of species distribution modeling (SDM) scientific workflows using Kepler (Ludäscher et al., 2006). Their approach allows for easy management of structural aspects of the scientific workflow, such as easily replacing application components. They also developed application components for data transformation and preprocessing, geospatial processing, and semantic annotation of processes. These experiments use occurrence data from the Mammal Networked Information System (MaNIS) 12 and future climatological scenarios from IPCC to predict the climate-change impact on more than 2,000 species. Morisette et al. (2013) present the Software for Assisted Habitat Modeling (SAHM) that allows for managing the various steps of SDM, including pre-and postprocessing activities. The implementation is coupled with the Vistrails (Freire et al., 2006) scientific workflow management system, which supports provenance management. Talbert, Talbert, Morisette, and Koop (2013) also describe SAHM and analyze the data management challenges of SDM using scientific workflows. Amaral et al. (2015) present the EUBrazilOpenBio Hybrid Data Infrastructure which implements cloud services for the biodiversity domain such as taxonomic mapping and resolution and SDM. Scientific workflows are supported both with DAGMan (Couvares, Kosar, Roy, Weber, & Wenger, 2007) and EasyGrid AMS (Boeres & Rebello, 2004). They evaluate the execution of SDM on cloud computing resources showing good performance. Candela, Castelli, Coro, Pagano, and Sinibaldi (2016) give a detailed description of an integrated cloud-based environment for SDM of the EUBrazilOpenBio Hybrid Data Infrastructure which includes components for retrieving species occurrences, environmental layers, and execution of various models for predicting species distributions. Some SDM applications and workflows are available through web portals, such as the Biodiversity Virtual e-Laboratory (BioVel) (A. R. Hardisty et al., 2016). BioVel (A. R. Hardisty et al., 2016) offers a web-based environment for managing scientific workflows for biodiversity. Various predefined activities are available in its interface: geographical and temporal selection of occurrences (BioSTIF), data cleaning, taxonomic name resolution, ENM algorithms, population modeling, ecosystem modeling, and metagenomics and phylogenetics applications (Vicario, Balech, Donvito, Notarangelo, & Pesole, 2012).
iPlant (Goff et al., 2011) is a computational research infrastructure, or cyberinfrastructure, for plant science. Its applications include the Tree of Life to produce phylogenetic trees of all green plant species, and Genotype to Phenotype to predict plant phenotypes from their genetic data. Kurator (Dou et al., 2012) is a software package for the Kepler (Ludäscher et al., 2006) scientific workflow management system that supports composing various data curation activities into scientific workflows. Prebuilt activities include georeferencing, scientific name, and flowering time validators. Provenance is recorded to document all the transformations activities that data went through caused by the various data curation activities. Nguyen et al. (2017) developed scientific workflows for assessing ecosystem risk based on IUCN guidelines (Keith et al., 2013) that use five rule-based criteria to assign one of eight risk categories that range from least concern (LC) to collapsed (CO). The assessment is performed in two phases. First, a stochastic ecosystem model is executed for the Meso-American Reef Ecosystem risk assessment by predicting future reef properties under diverse scenarios. This step was implemented both in Nimrod/G (Abramson, Giddy, & Kotler, 2000) and Spark (Zaharia et al., 2016), for comparative purposes. The Spark version had a considerably better performance in terms of computing time. Next, a workflow was implemented in the Kepler scientific workflow management system (Ludäscher et al., 2006) to execute the IUCN ecosystem risk assessment methodology using the results of the stochastic ecosystem model execution and applying its five rule-based criteria. Reproducibility is an important property in this process since risk assessment is often re-executed and its results need to be discussed by experts and decision-makers (Guru et al., 2016). Borregaard and Hart (2016) discuss the importance of allowing reproducibility in ecology experiments, especially due to the change in how they have been specified. Currently, scripting languages such as R and Python have been increasingly adopted in the data analysis process. In this context, one of the challenges is to provide a means for users who do not have programming language skills to replicate the experiments. Golding et al. (2018) present Zoon R, an R package that allows SDM to be reproducible and shareable through the specification of a workflow where the result is an R object that contains the data, code, and results used in the analysis. The resulting object can be published to a data repository so that others can access it, and it can be loaded back into the R environment together with the package allowing for reproducibility of the analysis. Cohen-Boulakia et al. (2017) explore the use of scientific workflows for the reproducibility of computational experiments in the life sciences. They analyze scientific workflow techniques and systems and evaluate to what extent they support reproducibility requirements in life science applications. Plant phenotyping, which evaluates how plants respond to different environmental conditions by monitoring their traits, was one of the use cases. For instance, keeping track of several different tool versions used in a workflow and their respective compatibility is one of its reproducibility requirements. They define different levels of reproducibility in the workflow context. Considering two scientific workflows A and B and assuming A has already been executed. When B is executed: repeatability is achieved when B contains exactly the same components of A; replicability is obtained when B uses similar (Starlinger, Brancotte, Cohen-Boulakia, & Leser, 2014) input components of A and both executions reach the same conclusion; reproducibility happens when both executions lead to the same scientific conclusion; reusability is observed when the specification B contains the specification of A. The authors analyze three workflow aspects from the reproducibility perspective: workflow specification, workflow execution, and workflow context, and runtime environment. Workflow specifications can support better reusability through common specification languages, such as CWL 13 (Common Workflow Language), and annotations. Assessing workflows similarity is critical for reuse but progress is still needed in solving this problem. Recording and analyzing workflow execution details can be supported by provenance information. Freire and Chirigati (2018) discuss other reproducibility levels that can be achieved and how they relate to provenance data that include aspects on the platform, implementation, and data used by an experiment. Depending on the type of provenance collected for each of these aspects, the experiment may be repeatable, re-runnable, portable, extendable or modifiable. While most systems support the PROV (Moreau, Groth, Cheney, Lebo, & Miles, 2015) standard, visualizing and analyzing large provenance datasets is still challenging. Also, preserving the runtime environment is still a challenge that is being addressed with virtualization technologies (Hale et al., 2017). WholeTale (Brinckman et al., 2019), for instance, is a computational environment that has reproducibility features. It has components for data collection, identity management, data publication, and interfaces to analytical tools, called frontends. These frontends will manipulate data and can be given, for instance, by interactive notebooks such as Jupyter. 14 The system is integrated with DataONE , users can search and retrieve datasets from it. Frontends are packaged as Docker containers (Boettiger, 2015) that can be executed on high-performance computing resources. The interaction between the datasets and analytical tools is documented and recorded in a metadata management system. This allows for reproducing the entire computational research performed, from data retrieval to data analysis and its outputs, including the computational environments used. Feng et al. (2019) present a checklist for maximizing the reproducibility of ENM, describing in detail each step of the process, and what should be preserved in each of them to enable better reproducibility. Mondelli, Townsend Peterson, and Gadelha (2019) presented a conceptual model and framework for supporting reproducibility and FAIR principles in computational experiments. The framework is evaluated with an ENM case study.

| BIODIVERSITY INFORMATICS CHALLENGES AND CONCLUDING REMARKS
The acceleration of global changes requires a constant assessment of their impacts on biodiversity and, consequently, on ecosystem services that are essential to humans. Some areas of the globe, for example, the South Atlantic Ocean, remain highly understudied, and therefore their biodiversity underestimated. A better understanding of marine biodiversity could be achieved with help of biodiversity informatics to leverage surveys to uncover novel species and systems, such as the Great Amazon Reef (Francini-Filho et al., 2018). To address this problem, biodiversity data must be systematically collected and analyzed. In this context, biodiversity informatics is an essential collection of methodologies, tools, and techniques to achieve this goal. EBVs (Pereira et al., 2013) were proposed as a set of indicators that would allow for systematic monitoring of biodiversity. However, the production of these indicators is still a challenge (Peterson & Soberón, 2017), in particular regarding the existence of information gaps that can prevent global-scale inferences on the state of biodiversity. These inferences provide essential input to decision-makers in devising governmental policies toward meeting global targets on biodiversity conservation, such as the Aichi Biodiversity Targets. The Bari Manifesto (A. R. Hardisty et al., 2019) was proposed as a set of guidelines for biodiversity informatics infrastructures to enable the implementation of scientific workflows for measuring or estimating EBVs gathering data from potentially multiple infrastructures and countries. In this section, we describe existing challenges for biodiversity informatics to become a systematic and global-scale tool for monitoring and making inferences about biodiversity. In Table 4, we summarize various tools and databases surveyed along with which steps of the biodiversity informatics life cycle they approach.
As described in Section 3.2, the availability of detailed information about most organisms is still very scarce (Peterson, 2006). This hinders the usage of these data in many biodiversity data analysis applications, such as ENM. Furthermore, we described other issues with biodiversity data, such as biases and frequent taxonomic and georeferencing errors. Therefore, one of the challenges of biodiversity informatics is not only to increase the amount of available data, filling some of the existing gaps, but also to reduce its bias and improve its quality. Some promising work addressing these issues are listed next: • Heidorn (Heidorn, 2008) observes that data from smaller scientific projects are rarely available to other scientists even though their aggregated size and value for research are considerable. This phenomenon is denominated the long tail of science. In biodiversity and ecology, some progress has been achieved through projects such as GBIF and DataONE, that receive a considerable amount of their datasets from small research groups. One issue with making these datasets available is the effort required to map the concepts present in them to standard vocabulary terms used in major biodiversity databases. Entity resolution techniques (Köpcke, Thor, & Rahm, 2010) have the potential to assist and speed up these record linkage routines. • Data collection could be substantially intensified by applying artificial intelligence methods for automating specimen identification, some preliminary work in this direction include the application of deep learning techniques for species identification in herbarium sheets (Carranza-Rojas et al., 2017; Carranza-Rojas, Joly, Goëau, Mata-Montero, & Bonnet, 2018). • Remote sensing provides the opportunity to observe the Earth regularly and, therefore, could benefit biodiversity monitoring by increasing the amount of data collected. It can also be a valuable tool to observe areas that are difficult to access through field expeditions. One of the pioneering works in this area was proposed by Holden and Ledrew (1999) by using hyperspectral remote sensing to monitor coral reefs. Clark et al. (2005) identified tree species using remote sensing images. Fretwell et al. (2012) were able to use satellite-based remote sensing to survey the Emperor Penguin on a global scale. Fernández et al. (2020) have suggested the use of remote sensing and on-site observations for estimating EBVs using biodiversity modeling. • As observed in Section 3.2, assessing the quality of a dataset is a critical step for any subsequent analysis and synthesis activity that might use it. Users should establish the intended use of datasets in their research. Determining if a dataset is fit for a particular use is still a challenge in biodiversity informatics since records available in public databases contain various types of errors. A promising approach was proposed by Veiga et al. (2017) composed of a framework for biodiversity data quality assessment and management that allows for users to define their data quality requirements and when a particular dataset is fit-for-use in a standardized manner. P. J. Morris et al. (2018) made some progress by implementing a library of small data quality assessment routines that can be composed into more complex workflows, to report data quality in terms of the framework proposed by Veiga et al. (2017).
In the GBIO (Hobern et al., 2013) report, produced by leading biodiversity informatics researchers, a number of areas of biodiversity informatics of limited or minimal progress were identified and can be considered research challenges. Biological systems modeling was considered an area of research in biodiversity informatics with minimal progress. Advances in this area could be composed of computational, or in-silico models or simulations ranging from single organisms to entire ecosystems. Current temporal and spatial modeling in biodiversity, such as ENM, described in T A B L E 4 A selection of biodiversity informatics databases and tools classified according to target life-cycle step: data planning and collection (DC), data quality and fitness-for-use (DQ), data description (DD), data preservation and publication (DP), data discovery and integration (DI), and computational modeling and data analysis (CM) Section 4.1, rely on species occurrence data. More fine-grained modeling would require incorporating species trait data (Schneider et al., 2019). Cardinale et al. (2012), for instance, advocated the development of new predictive models that take into account species interactions to predict the impact of biodiversity on ecosystem processes based on species traits. Areas of limited progress identified by the GBIO included: • Automated remote-sensed observation has the potential to enable observation of biodiversity in large and remote areas. More recently, preliminary work was conducted on defining biodiversity indicators that could be derived from images collected by satellite remote sensing (Pettorelli et al., 2016) and processed using statistical analysis and classification algorithms. • Identifying trends and making predictions about biodiversity could determine future trends in biodiversity under different global change scenarios. Ongoing research in this area includes predicting how climate change will affect species distributions (de Siqueira & Peterson, 2003;Thomas et al., 2004;M. Araújo et al., 2008;Wiens et al., 2009;M. B. Araújo & Peterson, 2012) and zoonotic diseases (Estrada-Peña, Ostfeld, Peterson, Poulin, & de la Fuente, 2014). • Providing access to aggregate species trait data, consisting of data on species characteristics and their interactions.
Data are incomplete, there are not much data about relative abundances of species, their traits, and on how they interact. This information is needed for creating better models to study ecosystem processes. This type of data can enable more complex biological systems modeling, such as evolutionary inference. MorphoBank (O'Leary & Kaufman, 2011), for instance, allows for scientists to upload images with the morphology of organisms with associated data.
Scientists often need to combine multiple sources of data (Fujioka et al., 2014;Jones et al., 2006) in biodiversity analysis and synthesis activities. Although there are many gaps in biodiversity data, such as the reduced availability of species trait data, there are many machine-readable and freely available, that is, open, datasets (Reichman et al., 2011) from areas such as remote sensing, socioeconomics, and climatology, that can be integrated into biodiversity studies. Open data are widely available online, including data provided by many governments. However, it is highly heterogeneous, dispersed in multiple sources, and may not provide metadata or schema. Metadata, described in Section 3.3, is helpful in discovering datasets and in integrating them when dataset attribute definitions are provided, as it is possible with EML (Fegraus et al., 2005). Semantic web (Walls et al., 2014), as described in Section 3.5, can also support data integration through the use of various existing ontologies for biodiversity and other domains. However, their increased usefulness depends on the widespread adoption of ontologies and metadata standards by data providers, a process that is still underway. A promising approach to overcome these limitations has been to use machine learning techniques to support open data integration activities (Dong & Rekatsinas, 2018;Miller, 2018), such as entity matching (Mudgal et al., 2018;Nargesian, Zhu, Pu, & Miller, 2018). These recently proposed techniques could be leveraged and extended for integrating biodiversity and other related datasets.
In Table 5, we describe computational techniques used in each step of the biodiversity informatics life cycle that were be explored in this work.
Trends, indicators, and facts derived from biodiversity data analysis and synthesis activities might be used for guiding governmental decision making in critical areas such as conservation area planning, impact assessment of large construction work projects, and zoonotic disease prevention. Since such decisions can have large-scale impacts in society, being able to trace back the processes involved in reaching them is an essential property. Therefore, it is important to use methodologies and techniques that are reproducible (Cohen-Boulakia et al., 2017;Ivie & Thain, 2018;Peng, 2011;Sandve et al., 2013) when executing these activities. Some initial advances were achieved in projects such as DataONE (Cao et al., 2016) and WholeTale (Brinckman et al., 2019) by recording the provenance of biodiversity datasets and of their analysis and synthesis. However, reproducibility in computational science, in general, is still a challenge. Providing reproducible frameworks for biodiversity analysis and synthesis activities would enable better decision traceability and validation of trends and indicators produced by them.

ACKNOWLEDGMENTS
The work is partially supported by CAPES (

CONFLICT OF INTEREST
The authors have declared no conflicts of interest for this article.