BioThings Standalone¶
This step-by-step guide shows how to use Biothings standalone instances. Standalone instances are based on Docker containers and provide a fully pre-configured, ready-to-use Biothings API that can easily be maintained and kept up-to-date. The idea is, for any user, to be able to run his/her own APIs locally and fulfill differents needs:
keep all API requests private and local to your own server
enrich existing and publicly available data found on our APIs with some private data
run API on your own architecture to perform heavy queries that would sometimes be throttled out from online services
Quick Links¶
If you already know how to run a BioThings standalone instance, you can download the latest avaiable Docker images from the following tables.
Warning
Some data sources, managed and served by standalone instances (production, demo and old), have restricted licenses. Therefore these standalone instances must not be used for other than non-profit purposes. By clicking on the following links, you agree and understand these restrictions. If you’re a for-profit company and would like to run a standalone instance, please contact us.
Note
Images don’t contain data but are ready to download and maintain data up-to-date running simple commands through the hub.
List of standalone instances¶
¶
Production and old data require at least 30GiB disk space.
Production |
Demo |
Old |
---|---|---|
¶
Production and old data require at least 2TiB disk space.
Production |
Demo |
Old |
---|---|---|
¶
Production and old data require at least 150Gib disk space.
Production |
Demo |
Old |
---|---|---|
Soon ! |
Prerequisites¶
Using standalone instances requires to have a Docker server up and running, some basic knowledge
about commands to run and use containers. Images have been tested on Docker >=17. Using AWS cloud,
you can use our public AMI biothings_demo_docker (ami-44865e3c
in Oregon region) with Docker pre-configured
and ready for standalone demo instances deployment. We recommend using instance type with at least
8GiB RAM, such as t2.large
. AMI comes with an extra 30GiB EBS volume, which should be enough to
deploy any demo instances.
Alternately, you can install your own Docker server (on recent Ubuntu systems, sudo apt-get install docker.io
is usually enough). You may need to point Docker images directory to a specific hard drive to get enough space,
using -g
option:
# /mnt/docker points to a hard drive with enough disk space
sudo echo 'DOCKER_OPTS="-g /mnt/docker"' >> /etc/default/docker
# restart to make this change active
sudo service docker restart
Demo instances use very little disk space, as only a small subset of data is available. For instance, myvariant demo only requires ~10GiB to run with demo data up-to-date, including the whole Linux system and all other dependencies. Demo instances provide a quick and easy way to setup a running APIs, without having to deal with some advanced system configurations.
For deployment with production or old data, you may need a large amount of disk space. Refer to the Quick Links section for more information. Bigger instance types will also be required, and even a full cluster architecture deployment. We’ll soon provide guidelines and deployment scripts for this purpose.
What you’ll learn¶
Through this guide, you’ll learn:
how to obtain a Docker image to run your favorite API
how to run that image inside a Docker container and how to access the web API
how to connect to the hub, a service running inside to container used to interact with the API systems
how to use that hub, using specific commands, in order to perform update and keep data up-to-date
Data found in standalone instances¶
All BioThings APIs (mygene.info, myvariant.info, …) provide data release in different flavors:
Production data, the actual data found on live APIs we, the BioThings team at SuLab, are running and keeping up-to-date on a regular basis. Please contact us if you’re interested in obtaining this type of data.
Demo data, a small subset of production data, publicly available
Old production data, an at least one year old production dataset (full), publicly available
The following guide applies to demo data only, though the process would be very similar for other types of data flavors.
Downloading and running a standalone instance¶
Standalone instances are available as Docker images. For the purpose of this guide, we’ll setup an instance running mygene API,
containing demo data. Links to standalone demo Docker images, can be found in Quick links at the beginning of this guide.
Use one of these links, or use this direct link
to mygene’s demo instance, and download the Docker image file, using your favorite browser or wget
:
$ wget http://biothings-containers.s3-website-us-west-2.amazonaws.com/demo_mygene/demo_mygene.docker
You must have a running Docker server in order to use that image. Typing docker ps
should return all running containers, or
at least an empty list as in the following example. Depending on the systems and configuration, you may have to add sudo
in front of this command to access Docker server.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Once downloaded, the image can be loaded into the server:
$ docker image load < demo_mygene.docker
$ docker image list
REPOSITORY TAG IMAGE ID CREATED SIZE
demo_mygene latest 15d6395e780c 6 weeks ago 1.78GB
Image is now loaded, size is ~1.78GiB, it contains no data (yet). An docker container can now be instantiated from that image, to create a BioThings standalone instance, ready to be used.
A standalone instance is a pre-configured system containing several parts. BioThings hub is the system used to interact with BioThings backend and perform operations such as downloading data and create/update ElasticSearch indices. Those indices are used by the actual BioThings web API system to serve data to end-users. The hub can be accessed through a standard SSH connection or through REST API calls. In this guide, we’ll use the SSH server.
A BioThings instance expose several services on different ports:
80: BioThings web API port
7022: BioThings hub SSH port
7080: BioThings hub REST API port
9200: ElasticSearch port
We will map and expose those ports to the host server using option -p
so we can access BioThings services without
having to enter the container (eg. hub ssh port here will accessible using port 19022).
$ docker run --name demo_mygene -p 19080:80 -p 19200:9200 -p 19022:7022 -p 19090:7080 -d demo_mygene
Note
Instance will store ElasticSearch data in /var/lib/elasticsearch/ directory, and downloaded data and logs
in /data/
directory. Those two locations could require extra disk space, if needed Docker option -v
can be used to mount a directory from the host, inside the container. Please refer to Docker documnentation.
Let’s enter the container to check everything is running fine. Services may take a while, up to 1 min, before fully started. If some services are missing, the troubleshooting section may help.
$ docker exec -ti demo_mygene /bin/bash
root@a6a6812e2969:/tmp# netstat -tnlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:7080 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:7022 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 25/nginx
tcp 0 0 127.0.0.1:8881 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8882 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8883 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8884 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8885 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8886 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8887 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8888 0.0.0.0:* LISTEN -
tcp6 0 0 :::7080 :::* LISTEN -
tcp6 0 0 :::7022 :::* LISTEN -
tcp6 0 0 :::9200 :::* LISTEN -
tcp6 0 0 :::9300 :::* LISTEN -
We can see the different BioThings services’ ports: 7080, 7022 and 7080. All 888x ports correspond to Tornado instances running behing Nginx port 80. They shouldn’t be accessed directly. Ports 9200 and 9300 are ElasticSearch standard ports (9200 one can be used to perform queries directly on ES, if needed)
At this point, the standalone instance is up and running. No data has been downloaded yet, let’s see how to populate the BioThings API using the hub.
Updating data using Biothings hub¶
If the standalone instance has been freshly started, there’s no data to be queried by the API. If we make a API call, such as fetching metadata, we’ll get an error:
# from Docker host
$ curl -v http://localhost:19080/metadata
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 19080 (#0)
> GET /metadata HTTP/1.1
> Host: localhost:19080
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Date: Tue, 28 Nov 2017 18:19:23 GMT
< Content-Type: text/html; charset=UTF-8
< Content-Length: 93
< Connection: keep-alive
< Server: TornadoServer/4.5.2
<
* Connection #0 to host localhost left intact
This 500 error reflects a missing index (ElasticSearch index, the backend used by BioThings web API). We can have a look at existing indices in ElasticSearch:
# from Docker host
$ curl http://localhost:19200/_cat/indices
yellow open hubdb 5 1 0 0 795b 795b
There’s only one index, hubdb
, which is an internal index used by the hub. No index containing actual biological data…
BioThings hub is a service running inside the instance, it can be accessed through a SSH connection, or using REST API calls.
For the purpose of the guide, we’ll use SSH. Let’s connect to the hub (type yes
to accept the key on first connection):
# from Docker host
$ ssh guest@localhost -p 19022
The authenticity of host '[localhost]:19022 ([127.0.0.1]:19022)' can't be established.
RSA key fingerprint is SHA256:j63IEgXc3yJqgv0F4wa35aGliH5YQux84xxABew5AS0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[localhost]:19022' (RSA) to the list of known hosts.
Welcome to Auto-hub, guest!
hub>
We’re now connected to the hub, inside a python shell where the application is actually running. Let’s see what commands are available:
Warning
the hub console, though accessed through SSH, is not a Linux shell (such as bash), it’s a python interpreter shell.
hub> help()
Available commands:
versions
check
info
download
apply
step_update
update
help
Type: 'help(command)' for more
versions()
will display all available data build versions we can download to populate the APIcheck()
will return whether a more recent version is available onlineinfo()
will display current local API version, and information about the latest available onlinedownload()
will download the data compatible with current local version (but without populating the ElasticSearch index)apply()
will use local data previously downloaded to populate the indexstep_update()
will bring data release to the next one (one step in versions), compatible with current local versionupdate()
will bring data to the latest available online (using a combination ofdownload
andapply
calls)
Note
update()
is the fastest, easiest and preferred way to update the API. download
, apply
, step_update
are available
when it’s necessary to bring the API data to a specific version (not the latest one), are considered more advanced,
and won’t be covered in this guide.
Note
Because the hub console is actually a python interpreter, we call the commands using parenthesis, just like functions or methods. We can also pass arguments when necessary, just like standard python (remember: it is python…)
Note
After each command is typed, we need to press “enter” to get either its status (still running) or the result
Let’s explore some more.
hub> info()
[2] RUN {0.0s} info()
hub>
[2] OK info(): finished
>>> Current local version: 'None'
>>> Release note for remote version 'latest':
Build version: '20171126'
=========================
Previous build version: '20171119'
Generated on: 2017-11-26 at 03:11:51
+---------------------------+---------------+-------------+-----------------+---------------+
| Updated datasource | prev. release | new release | prev. # of docs | new # of docs |
+---------------------------+---------------+-------------+-----------------+---------------+
| entrez.entrez_gene | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_refseq | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_unigene | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_go | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_genomic_pos | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_retired | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_accession | 20171118 | 20171125 | 10,003 | 10,003 |
| generif | 20171118 | 20171125 | 10,003 | 10,003 |
| uniprot | 20171025 | 20171122 | 10,003 | 10,003 |
+---------------------------+---------------+-------------+-----------------+---------------+
Overall, 9,917 documents in this release
0 document(s) added, 0 document(s) deleted, 130 document(s) updated
We can see here we don’t have any local data release (Current local version: 'None'
), whereas the latest online (at that time) is from
November 26th 2017. We can also see the release note with the different changes involved in the release (whether it’s a new version, or the number
of documents that changed).
hub> versions()
[1] RUN {0.0s} versions()
hub>
[1] OK versions(): finished
version=20171003 date=2017-10-05T09:47:59.413191 type=full
version=20171009 date=2017-10-09T14:47:10.800140 type=full
version=20171009.20171015 date=2017-10-19T11:44:47.961731 type=incremental
version=20171015.20171022 date=2017-10-25T13:33:16.154788 type=incremental
version=20171022.20171029 date=2017-11-14T10:34:39.445168 type=incremental
version=20171029.20171105 date=2017-11-06T10:55:08.829598 type=incremental
version=20171105.20171112 date=2017-11-14T10:35:04.832871 type=incremental
version=20171112.20171119 date=2017-11-20T07:44:47.399302 type=incremental
version=20171119.20171126 date=2017-11-27T10:38:03.593699 type=incremental
Data comes in two distinct types:
full: this is a full data release, corresponding to an ElasticSearch snapshot, containing all the data
incremental : this is a differential/incremental release, produced by computing the differences between two consecutives versions. The diff data is then used to patch an existing, compatible data release to bring it to the next version.
So, in order to obtain the latest version, the hub will first find a compatible version. Since it’s currently empty (no data), it will
use the first full release from 20171009, and then apply incremental updates sequentially (20171009.20171015
, then 20171015.20171022
,
then 20171022.20171029
, etc… up to 20171119.20171126
).
Let’s update the API:
hub> update()
[3] RUN {0.0s} update()
hub>
[3] RUN {1.3s} update()
hub>
[3] RUN {2.07s} update()
After a while, the API is up-to-date, we can run command info()
again (it also can be used to track update progress):
hub> info()
[4] RUN {0.0s} info()
hub>
[4] OK info(): finished
>>> Current local version: '20171126'
>>> Release note for remote version 'latest':
Build version: '20171126'
=========================
Previous build version: '20171119'
Generated on: 2017-11-26 at 03:11:51
+---------------------------+---------------+-------------+-----------------+---------------+
| Updated datasource | prev. release | new release | prev. # of docs | new # of docs |
+---------------------------+---------------+-------------+-----------------+---------------+
| entrez.entrez_gene | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_refseq | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_unigene | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_go | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_genomic_pos | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_retired | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_accession | 20171118 | 20171125 | 10,003 | 10,003 |
| generif | 20171118 | 20171125 | 10,003 | 10,003 |
| uniprot | 20171025 | 20171122 | 10,003 | 10,003 |
+---------------------------+---------------+-------------+-----------------+---------------+
Overall, 9,917 documents in this release
0 document(s) added, 0 document(s) deleted, 130 document(s) updated
Local version is 20171126
, remote is 20171126
, we’re up-to-date. We can also use check()
:
hub> check()
[5] RUN {0.0s} check()
hub>
[5] OK check(): finished
Nothing to dump
Nothing to dump
means there’s no available remote version that can be downloaded. It would otherwise return a version number, meaning
we would be able to update the API again using command update()
.
Press Control-D to exit from the hub console.
Querying ElasticSearch, we can see a new index, named biothings_current
, has been created and populated:
$ curl http://localhost:19200/_cat/indices
green open biothings_current 1 0 14903 0 10.3mb 10.3mb
yellow open hubdb 5 1 2 0 11.8kb 11.8kb
We now have a populated API we can query:
# from Docker host
# get metadata (note the build_version field)
$ curl http://localhost:19080/metadata
{
"app_revision": "672d55f2deab4c7c0e9b7249d22ccca58340a884",
"available_fields": "http://mygene.info/metadata/fields",
"build_date": "2017-11-26T02:58:49.156184",
"build_version": "20171126",
"genome_assembly": {
"rat": "rn4",
"nematode": "ce10",
"fruitfly": "dm3",
"pig": "susScr2",
"mouse": "mm10",
"zebrafish": "zv9",
"frog": "xenTro3",
"human": "hg38"
},
# annotation endpoint
$ curl http://localhost:19080/v3/gene/1017?fields=alias,ec
{
"_id": "1017",
"_score": 9.268311,
"alias": [
"CDKN2",
"p33(CDK2)"
],
"ec": "2.7.11.22",
"name": "cyclin dependent kinase 2"
}
# query endpoint
$ curl http://localhost:19080/v3/query?q=cdk2
{
"max_score": 310.69254,
"took": 37,
"total": 10,
"hits": [
{
"_id": "1017",
"_score": 310.69254,
"entrezgene": 1017,
"name": "cyclin dependent kinase 2",
"symbol": "CDK2",
"taxid": 9606
},
{
"_id": "12566",
"_score": 260.58084,
"entrezgene": 12566,
"name": "cyclin-dependent kinase 2",
"symbol": "Cdk2",
"taxid": 10090
},
...
BioThings API with multiple indices¶
Some APIs use more than one ElasticSearch index to run. For instance, myvariant.info uses one index for hg19 assembly, and one index
for hg38 assembly. With such APIs, the available commands contain a suffix showing which index (thus, which data release) they relate to.
Here’s the output of help()
from myvariant’s standalone instance:
hub> help()
Available commands:
versions_hg19
check_hg19
info_hg19
download_hg19
apply_hg19
step_update_hg19
update_hg19
versions_hg38
check_hg38
info_hg38
download_hg38
apply_hg38
step_update_hg38
update_hg38
help
For instance, update()
command is now available as update_hg19()
and update_hg38()
depending on the assemlby.
Troubleshooting¶
We test and make sure, as much as we can, that standalone images are up-to-date and hub is properly running for each data release. But things can still go wrong…
First make sure all services are running. Enter the container and type netstat -tnlp
, you should see
services running on ports (see usual running services). If services running on ports 7080 or 7022 aren’t running,
it means the hub has not started. If you just started the instance, wait a little more as services may take a while before
they’re fully started and ready.
If after ~1 min, you still don’t see the hub running, log to user biothings
and check the starting sequence.
Note
Hub is running in a tmux session, under user biothings
# sudo su - biothings
$ tmux a # recall tmux session
python -m biothings.bin.autohub
(pyenv) biothings@a6a6812e2969:~/mygene.info/src$ python -m biothings.bin.autohub
INFO:root:Hub DB backend: {'module': 'biothings.utils.es', 'host': 'localhost:9200'}
INFO:root:Hub database: hubdb
DEBUG:asyncio:Using selector: EpollSelector
start
You should see something looking like this above. If not, you should see the actual error, and depending on the error, you may be able to
fix it (not enough disk space, etc…). The hub can be started again using python -m biothings.bin.autohub
from within the application
directory (in our case, /home/biothings/mygene.info/src/
)
Note
Press Control-B then D to dettach the tmux session and let the hub running in background.
Logs are available in /data/mygene.info/logs/
. You can have a look at:
dump_*.log
files for logs about data downloadupload_*.log
files for logs about index update in general (full/incremental)sync_*.log
files for logs about incremental update onlyand
hub_*.log
files for general logs about the hub process
Finally, you can report issues and request for help, by joining Biothings Google Groups (https://groups.google.com/forum/#!forum/biothings)