Improving performance
This page contains tips that may help to increase indexing speed.
Configure table indexes
Postgres indexes are internal tables that Postgres can use to speed up data lookup. A database index acts like a pointer to data in a table, just like an index in a printed book. If you look in the index first, you will find the data much quicker than searching the whole book (or — in this case — database).
You should add indexes on columns often appearing in WHERE
clauses in your GraphQL queries and subscriptions.
DipDup ORM uses BTree indexes by default. To set index on a field, add db_index=True
to the field definition:
from dipdup import fields
from dipdup.models import Model
class Trade(Model):
id = fields.BigIntField(primary_key=True)
amount = fields.BigIntField()
level = fields.BigIntField(db_index=True)
timestamp = fields.DatetimeField()
Perform heavy computations in a separate process
For the most deferred calculations you can use built-in job scheduler. However, DipDup jobs are executed in the same asyncio loop as the rest of the framework, so they can affect indexing performance.
If you decide to implement a separate service to perform heavy computations, you can implement an additional DipDup CLI command to run it. That way you can reuse the same config and environment variables. Create a new file cli.py
in the project root directory:
from contextlib import AsyncExitStack
import click
from dipdup.cli import cli, _cli_wrapper
from dipdup.config import DipDupConfig
from dipdup.context import DipDupContext
from dipdup.utils.database import tortoise_wrapper
@cli.command(help='Run heavy calculations')
@click.option('-k', '--key', help='Command option')
@click.pass_context
@_cli_wrapper
async def heavy_stuff(ctx, key: str) -> None:
config: DipDupConfig = ctx.obj.config
url = config.database.connection_string
models = f'{config.package}.models'
async with tortoise_wrapper(url, models):
...
if __name__ == '__main__':
cli(prog_name='dipdup', standalone_mode=True)
Then use python -m dipdup_indexer.cli
instead of dipdup
as an entrypoint. Now you can call heavy-stuff
like any other command. dipdup.cli:cli
group handles arguments and config parsing, graceful shutdown, and other boilerplate. Keep in mind that DipDup ORM is not thread-safe.
python -m dipdup_indexer.cli -c dipdup.yaml heavy-stuff --key value
Or in Dockerfile:
ENTRYPOINT ["python", "-m", "dipdup_indexer.cli"]
CMD ["-c", "dipdup.yaml", "heavy-stuff", "--key", "value"]
Reducing disk I/O
Indexing produces a lot of disk I/O. During development you can store the database in RAM. By default DipDup uses in-memory SQLite database dropped on exit. Using tmpfs instead allows you to persist the database between process restarts until the system is rebooted. By default, tmpfs is mounted on /tmp
with a size of 50% of RAM. The following spells are for Linux, but on macOS the process should be similar.
# Make sure tmpfs is mounted
$ df -h /tmp
Filesystem Size Used Avail Use% Mounted on
tmpfs 16G 1.0G 16G 7% /tmp
# You can change the size of tmpfs without unmounting it
$ sudo mount -o remount,size=64G,noatime /tmp
# But make sure that you have enough swap for this
$ free -h
total used free shared buff/cache available
Mem: 30Gi 16Gi 3,1Gi 1,3Gi 11Gi 12Gi
Swap: 31Gi 6,0Mi 31Gi
# Update database config to use tmpfs
$ grep database -A2 dipdup.yaml
database:
kind: sqlite
path: /tmp/uniswap.sqlite
# After you've done indexing, move the database from RAM to disk
$ mv /tmp/uniswap.sqlite ~/uniswap.sqlite
Commands above were checked on Linux, but on macOS the process should be similar.