Importing a Quarter Trillion Points

Importing a Quarter Trillion Points

Managing and using a terabyte-scale point cloud dataset becomes painful when working with traditional file-based tools and methods. LumiDB might be able to help you here! In this case study we investigate how such a dataset is imported into LumiDB.

Published on


Do not index
Do not index
At LumiDB, we’re building a database for reality capture data that will make handling big point cloud datasets easy, accessible and fun. Request access to the alpha version to try our API.
When a point cloud dataset reaches billions of points, working with it using traditional file-based methods becomes very difficult. It is impossible to visualize the data without additional preprocessing as everything no longer fits into memory, and basic file management becomes challenging when the dataset is split into thousands of separate files.
The Gisborne LiDAR dataset offers a nice case study of this. It also presents an excellent way of demonstrating how the problems can be approached with LumiDB. The dataset covers the entire Gisborne region of New Zealand, and file and point counts represent the kind of real-world, large-scale point cloud data that organizations increasingly need to manage and analyze.
📊
  • Total number of points: 262,397,192,609 pts
  • Area: 8,672.77 km2
  • Point Density: 30.25 pts/m2
  • File count: 25190
  • Total size: 2236 GB (LAZ, compressed)
  • License: CC BY 4.0
notion image
The challenge? Importing and organizing 262 billion points split into twenty-five thousand files, while maintaining easy access, efficient rendering and extensive query capabilities.
With LumiDB, there are multiple ways of importing a dataset of this kind:
  1. Using our managed service and the Import API.
  1. Through an integration with your existing software or processing pipeline.
  1. Running the import manually on your own hardware.
In this blog post, we’ll explore the manual approach to better explain the underlying process. Using our managed service, all this happens automatically behind the scenes.
To run an import locally, we need to create a import manifest file, and then run the import itself. The manifest is a basic JSON document that lists all the files included in the full dataset.
Our example dataset is hosted by OpenTopography and sits behind a custom S3-compatible endpoint. To list the files, we can use AWS CLI:
$ aws s3 ls s3://pc-bulk/NZ23_Gisborne/ --recursive --endpoint-url https://opentopography.s3.sdsc.edu --no-sign-request

2024-05-01 02:17:02      37732 NZ23_Gisborne/Addendum1/CL2_BD43_2023_1000_1742.lax
2024-05-01 04:18:36  141652393 NZ23_Gisborne/Addendum1/CL2_BD43_2023_1000_1742.laz
2024-05-01 04:28:26      37732 NZ23_Gisborne/Addendum1/CL2_BD43_2023_1000_1743.lax
2024-05-01 02:01:34  108064295 NZ23_Gisborne/Addendum1/CL2_BD43_2023_1000_1743.laz
2024-05-01 05:34:49      39280 NZ23_Gisborne/Addendum1/CL2_BD43_2023_1000_1744.lax
2024-05-01 05:34:16  103315555 NZ23_Gisborne/Addendum1/CL2_BD43_2023_1000_1744.laz
# ... and 50000 lines more ...
LumiDB supports importing data directly from S3-compatible remotes. We can turn the file listing into into a import manifest file using a simple LumiDB command:
$ lumidb create-manifest --endpoint https://opentopography.s3.sdsc.edu --prefix s3://pc-bulk/NZ23_Gisborne/ --crs EPSG:2193 --date 2023 gisborne.json
The created import manifest looks like this:
{
  "inputs": [
    {
      "src": "s3://pc-bulk/NZ23_Gisborne/CL2_BF40_2023_1000_3750.laz",
      "crs": "EPSG:2193",
      "date": "2023"
    },
    {
      "src": "s3://pc-bulk/NZ23_Gisborne/CL2_BF40_2023_1000_3847.laz",
      "crs": "EPSG:2193",
      "date": "2023"
    },
    ...
  ]
}
The manifest can also contain customer-defined per-file metadata which can be used to build more advanced queries queries.
Once the manifest is created, we can start the import. With a self-hosted LumiDB, the import is done using the import-table command:
$ lumidb import-table gisborne gisborne.json EPSG:2193
The import reads the files, automatically splits them into chunks that are processed in parallel, and writes the database table in LumiDB format that is efficient to read and query.
Import times vary based on hardware, taking about six hours on a Macbook M3 Max with an external SSD. Performance depends mainly on network speed, disk I/O, and CPU capacity. We're actively optimizing the import process and expect improved speeds in future releases.
Once the import is done we can visualize and query the dataset.
LumiDB viewer showing a section of the dataset with a point budget of 2M points.
LumiDB viewer showing a section of the dataset with a point budget of 2M points.
LumiDB keeps a reference from each point to its original file. Here we color the points by the source file.
LumiDB keeps a reference from each point to its original file. Here we color the points by the source file.
To run queries programmatically, you can use our JS SDK:
const response = await lumidb.query({
  table: "gisborne",
  queryBoundary: {
    Aabb: {
      min: [19840554, -4657548, -10000],
      max: [19843506, -4654271, 10000]
    }
  },
  queryCRS: "EPSG:3857",
  maxPoints: 5_000_000,
  maxDensity: 10.0,
  sourceFileFilter: null,
  classFilter: null,
  outputFormat: "threejs", // or "laz" for exporting to other software
});
Check out our minimal example of how to integrate LumiDB query capabilities into a three.js-based viewer.
Importing the data is only the first step, and it unlocks multiple ways of efficiently using the data: advanced querying, efficient visualization, integrations to other software, exports, access control, processing, etc... Make sure to follow our blog for future updates!
Interested in testing LumiDB? Request access to our closed alpha program and join the ranks of companies testing LumiDB today.
 
notion image

Written by

Alex Lagerstedt
Alex Lagerstedt

Lead Developer