Large Datasets & Big Data


Considerations of Large Datasets

Parallelization
Row64 Data Structures and Software is designed for massive parallelization. Many people are familiar the threadripper which can parallelize up to 64 cores.

Row64 will maximize the use of CPU cores but will also maximize GPU cores. For example, a NVIDIA RTX 3090 has 10,496 cores. So Row64 has a massive amount of infrastructure and specialized engineering to take advantage of this new era of parallelization.

Casting Data

With large datasets, Row64 offers advanced features for casting data. To work with data the way you want and to optimize for performance, you’ll often want to set the exact data type. This is especially important when you start working with datasets over 100 million records.

To cast data in your import process, use “Detailed Import” in Row64.

image (26)

You can review the data without loading it in Detailed Import. Scroll up and down to inspect the raw data and remove any errors. You can also right-click on any row to have it removed from the import/ingest process.

Once you are satisfied with the data, click “Run Diagnostic.”

image (27)

The Diagnostic property page gives you a ton of control over your import. In particular, you can control how data is cast in the “Selected Type” combo boxes. Once you’ve adjusted your Detailed Import settings, hit OK.

Next, hit “Complete Import” to bring your data with your detailed settings into Row64.

image (29)


What are the Problems with Large Datasets?

Working with big data in a single file can cause a multitude of problems.

An example of this is the HDF5 file structure:

Row64 was designed to avoid all the problems detailed in this critique, and found commonly in other software dealing with large datasets. Please read this article by Cyrille Rossant that outlines the challenges we set as a target to fix:

Cyrille Rossant - Moving away from HDF5

Main HDF5 Problems:

Any spreadsheet program would have similar problems if it wasn’t structured for big data. If you’re curious you can find this kind of legacy organization in Excel. Here’s how to see see the detail:

  • Save out an Excel file for testing
  • Rename the file extension from .xlxs to .zip
  • Right click on the file and “Extract All”
  • This will extract the contents of the Excel “World In A File” file system
  • You can open any individual sub-files in any text editor
  • What you will find is Excel is similar to the HDF5 format but with old-school XML file structure in each file

If your interested in a quick summary of the problems of this type of approach, here’s a link: XML is really, really slow

Big picture there have been massive steps forward in fast file access and organization in the last 30 years. Row64 is built with all the “lessons learned” over that time period about breaking speed records and organizing data science projects.

Row64 file IO was re-written over and over for about 2 years while we figured out the most compact and low-level byte format that could get in and out of the GPU as quickly as possible.

Our hope is you do some hard-hitting, real tests with Row64 VS the competitors with real data and get the word out. If you have any test results or questions please share them in the comments below.


Are Row64 Multi-File Projects Designed for Big Data?

Yes. Our aim is to take various files and resources to work with them as a cohesive unit. This is achieved through Row64 managing projects through Workspaces.

Workspaces are very helpful with complex projects as they act as a type of environment where a data scientist or a business analyst can work, separated from outside interference for the duration of their task.


What Exactly is a Workspace?

Similar to VSCode, a Row64 Workspace means the collection of folders that are opened during a Row64 window or instance.

Generally, you have a single project folder opened at a time in a Row64 workspace. Although if you open Row64 multiple times you can work on multiple projects at the same time.