How to handle big data files

Sinan Artun
3 min readFeb 18, 2022

While big data files are already problematic in many ways, the XML format takes these problems to a whole new dimension. Last night, I got a call from a group of Data scientist friends who said they have big data in the inaccurate structure to ask if I could help them.

Yes, I said

When data get value increases, it becomes more difficult to find it. Discogs data is very superior, fantastic, and real-world data. It’s in XML format and quite big enough to hit the limits of your computer.
If you can’t beat them; join them

if you can’t beat them, join them

All I planned was to divide big files into smaller pieces. This method works (really works). Let me explain the worst scenario for an XML file.

typical structure of an XML file.

In this story, I will handle http://discogs-data.s3-us-west-2.amazonaws.com/data/2022/discogs_20220201_releases.xml.gz file, which unzipped type is ~ 70 GB XML file.
We intended to cut this big data file into smaller pieces. Because our data scientist team works on python and the most successful library on XML files is xml.etree.ElementTree, all XML files need a root tag; otherwise, these files can not parse them. You can put <release> tags into a <root> tag or <releases> tag as original data.

Steps for victory

  1. read the data line by line with the buffered reader.
  2. get rid of <releases> and </releases> tags on line 1 and last line
  3. divide files into ~100 MB files
  4. combine lines always ending with </release> tag
  5. data is yours

Here is the code I write for the solution

package com.company;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;
import java.nio.charset.StandardCharsets;
import java.util.Objects;

public class Main {

//Edit START
static String bigFilePath = "/Users/synan/Downloads/discogs_20220201_releases.xml";
static String smallFilesFolder = "/Users/synan/Downloads/data/";

//Edit END


static long start = System.currentTimeMillis();
public static void main(String[] args) {
parser();
}

public static void parser() {
int dd = 0;
try (BufferedReader br = new BufferedReader(new FileReader(bigFilePath))) {
StringBuilder cline = new StringBuilder("<root>").append("\n");
int file_number = 0;
int cc = 0;
for (String line; (line = br.readLine()) != null; ) {
if (dd == 0) {
dd++;
continue;
}
int linel = line.length();
if (linel < 10) {
cline.append(line).append(" ");
continue;
}
String last = line.substring(linel - 10);
boolean release_end = false;
if (!Objects.equals(last, new String("</release>"))) {
cline.append(line).append(" ");
} else {
cline.append(line).append("\n");
release_end = true;
}
cc++;
if (cc % 100000 == 0) {

if (!release_end) {
cc--;
continue;
}

file_number++;
String path = smallFilesFolder + file_number + ".xml";
try {
PrintWriter dataWriter = new PrintWriter(path, StandardCharsets.UTF_8);
dataWriter.print(cline.append("</root>").toString().trim());
cline = new StringBuilder("<root>").append("\n");
dataWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
file_number++;
String path = smallFilesFolder + file_number + ".xml";
try {
PrintWriter lastWriter = new PrintWriter(path, StandardCharsets.UTF_8);

cline = new StringBuilder(cline.toString().replace("</releases>","").trim());

lastWriter.print(cline.append("\n").append("</root>"));
lastWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
long end = System.currentTimeMillis();
System.out.println("completed in " + (end - start) / 1000 + " seconds");
// line is not visible here.
} catch (IOException e) {
e.printStackTrace();
}
}
}

Github

this is a single file java solution. it's a little bit complicated because of performance issues I had to merge many functions. Just install IDEA and create a new project, paste these lines to the main java file

edit bigFilePath and smallFilesFolder and run.

this is the result

Don’t hesitate to ask any questions, Happy coding.

--

--