Parsing files with Processing…

In the following basic example, I’m using Processing to parse a 2GB textfile.

The textfile I’m using (content.rdf.u8) is from the DMOZ.org project. You can download the compressed file here.

As mentioned earlier, the size of the textfile, content.rdf.u8, is approx 2GB. For a sample of the contents and structure of the file, follow this link. For a file of this size, it’s important to use BufferedReader. Read here for more on BufferedReader.

The following piece of code, reads the textfile and converts it to another textfile of the format: URL|Title|Description|Category:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import java.util.regex.*;
 
String fileName = dataPath("C:\\Users\\Arkadian\\Downloads\\content.rdf.u8") ;
String outputFile = dataPath("D:\\dmoz\\data.txt") ;
 
String sep = "|";
String outputLine = "";
String L = "";
String t = "";
 
PrintWriter output;
output = createWriter(outputFile);
 
try
{
  BufferedReader file = new BufferedReader (new FileReader (fileName)) ;
 
  while (file.ready()){ 
 
    L = file.readLine(); 
    //println(L); //for debuging only...
 
    String[] m1 = match(L, "<ExternalPage about=");
    String[] m2 = match(L, "<d:Title>");
    String[] m3 = match(L, "<d:Description>");
    String[] m4 = match(L, "<topic>");
    String[] m5 = match(L, "</ExternalPage>");
 
    // Start of an external page & URL
    if(m1 != null){
      t = L.replaceAll("<ExternalPage about=\"","");
      t = t.replaceAll("\">","");
      outputLine = t;
    }
 
    // Title
    if(m2 != null){
      t = L.replaceAll("  <d:Title>","");
      t = t.replaceAll("</d:Title>","");
      outputLine = outputLine + sep + t;
    }   
 
    // Description
    if(m3 != null){
      t = L.replaceAll("  <d:Description>","");
      t = t.replaceAll("</d:Description>","");
      outputLine = outputLine + sep + t;
    }   
 
    // Topic
    if(m4 != null){
      t = L.replaceAll("  <topic>","");
      t = t.replaceAll("</topic>","");
      outputLine = outputLine + sep + t;
    }    
 
    // End of External page
    if(m5 != null){
      println(outputLine);
      output.println(outputLine);
      output.flush();
    }    
  }
}
 
catch (Exception e){
  println ("Error" + e) ;
}
 
output.close();
exit();

Pretty simple, yet effective…