In the following basic example, I’m using Processing to parse a 2GB textfile.
The textfile I’m using (content.rdf.u8) is from the DMOZ.org project. You can download the compressed file here.
As mentioned earlier, the size of the textfile, content.rdf.u8, is approx 2GB. For a sample of the contents and structure of the file, follow this link. For a file of this size, it’s important to use BufferedReader. Read here for more on BufferedReader.
The following piece of code, reads the textfile and converts it to another textfile of the format: URL|Title|Description|Category:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | import java.util.regex.*; String fileName = dataPath("C:\\Users\\Arkadian\\Downloads\\content.rdf.u8") ; String outputFile = dataPath("D:\\dmoz\\data.txt") ; String sep = "|"; String outputLine = ""; String L = ""; String t = ""; PrintWriter output; output = createWriter(outputFile); try { BufferedReader file = new BufferedReader (new FileReader (fileName)) ; while (file.ready()){ L = file.readLine(); //println(L); //for debuging only... String[] m1 = match(L, "<ExternalPage about="); String[] m2 = match(L, "<d:Title>"); String[] m3 = match(L, "<d:Description>"); String[] m4 = match(L, "<topic>"); String[] m5 = match(L, "</ExternalPage>"); // Start of an external page & URL if(m1 != null){ t = L.replaceAll("<ExternalPage about=\"",""); t = t.replaceAll("\">",""); outputLine = t; } // Title if(m2 != null){ t = L.replaceAll(" <d:Title>",""); t = t.replaceAll("</d:Title>",""); outputLine = outputLine + sep + t; } // Description if(m3 != null){ t = L.replaceAll(" <d:Description>",""); t = t.replaceAll("</d:Description>",""); outputLine = outputLine + sep + t; } // Topic if(m4 != null){ t = L.replaceAll(" <topic>",""); t = t.replaceAll("</topic>",""); outputLine = outputLine + sep + t; } // End of External page if(m5 != null){ println(outputLine); output.println(outputLine); output.flush(); } } } catch (Exception e){ println ("Error" + e) ; } output.close(); exit(); |
Pretty simple, yet effective…