Best language to parse extremely large Excel 2007 files [closed]
My boss has a habit of performing queries on our databases that return tens of thousands of rows and saving them into excel files. I, being the intern, constantly have to write scripts that work with the information from these files. Thus far I've tried VBScript and Powershell for my scripting needs. Both of these can take several minutes to perform even the simplest of tasks, which would mean that the script when finished would take most of an 8 hour day.
My workaround right now is simply to write a PowerShell script that removes all of the commas and newline characters from an xlsx file, saves the .xlsx files to .csv, and then have a Java program handle the data gathering and output, and have my script clean up the .csv files when finished. This runs in a matter of seconds for my current project, but I can't help but wonder if there's a more elegant alternative for my next one. Any suggestions?
I kept getting all kinds of weird errors when working with .xlsx files.
Here's a simple example of using Apache POI to traverse an .xlsx
file. See also Upgrading to POI 3.5, including converting existing HSSF Usermodel code to SS Usermodel (for XSSF and HSSF).
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.DateUtil;
import org.apache.poi.ss.usermodel.FormulaEvaluator;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
public class XlsxReader {
public static void main(String[] args) throws IOException {
InputStream myxls = new FileInputStream("test.xlsx");
Workbook book = new XSSFWorkbook(myxls);
FormulaEvaluator eval =
book.getCreationHelper().createFormulaEvaluator();
Sheet sheet = book.getSheetAt(0);
for (Row row : sheet) {
for (Cell cell : row) {
printCell(cell, eval);
System.out.print("; ");
}
System.out.println();
}
myxls.close();
}
private static void printCell(Cell cell, FormulaEvaluator eval) {
switch (cell.getCellType()) {
case Cell.CELL_TYPE_BLANK:
System.out.print("EMPTY");
break;
case Cell.CELL_TYPE_STRING:
System.out.print(cell.getStringCellValue());
break;
case Cell.CELL_TYPE_NUMERIC:
if (DateUtil.isCellDateFormatted(cell)) {
System.out.print(cell.getDateCellValue());
} else {
System.out.print(cell.getNumericCellValue());
}
break;
case Cell.CELL_TYPE_BOOLEAN:
System.out.print(cell.getBooleanCellValue());
break;
case Cell.CELL_TYPE_FORMULA:
System.out.print(cell.getCellFormula());
break;
default:
System.out.print("DEFAULT");
}
}
}
Your goal is to do "data transformation" on your Excel files.
To solve this, I would use a dedicated ETL tool (Extract Transform Load), such as Talend Open Studio.
You just have to put a "Excel Input" component, a "data transform" component, and a "CSV output component". Talend ETL will convert this functional description of your problem into a Java code. Finally, you just have to execute this program...
I personally would use Python for this. I have found that it runs fast enough to not be a noticeable problem.
If you don't want to worry about a new language, why not just use Java for the entire thing? Removing commas and newlines is pretty trivial in Java and it would save you a step.
You should always think about the future of your code...
Who will maintain your script in the future? Does your company have any other developers that are familiar with PowerShell/VBScript?
I would have to say that you should stick to one language that fits your (and your company's) needs. As Nathan suggested, Python would be a great choice for creating fast scripts.
And one more thing - If you can control the SQL statements your boss does, you can make him create outputs that will ease your parsers' development and make them much more simple.
Good luck!
Tal.