Gene Golovchinsky, a Senior Research Scientist at FujiXerox, approached us with an interesting task. Can Pingar API scale to process a rather large dataset: approximately 1.7 million documents retrieved from CiteSeer (a repository of research publications)? Such documents being scientific publications typically average at 6 pages of text in small font. So after extracting text from all of these documents, the dataset adds up to over 110GB of uncompressed text. This post examines the methodology used and the results we found during our experimentation.
Learn More