Processing 100+ GB of data using Pingar API on Amazon - 9 July, 2012
Gene Golovchinsky, a Senior Research Scientist at FujiXerox, approached us with an interesting task. Can Pingar API scale to process a rather large dataset: approximately 1.7 million documents retrieved from CiteSeer (a repository of research publications)? Such documents being scientific publications typically average at 6 pages of text in small font. So after extracting text from all of these documents, the dataset adds up to over 110GB of uncompressed text. This post examines the methodology used and the results we found during our experimentation.
Specifically, Gene was interested in identifying all valid phrases in each document. This was easy to do. We simply used Pingar keyword extraction (available via the GetEntities method) with a number of keywords threshold set to 1000. The challenge was to run this method over all documents in the large input set.
First of all, we needed an estimate of how long it would take to process such a dataset on a regular machine. Gene supplied us with a sample set of 100 documents, or approximately 5MB of text. To get a baseline estimate, I ran them against the following machines:
- My development machine (4-core CPU, 8GB RAM) - 2 min 10 seconds
- A Rackspace server (4-core CPU, 8GB RAM, Virtual Machine) - 3 min 20 seconds
Some of these documents were quite large (up to 600KB), so they took a while to process. Extrapolating these numbers out to ~100GB of documents, we were looking at approximately 722 hours (30 days) to process on a development box, or 1111 hours (46 days) to process on a Rackspace server.
A new method was needed. The first and easiest thing to do was to multi-thread my client code. Just running 10 threads with 10 docs each wouldn't provide much of an accurate estimate, so I decided on a base level of 1000 documents, or running my 100 document sample set 10 times. Still sending 10 documents in each thread, I managed to process 1000 documents (10x100 in the sample set) in 22 minutes using 2 threads, 21 minutes using 4 threads, and a cool 17 minutes with 8 threads. OK - that's 566 hours (24 days) running locally. Certainly an improvement, but still a much larger time frame than I had hoped for.
Obviously, we were in need of more processing power. We have been using Amazon Elastic MapReduce to process our datasets for a while now, so naturally this was the next place to look. Elastic Compute offers a number of different machine configurations as on-demand instances.
Before deciding if this was the right direction to be heading, I had to take a baseline measurement. I selected first of all a "Large" instance type (7.5GB memory, 4 EC2 Compute Units [2 virtual cores]) and started a new instance, using the Windows Server 2008 R2 machine image. After logging into the virtual machine via RDP, it was simply a matter of logging into our fileshare and copying the appropriate files over. Once everything was tested and working, I went back to the Amazon console and saved a private Amazon Machine Image. What this means is that now anytime I want to run a Pingar API instance on an Amazon Elastic Compute machine, I simply run it from this image in the console (as opposed to using the generic/blank Windows Server image). It's an easy way to start up a cloud server (or, servers) on demand.
So now I have a running instance of the Pingar API hosted on Amazon - time to test processing against it! I started up a "Small" instance, and copied my testing application over. After some teething issues (I forgot to allow access from one instance to another in the security group settings) I successfully ran my test against this instance. The results weren't exactly impressive - 4 threads took approximately 30 minutes to process 1000 documents. Increasing the number of threads didn't do much to decrease the time, either.
My next step took a different approach. Because the dataset contained research publications, in which keywords typically appear in the abstract, introduction and conclusions, we could safely assume that chopping out the middle part of the document would have minimal effect on the results. I proceeded to "trim" the longer documents, by only taking the first 30kb and the last 20kb of the text file. This brought down the too-long processing time of the larger documents, but still didn't reduce the total processing to a reasonable time frame.
So I figured it was time to bring out the big guns. The Amazon High-CPU Extra Large instance has 20 EC2 compute units (8 virtual cores) and 7GB of RAM. Once I started processing, it was immediately obvious that this one was faster - and the more I increased the number of 'client' threads running against it, the faster it got. Increasing the number of threads running simultaneously meant that response times were higher (i.e., a single 10 document request took longer to process) but since more threads were running simultaneously this was fine, as overall we were making more progress. In fact, running 24 threads against this High-CPU Extra Large instance brought the estimated time for processing 1000 documents down to 6 minutes. This was a massive improvement over the 17 minutes I'd managed to reach on my development machine.
This works out to 200 hours, or a little over 8 days. By adding further machines to the cluster, I could further reduce this time. Running 4 machines meant I could do 4 times the processing at once - bringing it down to approximately 50 hours to do the whole dataset. And 50 hours meant I could let it run over the weekend.
As a comparison, I would've had to run 20 'Large' (recall 7.5GB memory, 4 EC2 Compute Units [2 virtual cores]) machines to process the dataset in the same time as the 4 'High-CPU Extra Large' machine. This would've cost 20 x $0.46 = $9.20 an hour. The 'High-CPU Extra Large Machine' cost 4 x $1.14 = $4.56 an hour. The fact that the initially more expensive High-CPU machine resulted in half the total cost for processing was a bit of a surprise, and showed that it is certainly worth doing some investigation into which is the right technology for the job at hand.
The actual process of running the 4 machines was surprisingly simple. I simply selected 4 instances when starting up from the previously saved machine image and then started a load balancer. I did have some trouble figuring out why the load balancer was repeatedly reporting the instances as 'Unhealthy' - turned out it was a security group issue, again. I had to add amazon-elb/amazon-elb-sg (Amazon Elastic Load Balancer) to the security group - which conveniently auto-completed when typing into the control console. Once I did this, everything worked as expected. I pointed my client code at the load balancer, ran my test code and (after logging in via RDP) saw the CPU usage on each instance sitting happily around 90%. Excellent.
The final step now was the actual processing of the provided data. Gene had uploaded the data to Amazon S3 for us, so it was easy enough to download onto my 'client' instance - but here I struck a problem. The compute instance I'd been using for running my client test code only had a small main drive! The promised 410GB of storage was nowhere to be seen. So, back to Google I went and found a useful post on the Amazon Web Services forums which explained the process for adding ephemeral storage. Sure enough the instance that started up had the additional storage available and ready to go.
Overall my experience around running a multi-machine solution on Amazon Elastic Compute services was a positive one. Most of the troubles I had were fairly easily solved, and generally could have been prevent by closer examination of the documentation instead of my ad-hoc approach. We were able to process our largest dataset yet successfully and learn a lot in the process.
New features in Pingar API 3.1! - 27 June, 2012
We recently released the Pingar API version 3.1 to the public. This has been in the works for some months and contains a number of exciting new features. While you can continue to use the old endpoint for a while, it is being deprecated and we recommend moving to the new endpoint as soon as possible. This blog post explains the main new features of Pingar API 3.1.
Pingar API now runs on WCF
The largest change was the move to WCF based services. This means that instead of pointing your code at http://api3.pingar.com/PingarAPIService.asmx you now point it at http://api3.pingar.com/soap.svc. The benefit of running a WCF SOAP service is that we can now add changes in a non-breaking way (for most clients). So, for example, if we decide to add a new method called "GetMagicDocument" in 3.2, then your code compiled against 3.1 shouldn't break. Similarly if we were to add a new property "MagicDust" to an Entity object, then you should be OK to keep using the old code again. This should be the case if you use C# or VB with a Visual Studio generated reference, or Java with an Axis2 reference (like the code supplied in the Pingar SDK files), and possibly other languages too.
Pingar API now is available via a REST endpoint
Another benefit of moving to WCF was that we now supply a REST endpoint. This is not strictly RESTful in the traditional sense, as we decided that the complexity of the number of options we provide for each method would have made developing and supporting for this a bit of a nightmare. Instead we use the existing object structures (PingarAPIRequest, PingarAPIResponse, etc), formatted in JSON, and then passed in the method body for the request/response. Instead of parameters in the request URI and the document contents in the body, this keeps everything in one place. We have provided a REST API guide here which should be useful in developing against this new endpoint.
New handling of entity types
When we released the 3.0 version of the Pingar API, one of the benefits over the 2.X versions was that EntityTypes, DocumentFormat, etc, were using enumerations. We added these because we were encountering some confusion between spelling and capitalisation of types, even internally. For example is a bank account Entity Type spelled "BankAccount" or "Bankaccount"? Using enumerations worked great - in SOAP. Unfortunately there is no way to support these in REST, and modifying or adding to the enumeration is a breaking change, even with WCF. So for this reason we have switched back to using strings to define our EntityTypes, SubEntityTypes, DocumentFormat, LanguageCode and SentenceTypes. In an effort to solve the previous problems, we've created a 'StringConstants' file which contains classes with static values which we recommend you use when dealing with these types. This is included in the new SDKs, and is designed in such a way that in most cases (except method arguments) the only code change you need to make is the inclusion of 'using PingarAPIExamples.StringConstants;'.
New languages: German, French and Spanish
Another new feature is the addition of new languages! The API 3.0 release was a re-architecting of our code to enable the easy addition of new languages, and this release is an example of how straightforward it is to do that. We now support German, French and Spanish, on top of the English, Chinese and Japanese we already had support for. We've also added two new regions to English (en-us for the United States and en-au for Australia) allowing us to provide region-specific features, primarily with Entity extraction.
Improved keyword extraction
Apart from several bug fixes, we have introduced several changes that make Pingar API's keyword extraction more meaningful and easier to use. The keywords scores are now returned within a relative range, from 1 (most relevant keyword in that document) to 0 (no relevance). Now you can meaningfully compare importance of a given keyword across different documents. Also, each keyword returned for a given document now contains occurrence location information, encoded as the Entity.Occurrences property. This is a feature that has been missing in earlier versions, and certainly makes highlighting a Keyword in text a lot easier.
Finally with Keywords we are now setting the Frequency value (which indicates how many times it occurred in text). This is not as simple as EntityOccurrences.Count, as some occurrences may actually be subphrases of other occurrences. For example, "real estate" may be a subphrase of "real estate industry", and both would be listed in the occurrences for the "real estate" keyword. We've done the calculations for you, and provide an accurate measure of the number of times a Keyword occurs in a document in the Frequency value of the Entity object.
Other notable improvements
That covers most of the big features and fixes. Some of the other smaller but notable things are:
- GetAvailableLanguages method for finding out what languages are available for a given Pingar API installation
- 'GSTNumber' and 'GSTAmount' EntityTypes (for GetInvoiceEntities method) are now the more universal 'TaxNumber' and 'TaxAmount'
- A default Taxonomy is now specified for every PingarAPI installation - if you do not specify a Taxonomy when calling GetEntities or GetTaxonomyTerms, it will use this default, as opposed to all taxonomies
- ReplacedEntities property is now available for Redact/Sanitize methods, so you can find out what exactly has been replaced
- TaxonomyCategory added to Entity objects, which specifies the parent category in the Taxonomy (if the EntityType is TaxonomyTerm)
- InvoiceLineItems EntityType for GetInvoiceEntities (currently in beta)
- The NumberOfResults, NumberOfKeywords and SummaryLength parameters available for the GetAutocompleteSuggestions, GetDocumentPreview, GetSummary and GetEntities methods are now nullable ints. This means if you do not set a value for the parameter, the method will return a default sized set of results. You may now set 0 as a valid number of results to be returned.
If you have any comments or questions, please do ask on our forums. We would love to hear from you.