Choose your language:

Australia

Germany

Hong Kong

India

Ireland

Netherlands

New Zealand

Singapore

Sweden

Switzerland

United Kingdom

United States

AWS

Amazon Textract-Based Document Redaction Proof of Concept

A Story of Owning Change

About King County of Washington

King County is home to Seattle and is the most populous county in the state and the 12th most populous county in the United States. King County delivers vital services for more than 2 million residents and is one of the state’s largest employers with 16,000 dedicated employees.


king county logo in black and white

The prototype is delivering business value to the County. With 1Strategy’s help creating a prototype in just four weeks, the data pipeline has reduced the time it takes to search for and redact PII from 30 minutes to just five seconds per application.


—Tanya Hannah, Chief Information Officer, King County


The Challenge

When senior citizens in King County seek property tax relief, they can submit forms electronically or on paper. When paper submissions are received, they are scanned and then manually reviewed for Personal Identifiable Information (PII), which is redacted by hand. A similar challenge exists when unredacted documents are submitted online. A senior exemption specialist assesses those forms to determine eligibility by age, income and health care expenditures. King County wanted to automate and therefore speed up the process to reduce the physical labor involved and to increase the security and protection of seniors.

Why Amazon Web Services

King County was already running workloads on AWS. Based on positive experiences with AWS, King County knew they wanted to explore the additional possibilities for machine learning within AWS. At the recommendation from AWS, King County reached out to 1Strategy for support in this project. After the introductory meeting between the two companies, wherein 1Strategy listened to understand the project’s objective and demonstrated their extensive experience and expertise with similar machine learning projects, King County knew they could confidently move forward with 1Strategy.

“AWS has leading-edge technology that allows King County to innovate and build solutions that solve complex business problems,” said Tanya Hannah, King County’s chief information officer. “With 1Strategy’s partnership and technical guidance, the team is well-prepared to extend intelligent processing to similar use cases and beyond.”

The Solution

1Strategy worked with King County to create a document redaction prototype in four weeks. The project leveraged two powerful machine learning managed services for text and image recognition: Amazon Textract and Amazon Rekognition. The AWS Software Development Kit (SDK) was used to link custom code to these services in an Amazon SageMaker Jupyter notebook. This allowed the team to build customized pieces of the application rather than having to implement machine learning algorithms from scratch. The project produced a proof-of-concept data pipeline to automate and speed up redaction of incoming documents to hide sensitive PII and establish a baseline for using Amazon’s AI services. Working with 1Strategy provided King County the education needed to understand the product and iterate upon it.

“The capability of Textract to read these documents and identify the different fields within them is pretty impressive,” said Eric Maia, solution architect at King County.

One of King County’s primary goals was to speed up the process of redacting PII from the seniors’ document submissions. To do so, they worked with 1Strategy to design and build an AWS data pipeline (see Figure 1), which used machine learning to identify the type of document and then read the document and identify where redactions of PII should be performed.


Figure 1: AWS data pipeline architecture diagram

Documents, or data, entered the pipeline and were stored in Amazon Simple Storage Service (S3). From there, documents in three formats (JPG, PNG, PDF) were imported to the Amazon SageMaker Jupyter Notebook service, where custom code standardized the documents into a format appropriate for further processing. This custom code deskewed documents, separated multipage documents into individual pages and converted all pages into a standardized image format. Once each page was in a standard format, the application had to determine what type of document it was. As in any machine learning application, the designer’s objective was to train a machine to do what people were currently doing. Similar to how a novice exemption specialist would first need to learn the types of forms they were working with, the machine needed to be trained to do this task. For this purpose, the team chose Amazon Rekognition’s custom image classifier. The team supplied samples of various document types to Rekognition, along with expert guidance identifying each document’s type. From this training, Rekognition learned to do the task on its own. Because Rekognition is an easy-to-implement managed service, this design and training process took only a few hours for an initial prototype, ready to be used by the data processing pipeline.

After a document was identified by Rekognition, it would be read by Amazon’s machine learning text recognition service, Amazon Textract. Textract could look for instances on the form that paired a prompt with a response, such as “Social Security Number” and “123-45-6789.” Only a subset of these responses needed to be redacted. Figure 2 illustrates the results of redacting only certain responses on a simulated sample tax document. Most of the responses were left alone, but seven responses had either a blue and red box or just a blue redaction box superposed over the information. The red boxes were successful redactions where Textract was able to find a desired prompt and then remove its response. The blue boxes used locational data to remove areas on the page where a redaction was expected to be needed.

To set up this red/blue redaction process, the team used a separate custom application implemented in an Amazon SageMaker notebook to record all the prompts and geographic locations on an exemplar document of a specific type, along with an expert’s specification for whether to redact the corresponding response. The pipeline then used this expert and locational data as applied to a new document of the same type to tell Textract which responses to redact (red box) and where those boxes should be on the new document (blue box). The locational data used a linear regression to map from the coordinate system of the exemplar document to the coordinate system of the new document as determined by the positional data Textract gleaned from the prompt/response pairs it found on the new document. Once a document was successfully redacted, it was stored in Amazon S3.

The algorithm rejected documents that would be difficult for Textract to read by setting a minimum number of prompt/response pairs and by looking for large rotation angles that were outside the limits of Textract’s ability to read. These documents were then moved to a different S3 location and made available for human review.

For the purposes of this proof of concept, King County used a simulated sample with a single document type. Of that sample, 84% of documents were marked as successful redactions, and of those, 95% were indeed successful. Of the remaining 16% of documents marked for human review, 87% needed further human review. In the future, the team will extend the project to include additional document types using additional exemplar documents and more extensive training of the Rekognition classifier to broaden the types of documents the pipeline can handle.

The ease of implementation of AWS managed machine learning services provided an opportunity to create a proof of concept in four weeks and establish a baseline for King County’s prospects for using machines rather than people to read and redact documents.

“The 1Strategy team really focused on making sure that we understood the product we were putting together, how it works and how to extend and apply it,” said Maia.

Additionally, Hannah said, “The prototype is delivering business value to the County. With 1Strategy’s help creating a prototype in just four weeks, the data pipeline has reduced the time it takes to search for and redact PII from 30 minutes to just five seconds per application. This automation is helping the County’s Assessors staff keep up with the 8,000 new applications received annually and clear the backlog of 4,000 unprocessed applications with 100,000 pages of accompanying documents from the 2021 tax year.”


The work described in this engagement was originally completed by 1Strategy, a TEKsystems Global Services company acquired in 2019. As of June 2023, 1Strategy has fully integrated with TEKsystems Global Services to continue to deliver AWS expertise to customers. Learn more about our AWS solutions.

Discover The Power of Real Partnership

Let’s talk about the world of possibilities and how we can partner to make them a reality.

Start a conversation