In 2017, the US government released over 34,000 pages of documents relating to the CIA investigation of the JFK assasination. These files contain a huge volume of unstructured data - typed and handwritten notes, photos and other data that standard search solutions are unable to parse.
This lab will show you how you can leverage AI and Cognitive Search, to extract meaning from this data. You can watch a demo of the lab contents in action in a short online video or explore the JFK files yourself with our online demo.
Cognitive Search is an Azure service that ingests your data from almost any datasource; enriches it using a set of cognitive skills, and finally enables you explore the data using Azure Search.
The JFK files example leverages the built-in Cognitive Skills inside of Azure Search and combines it with custom skills using extensibility. The architecture below showcases how the new Cognitive Search capabilities of Azure enable you to easily create structure from almost any datasource.
Note: This diagram of visuals are inspired by the CIA's JFK document management system in 1997 included in the JFK files.
This project includes the following capabilities for you to build your own version of the JFK files.
This lab requires an Azure subscription. If you delete the resources at the end of the session, total charges will be less than $1 so we strongly recommend using an existing subscription if available.
If you need a new Azure subscription, then there are a couple of options to get a free subscription:
This lab uses Postman to interact with the Azure Search REST API. Other similar tools can be used e.g. Fiddler, Charles. The free Postman desktop app can be downloaded and installed for most operating systems. See the Azure Search documentation for more information on using Postman.
Download the sample code provided for this lab, it includes the following:
To Download:
In this module, you'll learn how to create your own Cognitive Search pipeline. You'll use the 'OCR cognitive skill' to perform character recognition on the source documents.
In order to speed up the lab we provide an Azure Template to deploy the required resources for this lab. The template will provision the following services:
Follow the next steps to create the resources:
Click the following link to deploy the template: // TODO: Replace template url
You will be redirected to Azure, provide your credentials to login.
Provide the required information:
ai-labs-<your initials>
.South Central US
.[!NOTE] At the time of writing, Azure Search with Cognitive Services is only available in the
South Central US
region.
Select the checkbox for "I agree to the terms and conditions stated above".
Click Purchase. This step might take a few seconds.
Once the deployment is complete, you will see a Deployment succeeded notification.
Save the configuration values required to use these services:
jfk-labs-<your initials>
.[!NOTE] These settings are required to connect to the services later in the lab.
We'll need a set of documents for testing the capabilities of cognitive search. Let's upload the sample files.
Create the blob container:
Go to All Resources in the left pane and search for the storage: jfkstorage
.
Click on the resource.
Click on the Blob option.
Click on +Container and provide the required information:
jfkfiles
.Container
.Click on the newly created container.
Click the Upload button.
Select the following files from the lab materials: resources\documents
.
Click Upload and wait for the process to complete.
A data source is the mechanism by which Azure Search indexers ingest data. You can pull data from supported Azure data sources using indexers and schedule data refreshes of a target index.
Setup Postman:
Open Postman from the Start Menu.
Click on Import from the toolbar.
Click on Import files and select the collection file at resources\AI Labs - Azure Search.postman_collection.json
.
Click the gear icon in the upper right corner and select "Manage Environments".
Click the Import button at the bottom of the modal and select the file resources\AI Labs - Azure Search.postman_environment.json
.
Click on the environment name to reveal a key-value editor to add, edit, and delete the environment variables.
Set the value of the following variables:
Click on Update and close the modal window.
Select AI Labs - Azure Search
from the environment list in the top right corner.
Make the request:
Click the Create Data Source request from the collection.
Click on Body and specify the following values:
jfklabds
.azureblob
.Click on Send and wait for the response. If the request is successful you should see a status code 201 Created
.
A Skillset is a set of Cognitive Skills that are used to extract and enrich data from the source documents to make it searchable in Azure Search. The service provides a set of predefined cognitive skills to extract data using techniques like entity recognition, language detection, key phrase extraction, text manipulation, and image detection, for more information follow this link.
Click the Create Skillset request from the collection in Postman.
Review the request url and replace the value [skillset_name] with jfklabskillset
.
Click on Body and check the value, it contains two predefined skills:
Click on Send and wait for the response. For a successful request, you should see status code 201 Created
.
Creating an index specifies the schema and metadata that are required to store documents in Azure Search.
jfklabindex
.[!NOTE] The body of the request contains the schema definition, which includes the list of data fields within documents that will be fed into this index.
201 Created
.Indexers are specific to Azure data storage, they are used for crawling data in the data source and populating the Azure Search index.
Click the Create Indexer request from the collection in Postman.
Review the request url and replace the value [indexer_name] with jfklabindexer
.
Click on Body and replace the [IndexerName] with jfklabindexer
.
Replace the following values:
jfklabindex
).jfklabds
).jfklabskillset
).Click on Send and wait for the response. For a successful request, you should see status code 201 Created
.
jfk-lab-search-service
.$top=10&$count=true
.JFK&$count=true
.[!NOTE] Use
CTRL+F
to search through the document content if you can't find the content that matches your query.
In this module, you'll learn how create a custom skill that can be plugged into the Cognitive Search pipeline.
Building a custom skill gives you a way to insert specific transformations to your content and apply whatever enrichment process you require.
In this example we will create a custom skill that annotates documents that contain CIA "Cryptonym" code words. e.g The CIA assigned the cryptonym GPFLOOR
to Lee Harvey Oswald, so any documents containing that Cryptonym will be linked with Oswald.
The only requirement for a skill is the ability to accept inputs and emit outputs. Currently, the only mechanism for interacting with a custom skill is through a Web API interface. Although this example uses an Azure Function to host a web API, it is not required as long as you meet the interface requirements for a cognitive skill. Click here for more information.
JfkWebApiSkills\JfkWebApiSkills.sln
.JfkWebApiSkills
.[FunctionName("link-cryptonyms")]
public static IActionResult RunCryptonymLinker([HttpTrigger(AuthorizationLevel.Function, "post", Route = null)]HttpRequest req, TraceWriter log, ExecutionContext executionContext)
{
}
string skillName = executionContext.FunctionName;
// Get the batch of input records from the request
var requestRecords = WebApiSkillHelpers.GetRequestRecords(req);
if (requestRecords == null)
{
return new BadRequestObjectResult($"{skillName} - Invalid request record array.");
}
// Process each record and set the cryptonym to the output if found
WebApiSkillResponse response = WebApiSkillHelpers.ProcessRequestRecords(skillName, requestRecords,
(inRecord, outRecord) => {
string word = inRecord.Data["word"] as string;
if (word.All(Char.IsUpper) && cryptonymLinker.Cryptonyms.TryGetValue(word, out string description))
{
outRecord.Data["cryptonym"] = new { value = word, description };
}
return outRecord;
});
return (ActionResult)new OkObjectResult(response);
[!NOTE] The
ProcessRequestRecords
method sets the description of each cryptonym, it reads the values from the json fileCryptonymLinker\cia-cryptonyms.json
. Open this file to see the list of available cryptonyms.
For the purposes of our demo, we'll be deploying directly from Visual Studio.
[!ALERT] Ensure you are signed in with the same credentials you used to sign in to Azure. This will connect Visual Studio to your Azure subscription.
Release
.jfk-lab-function-app
.[!NOTE] If you are prompted to update the Functions Version on Azure click Yes.
jfk-lab-function-app
.[!NOTE] All Azure Functions created after June 30th, 2018 have disabled TLS 1.0, which is not currently compatible with custom skills.
[!ALERT] TLS 1.2 functions are not yet supported as custom skills.
Click on the Overview option.
Click on Configuration.
Click on + New Application Settings Verify and add the following setting:
MSDEPLOY_RENAME_LOCKED_FILES
.1
.Scroll to the top and click the Save button.
Return to Postman.
Click the Test Custom Skill request from the collection in Postman.
Review the request url and replace the following values:
AZURE FUNCTION SITE NAME
value from the initial Deployment Output.Click on Body and check the content, we'll be sending 2 cryptonyms to the function.
Click on Send and wait for the response. For a successful request, you should see status code 200 OK
.
Check the descriptions for each cryptonym in the response body.
jfklabskillset
.{
"@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"description": "Cryptonym linker",
"uri": "https://[function_app_name].azurewebsites.net/api/link-cryptonyms?code=[default_host_key]",
"context": "/document/normalized_images/*/layoutText/words/*/text",
"inputs": [
{
"name": "word",
"source": "/document/normalized_images/*/layoutText/words/*/text"
}
],
"outputs": [
{
"name": "cryptonym",
"targetName": "cryptonym"
}
]
}
[!NOTE] Replace the
[function_app_name]
and[default_host_key]
with the values used in the previous section.
204 No Content
.jfklabskillset
.Include the cryptonyms field in the Indexer:
outputFieldMappings
section:{
"sourceFieldName": "/document/normalized_images/*/layoutText/words/*/text/cryptonym/value",
"targetFieldName": "cryptonyms"
}
We have to re-run the indexer to apply the new skill to the source documents.
jfk-lab-search-service
.jfk-lab-search-service
.WILSON&$count=true
.cryptonyms
field, you will see the AM cryptonym in the results.In this module, you'll create a more advanced pipeline, using the custom skill from the previous module and include more cognitive skills in the pipeline.
Let's take a look at our source data and see which data is not being extracted:
resources\documents\104-10013-10234.pdf
.FBI
.resources\documents\photo_oswald.jpeg
.oswald
.We'll use a query key to query the Search service from the front-end. Query keys grant read-only access to indexes and documents, and are typically distributed to client applications that issue search requests. In the next steps we'll create a new query key. You can create up to 50 query keys per service.
Follow this link to create the query key.
Login with the same credentials you used to sign in to Azure.
Review the request url and provide the following parameters:
my_query_key
.Click on Run.
Check the response body, copy the key value as you will need it later.
In order to speed up the lab process we'll use a console app to recreate the different components in the pipeline. This will include more cognitive skills like handwritting and image analysis for categories, faces, image type, adult content and others.
Return to Visual Studio.
Open the file App.config
from the JfkInitializer
project..
Add the configuration values, use the values obtained in previous steps.
Open the Program.cs
file.
Go to the CreateAdvancedPipelineAsync
method.
Check the list of components being re-created.
Let's implement the synonyms map, go to the method CreateSynonyms
.
Add the following code snippet where indicated:
try
{
SynonymMap synonyms = new SynonymMap(SynonymMapName, SynonymMapFormat.Solr,
@"GPFLOOR,oswold,ozwald,ozwold,oswald
silvia, sylvia
sever, SERVE, SERVR, SERVER
novenko, nosenko, novenco, nosenko");
await _searchClient.SynonymMaps.CreateAsync(synonyms);
}
catch (Exception ex)
{
Console.WriteLine("Error creating synonym map: {0}", ex.Message);
return false;
}
JfkInitializer
project and click Debug > Start new instance.[!NOTE] Click Continue Debugging (Don't ask again) if prompted.
Open the file explorer in the lab materials directory and replace the following files:
resources\advanced-pipeline\JfkWebApiSkills.cs
to src\JfkWebApiSkills\JfkWebApiSkills
.resources\advanced-pipeline\cia-cryptonyms.json
to src\JfkWebApiSkills\JfkWebApiSkills\CryptonymLinker
.Replace the file and open it in Visual Studio. The file contains the following functions:
Right-click the JfkWebApiSkills project.
Click Rebuild and wait for it to finish.
Right-click the JfkWebApiSkills project.
Click Publish.
Click Publish again.
Review the deployment result in the Output window and wait for the operation to complete.
[!Note] If your deployment fails due ERROR_FILE_IN_USE, return to Azure and open your function app resource, click Restart from the Overview page and try the deployment again.
Go to the Main
method.
Replace the following line: bool result = CreateAdvancedPipelineAsync().GetAwaiter().GetResult();
with
bool result = DeployFrontEndAsync().GetAwaiter().GetResult();
Right click on the JfkInitializer
project and click Debug > Start new instance.
It will open a Console App, wait until it requests to build the Website.
Open a Terminal and navigate to the frontend
directory in you lab folder.
Execute the following commands:
npm install
npm run build:prod
[!NOTE] Check that a dist folder was created in your frontend directory.
Return to the Console App and press any key to continue.
Wait for deployment to complete, this might take a few seconds.
Copy the Website url and press any key to finish.
ozwold
.[!NOTE] Notice how the Synonyms map allows to match the ozwold word to oswald.
FBI
.此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。