10.10.2022 | Blog Find hidden data treasures with file share analysis
Whether manuals, data sheets or presentations - in many companies, more and more data accumulate on their servers. The metadata quality of these documents is not always optimal and there are often breaks in the access rights to the data. Often, it is a reality in many companies, especially when using file shares, that duplicates or outdated documents "clutter" the file share. Typically, employees still need to know the exact name of a particular document and where it is stored. However, if the file has moved to a different location or is not correctly named or tagged with appropriate metadata, the document cannot be found at all. As a result, valuable knowledge is lost or the wheel is invented twice too often because, for example, the employees are not aware that information on a certain topic already exists.
An additional challenge arises when data must be migrated to a different IT architecture. Often, the IT administrator faces the challenge of migrating only the required data, or data that is allowed to be migrated according to data protection guidelines. This is particularly important when data migration takes place from corporate networks to a cloud-based data lake. Duplicates have to be removed in that process so that files are not unnecessarily moved twice. However, existing access rights and roles must be carefully considered and adhered to on the target platform. Personal or otherwise protectable content must be identified.
File share analysis means improving data quality
To figure out which data should be migrated, and which data can be deleted, IT first needs to know which data exists and where it is stored. Performing a deep analysis of the file shares, captures all data and provides three key insights in particular:
1. Identify breaks in the access rights concept: cutting and pasting folders with many attached files - a quite common means of migrating departmental data - often results in breaks in access rights on the file share. The problem that arises is that folders with missing permissions are not displayed in File Explorer. However, documents that are stored inside these folders, often do not have the same restrictions (i.e., the user has access rights). A challenge in migration projects or in an enterprise search project is to display these documents correctly in the hit list based on the actual file permissions, even when the folder, in which the file is stored, is not visible in the File Explorer due to missing access rights.
2. The analysis of metadata. Here, among other things, it can be seen how often data is accessed and which file formats occur how often in which areas. The age of the content can also be analyzed.
3. Content analysis. Here, the objective is to extract essential content using content analyticis methods. The content analysis analyzes directory paths for top topics and important metadata. In the documents, important keywords, product information, employees or personal names, locations or company names are recognized, collected, and displayed on the analysis platform.
The benefit for companies
Through the described detailed analysis of the file share data, a security manager can create trust by migrating the data cleanly and complying with the security requirements such as integrated rights concepts.
Equally important is the content analysis. Which documents must be removed from the storage systems in accordance with legal or company-specific deletion deadlines and compliance guidelines? Which documents need to be considered separately because there are defined workflows or target systems in place in the company?
The company further benefits from file share analysis in the following ways:
- More security: By recognizing missing rights concepts, these can be added later.
- Better data quality: By detecting duplicates, these can be cleaned up. Data that has not been used for a long time or is outdated can be conveniently deleted in a separate process.
- More knowledge: Documents are tagged with metadata so that users know what content is in the data. Hidden knowledge is uncovered and made available to a large, authorized group of users.
- Lower costs: Finally, data cleansing can save storage space and costs for additional servers. The volume of data to be migrated can be drastically reduced because of the knowledge gained.
File share analysis with iFinder and Kibana
Once all data sources to be analyzed have been captured, the next step is to index them with iFinder and make them searchable. iFinder is an enterprise search solution for company-wide searches in structured and unstructured data. Kibana is a visualization platform from Elastic that complements the iFinder search interface with flexible reporting functions. In addition, iFinder has numerous content analytics features. Already during indexing, iFinder extracts metadata such as author, date, or title from documents and shows them in the hit list. If a document contains keywords, products, people, organizations, or locations, iFinder recognizes them automatically. This information is indexed and stored as metadata and is searchable by analysts.
Search queries can be filtered quickly and easily. For example, searching for the term "confidential", can find a huge list of matching documents. With search filters like file type “PowerPoint”, author “Intrafind” and timeline “last month” reduces the hit list to a manageable number of documents from a large database. With only three clicks, all PowerPoint presentations containing the term "confidential" from the past month by IntraFind are found.
In a different data presentation of the analysis platform, the gathered metadata are quickly summarized in reports using modern, Kibana-based dashboards. Reports can be redefined easily and quickly and can be reused for a group of analysts.
This is how iFinder supports file share analysis:
- Autocomplete: with typo correction and "did you mean" suggestions.
- Preview: Preview function for more than 600 file formats and highlighting of the search term matches in the large preview of the iFinder user interface.
- Metadata generation: Using automatic keywording as a standard iFinder functionality, missing metadata can be generated automatically or in a quality-assured process from the unstructured full texts to enrich the content. The system generates the following metadata: Top keywords, general proper names of persons, companies, or even company-specific entities such as product names or department names as well as topic affiliations. Relationships between entities can also be detected and enriched based on this.
- Identify rights violations: During indexing, the documents that have different rights than the folder in which they are stored are provided with corresponding metadata flags. In the hit list, these are then color-coded accordingly and can also be filtered. In this way, every user is also made aware of possible rights violations.
- Duplicate check: Filtering for documents with the same content or for identical documents.
- Find similar documents: Based on the keywords and top terms in a found document, documents with similar content are found in the data base.
- Storage of complete search queries, support of collaboration functionalities
- Role concepts: iFinder uses rights and role concepts to control different views of document content. For example, an employee from the legal department can see any content regardless of the document rights. Business administrators can be granted access to their "own data" only, and IT administrators can conceivably see the metadata of all documents that the system captures.
- The iFinder is based on Elasticsearch technology and has no limits on the amount of data that can be captured. Several billion data records are no hurdle for the iFinder.
Without doubt, content analysis of the data is useful in any other source system. Even across data sources, content analysis is an important tool to gain reliable insight about data content and data structure in many application scenarios.
Learn more about AI-based document analysis here.
Ask us about successful project examples at our customers!