
MCP Server
BillChan226
public
ShieldAgent
一个基于MCP的工具,用于从PDF、HTML和TXT文档中提取策略并生成结构化风险分类。
Repository Info
1
Stars
0
Forks
1
Watchers
1
Issues
C
Language
MIT License
License
About This Server
一个基于MCP的工具,用于从PDF、HTML和TXT文档中提取策略并生成结构化风险分类。
Model Context Protocol (MCP) - This server can be integrated with AI applications to provide additional context and capabilities, enabling enhanced AI interactions and functionality.
Documentation
# ShieldAgent
## Automatic Policy Extraction Pipeline
Based on MCP, this agent extracts policies and rules from **PDF**, **HTML**, and **TXT** documents and produce structured risk categories governed by corresponding rule definitions.
> `deep policy exploration` feature: Designed for automatically exploring comprehensive policies among lengthy PDF documents or distributed HTML websites via a priority-based search algorithm that explores document subsections in-depth.
## ✨ Features
- 📚 Supports multiple document types (PDF, HTML, TXT)
- 🔍 Deep exploration of document sections and links via tree-search
- ⚙️ Prioritizes sections based on likelihood of containing target policies
- 🔄 Parallel processing support for faster extraction
- 📊 Comprehensive output including document trees and visualizations
- 🏷️ Automatic rules extraction and risk categorization
## 🔧 Installation
```bash
# Clone the repository
git clone https://github.com/BillChan226/ShieldAgent.git
cd ShieldAgent
# Install required dependencies
pip install -r requirements.txt
```
## 📋 Usage
```bash
python policy_extractor_async.py [OPTIONS]
```
### Required Arguments
| Argument | Short | Description |
|----------|-------|-------------|
| `--document-path` | `-d` | Path to document file (PDF, TXT) or URL (HTML) |
| `--organization` | `-org` | Name of the organization whose policies are being extracted |
| `--input-type` | `-t` | `pdf` | Type of input document (`pdf`, `html`, or `txt`) |
### Optional Arguments
| Argument | Short | Default | Description |
|----------|-------|---------|-------------|
| `--organization-description` | `-org-desc` | `""` | Description of the organization |
| `--target-subject` | `-ts` | `User` | Target subject of the policies (e.g., "User", "Customer") |
| `--initial-page-range` | `-ipr` | `1-5` | Initial page range to extract from PDF |
| `--deep-policy` | `-dp` | `False` | Flag to automatically explore policy documents in-depth |
| `--output-dir` | `-o` | `./output/deep_policy` | Directory to save output files |
| `--debug` | | `False` | Enable debug mode |
| `--user-request` | `-u` | `""` | Specific user request to guide the policy extraction |
| `--async-num` | `-a` | `1` | Number of policy extraction tasks to run in parallel (1-3) |
| `--extract-rules` | `-er` | `False` | Extract concrete rules from policies after extraction |
## 📄 Input Types
The tool supports three input types:
### 📑 PDF Files (`-t pdf`)
- Supports page ranges with `-ipr` (e.g., "1-5" for pages 1 through 5)
- Automatically extracts text from specified pages
- Useful for official policy documents, terms of service, etc.
### 🌐 HTML/Web Pages (`-t html`)
- Extracts content from web pages
- Can follow links for deep policy extraction when `-dp` is specified
- Ideal for online policy repositories and terms of service pages
### 📝 Text Files (`-t txt`)
- Directly processes plain text files
- Simple and fastest processing option
- Good for pre-extracted content or manually curated policy text
## 🚀 Example Use Cases
### 1️⃣ Extract Policies from a Website (with Deep Exploration)
This example extracts policies from Reddit's rules page, follows links to related pages, and extracts rules targeting users:
```bash
python policy_extractor_async.py \
-d https://redditinc.com/policies/reddit-rules \
-t html \
-org "Reddit" \
-u "Focus on rules that target users and customers" \
-dp \
-er
```
### 2️⃣ Extract Policies from a PDF (Specific Pages)
This example extracts policies from the EU AI Act PDF, focusing on pages 2-8:
```bash
python policy_extractor_async.py \
-d ./policy_docs/eu_ai_act_art5.pdf \
-t pdf \
-org "EU AI ACT" \
-ipr "2-8" \
-er
```
### 3️⃣ Extract Policies from a Text File
This example extracts policies from a text file containing EU AI Act rules:
```bash
python policy_extractor_async.py \
-d ./eu_ai_act_rules/art_5.txt \
-t txt \
-org "EU AI ACT" \
-er
```
### 4️⃣ Process Multiple Sections in Parallel
For large documents, you can enable parallel processing of sections:
```bash
python policy_extractor_async.py \
-d large_policy_document.pdf \
-t pdf \
-org "Large Organization" \
-a 3 \
-er
```
## 📂 Output Files
The tool generates several output files in the specified output directory:
| File | Description |
|------|-------------|
| `{organization}_all_extracted_policies.json` | All extracted policies |
| `{organization}_all_extracted_rules.json` | Concrete rules extracted from policies |
| `{organization}_risk_categories.json` | Categorized rules with risk categories |
| `{organization}_document_tree.json` | Hierarchical representation of document sections with policies |
| `{organization}_extraction_report.json` | Detailed report of the extraction process |
| `{organization}_policy_rule_mapping.json` | Mapping between policies and rules |
## 🔍 Advanced Usage
### Custom Organization Description
Provide a detailed description of the organization to improve extraction quality:
```bash
python risk_extraction_doc/policy_extractor_async.py \
-d ./company_policies.pdf \
-org "Acme Corp" \
-org-desc "A multinational technology company specializing in AI services" \
-er
```
### Targeted Subject Extraction
Focus extraction on specific subjects within policies:
```bash
python policy_extractor_async.py \
-d platform_guidelines.pdf \
-org "Social Platform" \
-ts "Content Creator" \
-u "Focus on monetization policies" \
-er
```
## ⚠️ Troubleshooting
- If you encounter errors with PDF processing, ensure you have all dependencies installed
- For HTML extraction issues, check if the website allows scraping
- For large documents, consider using page ranges (`-ipr`) to process specific sections
- Enable debug mode (`--debug`) for detailed logging information
## 🔄 Processing Flow
1. **Document Loading**: The tool loads the document based on input type
2. **Content Extraction**: Text is extracted from the document
3. **Section Analysis**: The system analyzes sections for policy content
4. **Policy Extraction**: Policies are identified and extracted
5. **Deep Exploration**: If enabled, the system explores other related links or subsections
6. **Rules Extraction**: Concrete rules are extracted from policies
7. **Risk Categorization**: Rules are categorized by risk type
8. **Output Generation**: Results are saved to JSON files
## 📊 Example Output Structure
Policy structure:
```json
{
"policy_id": "p123",
"organization": "Example Org",
"policy_text": "Users must not share personal information of others without consent.",
"policy_source": "Privacy Policy, Section 3.2",
"extraction_timestamp": "2023-10-25T14:30:00"
}
```
Rule structure:
```json
{
"rule_id": "r456",
"source_policy_ids": ["p123"],
"rule_text": "Do not share another user's personal information without their explicit consent",
"risk_category": "Privacy Violation",
"severity": "High"
}
```
## 📚 Resources
- [EU AI Act](https://digital-strategy.ec.europa.eu/en/policies/european-approach-artificial-intelligence)
- [Policy Extraction Techniques Paper](https://example.org/paper)
- [Related Work on Policy Analysis](https://example.org/work)
## 📄 License
[Your license information here]
Quick Start
1
Clone the repository
git clone https://github.com/BillChan226/ShieldAgent2
Install dependencies
cd ShieldAgent
npm install3
Follow the documentation
Check the repository's README.md file for specific installation and usage instructions.
Repository Details
OwnerBillChan226
RepoShieldAgent
Language
C
LicenseMIT License
Last fetched8/8/2025
Recommended MCP Servers
💬
Discord MCP
Enable AI assistants to seamlessly interact with Discord servers, channels, and messages.
integrationsdiscordchat
🔗
Knit MCP
Connect AI agents to 200+ SaaS applications and automate workflows.
integrationsautomationsaas
🕷️
Apify MCP Server
Deploy and interact with Apify actors for web scraping and data extraction.
apifycrawlerdata
🌐
BrowserStack MCP
BrowserStack MCP Server for automated testing across multiple browsers.
testingqabrowsers
⚡
Zapier MCP
A Zapier server that provides automation capabilities for various apps.
zapierautomation