
Completed
Posted
Paid on delivery
I have roughly 5,000 DEF 14A proxy statements in HTML format and I need the key compensation details for each named executive pulled out and placed into a clean, structured file. The fields I must end up with are: base salary, stock options and awards, bonuses / incentive pay, plus any other compensation figures that appear in the summary or grants tables. Because the data are scattered in both narrative text blocks and embedded HTML tables, a purely scripted scrape misses too much, while a purely manual effort would be too slow. I’m therefore looking for a balanced workflow that blends solid Python-based parsing (BeautifulSoup, pandas, regex, maybe an LLM call for tricky passages) with targeted human review to catch formatting quirks and footnotes. Deliverables • A single CSV or Excel file where each row is a firm-year filing and each column holds one of the compensation items above, clearly labeled. • A short read-me describing the extraction logic, any LLM prompts used, and the quality-control steps you applied. • A reproducible script or notebook so I can rerun the pipeline on future filings. Acceptance criteria • ≥ 95 % of filings processed; missing cases flagged with reasons. • Random audit of 50 filings must show ≤ 5% field-level error rate. • Output passes numeric sanity checks (e.g., no negative salaries, totals match table footings when provided). If you have experience parsing SEC filings or have already built hybrid scraping/LLM solutions, that will help you move quickly. Let me know how you plan to split automation versus manual review, which tools or models you prefer, and your estimated turnaround time for the full 5,000-file set.
Project ID: 40297759
143 proposals
Remote project
Active 1 mo ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs

- CIK acquisition of fiscal year available `DEF 14A` - SEC-aware HTML parsing into typed text, heading, and table blocks - XBRL-aware compensation table handling and wide-header detection - deterministic Summary Compensation Table extraction before any LLM fallback - clean split of executive name vs title from combined cells - role assignment using the most recent fiscal year for each executive - Create the compensation table by CIK for each fiscal year for the companies specified - Do manual QA on the filings extraction- Incorporate any needed LLM logic for complicated passages to ensure a comprehensive extraction
€2,072 EUR in 15 days
0.0
0.0
143 freelancers are bidding on average €1,070 EUR for this job

Hello, I understand you need a hybrid scraping workflow for 5,000 DEF 14A statements that reliably extracts base salary, stock options and awards, bonuses/incentives, and any other compensation figures from both narrative text and HTML tables. I will build a Python-based pipeline using BeautifulSoup, pandas, and regex to pull structured data, with a scoped LLM pass for tricky passages. A targeted human-review layer will catch formatting quirks and footnotes, ensuring accuracy before final delivery. Deliverables include a single CSV/Excel where each row is a firm-year filing and each column is a field, a concise read-me with the extraction logic and any prompts used, and a reproducible notebook or script to rerun the pipeline on future filings. The workflow will balance automation and manual checks to meet the acceptance criteria, plus clear flags for any missing cases. I’ll outline a reproducible setup, from environment to run commands, so you can reproduce results on the full set. What is your preferred balance between automation and human review for edge cases, and which environments should I support (Windows/Linux/macOS)? Which file naming and storage convention do you want for the output and artifacts? Do you have any preferred LLM provider or on-premise setup for the tricky passages? Are there any known file-format quirks in your 5,000 filings that I should anticipate? Best regards,
€1,500 EUR in 26 days
9.2
9.2

Hello, As a dedicated team at Live Experts, my colleagues and I can provide the nuanced balance you require for your project. Our skills in Python programming, BeautifulSoup and pandas libraries, as well as regex, combined with our ability to understand complex passages like LLM calls make us stand out for this task. Having previously dealt with similar projects such as parsing SEC filings, we know how important it is to blend automation with human review to tackle any formatting quirks and footnotes. Moreover, our proficiency in data analysis ensures that we can handle large data sets, execute numeric sanity checks, and deliver high-quality structured files. The various tools we're skilled in will enable us to ensure your data processing demands are met efficiently. Additionally, our substantial experience in creating reproducible scripts will ensure you are able to repeat the process seamlessly- an important requirement for long term efficiency. Choosing us guarantees not only technical expertise but also commitment towards ensuring client satisfaction in every project we undertake. With the talented professionals at Live Experts working together on your project, expect an efficient approach with automated elements and requisite manual review aimed at maintaining quality. We are confident in exceeding your expectations with punctuality and precision through employing the most effective models and tools for your needs definite Thanks!
€1,500 EUR in 4 days
8.3
8.3

Hello I have several years of experience with Python and BeautifulSoup, pandas, regex and HTML parsing. I have completed a lot of similar projects. I
€750 EUR in 8 days
8.1
8.1

Hi Please review these past projects on SEC filings: 1) https://www.freelancer.com/projects/web-scraping/text-search-https-www-sec/details 2) https://www.freelancer.com/projects/web-scraping/SEC-Edgar-Data-Scraper-Email/details 3) https://www.freelancer.com/projects/python/Edgar-data-collection/details I have experience in parsing data from SEC filings, and can develop a Python script (integrated with LLM prompts) to extract compensation details from 5,000 DEF 14A filings into structured CSV/Excel format. I'm available to discuss details in chat and can start right away. Abdul H.
€750 EUR in 2 days
7.8
7.8

⭐⭐⭐⭐⭐ Extract Key Compensation Details from SEC Proxy Statements ❇️ Hi My Friend, I hope you are doing well. I just reviewed your project requirements and see you are looking for help with extracting key compensation details from proxy statements. You don't need to look any further because Zohaib is here to assist you! My team has successfully completed over 50 similar projects for data extraction and analysis. I will use a balanced approach, combining Python-based parsing with human review to ensure accuracy and efficiency. ➡️ Why Me? I can easily extract the required compensation details from your 5,000 DEF 14A proxy statements as I have 5 years of experience in data extraction, web scraping, and Python programming. My expertise includes working with BeautifulSoup, pandas, and regex for effective data parsing. Additionally, I have a strong grip on quality control processes to ensure the output meets your standards. ➡️ Let's have a quick chat to discuss your project in detail and I can showcase samples of my previous work. Looking forward to it! ➡️ Skills & Experience: ✅ Python Programming ✅ Data Extraction ✅ Web Scraping ✅ BeautifulSoup ✅ Pandas ✅ Regex ✅ Data Analysis ✅ Quality Control ✅ LLM Integration ✅ CSV/Excel Processing ✅ Automation ✅ Human Review Waiting for your response! Best Regards, Zohaib
€900 EUR in 2 days
8.1
8.1

I possess extensive experience in web and mobile development, including expertise in parsing and extracting data efficiently. I understand the challenge you face with the hybrid scraper needed for DEF 14A statements, requiring a balanced approach of Python-based parsing and targeted human review. In past projects, I have successfully worked on parsing complex data sets, including SEC filings, using tools like BeautifulSoup and pandas. My solutions have always focused on accuracy and efficiency, aligning with the requirements you have outlined for this project. I am confident in my ability to deliver a structured file with the key compensation details you need, meeting the acceptance criteria set forth. My approach will involve a strategic blend of automated scraping and manual review to ensure high-quality output. Your project aligns perfectly with my skill set, and I am eager to discuss the specifics further. Feel free to reach out to me so we can begin working on this project together and deliver exceptional results within your budget and timeframe.
€1,200 EUR in 20 days
7.4
7.4

⭐⭐⭐⭐⭐ Review all 5,000 DEF 14A HTML files and map common compensation table structures and narrative patterns to design a robust extraction plan. Build a Python pipeline using BeautifulSoup, pandas, and regex to parse tables and detect compensation fields (base salary, bonuses, stock awards, options, other compensation). Implement rule-based normalization and numeric validation to standardize extracted values across varying formats. Use targeted LLM-assisted parsing for complex narrative sections and footnotes where scripted extraction is unreliable. Create automated checks to flag missing values, anomalies, and totals mismatches for manual review. Perform structured human validation on flagged records and random batches to maintain ≤5% error rate. Deliver a clean CSV/Excel dataset with firm-year rows and clearly labeled compensation fields. Provide a documented Python notebook detailing parsing logic, LLM prompts, QC workflow, and reproducible steps for future filings. CnELIndia and Raman Ladhani ensure scalable automation, structured QA review, and reliable delivery meeting ≥95% processing and audit requirements.
€1,125 EUR in 7 days
7.6
7.6

Hello, I have thoroughly reviewed the project requirements for developing a Hybrid Scraper for DEF 14A Statements. I understand the need to extract key compensation details from 5,000 DEF 14A proxy statements in HTML format and organize them into a structured file. Let's chat and discuss it further. To handle your project, I will start with a balanced approach using Python-based parsing tools such as BeautifulSoup, pandas, regex, and potentially LLM calls for complex sections. I will combine automated scraping with targeted human review to ensure accurate extraction of compensation data from narrative text blocks and HTML tables. The deliverables will include a CSV or Excel file with labeled compensation columns, a read-me outlining extraction logic and quality control steps, and a reproducible script for future filings. Before signing-off my bid, I would like to ask a question, i.e., how would you prefer to handle footnotes and formatting quirks in the extraction process? Best Regards, Aneesa.
€750 EUR in 1 day
6.8
6.8

Dear Client, With my extensive experience in full-stack development and a keen eye for detail, I believe I'm the perfect fit for your Hybrid Scraper project. Although my main area of expertise lies in building applications and software, I've also constantly kept myself updated in the domain of data extraction and manipulation. Over the years, leveraging tools like Beautiful Soup, pandas, and regex have become second nature to me and I can adeptly integrate LLM calls to tackle even the trickiest of passages. One of my greatest virtues is the ability to strike a balance between automation and manual review. I understand that while automation ensures efficiency, there is a certain level of complexity in your project that requires targeted human intervention. This is where I would bring in my meticulousness and proficiency to ensure no formatting quirk or footnote slips through without being captured accurately. Lastly, ethical protocols matter to me as much as technical excellence does. In this project particularly, where accuracy is paramount and quality control steps are essential, you can expect nothing short of an output that complies with your acceptance criteria. Moreover, given the sheer scale of the task at hand, turnaround time is critical. Rest assured that I recognize this urgency and can deliver promptly without compromising on quality or accuracy. Trust me to build a robust pipeline with clear extraction logic and ample room for reusability!! Thank you!!!
€1,200 EUR in 7 days
6.8
6.8

Have over 18 years of experience in data mining/ Web scrapping/ Scraping Bots/ Chrome/Opera Extensions I have done it all. Tell us your source and we will put it in excel for you, Or we can even give you filtered results as per your requirement, In the format you want. You can also ask for data into a particular format - Excel, Json, Mysql, Databases, XMLs, you name them. Further Can help you with integrating it with ur databases, Can create json outputs. We are not only good with scraping but also with the tools that u may need after that. We can help you build you softwares round the data we have 99% Data Accuracy. We have Duplicate finder. etc., We can help with Statistics on the data We can help with creating Api's front the data We can create Softwares to manage that data We can build Sites round the data
€866 EUR in 7 days
6.9
6.9

Hi, You need to extract key compensation details from 5,000 DEF 14A proxy statements in HTML format. I will use Python with BeautifulSoup and pandas to parse the data. I will also do some manual review to catch any tricky parts. I plan to automate most of the extraction and review the results for accuracy. I can deliver a CSV file with all the details you need and a short read-me for the process. Can you tell me if there are any specific formats you need for the output? Burhan
€1,120 EUR in 14 days
7.0
7.0

Hi there,\n\nI understand that extracting key compensation details from 5,000 DEF 14A proxy statements presents a unique challenge. With my extensive background as a top freelancer from California, where I've successfully completed various data extraction projects with 5-star reviews, I’m confident I can provide the perfect solution for you.\n\nTo tackle the intricacies of your project, I propose a hybrid approach that combines Python-based parsing using BeautifulSoup and pandas, along with regex for handling nuanced data. This will be complemented by targeted human review to ensure we capture any formatting quirks or footnotes. The deliverables would be a polished CSV file, a comprehensive read-me file detailing the extraction methodology, and a reproducible script for future filings. I'm ready to start immediately and can keep you updated throughout the process. What specific challenges do you foresee during the parsing process, and how would you like to address those?\n\nThanks,
€1,375 EUR in 15 days
6.4
6.4

Hi, I can build a reliable workflow to extract the executive compensation data from your DEF 14A HTML filings and organize it into a clean, structured dataset. The approach would combine Python-based parsing with targeted validation to ensure both speed and accuracy across the 5,000 filings. Using tools like BeautifulSoup, pandas, and regex, I’ll automate the extraction of compensation details from both narrative sections and embedded tables. For more complex formatting or footnotes, I can incorporate an LLM-assisted step and follow it with manual verification to catch edge cases and ensure data quality. You’ll receive a structured CSV or Excel file with clearly labeled fields for salary, stock awards, bonuses, and other compensation figures, along with a short read-me explaining the extraction logic, prompts used, and quality-control checks. I’ll also provide a reproducible script or notebook so you can rerun the pipeline on future filings. My focus will be on achieving high coverage, flagging any missing cases, and validating the data through consistency checks and sample audits. Best regards,
€1,125 EUR in 7 days
6.8
6.8

Hi there, I hope you’re doing well. I reviewed your project and see you need someone to extract executive compensation data from 5,000 HTML proxy statements. Look no further, Suryansh is here to help you! I have scraped over 1,000 websites with similar complexities including SEC filings, HTML tables, and unstructured text parsing. I understand your challenge perfectly. Pure automation misses nuances while manual work is too slow. I will build a hybrid pipeline using BeautifulSoup and pandas for structured tables, regex for narrative blocks, and targeted LLM calls for tricky footnotes. My 230-website automated scraping system runs daily without human intervention, so I know how to build reliable pipelines. My approach will be to first parse all HTML tables systematically, then use pattern matching for text blocks, flag ambiguous cases for quick human review, and run sanity checks on all numeric fields. I will deliver a clean CSV with proper column headers, a documented script you can rerun on future filings, and a readme explaining the logic and QC steps. Skills & Experience: ✅ Web Scraping ✅ BeautifulSoup & Pandas ✅ HTML Table Parsing ✅ Regex & Text Extraction ✅ LLM Integration ✅ Data Cleaning & Validation ✅ SEC Filings Experience ✅ CSV & Excel Export ✅ Python Automation ✅ Quality Control Systems Waiting for your response! Best Regards, Suryansh
€1,000 EUR in 7 days
6.7
6.7

Hi there, I’ve reviewed your project and understand you need to extract executive compensation data from about 5,000 DEF 14A HTML proxy statements and structure the results into a clean dataset. Because compensation details appear across narrative text and embedded tables, the solution requires a hybrid approach combining automated parsing with careful validation to ensure high accuracy. I can build a Python based extraction pipeline using BeautifulSoup, pandas, and regex to locate summary compensation and grants tables, while also parsing narrative sections where compensation details appear. For complex formatting or ambiguous text, I can integrate LLM assisted extraction to improve coverage. The workflow will include validation checks and flagged cases for manual review so we reach the ≥95 percent processing requirement while maintaining strong field level accuracy. You’ll receive a structured CSV or Excel dataset with clearly labeled compensation fields per firm year filing, along with a reproducible script or notebook and a short README explaining the extraction logic, prompts, and quality control process. The pipeline will also log missing or uncertain cases and run numeric sanity checks to ensure reliable outputs. Best regards, Muhammad Adil Portfolio: https://www.freelancer.com/u/webmasters486
€1,100 EUR in 10 days
6.1
6.1

As a seasoned Data Analyst with over 16 years of experience, I am confident in my ability to provide you with a well-rounded solution for scraping and parsing the DEF 14A statements. My extensive knowledge and expertise in Python programming paired with data analysis make me the ideal candidate for this task. I can leverage BeautifulSoup, pandas, and regex to effectively extract the compensation details you need while ensuring seamless integration and a structured format. Understanding the unique challenges posed by different data formats, I propose a balanced workflow that incorporates both automated parsing and meticulous human review. This strategy maximizes the accuracy and comprehensiveness of information extraction. Additionally, I assure you of a reproducible script or notebook for future use. Having successfully completed several complex data extraction projects, including parsing SEC filings, my approach is both thorough and efficient. Appreciating your need for a quick turnaround time, I am confident in delivering within your timelines while upholding industry-standard quality control measures.
€3,333 EUR in 99 days
6.2
6.2

As a highly experienced software engineer with a strong background in data analysis and extraction, I am confident in my ability to complete this project to the highest quality. I understand the challenges you are facing with your DEF 14A statements dataset and the need for a robust hybrid solution. My range of skills, including expertise in Python, web scraping, and software architecture, uniquely position me to create an efficient scraping system blending automated processing with targeted human review. Drawing from my well-honed skillset which includes the use of BeautifulSoup, pandas, regex, as well as extensive cloud technology knowledge, I am adept at extracting complex elements from various sources. Moreover, my experience in working with large datasets ensures that not only will ≥ 95% of filings be processed but also that missing cases will be identified and logically flagged with reasons.
€1,125 EUR in 7 days
6.4
6.4

Hello, I’m a Python data engineer with strong experience in large-scale web scraping, SEC filing parsing, and hybrid automation workflows combining BeautifulSoup, pandas, regex, and LLM-assisted extraction. I can build a reliable pipeline to extract executive compensation data from both narrative sections and HTML tables, with structured output in CSV/Excel format. The solution will include validation checks, missing-case flagging, and quality control to meet your ≥95% coverage and low error-rate acceptance criteria. I will also provide a clean, reproducible script/notebook and a concise README explaining extraction logic and review steps. I’m comfortable designing balanced automation + targeted manual review processes for messy financial disclosures. Available to start immediately and can share a realistic timeline after reviewing sample filings. Looking forward to working with you.
€750 EUR in 5 days
6.0
6.0

Hi there Thanks for posting this exciting project. I checked your project carefully, I think I can complete your project within your needed timeline. I am super professional in Python, Data Processing, Web Scraping, Software Architecture, Data Extraction, Data Analysis Please ping , I am always online here Thanks Efanntyo -.
€750 EUR in 14 days
5.9
5.9

Hello client, I’ve carefully reviewed your job description and have strong experience in these Data Extraction, Data Processing, Web Scraping, Software Architecture, Data Analysis and Python. I can build a reliable web scraping solution tailored specifically to your needs. Whether using Node.js with Puppeteer/Cheerio or Python with Selenium/BeautifulSoup, I will extract, clean, and organize your data efficiently. I also handle anti-bot protections, pagination, and full automation as required. As you can see from my profile, my web scraping reviews are excellent, reflecting my commitment to quality work. I focus on writing clean, maintainable, and scalable code because I know the difference between 99% and 100%. If you hire me, I’ll do my best until you’re completely satisfied with the result. Let’s discuss your target website and preferred data format. Thanks, Denis
€750 EUR in 10 days
5.7
5.7

Segovia, Spain
Payment method verified
Member since Oct 17, 2019
€30-250 EUR
€250-500 EUR
€12-18 EUR / hour
₹750-1250 INR / hour
$250-750 USD
₹1500-12500 INR
$30-250 USD
$10-15 USD
min $50 USD / hour
$250-750 USD
$15-25 USD / hour
$30-250 USD
€250-750 EUR
$15-25 USD / hour
$15-25 USD / hour
$30-250 USD
₹12500-37500 INR
$1500-3000 USD
₹12500-37500 INR
$30-250 USD
$2-8 USD / hour
₹37500-75000 INR