How to get metadata from pdf's?

Resolved Knut23
(@knut23)

9 years, 4 months ago

Hello!

First of all: Thanks for the work you put in on this plugin and for making it available to us php/wordpress disabled people! And for bearing with our endless questions!

I would like to be able to extract Document title, keywords and description from my pdf’s. I was hoping for those to populate the metadata fields in wordpress:
Document title (pdf) –> Attachment title (WP)
Description (pdf) –> Attachment Description field (WP)
and keywords in custom field. Is this an erroneous assumption? Does everything end up in custom fields?

I have spent several hours trying to figure out how to get hold of the metadata in pdf’s. I am a little bit ashamed to admit that I have come basically nowhere.
I created a custom field called “MLA_keywords” (on the IPTC/EXIF tab/Add a new Field and Mapping Rule) and filled in pdf:ALL_PDF in the field for “EXIF/Template Value” (according to this post). Ran the rule and got keywords for an IMAGE! Nothing for the pdf’s.

I conclude that I must be doing something wrong. I can´t even find the meta data I am after in the drop down menus (except for “keywords”). Could someone please provide me with a write up for how to get metadata from pdf’s?

Hope my questions are understandable…

PS Using Version 1.95 of MLA DS

Thanks in advance!
/Richard

https://wordpress.org/plugins/media-library-assistant/

Viewing 3 replies - 1 through 3 (of 3 total)

Plugin Author David Lingren
(@dglingren)

9 years, 4 months ago
Thank you for your kind words; positive feedback is a great motivator to keep working on the plugin and supporting its users.

I have found that PDF meta data is more uneven than that found in images. A lot depends on the application that created the document.

It would be very helpful if you could post a link to one or a few of your documents, so I can examine them and see what’s available for you to work with. If you would rather send them by e-mail you can go to the Contact Us page and give me your contact information. I will send you my e-mail address:

Fair Trade Judaica/Contact Us

As you’ve found, the ALL_PDF pseudo variable can be helpful. An easy to use it for inspecting your documents is to create a post or page and add this [mla_gallery] shortcode to display the information:
```
<h3>ALL_PDF</h3>
[mla_gallery post_mime_type='application/pdf' post_parent=all size=none mla_caption='{+base_file+}<br>{+pdf:ALL_PDF+}' columns=2]
```
With this approach you can avoid the effort to create a custom field and mapping rule. Let me know if that helps, and consider posting or sending me some of the results.

I am confident we can get this working for you, if your documents contain the information you want. I look forward to hearing from you here or by e-mail.
Thread Starter Knut23
(@knut23)

9 years, 4 months ago

Hi David!

Thanks!!! I have sent you an email with links to the pdf’s.

Regards,
Richard
Plugin Author David Lingren
(@dglingren)

9 years, 3 months ago
Thank you for following up with your contact information and a link to the documents you are working with. Here is a summary of my e-mail response to you:

The MLA parsing code tries to populate the following fields from a variety of sources:
```
/*
 * Try to populate all the PDF-standard keys (except Trapped)
 * Title - The document's title
 * Author - The name of the person who created the document
 * Subject - The subject of the document
 * Keywords - Keywords associated with the document
 * Creator - the name of the conforming product that created the original document
 * Producer - the name of the conforming product that converted it to PDF
 * CreationDate - The date and time the document was created
 * ModDate - The date and time the document was most recently modified
 */
```
You can find more information about this in the “Metadata in PDF documents” section of the Settings/Media Library Assistant Documentation tab. I suggest you use those fields as a starting point for mapping the data. Here are the IPTC/EXIF Mapping Rules I came up with:

Title: template:([+pdf:Title+])
Caption: template:([+pdf:Subject+])

Att. Categories: template:([+pdf:Keywords,array+])

The three rules have a similar structure:
- “template:” (goes in the text box below “EXIF/Template Value”) is used to access the pdf: values instead of the EXIF values.
- The values are surrounded by parentheses “(” and “)” so they will return an empty string for documents without meta data in the field and for other items such as images.
- I have selected “Replace” to overwrite the existing text, because a default Title was assigned to the items when they were uploaded. You can change this to “Keep” if you already have values in one or more of the fields that you want to retain.
The taxonomy rule also has the “,array” option to return multiple keywords as individual array elements that can be converted to taxonomy terms.

I have also checked the “Enable IPTC/EXIF Mapping when adding new media” box so the rules are automatically applied when new items are uploaded.

After you enter and save the rules you can test them out on single documents by clicking the “Map IPTC/EXIF Metadata” link in the “Save” meta box on the Media/Edit Media screen. You can also use the Media/Assistant Bulk Edit area to experiment on several items at once. Of course, you can also use the “Map All … ” buttons in the IPTC/EXIF tab if you are feeling brave/lucky.

I hope that my response got you the results you wanted. I am marking this topic resolved, but please update it if you have any problems or further questions regarding the mapping of PDF meta data to WordPress fields and taxonomies. Thank you for your interest in the plugin.

Viewing 3 replies - 1 through 3 (of 3 total)

The topic ‘How to get metadata from pdf's?’ is closed to new replies.