破解电子书

网上下载的电子书常见的有chm,pdf,exe等格式,可是经常在使用时遇到eBook打不开或者禁止文字copy的情况。俺现就遇到一个exe格式和一个一个pdf格式。

EXE格式的ebook中,文字可选择,就是无法copy使用。先用CyberAticle(记得里面有反编译的功能),可是由于此exe的ebook不是CA制作的,所以不能反编译。(CA的限制在由其自身制作的exe格式电子书可反编译成book格式。)在网上找到unWebcompiler(因为大部分exe的eBook都是Webcompiler制作的)。小软件,免安装,挺好用的,可以将exe书转为数个文件夹,按照类别(图片,文字等)分开的,文字主要装边成了html格式,可copy。问题解决,推荐使用。
unWebcompiler体验 Download


遇到PDF的ebook一本,所有文字都以文本框的图片形式存在,无法选择,更别提copy了,连Adobe Acrobat的Professional都不行。记得以前输出中文文档的时候,遇到的困难大多是文字可选择,但是copy下来是乱码,无法使用。怀疑是字体样式不兼容,字符集代码错误,导致输出的文件在Word、写字板中打开的时候,只能看到一堆乱码。部分乱码时(标点和符号)还可以手工替换一下字符集编码即可解决。但是通篇乱码就没有办法了,至今也没有解决的问题。据说原因是PDF文件中使用了自定义的字库,导致转换出来后的文件无法正常显示。(PDF文件自带字库有两种方式:自带一种完整的字库,称为font embedding;只自带一种字库中要用到的那几个字符,祀°为font subsetting。)在codeproject上有相关的文章,提供将PDF转换成纯文本的源代码。


我找到了这篇
Introduction
PDF documents are commonly used and their content is usually compressed. This article shows a simple C code that can be used to extract plain text from the PDF file.

Why?
Adobe does allows you to submit PDF files and will extract the text or HTML and mail it back to you. But there are times when you need to extract the text yourself or do it inside an application. You may also want to apply special formatting (e.g., add tabs) so that the text can be easily imported into Excel for example (when your PDF document mostly contains tables that you need to port to Excel, which is how this code got developed).

There are several projects on "The Code Project" that show how to create PDF documents, but none that provide free code that shows how to extract text without using a commercial library. In the reader comments, a need was expressed for code just like what is being supplied here.

There are several libraries out there that read or create PDF file, but you have to register them for commercial use or sign various agreements. The code supplied here is very simple and basic, but it is entirely free. It only use the ZLIB library which is also free.

Basics
You can download documents such as PDFReference15_v5.pdf from here that explains some of the inners of PDF files. In short, each PDF file contains a number of objects. Each object may require one or more filters to decompress it and may also provide a stream of data. Text streams are usually compressed using the FlateDecode filter and may be uncompressed using code from the ZLIB (http://www.zlib.org/) library.

The data for each object can be found between "stream" and "endstream" sections. Once inflated, the data needs to be processed to extract the text. The data usually contains one or more text objects (starting with BT and ending with ET) with formatting instructions inside. You can learn a lot from the structure of PDF file by stepping through this application.

About Code
This single source code file contains very simple, very basic C code. It initially reads in the entire PDF file into one buffer and then repeatedly scans for "stream" and "endstream" sections. It does not check which filter should be applied and always assumes FlateDecode. (If it gets it wrong, usually no output is generated for that section of the file, so it is not a big issue). Once the data stream is inflated (uncompressed), it is processed. During the processing, the code searches for the BT and ET tokens that signify text objects. The contents of each is processed to extract the text and a guess is made as to whether tabs or new line characters are needed.

The code is far from complete or being any sort of general utility class, but it does demonstrate how you can extract the text yourself. It is enough to show you how and get you going.

The code is however fully functional, so when it is applied to a PDF document, it generally does a fair job of extracting the text. It has been tested on several PDF files.

This code is supplied as is, no warranties. Use at your own risk.

Using The Code
The download contains one C file. To use it, create a simple Windows 32 Console project and add the pdf.c file to the project. You also need to go here (bless them!) and download the free "zlib compiled DLL" zip file. Extract zdll.lib to your project directory and add it as a project dependency (link against it). Also put zlib1.dll in your project directory. Also put zconf.h and zlib.h in your project directory and add them to the project.

Now, step through the application and note that the input PDF and output text file names are hardwired at the start of the main method.

Future Enhancements
If there is enough interest, the author may consider uploading a release version with a Windows interface. The code is quite good for extracting data from tables in a form that can be readily imported into Excel, with the column preserved (because of the tabs that get added).

Code Snippets
Stream sections are located using initially:

size_t streamstart = FindStringInBuffer (buffer, "stream", filelen); size_t streamend = FindStringInBuffer (buffer, "endstream", filelen); And then once the data portion is identified, it is inflated as follows:

z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));
zstrm.avail_in = streamend - streamstart + 1;
zstrm.avail_out = outsize;
zstrm.next_in = (Bytef*)(buffer + streamstart);
zstrm.next_out = (Bytef*)output;
int rsti = inflateInit(&zstrm);
if (rsti == Z_OK)
{
int rst2 = inflate (&zstrm, Z_FINISH);
if (rst2 >= 0)
{
//Ok, got something, extract the text:
size_t totout = zstrm.total_out;
ProcessOutput(fileo, output, totout);
}
}


The main work gets done in the ProcessOutput method which processes the uncompressed stream to extract text portion of any text object. It looks as follows:

void ProcessOutput(FILE* file, char* output, size_t len) { //Are we currently inside a text object? bool intextobject = false; //Is the next character literal //(e.g. \\ to get a \ character or \( to get ( ): bool nextliteral = false;

//() Bracket nesting level. Text appears inside ()
int rbdepth = 0;

//Keep previous chars to extract numbers etc.:
char oc[oldchar];
int j=0;
for (j=0; j<oldchar; j++) oc[j]=' ';

for (size_t i=0; i<len; i++)
{
char c = output[i];
if (intextobject)
{
if (rbdepth==0 && seen2("TD", oc))
{
//Positioning.
//See if a new line has to start or just a tab:
float num = ExtractNumber(oc,oldchar-5);
if (num>1.0)
{
fputc(0x0d, file);
fputc(0x0a, file);
}
if (num<1.0)
{
fputc('\t', file);
}
}
if (rbdepth==0 && seen2("ET", oc))
{
//End of a text object, also go to a new line.
intextobject = false;
fputc(0x0d, file);
fputc(0x0a, file);
}
else if (c=='(' && rbdepth==0 && !nextliteral)
{
//Start outputting text!
rbdepth=1;
//See if a space or tab (>1000) is called for by looking
//at the number in front of (
int num = ExtractNumber(oc,oldchar-1);
if (num>0)
{
if (num>1000.0)
{
fputc('\t', file);
}
else if (num>100.0)
{
fputc(' ', file);
}
}
}
else if (c==')' && rbdepth==1 && !nextliteral)
{
//Stop outputting text
rbdepth=0;
}
else if (rbdepth==1)
{
//Just a normal text character:
if (c=='\\' && !nextliteral)
{
//Only print out next character
//no matter what. Do not interpret.
nextliteral = true;
}
else
{
nextliteral = false;
if ( ((c>=' ') && (c<='~')) || ((c>=128) && (c<255)) )
{
fputc(c, file);
}
}
}
}
//Store the recent characters for
//when we have to go back for a number:
for (j=0; j<oldchar-1; j++) oc[j]=oc[j+1];
oc[oldchar-1]=c;
if (!intextobject)
{
if (seen2("BT", oc))
{
//Start of a text object:
intextobject = true;
}
}
}
}


这招还在学习中。

此外特别感谢马健同学对破解电子书知识的收集整理。