Seeking guideline to preserve formatting

Hello everyone! Hope everything is fine

I am currently working on developing a web app which translates an input file and generates the output target file in sdlxliff format. 

The problem that I am facing right now is that when text is extracted from the input file, it is being extracted as plain text (Formatting is not preserved)  but I want the extracted text to be in the format as the input file and the same text formatting should be applied to the final .sdlxliff output file, I am sharing my text extraction and .sdlxliff generation logic here, it is a python code. 

I want someone to help me out, to guide me the right text extraction and .sdlxliff creation logic.

My current logic for text extraction is

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def extract_text(file_path: str, extension: str) -> List[str]:
· ·
· · segments = []
· · if extension == '.docx':
· · · · doc = Document(file_path)
· · · · segments = [para.text.strip() for para in doc.paragraphs if para.text.strip()]
· · elif extension == '.pdf':
· · · · reader = PdfReader(file_path)
· · · · for page in reader.pages:
· · · · · · page_text = page.extract_text()
· · · · · · if page_text:
· · · · · · · · # Split by newlines, filter out empty lines
· · · · · · · · segments.extend(line.strip() for line in page_text.split('\n') if line.strip())
· · elif extension == '.xlsx':
· · · · wb = load_workbook(file_path, data_only=True)
· · · · for sheet_name in wb.sheetnames:
· · · · · · ws = wb[sheet_name]
· · · · · · for row in ws.iter_rows(values_only=True):
· · · · · · · · row_text = " ".join(str(cell) if cell is not None else "" for cell in row).strip()
· · · · · · · · if row_text:
· · · · · · · · · · segments.append(row_text)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX



Everything apart from text formatting preservation is perfect.


Moved to code block.
[edited by: Paul at 5:10 PM (GMT 0) on 14 Mar 2025]
emoji